RAGchain.preprocess.text_splitter package
Submodules
RAGchain.preprocess.text_splitter.base module
- class RAGchain.preprocess.text_splitter.base.BaseTextSplitter
Bases:
Runnable
[List
[Document
],List
[Passage
]],ABC
Base class for text splitters. At this framework, we use our own text splitter instead of the one from langchain.
- property InputType: Type[Input]
The type of input this runnable accepts specified as a type annotation.
- property OutputType: Type[Output]
The type of output this runnable produces specified as a type annotation.
- invoke(input: Input, config: RunnableConfig | None = None) Output
Transform a single input into an output. Override to implement.
- Args:
input: The input to the runnable. config: A config to use when invoking the runnable.
The config supports standard keys like ‘tags’, ‘metadata’ for tracing purposes, ‘max_concurrency’ for controlling how much work to do in parallel, and other keys. Please refer to the RunnableConfig for more details.
- Returns:
The output of the runnable.
RAGchain.preprocess.text_splitter.code_splitter module
- class RAGchain.preprocess.text_splitter.code_splitter.CodeSplitter(language_name: str = 'PYTHON', chunk_size: int = 50, chunk_overlap: int = 0, **kwargs)
Bases:
BaseTextSplitter
The CodeSplitter class in the RAGchain library is a text splitter that splits documents based on separators of langchain’s library Language enum. This class inherits from the BaseTextSplitter class and uses the from_language method of RecursiveCharacterTextSplitter class from the langchain library to perform the splitting. CodeSplitter supports CPP, GO, JAVA, KOTLIN, JS, TS, PHP, PROTO, PYTHON, RST, RUBY, RUST, SCALA, SWIFT, MARKDOWN, LATEX, HTML, SOL, CSHARP.
RAGchain.preprocess.text_splitter.html_header_splitter module
- class RAGchain.preprocess.text_splitter.html_header_splitter.HTMLHeaderSplitter(headers_to_split_on: Tuple[str, str] | None = None, return_each_element: bool = False)
Bases:
BaseTextSplitter
The HTMLHeaderSplitter class in the RAGchain library is a text splitter that splits documents based on HTML headers. This class inherits from the BaseTextSplitter class and uses the HTMLHeaderTextSplitter.
RAGchain.preprocess.text_splitter.markdown_header_splitter module
- class RAGchain.preprocess.text_splitter.markdown_header_splitter.MarkDownHeaderSplitter(headers_to_split_on: List[tuple[str, str]] | None = None, return_each_line: bool = False)
Bases:
BaseTextSplitter
The MarkDownHeaderSplitter is used to split a document into passages based document’s header information which a list of separators contain. The most feature is similar with Langchain’s MarkdownHeaderTextSplitter. It split based on header.
RAGchain.preprocess.text_splitter.text_splitter module
- class RAGchain.preprocess.text_splitter.text_splitter.RecursiveTextSplitter(separators: List[str] | None = None, keep_separator: bool = True, *args, **kwargs)
Bases:
BaseTextSplitter
Split a document into passages by recursively splitting on a list of separators. You can specify a window_size and overlap_size to split the document into overlapping passages.
RAGchain.preprocess.text_splitter.token_splitter module
- class RAGchain.preprocess.text_splitter.token_splitter.TokenSplitter(tokenizer_name: str = 'tiktoken', chunk_size: int = 100, chunk_overlap: int = 0, pretrained_model_name: str = 'gpt2', **kwargs)
Bases:
BaseTextSplitter
The TokenSplitter is used to split a document into passages by token using various tokenization methods. It’s designed to split text from a document into smaller chunks, or “tokens”, using various tokenization methods. The class supports tokenization with ‘tiktoken’, ‘spaCy’, ‘SentenceTransformers’, ‘NLTK’, and ‘huggingFace’.