RAGchain.preprocess.text_splitter package

Submodules

RAGchain.preprocess.text_splitter.base module

class RAGchain.preprocess.text_splitter.base.BaseTextSplitter

Bases: Runnable[List[Document], List[Passage]], ABC

Base class for text splitters. At this framework, we use our own text splitter instead of the one from langchain.

property InputType: Type[Input]

The type of input this runnable accepts specified as a type annotation.

property OutputType: Type[Output]

The type of output this runnable produces specified as a type annotation.

invoke(input: Input, config: RunnableConfig | None = None) Output

Transform a single input into an output. Override to implement.

Args:

input: The input to the runnable. config: A config to use when invoking the runnable.

The config supports standard keys like ‘tags’, ‘metadata’ for tracing purposes, ‘max_concurrency’ for controlling how much work to do in parallel, and other keys. Please refer to the RunnableConfig for more details.

Returns:

The output of the runnable.

abstract split_document(document: Document) List[Passage]

Split a document into passages.

split_documents(documents: List[Document]) List[List[Passage]]

Split a list of documents into passages. The return passages will be 2d list, where the first dimension is the document index, and the second dimension is the passage index.

RAGchain.preprocess.text_splitter.code_splitter module

class RAGchain.preprocess.text_splitter.code_splitter.CodeSplitter(language_name: str = 'PYTHON', chunk_size: int = 50, chunk_overlap: int = 0, **kwargs)

Bases: BaseTextSplitter

The CodeSplitter class in the RAGchain library is a text splitter that splits documents based on separators of langchain’s library Language enum. This class inherits from the BaseTextSplitter class and uses the from_language method of RecursiveCharacterTextSplitter class from the langchain library to perform the splitting. CodeSplitter supports CPP, GO, JAVA, KOTLIN, JS, TS, PHP, PROTO, PYTHON, RST, RUBY, RUST, SCALA, SWIFT, MARKDOWN, LATEX, HTML, SOL, CSHARP.

split_document(document: Document) List[Passage]

Split a document into passages.

RAGchain.preprocess.text_splitter.html_header_splitter module

class RAGchain.preprocess.text_splitter.html_header_splitter.HTMLHeaderSplitter(headers_to_split_on: Tuple[str, str] | None = None, return_each_element: bool = False)

Bases: BaseTextSplitter

The HTMLHeaderSplitter class in the RAGchain library is a text splitter that splits documents based on HTML headers. This class inherits from the BaseTextSplitter class and uses the HTMLHeaderTextSplitter.

split_document(document: Document) List[Passage]

Split a document into passages.

RAGchain.preprocess.text_splitter.markdown_header_splitter module

class RAGchain.preprocess.text_splitter.markdown_header_splitter.MarkDownHeaderSplitter(headers_to_split_on: List[tuple[str, str]] | None = None, return_each_line: bool = False)

Bases: BaseTextSplitter

The MarkDownHeaderSplitter is used to split a document into passages based document’s header information which a list of separators contain. The most feature is similar with Langchain’s MarkdownHeaderTextSplitter. It split based on header.

split_document(document: Document) List[Passage]

Split a document into passages.

RAGchain.preprocess.text_splitter.text_splitter module

class RAGchain.preprocess.text_splitter.text_splitter.RecursiveTextSplitter(separators: List[str] | None = None, keep_separator: bool = True, *args, **kwargs)

Bases: BaseTextSplitter

Split a document into passages by recursively splitting on a list of separators. You can specify a window_size and overlap_size to split the document into overlapping passages.

split_document(document: Document) List[Passage]

Split a document.

RAGchain.preprocess.text_splitter.token_splitter module

class RAGchain.preprocess.text_splitter.token_splitter.TokenSplitter(tokenizer_name: str = 'tiktoken', chunk_size: int = 100, chunk_overlap: int = 0, pretrained_model_name: str = 'gpt2', **kwargs)

Bases: BaseTextSplitter

The TokenSplitter is used to split a document into passages by token using various tokenization methods. It’s designed to split text from a document into smaller chunks, or “tokens”, using various tokenization methods. The class supports tokenization with ‘tiktoken’, ‘spaCy’, ‘SentenceTransformers’, ‘NLTK’, and ‘huggingFace’.

split_document(document: Document) List[Passage]

Split a document.

Module contents