RAGchain.preprocess.loader package
Submodules
RAGchain.preprocess.loader.dataset_loader module
RAGchain.preprocess.loader.deepdoctection_loader module
- class RAGchain.preprocess.loader.deepdoctection_loader.DeepdoctectionPDFLoader(file_path: str, deepdoctection_host: str)
Bases:
BasePDFLoader
Load PDF file using NomaDamas’ Deepdoctection API server. You can use Deepdoctection API server using Dockerfile at https://github.com/NomaDamas/deepdoctection-api-server
- extract_pages(result: List[Dict[str, Any]]) List[Dict[str, Any]]
- static find_positions(text, substring)
- lazy_load(*args, **kwargs) Iterator[Document]
lazy_load pdf file using Deepdoctection API server return list of Document
- load(*args, **kwargs) List[Document]
load pdf file using Deepdoctection API server return list of Document
RAGchain.preprocess.loader.excel_loader module
RAGchain.preprocess.loader.file_loader module
- class RAGchain.preprocess.loader.file_loader.FileLoader(target_dir: str, hwp_host_url: str, *args, **kwargs)
Bases:
BaseLoader
Loads documents from a directory. You can load .txt, .pdf, .csv, .xlsx, .hwp files.
- lazy_load(filter_ext: List[str] | None = None) Iterator[Document]
Lazily load all files in the target directory. :param filter_ext: If not None, only files with the given extensions will be loaded. filter_ext elements must contain the dot (.) prefix.
- load(filter_ext: List[str] | None = None) List[Document]
Load all files in the target directory. :param filter_ext: If not None, only files with the given extensions will be loaded. filter_ext elements must contain the dot (.) prefix.
RAGchain.preprocess.loader.hwp_loader module
- class RAGchain.preprocess.loader.hwp_loader.HwpLoader(path: str, hwp_host_url: str, retry_connection: int = 4)
Bases:
BaseLoader
Load Hwp files.
Hwp to text using hwp-converter-api. You can use hwp-converter-api at https://github.com/NomaDamas/hwp-converter-api
- async async_request()
- lazy_load() Iterator[Document]
Load a document lazily. This method uses asyncio requests.
- load() List[Document]
Load a document.
RAGchain.preprocess.loader.mathpix_markdown_loader module
- class RAGchain.preprocess.loader.mathpix_markdown_loader.MathpixMarkdownLoader(filepath: str)
Bases:
BaseLoader
Load mathpix markdown file. mathpix markdown is .mmd file which is a markdown file for science papers. This class supports to split the file into sections and tables of science papers.
- lazy_load(split_section: bool = True, split_table: bool = True) Iterator[Document]
- Parameters:
split_section – If True, split the file into sections. Default is True.
split_table – If True, split the file into tables. Default is True.
- Returns:
Iterator of Document. If split_section and split_table are True, return contains multiple Documents.
The order of each section and table are the same as the order of the file.
- load(split_section: bool = True, split_table: bool = True) List[Document]
- Parameters:
split_section – If True, split the file into sections. Default is True.
split_table – If True, split the file into tables. Default is True.
- Returns:
List of Document. If split_section and split_table are True, the list contains multiple Documents.
The order of each section and table are the same as the order of the file.
- static split_section(content: str) List[str]
Split section from mathpix markdown content by ‘#’.
- static split_table(content: str) List[str]
Split table from mathpix markdown content. :param content: mathpix markdown content :return: The odd index is the content without table, and the even index is the table.
RAGchain.preprocess.loader.nougat_pdf_loader module
- class RAGchain.preprocess.loader.nougat_pdf_loader.NougatPDFLoader(file_path: str, nougat_host: str)
Bases:
BasePDFLoader
Load PDF file using Nougat API server. You can use Nougat API server using Dockerfile at https://github.com/facebookresearch/nougat
- lazy_load(split_section: bool = True, split_table: bool = True, *args, **kwargs) Iterator[Document]
- Parameters:
split_section – If True, split the document by section.
split_table – If True, split the document by table.
start – Start page number to load. Optional.
stop – Stop page number to load. Optional.
- load(split_section: bool = True, split_table: bool = True, *args, **kwargs) List[Document]
- Parameters:
split_section – If True, split the document by section.
split_table – If True, split the document by table.
start – Start page number to load. Optional.
stop – Stop page number to load. Optional.
RAGchain.preprocess.loader.pdf_link_loader module
RAGchain.preprocess.loader.rem_loader module
- class RAGchain.preprocess.loader.rem_loader.RemLoader(path: str, time_range: List[datetime] | None = None)
Bases:
BaseLoader
Load rem storage file from rem sqlite database. You can set time range to load.
- lazy_load() Iterator[Document]
A lazy loader for Documents.
- load() List[Document]
Load data into Document objects.
RAGchain.preprocess.loader.rust_hwp_loader module
- class RAGchain.preprocess.loader.rust_hwp_loader.RustHwpLoader(path: str)
Bases:
BaseLoader
Load HWP file using libhwp. It works for any os. Using load or lazy_load, you can get list of Documents from hwp file. This loader loads all paragraphs and tables from hwp file. At the first Document, there are all paragraphs from hwp file, including texts in each table. Next, there are separated Documents for each table paragraphs. Unfortunately, You can’t distinguish row and columns in table.
In the metadata, there are filepath at key ‘source’ and page_type, which is ‘text’ or ‘table’.
Recommend to use other hwp loader, but it is great option to use this loader at mac and linux. It is no need to use external hwp loader server, or hwp program that only available at windows.
- lazy_load() Iterator[Document]
A lazy loader for Documents.
- load() List[Document]
Load data into Document objects.
RAGchain.preprocess.loader.win32_hwp_loader module
- class RAGchain.preprocess.loader.win32_hwp_loader.Win32HwpLoader(path: str)
Bases:
BaseLoader
Load HWP file using pywin32. It works for only Windows. Using load or lazy_load, you can get list of Documents from hwp and hwpx file. This loader loads all paragraphs and tables from hwp or hwpx file. At the first Document, there are all paragraphs excluding texts in each table. Next, there are separated Documents for each table. All table contents are converted to html format. So you can get row, columns, or any complicated table structure.
In the metadata, there are filepath at key ‘source’ and page_type, which is ‘text’ or ‘table’.
It is great option to use loader for loading complicated tables from hwp or hwpx file. But it is only available at windows, so choose other hwp loader if you want to use at mac or linux.
- static convert_hwp_to_hwpx(input_filepath, output_filepath)
- lazy_load() Iterator[Document]
A lazy loader for Documents.
- load() List[Document]
Load data into Document objects.
- preprocessor() tuple[List, List]