RAGchain.preprocess.loader package

Submodules

RAGchain.preprocess.loader.dataset_loader module

class RAGchain.preprocess.loader.dataset_loader.KoStrategyQALoader(*args, **kwargs)

Bases: BaseLoader

KoStrategyQA dataset loader The dataset downloads at huggingface via internet.

load() List[Document]

Load data into Document objects.

RAGchain.preprocess.loader.deepdoctection_loader module

class RAGchain.preprocess.loader.deepdoctection_loader.DeepdoctectionPDFLoader(file_path: str, deepdoctection_host: str)

Bases: BasePDFLoader

Load PDF file using NomaDamas’ Deepdoctection API server. You can use Deepdoctection API server using Dockerfile at https://github.com/NomaDamas/deepdoctection-api-server

extract_pages(result: List[Dict[str, Any]]) List[Dict[str, Any]]
static find_positions(text, substring)
lazy_load(*args, **kwargs) Iterator[Document]

lazy_load pdf file using Deepdoctection API server return list of Document

load(*args, **kwargs) List[Document]

load pdf file using Deepdoctection API server return list of Document

RAGchain.preprocess.loader.excel_loader module

class RAGchain.preprocess.loader.excel_loader.ExcelLoader(path: str, sheet_name: str | None = None, *args, **kwargs)

Bases: BaseLoader

Load a document from an Excel file.

lazy_load() Iterator[Document]

A lazy loader for Documents.

load() List[Document]

Load data into Document objects.

RAGchain.preprocess.loader.file_loader module

class RAGchain.preprocess.loader.file_loader.FileLoader(target_dir: str, hwp_host_url: str, *args, **kwargs)

Bases: BaseLoader

Loads documents from a directory. You can load .txt, .pdf, .csv, .xlsx, .hwp files.

lazy_load(filter_ext: List[str] | None = None) Iterator[Document]

Lazily load all files in the target directory. :param filter_ext: If not None, only files with the given extensions will be loaded. filter_ext elements must contain the dot (.) prefix.

load(filter_ext: List[str] | None = None) List[Document]

Load all files in the target directory. :param filter_ext: If not None, only files with the given extensions will be loaded. filter_ext elements must contain the dot (.) prefix.

RAGchain.preprocess.loader.hwp_loader module

class RAGchain.preprocess.loader.hwp_loader.HwpLoader(path: str, hwp_host_url: str, retry_connection: int = 4)

Bases: BaseLoader

Load Hwp files.

Hwp to text using hwp-converter-api. You can use hwp-converter-api at https://github.com/NomaDamas/hwp-converter-api

async async_request()
lazy_load() Iterator[Document]

Load a document lazily. This method uses asyncio requests.

load() List[Document]

Load a document.

RAGchain.preprocess.loader.mathpix_markdown_loader module

class RAGchain.preprocess.loader.mathpix_markdown_loader.MathpixMarkdownLoader(filepath: str)

Bases: BaseLoader

Load mathpix markdown file. mathpix markdown is .mmd file which is a markdown file for science papers. This class supports to split the file into sections and tables of science papers.

lazy_load(split_section: bool = True, split_table: bool = True) Iterator[Document]
Parameters:
  • split_section – If True, split the file into sections. Default is True.

  • split_table – If True, split the file into tables. Default is True.

Returns:

Iterator of Document. If split_section and split_table are True, return contains multiple Documents.

The order of each section and table are the same as the order of the file.

load(split_section: bool = True, split_table: bool = True) List[Document]
Parameters:
  • split_section – If True, split the file into sections. Default is True.

  • split_table – If True, split the file into tables. Default is True.

Returns:

List of Document. If split_section and split_table are True, the list contains multiple Documents.

The order of each section and table are the same as the order of the file.

static split_section(content: str) List[str]

Split section from mathpix markdown content by ‘#’.

static split_table(content: str) List[str]

Split table from mathpix markdown content. :param content: mathpix markdown content :return: The odd index is the content without table, and the even index is the table.

RAGchain.preprocess.loader.nougat_pdf_loader module

class RAGchain.preprocess.loader.nougat_pdf_loader.NougatPDFLoader(file_path: str, nougat_host: str)

Bases: BasePDFLoader

Load PDF file using Nougat API server. You can use Nougat API server using Dockerfile at https://github.com/facebookresearch/nougat

lazy_load(split_section: bool = True, split_table: bool = True, *args, **kwargs) Iterator[Document]
Parameters:
  • split_section – If True, split the document by section.

  • split_table – If True, split the document by table.

  • start – Start page number to load. Optional.

  • stop – Stop page number to load. Optional.

load(split_section: bool = True, split_table: bool = True, *args, **kwargs) List[Document]
Parameters:
  • split_section – If True, split the document by section.

  • split_table – If True, split the document by table.

  • start – Start page number to load. Optional.

  • stop – Stop page number to load. Optional.

RAGchain.preprocess.loader.rem_loader module

class RAGchain.preprocess.loader.rem_loader.RemLoader(path: str, time_range: List[datetime] | None = None)

Bases: BaseLoader

Load rem storage file from rem sqlite database. You can set time range to load.

lazy_load() Iterator[Document]

A lazy loader for Documents.

load() List[Document]

Load data into Document objects.

RAGchain.preprocess.loader.rust_hwp_loader module

class RAGchain.preprocess.loader.rust_hwp_loader.RustHwpLoader(path: str)

Bases: BaseLoader

Load HWP file using libhwp. It works for any os. Using load or lazy_load, you can get list of Documents from hwp file. This loader loads all paragraphs and tables from hwp file. At the first Document, there are all paragraphs from hwp file, including texts in each table. Next, there are separated Documents for each table paragraphs. Unfortunately, You can’t distinguish row and columns in table.

In the metadata, there are filepath at key ‘source’ and page_type, which is ‘text’ or ‘table’.

Recommend to use other hwp loader, but it is great option to use this loader at mac and linux. It is no need to use external hwp loader server, or hwp program that only available at windows.

lazy_load() Iterator[Document]

A lazy loader for Documents.

load() List[Document]

Load data into Document objects.

RAGchain.preprocess.loader.win32_hwp_loader module

class RAGchain.preprocess.loader.win32_hwp_loader.Win32HwpLoader(path: str)

Bases: BaseLoader

Load HWP file using pywin32. It works for only Windows. Using load or lazy_load, you can get list of Documents from hwp and hwpx file. This loader loads all paragraphs and tables from hwp or hwpx file. At the first Document, there are all paragraphs excluding texts in each table. Next, there are separated Documents for each table. All table contents are converted to html format. So you can get row, columns, or any complicated table structure.

In the metadata, there are filepath at key ‘source’ and page_type, which is ‘text’ or ‘table’.

It is great option to use loader for loading complicated tables from hwp or hwpx file. But it is only available at windows, so choose other hwp loader if you want to use at mac or linux.

static convert_hwp_to_hwpx(input_filepath, output_filepath)
lazy_load() Iterator[Document]

A lazy loader for Documents.

load() List[Document]

Load data into Document objects.

preprocessor() tuple[List, List]

Module contents