autorag.data.parse package

Submodules

autorag.data.parse.base module

autorag.data.parse.base.parser_node(func)[source]

autorag.data.parse.clova module

autorag.data.parse.clova.clova_ocr(data_path_list: List[str], url: str | None = None, api_key: str | None = None, batch: int = 5, table_detection: bool = False) Tuple[List[str], List[str], List[int]][source]

Parse documents to use Naver Clova OCR.

Parameters:
  • data_path_list – The list of data paths to parse.

  • url – The URL for Clova OCR. You can get the URL with the guide at https://guide.ncloud-docs.com/docs/clovaocr-example01 You can set the environment variable CLOVA_URL, or you can set it directly as a parameter.

  • api_key – The API key for Clova OCR. You can get the API key with the guide at https://guide.ncloud-docs.com/docs/clovaocr-example01 You can set the environment variable CLOVA_API_KEY, or you can set it directly as a parameter.

  • batch – The batch size for parse documents. Default is 8.

  • table_detection – Whether to enable table detection. Default is False.

Returns:

tuple of lists containing the parsed texts, path and pages.

async autorag.data.parse.clova.clova_ocr_pure(image_data: bytes, image_info: dict, url: str, api_key: str, table_detection: bool = False) Tuple[str, str, int][source]
autorag.data.parse.clova.extract_text_from_fields(fields)[source]
autorag.data.parse.clova.generate_image_info(pdf_path: str, num_pages: int) List[dict][source]

Generate image names based on the PDF file name and the number of pages.

autorag.data.parse.clova.json_to_html_table(json_data)[source]
autorag.data.parse.clova.pdf_to_images(pdf_path: str) List[bytes][source]

Convert each page of the PDF to an image and return the image data.

autorag.data.parse.langchain_parse module

autorag.data.parse.langchain_parse.langchain_parse(data_path_list: List[str], parse_method: str, **kwargs) Tuple[List[str], List[str], List[int]][source]

Parse documents to use langchain document_loaders(parse) method

Parameters:
  • data_path_list – The list of data paths to parse.

  • parse_method – A langchain document_loaders(parse) method to use.

  • kwargs – The extra parameters for creating the langchain document_loaders(parse) instance.

Returns:

tuple of lists containing the parsed texts, path and pages.

autorag.data.parse.langchain_parse.langchain_parse_pure(data_path: str, parse_method: str, kwargs) Tuple[List[str], List[str], List[int]][source]

Parses a single file using the specified parse method.

Args:

data_path (str): The file path to parse. parse_method (str): The parsing method to use. kwargs (Dict): Additional keyword arguments for the parsing method.

Returns:

Tuple[str, str]: A tuple containing the parsed text and the file path.

autorag.data.parse.langchain_parse.parse_all_files(data_path_list: List[str], parse_method: str, **kwargs) Tuple[List[str], List[str]][source]

autorag.data.parse.llamaparse module

autorag.data.parse.llamaparse.llama_parse(data_path_list: List[str], batch: int = 8, use_vendor_multimodal_model: bool = False, vendor_multimodal_model_name: str = 'openai-gpt4o', use_own_key: bool = False, vendor_multimodal_api_key: str = None, **kwargs) Tuple[List[str], List[str], List[int]][source]

Parse documents to use llama_parse. LLAMA_CLOUD_API_KEY environment variable should be set. You can get the key from https://cloud.llamaindex.ai/api-key

Parameters:
  • data_path_list – The list of data paths to parse.

  • batch – The batch size for parse documents. Default is 8.

  • use_vendor_multimodal_model – Whether to use the vendor multimodal model. Default is False.

  • vendor_multimodal_model_name – The name of the vendor multimodal model. Default is “openai-gpt4o”.

  • use_own_key – Whether to use the own API key. Default is False.

  • vendor_multimodal_api_key – The API key for the vendor multimodal model.

  • kwargs – The extra parameters for creating the llama_parse instance.

Returns:

tuple of lists containing the parsed texts, path and pages.

async autorag.data.parse.llamaparse.llama_parse_pure(data_path: str, parse_instance) Tuple[List[str], List[str], List[int]][source]

autorag.data.parse.run module

autorag.data.parse.run.run_parser(modules: List[Callable], module_params: List[Dict], data_path_glob: str, project_dir: str, all_files: bool)[source]

autorag.data.parse.table_hybrid_parse module

Module contents