autorag.data.parse package¶

Submodules¶

autorag.data.parse.base module¶

autorag.data.parse.base.parser_node(func)[source]¶

autorag.data.parse.clova module¶

autorag.data.parse.clova.clova_ocr(data_path_list: List[str], url: str | None = None, api_key: str | None = None, batch: int = 5, table_detection: bool = False) → Tuple[List[str], List[str], List[int]][source]¶

Parse documents to use Naver Clova OCR.

Parameters:

data_path_list – The list of data paths to parse.
url – The URL for Clova OCR. You can get the URL with the guide at https://guide.ncloud-docs.com/docs/clovaocr-example01 You can set the environment variable CLOVA_URL, or you can set it directly as a parameter.
api_key – The API key for Clova OCR. You can get the API key with the guide at https://guide.ncloud-docs.com/docs/clovaocr-example01 You can set the environment variable CLOVA_API_KEY, or you can set it directly as a parameter.
batch – The batch size for parse documents. Default is 8.
table_detection – Whether to enable table detection. Default is False.

Returns:

tuple of lists containing the parsed texts, path and pages.

async autorag.data.parse.clova.clova_ocr_pure(image_data: bytes, image_info: dict, url: str, api_key: str, table_detection: bool = False) → Tuple[str, str, int][source]¶

autorag.data.parse.clova.extract_text_from_fields(fields)[source]¶

autorag.data.parse.clova.generate_image_info(pdf_path: str, num_pages: int) → List[dict][source]¶: Generate image names based on the PDF file name and the number of pages.

autorag.data.parse.clova.json_to_html_table(json_data)[source]¶

autorag.data.parse.clova.pdf_to_images(pdf_path: str) → List[bytes][source]¶: Convert each page of the PDF to an image and return the image data.

autorag.data.parse.langchain_parse module¶

autorag.data.parse.langchain_parse.langchain_parse(data_path_list: List[str], parse_method: str, **kwargs) → Tuple[List[str], List[str], List[int]][source]¶

Parse documents to use langchain document_loaders(parse) method

Parameters:

data_path_list – The list of data paths to parse.
parse_method – A langchain document_loaders(parse) method to use.
kwargs – The extra parameters for creating the langchain document_loaders(parse) instance.

Returns:

tuple of lists containing the parsed texts, path and pages.

autorag.data.parse.langchain_parse.langchain_parse_pure(data_path: str, parse_method: str, kwargs) → Tuple[List[str], List[str], List[int]][source]¶

Parses a single file using the specified parse method.

Args:: data_path (str): The file path to parse. parse_method (str): The parsing method to use. kwargs (Dict): Additional keyword arguments for the parsing method.
Returns:: Tuple[str, str]: A tuple containing the parsed text and the file path.

autorag.data.parse.langchain_parse.parse_all_files(data_path_list: List[str], parse_method: str, **kwargs) → Tuple[List[str], List[str]][source]¶

autorag.data.parse.llamaparse module¶

autorag.data.parse.llamaparse.llama_parse(data_path_list: List[str], batch: int = 8, use_vendor_multimodal_model: bool = False, vendor_multimodal_model_name: str = 'openai-gpt4o', use_own_key: bool = False, vendor_multimodal_api_key: str = None, **kwargs) → Tuple[List[str], List[str], List[int]][source]¶

Parse documents to use llama_parse. LLAMA_CLOUD_API_KEY environment variable should be set. You can get the key from https://cloud.llamaindex.ai/api-key

Parameters:

data_path_list – The list of data paths to parse.
batch – The batch size for parse documents. Default is 8.
use_vendor_multimodal_model – Whether to use the vendor multimodal model. Default is False.
vendor_multimodal_model_name – The name of the vendor multimodal model. Default is “openai-gpt4o”.
use_own_key – Whether to use the own API key. Default is False.
vendor_multimodal_api_key – The API key for the vendor multimodal model.
kwargs – The extra parameters for creating the llama_parse instance.

Returns:

tuple of lists containing the parsed texts, path and pages.

async autorag.data.parse.llamaparse.llama_parse_pure(data_path: str, parse_instance) → Tuple[List[str], List[str], List[int]][source]¶

autorag.data.parse.run module¶

autorag.data.parse.run.run_parser(modules: List[Callable], module_params: List[Dict], data_path_glob: str, project_dir: str, all_files: bool)[source]¶

autorag.data.parse package¶

Submodules¶

autorag.data.parse.base module¶

autorag.data.parse.clova module¶

autorag.data.parse.langchain_parse module¶

autorag.data.parse.llamaparse module¶

autorag.data.parse.run module¶

autorag.data.parse.table_hybrid_parse module¶

Module contents¶