autorag.data.parse package¶
Submodules¶
autorag.data.parse.base module¶
autorag.data.parse.clova module¶
- autorag.data.parse.clova.clova_ocr(data_path_list: List[str], url: str | None = None, api_key: str | None = None, batch: int = 5, table_detection: bool = False) Tuple[List[str], List[str], List[int]] [source]¶
Parse documents to use Naver Clova OCR.
- Parameters:
data_path_list – The list of data paths to parse.
url – The URL for Clova OCR. You can get the URL with the guide at https://guide.ncloud-docs.com/docs/clovaocr-example01 You can set the environment variable CLOVA_URL, or you can set it directly as a parameter.
api_key – The API key for Clova OCR. You can get the API key with the guide at https://guide.ncloud-docs.com/docs/clovaocr-example01 You can set the environment variable CLOVA_API_KEY, or you can set it directly as a parameter.
batch – The batch size for parse documents. Default is 8.
table_detection – Whether to enable table detection. Default is False.
- Returns:
tuple of lists containing the parsed texts, path and pages.
- async autorag.data.parse.clova.clova_ocr_pure(image_data: bytes, image_info: dict, url: str, api_key: str, table_detection: bool = False) Tuple[str, str, int] [source]¶
autorag.data.parse.langchain_parse module¶
- autorag.data.parse.langchain_parse.langchain_parse(data_path_list: List[str], parse_method: str, **kwargs) Tuple[List[str], List[str], List[int]] [source]¶
Parse documents to use langchain document_loaders(parse) method
- Parameters:
data_path_list – The list of data paths to parse.
parse_method – A langchain document_loaders(parse) method to use.
kwargs – The extra parameters for creating the langchain document_loaders(parse) instance.
- Returns:
tuple of lists containing the parsed texts, path and pages.
- autorag.data.parse.langchain_parse.langchain_parse_pure(data_path: str, parse_method: str, kwargs) Tuple[List[str], List[str], List[int]] [source]¶
Parses a single file using the specified parse method.
- Args:
data_path (str): The file path to parse. parse_method (str): The parsing method to use. kwargs (Dict): Additional keyword arguments for the parsing method.
- Returns:
Tuple[str, str]: A tuple containing the parsed text and the file path.
autorag.data.parse.llamaparse module¶
- autorag.data.parse.llamaparse.llama_parse(data_path_list: List[str], batch: int = 8, use_vendor_multimodal_model: bool = False, vendor_multimodal_model_name: str = 'openai-gpt4o', use_own_key: bool = False, vendor_multimodal_api_key: str = None, **kwargs) Tuple[List[str], List[str], List[int]] [source]¶
Parse documents to use llama_parse. LLAMA_CLOUD_API_KEY environment variable should be set. You can get the key from https://cloud.llamaindex.ai/api-key
- Parameters:
data_path_list – The list of data paths to parse.
batch – The batch size for parse documents. Default is 8.
use_vendor_multimodal_model – Whether to use the vendor multimodal model. Default is False.
vendor_multimodal_model_name – The name of the vendor multimodal model. Default is “openai-gpt4o”.
use_own_key – Whether to use the own API key. Default is False.
vendor_multimodal_api_key – The API key for the vendor multimodal model.
kwargs – The extra parameters for creating the llama_parse instance.
- Returns:
tuple of lists containing the parsed texts, path and pages.