RAGchain.benchmark.dataset package

Submodules

RAGchain.benchmark.dataset.antique module

class RAGchain.benchmark.dataset.antique.AntiqueEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

AntiqueEvaluator is a class for evaluating pipeline performance on antique dataset.

evaluate(**kwargs) → EvaluateResult: Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None, random_state=None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. :param random_state: A random state to fix the shuffled corpus to ingest. Types are like these. int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

Notice: If ingest size too big, It takes a long time. So we shuffle corpus and slice by ingest size for test. Put retrieval gt corpus in passages because retrieval retrieves ground truth in db.

If you want to use context_recall metrics, you should ingest all data.

RAGchain.benchmark.dataset.asqa module

class RAGchain.benchmark.dataset.asqa.ASQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

ASQAEvaluator is a class for evaluating pipeline performance on ASQA dataset.

evaluate(**kwargs) → EvaluateResult: Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None): Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data.

RAGchain.benchmark.dataset.base module

class RAGchain.benchmark.dataset.base.BaseDatasetEvaluator(run_all: bool = True, metrics: List[str] | None = None)

Bases: BaseEvaluator, ABC

abstract ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None)

class RAGchain.benchmark.dataset.base.BaseStrategyQA

Bases: object

convert_qa_to_pd(data)

RAGchain.benchmark.dataset.dstc11_track5 module

class RAGchain.benchmark.dataset.dstc11_track5.DSTC11Track5Evaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

DSTC11Track5Evaluator is a class for evaluating pipeline performance on DSTC-11-Track-5 dataset.

evaluate(**kwargs) → EvaluateResult: Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None, random_state=None): Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. You must ingest all data for using context_recall metrics. :param random_state: A random state to fix the shuffled corpus to ingest. Types are like these. int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

RAGchain.benchmark.dataset.eli5 module

class RAGchain.benchmark.dataset.eli5.Eli5Evaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

Eli5Evaluator is a class for evaluating pipeline performance on Eli5 dataset.

evaluate(**kwargs) → EvaluateResult: Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None): Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data.

RAGchain.benchmark.dataset.ko_strategy_qa module

class RAGchain.benchmark.dataset.ko_strategy_qa.KoStrategyQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator, BaseStrategyQA

Ko-StrategyQA dataset evaluator

dataset_name = 'NomaDamas/Ko-StrategyQA'

evaluate(validate_passages: bool = True) → EvaluateResult: Evaluate pipeline performance on Ko-StrategyQA dataset. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None): Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. If you want to use context_recall and context_precision metrics, you should ingest all data.

RAGchain.benchmark.dataset.mr_tydi module

class RAGchain.benchmark.dataset.mr_tydi.MrTydiEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None, language: str = 'english')

Bases: BaseDatasetEvaluator

MrTydiEvaluator is a class for evaluating pipeline performance on Mr.tydi dataset.

evaluate(**kwargs) → EvaluateResult: Evaluate pipeline performance on Mr. Tydi dataset. This method always validate passages.

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None, random_state=None): Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. You must ingest all data for using context_recall metrics. If the ingest size is excessively large, it results in prolonged processing times. To address this, we shuffle the corpus and slice it according to the ingest size for testing purposes. The reason for transforming the retrieval ground truth corpus into passages and ingesting it is to enable retrieval to retrieve the retrieval ground truth within the database. :param random_state: A random state to fix the shuffled corpus to ingest. Types are like these. int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

RAGchain.benchmark.dataset.msmarco module

class RAGchain.benchmark.dataset.msmarco.MSMARCOEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None, version: str = 'v1.1')

Bases: BaseDatasetEvaluator

MSMARCO is a class for evaluating pipeline performance on MSMARCO dataset.

evaluate(**kwargs) → EvaluateResult: Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None): Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. You must ingest all data for using context_recall metrics.

RAGchain.benchmark.dataset.natural_question module

class RAGchain.benchmark.dataset.natural_question.NaturalQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

NATURALQAEvaluator is a class for evaluating pipeline performance on natural qa dataset.

evaluate(**kwargs) → EvaluateResult: Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None): Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. You must ingest all data for using context_recall metrics.

RAGchain.benchmark.dataset.nfcorpus module

class RAGchain.benchmark.dataset.nfcorpus.NFCorpusEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

NFCorpusEvaluator is a class for evaluating pipeline performance on NFCorpus dataset.

evaluate(**kwargs) → EvaluateResult: Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None, random_state=None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. :param random_state: A random state to fix the shuffled corpus to ingest. Types are like these. int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

Notice: If the ingest size is excessively large, it results in prolonged processing times. To address this, we shuffle the corpus and slice it according to the ingest size for testing purposes. The reason for transforming the retrieval ground truth corpus into passages and ingesting it is to enable retrieval to retrieve the retrieval ground truth within the database.

RAGchain.benchmark.dataset.qasper module

class RAGchain.benchmark.dataset.qasper.QasperEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int, metrics: List[str] | None = None, random_state: int = 42)

Bases: BaseDatasetEvaluator

QasperEvaluator is a class for evaluating pipeline performance on Qasper dataset.

dataset_name = 'NomaDamas/qasper'

evaluate(**kwargs) → EvaluateResult: Evaluate pipeline performance on Qasper dataset. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None): Set ingest params for evaluating pipeline. In this method, we don’t ingest passages, because Qasper dataset is not designed for ingest all paragraphs and retrieve it. It only has questions that are related to certain papers. So, we ingest each paper’s paragraphs when we evaluate it. :param retrievals: retrievals to ingest :param db: db to ingest :param ingest_size: Default is None. You don’t need to set this params. If you set, it will ignore this param.

preprocess(data): Preprocess Qasper dataset to make it suitable for evaluating pipeline.

RAGchain.benchmark.dataset.search_qa module

class RAGchain.benchmark.dataset.search_qa.SearchQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

SearchQAEvaluator is a class for evaluating pipeline performance on search qa dataset.

evaluate(**kwargs) → EvaluateResult: Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None, random_state=None): Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. If the ingest size is excessively large, it results in prolonged processing times. To address this, we shuffle the corpus and slice it according to the ingest size for testing purposes. The reason for transforming the retrieval ground truth corpus into passages and ingesting it is to enable retrieval to retrieve the retrieval ground truth within the database. This dataset has many retrieval ground truths per query, so it is recommended to set the ingest size to a small value. :param random_state: A random state to fix the shuffled corpus to ingest. Types are like these. int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

RAGchain.benchmark.dataset.strategy_qa module

class RAGchain.benchmark.dataset.strategy_qa.StrategyQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator, BaseStrategyQA

StrategyQAEvaluator is a class for evaluating pipeline performance on StrategyQA dataset.

dataset_name = 'voidful/StrategyQA'

evaluate(validate_passages: bool = True, **kwargs) → EvaluateResult: Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None): Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. If you want to use context_recall and context_precision metrics, you should ingest all data.

RAGchain.benchmark.dataset.triviaqa module

class RAGchain.benchmark.dataset.triviaqa.TriviaQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

TriviaQAEvaluator is a class for evaluating pipeline performance on TriviaQA dataset.

evaluate(**kwargs) → EvaluateResult: Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None): Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data.

RAGchain.benchmark.dataset package

Submodules

RAGchain.benchmark.dataset.antique module

RAGchain.benchmark.dataset.asqa module

RAGchain.benchmark.dataset.base module

RAGchain.benchmark.dataset.dstc11_track5 module

RAGchain.benchmark.dataset.eli5 module

RAGchain.benchmark.dataset.ko_strategy_qa module

RAGchain.benchmark.dataset.mr_tydi module

RAGchain.benchmark.dataset.msmarco module

RAGchain.benchmark.dataset.natural_question module

RAGchain.benchmark.dataset.nfcorpus module

RAGchain.benchmark.dataset.qasper module

RAGchain.benchmark.dataset.search_qa module

RAGchain.benchmark.dataset.strategy_qa module

RAGchain.benchmark.dataset.triviaqa module

Module contents