RAGchain.benchmark.dataset package

Submodules

RAGchain.benchmark.dataset.antique module

class RAGchain.benchmark.dataset.antique.AntiqueEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

AntiqueEvaluator is a class for evaluating pipeline performance on antique dataset.

evaluate(**kwargs) EvaluateResult

Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None, random_state=None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. :param random_state: A random state to fix the shuffled corpus to ingest. Types are like these. int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

Notice: If ingest size too big, It takes a long time. So we shuffle corpus and slice by ingest size for test. Put retrieval gt corpus in passages because retrieval retrieves ground truth in db.

If you want to use context_recall metrics, you should ingest all data.

RAGchain.benchmark.dataset.asqa module

class RAGchain.benchmark.dataset.asqa.ASQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

ASQAEvaluator is a class for evaluating pipeline performance on ASQA dataset.

evaluate(**kwargs) EvaluateResult

Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data.

RAGchain.benchmark.dataset.base module

class RAGchain.benchmark.dataset.base.BaseDatasetEvaluator(run_all: bool = True, metrics: List[str] | None = None)

Bases: BaseEvaluator, ABC

abstract ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None)
class RAGchain.benchmark.dataset.base.BaseStrategyQA

Bases: object

convert_qa_to_pd(data)

RAGchain.benchmark.dataset.dstc11_track5 module

class RAGchain.benchmark.dataset.dstc11_track5.DSTC11Track5Evaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

DSTC11Track5Evaluator is a class for evaluating pipeline performance on DSTC-11-Track-5 dataset.

evaluate(**kwargs) EvaluateResult

Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None, random_state=None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. You must ingest all data for using context_recall metrics. :param random_state: A random state to fix the shuffled corpus to ingest. Types are like these. int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

RAGchain.benchmark.dataset.eli5 module

class RAGchain.benchmark.dataset.eli5.Eli5Evaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

Eli5Evaluator is a class for evaluating pipeline performance on Eli5 dataset.

evaluate(**kwargs) EvaluateResult

Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data.

RAGchain.benchmark.dataset.ko_strategy_qa module

class RAGchain.benchmark.dataset.ko_strategy_qa.KoStrategyQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator, BaseStrategyQA

Ko-StrategyQA dataset evaluator

dataset_name = 'NomaDamas/Ko-StrategyQA'
evaluate(validate_passages: bool = True) EvaluateResult

Evaluate pipeline performance on Ko-StrategyQA dataset. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. If you want to use context_recall and context_precision metrics, you should ingest all data.

RAGchain.benchmark.dataset.mr_tydi module

class RAGchain.benchmark.dataset.mr_tydi.MrTydiEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None, language: str = 'english')

Bases: BaseDatasetEvaluator

MrTydiEvaluator is a class for evaluating pipeline performance on Mr.tydi dataset.

evaluate(**kwargs) EvaluateResult

Evaluate pipeline performance on Mr. Tydi dataset. This method always validate passages.

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None, random_state=None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. You must ingest all data for using context_recall metrics. If the ingest size is excessively large, it results in prolonged processing times. To address this, we shuffle the corpus and slice it according to the ingest size for testing purposes. The reason for transforming the retrieval ground truth corpus into passages and ingesting it is to enable retrieval to retrieve the retrieval ground truth within the database. :param random_state: A random state to fix the shuffled corpus to ingest. Types are like these. int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

RAGchain.benchmark.dataset.msmarco module

class RAGchain.benchmark.dataset.msmarco.MSMARCOEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None, version: str = 'v1.1')

Bases: BaseDatasetEvaluator

MSMARCO is a class for evaluating pipeline performance on MSMARCO dataset.

evaluate(**kwargs) EvaluateResult

Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. You must ingest all data for using context_recall metrics.

RAGchain.benchmark.dataset.natural_question module

class RAGchain.benchmark.dataset.natural_question.NaturalQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

NATURALQAEvaluator is a class for evaluating pipeline performance on natural qa dataset.

evaluate(**kwargs) EvaluateResult

Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. You must ingest all data for using context_recall metrics.

RAGchain.benchmark.dataset.nfcorpus module

class RAGchain.benchmark.dataset.nfcorpus.NFCorpusEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

NFCorpusEvaluator is a class for evaluating pipeline performance on NFCorpus dataset.

evaluate(**kwargs) EvaluateResult

Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None, random_state=None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. :param random_state: A random state to fix the shuffled corpus to ingest. Types are like these. int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

Notice: If the ingest size is excessively large, it results in prolonged processing times. To address this, we shuffle the corpus and slice it according to the ingest size for testing purposes. The reason for transforming the retrieval ground truth corpus into passages and ingesting it is to enable retrieval to retrieve the retrieval ground truth within the database.

RAGchain.benchmark.dataset.qasper module

class RAGchain.benchmark.dataset.qasper.QasperEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int, metrics: List[str] | None = None, random_state: int = 42)

Bases: BaseDatasetEvaluator

QasperEvaluator is a class for evaluating pipeline performance on Qasper dataset.

dataset_name = 'NomaDamas/qasper'
evaluate(**kwargs) EvaluateResult

Evaluate pipeline performance on Qasper dataset. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None)

Set ingest params for evaluating pipeline. In this method, we don’t ingest passages, because Qasper dataset is not designed for ingest all paragraphs and retrieve it. It only has questions that are related to certain papers. So, we ingest each paper’s paragraphs when we evaluate it. :param retrievals: retrievals to ingest :param db: db to ingest :param ingest_size: Default is None. You don’t need to set this params. If you set, it will ignore this param.

preprocess(data)

Preprocess Qasper dataset to make it suitable for evaluating pipeline.

RAGchain.benchmark.dataset.search_qa module

class RAGchain.benchmark.dataset.search_qa.SearchQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

SearchQAEvaluator is a class for evaluating pipeline performance on search qa dataset.

evaluate(**kwargs) EvaluateResult

Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None, random_state=None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. If the ingest size is excessively large, it results in prolonged processing times. To address this, we shuffle the corpus and slice it according to the ingest size for testing purposes. The reason for transforming the retrieval ground truth corpus into passages and ingesting it is to enable retrieval to retrieve the retrieval ground truth within the database. This dataset has many retrieval ground truths per query, so it is recommended to set the ingest size to a small value. :param random_state: A random state to fix the shuffled corpus to ingest. Types are like these. int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

RAGchain.benchmark.dataset.strategy_qa module

class RAGchain.benchmark.dataset.strategy_qa.StrategyQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator, BaseStrategyQA

StrategyQAEvaluator is a class for evaluating pipeline performance on StrategyQA dataset.

dataset_name = 'voidful/StrategyQA'
evaluate(validate_passages: bool = True, **kwargs) EvaluateResult

Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data. If you want to use context_recall and context_precision metrics, you should ingest all data.

RAGchain.benchmark.dataset.triviaqa module

class RAGchain.benchmark.dataset.triviaqa.TriviaQAEvaluator(run_pipeline: BaseRunPipeline, evaluate_size: int | None = None, metrics: List[str] | None = None)

Bases: BaseDatasetEvaluator

TriviaQAEvaluator is a class for evaluating pipeline performance on TriviaQA dataset.

evaluate(**kwargs) EvaluateResult

Evaluate metrics and return the results :param validate_passages: If True, validate passages in retrieval_gt already ingested. If False, you can’t use context_recall and KF1 metrics. We recommend to set True for robust evaluation. :return: EvaluateResult

ingest(retrievals: List[BaseRetrieval], db: BaseDB, ingest_size: int | None = None)

Ingest dataset to retrievals and db. :param retrievals: The retrievals that you want to ingest. :param db: The db that you want to ingest. :param ingest_size: The number of data to ingest. If None, ingest all data.

Module contents