Evaluator
The evaluator is the module responsible for evaluating the assistant's answers. It is used to ensure that the assistant's answers are of high quality and meet the desired criteria.
What is Automatic Evaluation?
Automatic evaluation is a crucial component in the development and optimization of AI assistants. It's a process that allows developers to assess the performance of their assistants using predefined datasets and metrics.
Purpose of Automatic Evaluation
The primary purpose of automatic evaluation is to:
- Obtain performance metrics of the assistant on a set of known inputs.
- Ensure regression testing, preventing performance degradation over time.
Configuration
The Evaluator Configuration defines how assistants are tested and evaluated using predefined datasets and metrics. It is organized into two main steps: General and Metrics.
-
General In this section, you define the name of the evaluator and select the assistant that will be evaluated. The selected assistant will be used to generate responses for the test datasets during evaluation.
-
Metrics In this step, you configure the datasets, metrics, and visualizations that determine how evaluation results are measured and displayed.
-
Datasets – Select one or more datasets to use in the evaluation. Each dataset contains a collection of questions or prompts that the assistant will respond to.
-
Metrics – Choose the evaluation metrics to apply to the selected dataset.
-
Visualizations – Define how the results will be displayed.
-
Once the configuration is complete, click Add configuration to include the selected setup, and then Show metrics to review the configured metrics.
Running Evaluators
To execute an evaluator, navigate to Activity → Evaluation → Configuration. In this section, you will find a list of all existing evaluators. In the Actions column, click the Run button next to the evaluator you want to execute.
Once executed, you can monitor and review its status under Activity → Evaluation → Executions. Here, all evaluations appear with their respective ID, Date, Datasets Evaluated, Status, and Actions.
When an evaluation is completed, its status will display as COMPLETED, and you can access detailed performance metrics by clicking View Metrics. This view allows you to analyze key indicators such as accuracy and behavior across datasets, ensuring that the assistant operates correctly in relation to the information provided.
Key Components
Datasets
Datasets used in automatic evaluation typically include:
- Domain's questions with expected answers
- Out-of-context queries
- Prompt hacking attempts
These datasets help test various aspects of the assistant's performance, from accuracy to robustness against potential misuse.
To create a new dataset enter a Name to identify your dataset and provide a Description that summarizes its purpose or content. Then, upload a Q&A File containing the dataset entries — each row should include a question and its corresponding answer. You can use the XLSX Template provided in the Templates dropdown as a reference for the required format.
Once all fields are completed, click Create to save the dataset. The dataset will then be available for use in evaluators, allowing you to test and measure assistant performance across different types of questions.
Metrics
Automatic evaluation employs three main types of metrics:
- Traditional NLP Metrics
- LLM-based Metrics
- Code Property evaluation
Code Property Evaluation
This type of evaluation focuses on verifiable properties that can be checked through code. It might include tests for response length, presence of specific keywords, or adherence to certain formatting rules.
- Description: The score is 1 if the prediction has a reference (a number in brackets, e.g., [1]). The aggregation is the average of the scores for each prediction.
- Value Range: 0 (worst) to 1 (best).
- How to Read: The closer to 1, the more responses contain references in their answer.
NLP Metrics
Hugging Face provides a range of pre-implemented NLP metrics that can be used for evaluation. These might include BLEU, ROUGE or Bert Score F1.
BLEU (Bilingual Evaluation Understudy)
- Description: BLEU is a commonly used metric for evaluating the quality of machine translations. It measures the overlap of words between the generated translation and one or more reference translations.
- Value Range: 0 (worst) to 1 (best). The closer to 1, the better the match with the references.
- How to Read: A BLEU score of 1 indicates a perfect match with the references, while lower values indicate less similarity.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- ROUGE-1 (unigram overlap):
- Description: Measures the overlap of unigrams (individual words) between the reference and the generated output.
- Value Range: 0 (worst) to 1 (best).
- How to Read: The closer to 1, the higher the similarity in terms of individual words.
- ROUGE-2 (bigram overlap):
- Description: Similar to ROUGE-1, but measures the overlap of bigrams (pairs of words).
- Value Range: 0 (worst) to 1 (best).
- ROUGE-L (longest common subsequence):
- Description: Measures the overlap of the longest common subsequence between the reference and the generated output.
- Value Range: 0 (worst) to 1 (best).
BERT Score F1
- Description: BERT Score is a metric that evaluates the semantic similarity between two texts. It uses a pre-trained model like BERT to calculate the score.
- Value Range: 0 (worst) to 1 (best).
- How to Read: The closer to 1, the higher the semantic similarity between the reference and the generated output.
LLM Metrics
LLM metrics use the power of large language models to evaluate semantic properties of the assistant's responses. This could involve checking for relevance to the query, factual accuracy, coherence and fluency.
- Description: Evaluation using the Langchain evaluator. The criterion, model, and specific prompt can be defined if desired. This evaluator uses a language model to verify that a certain criterion is met in the input text. For the 'correctness' criterion, the question, reference answer, and generated answer must be provided. The evaluator returns 0 or 1: 1 means the answer is correct, 0 means otherwise. The value returned is the average of all predictions. More information on how to configure these types of evaluators is provided in the next section.
- Value Range: 0 (worst) to 1 (best).
- How to Read: The closer to 1, the more responses are semantically correct.
Executions
Automatic evaluation typically generates insights into the assistant's performance. These dashboards might include:
- Overall scores on different metrics
- Breakdown of performance by question type or dataset
- Comparison with previous versions or benchmarks
- Highlighted areas for improvement
Benefits of Automatic Evaluation
- Consistency: Provides a standardized way to measure performance across different versions or configurations.
- Efficiency: Allows for rapid testing of changes without the need for extensive manual review.
- Scalability: Can handle large volumes of test cases that would be impractical for human evaluation.
- Objectivity: Reduces potential bias in evaluation by using predefined criteria.
Limitations
While automatic evaluation is a powerful tool, it's important to note that it doesn't replace the need for human evaluation entirely. Some aspects of assistant performance, such as nuanced understanding of context or appropriateness of tone, may still require human judgment. In conclusion, automatic evaluation serves as a critical tool in the development and maintenance of AI assistants, providing quantitative insights into performance and guiding ongoing improvements.