Quick Introduction

DeepEval is an open-source evaluation framework for Python, that makes it easy to build and iterate on LLM applications with the following principles in mind:

Easily "unit test" LLM outputs in a similar way to Pytest.
Leverage various out-of-the-box LLM-evaluated and classic evaluation metrics.
Define evaluation datasets in Python code.
Metrics are simple to customize.
[alpha] Bring evaluation into production using Python decorators.

Delivered by

Confident AI

Setup A Python Environement

Go to the root directory of your project and create a virtual environement (if you don't already have one). In the CLI, run:

python3 -m venv venv
source venv/bin/activate

Installation

In your newly created virtual environement, run:

pip install -U deepeval

You can also keep track of all evaluation results by logging into our in all one evaluation platform, and use Confident AI's proprietary LLM evaluation agent for evaluation:

deepeval login

note

Contact us if you're dealing with sensitive data that has to reside in your private VPCs.

Create Your First Test Case

Run touch test_example.py to create a test file in your root directory. Open test_example.py and paste in your first test case:

test_example.py
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_answer_relevancy():
    input = "What if these shoes don't fit?"
    retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]

    # Replace this with the actual output of your LLM application
    actual_output = "We offer a 30-day full refund at no extra cost."
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    test_case = LLMTestCase(input=input, actual_output=actual_output, retrieval_context=retrieval_context)
    assert_test(test_case, [answer_relevancy_metric])

Run deepeval test run from the root directory of your project:

deepeval test run test_example.py

Congratulations! Your test case should have passed ✅ Let's breakdown what happened.

The variable input mimics a user input, and actual_output is a placeholder for what your application's supposed to output based on this input.
The variable retrieval_context contains the retrieved context from your knowledge base, and AnswerRelevancyMetric(threshold=0.5) is an default metric provided by deepeval for you to evaluate your LLM output's relevancy based on the provided retrieval context.
All metric scores range from 0 - 1, which the threshold=0.5 threshold ultimately determines if your test have passed or not.

note

You'll need to set your OPENAI_API_KEY as an enviornment variable before running the AnswerRelevancyMetric, since the AnswerRelevancyMetric is an LLM-evaluated metric. You use a custom LLM instead of OpenAI, check out this part of the docs.

To save the test results locally for each test run, set the DEEPEVAL_RESULTS_FOLDER environement variable to your relative path of choice:

export DEEPEVAL_RESULTS_FOLDER="./data"

Create Your First Custom Metric

deepeval provides two types of custom metric to evaluate LLM outputs: metrics evaluated with LLMs and metrics evaluated without LLMs. Here is a brief overview of each custom metric.

LLM Evaluated Metrics

An LLM evaluated metric, is one where evaluation is carried out by an LLM. deepeval offers G-Eval, a state-of-the-art framework to create an LLM evaluated metric.

test_example.py
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

...

def test_summarization():
    input = "What if these shoes don't fit? I want a full refund."

    # Replace this with the actual output from your LLM application
    actual_output = "If the shoes don't fit, the customer wants a full refund."

    summarization_metric = GEval(
        name="Summarization",
        criteria="Summarization - determine if the actual output is an accurate and concise summarization of the input.",
        evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
        threshold=0.5
    )
    test_case = LLMTestCase(input=input, actual_output=actual_output)
    assert_test(test_case, [summarization_metric])

Classic Metrics

A classic metric is a metric where evluation is not carried out by another LLM. You can define a custom classic metric by defining the measure and is_successful methods upon inheriting the base Metric class. Paste in the following:

test_example.py
from deepeval.metrics import BaseMetric

...

class LengthMetric(BaseMetric):
    # This metric checks if the output length is greater than 10 characters
    def __init__(self, max_length: int=10):
        self.threshold = max_length

    def measure(self, test_case: LLMTestCase):
        self.success = len(test_case.actual_output) > self.threshold
        if self.success:
            score = 1
        else:
            score = 0
        return score

    def is_successful(self):
        return self.success

    @property
    def __name__(self):
        return "Length"


def test_length():
    input = "What if these shoes don't fit?"

    # Replace this with the actual output of your LLM application
    actual_output = "We offer a 30-day full refund at no extra cost."
    length_metric = LengthMetric(max_length=10)
    test_case = LLMTestCase(input=input, actual_output=actual_output)
    assert_test(test_case, [length_metric])

Run deepeval test run from the root directory of your project again:

deepeval test run test_example.py

You should see both test_answer_relevancy and test_length passing.

Two things to note:

Custom metrics does requires a mimimum_score as a passing criteria. In the case of our LengthMetric, the passing criteria was whether the max_length of actual_output is greater than 10.
We removed retrieval_context in test_length since it was irrelevant to evaluating output length. However input and actual_output to always mandatory to create a valid LLMTestCase.

Combine Your Metrics

You might've noticed we have duplicated test cases for both test_answer_relevancy and test_length (ie. they have the same input and expected output). To avoid this redundancy, deepeval offers an easy way to apply as many metrics as you wish on a single test case.

test_example.py
...

def test_everything():
    input = "What if these shoes don't fit?"
    retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]

    # Replace this with the actual output of your LLM application
    actual_output = "We offer a 30-day full refund at no extra cost."
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
    length_metric = LengthMetric(max_length=10)
    summarization_metric = GEval(
        name="Summarization",
        criteria="Summarization - determine if the actual output is an accurate and concise summarization of the input.",
        evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
        threshold=0.5
    )

    test_case = LLMTestCase(input=input, actual_output=actual_output, retrieval_context=retrieval_context)
    assert_test(test_case, [answer_relevancy_metric, length_metric, summarization_metric])

In this scenario, test_everything only passes if all metrics are passing. Run deepeval test run again to see the results:

deepeval test run test_example.py

Evaluate Your Dataset in Bulk

An evaluation dataset in deepeval is a collection of test cases. Using deepeval's Pytest integration, you can utilize the @pytest.mark.parametrize decorator to loop through and evaluate your evaluation dataset.

test_dataset.py
import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

first_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])

# Initialize an evaluation dataset by supplying a list of test cases
dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])

# Loop through test cases using Pytest
@pytest.mark.parametrize(
    "test_case",
    dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
    faithfulness_metric = FaithfulnessMetric(threshold=0.3)
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [faithfulness_metric, answer_relevancy_metric])

You can also run test cases in parallel by using the optional -n flag followed by a number (that determines the number of processes that will be used) when executing deepeval test run:

deepeval test run test_dataset.py -n 2

Alternatively, you can evaluate entire datasets without going through the CLI (if you're in a notebook environment):

...

evaluate(test_cases, [faithfulness_metric, answer_relevancy_metric])

info

To learn more about the additional features an EvaluationDataset offers, visit the dataset section.

Visualize Your Results

If you have reached this point, you've likely ran deepeval test run multiple times. To keep track of all future evaluation results created by deepeval, login to Confident AI by running the following command:

deepeval login

Confident AI is the platform powering deepeval, and offer deep insights to help you quickly figure out how to best implement your LLM application. Follow the instructions displayed on the CLI to create an account, get your Confident API key, and paste it in the CLI.

Once you've pasted your Confident API key in the CLI, run:

deepeval test run test_example.py

View Test Run

You should now see a link being returned upon test completion. Paste it in your browser to view results.

View Individual Test Cases

You can also view individual test cases for enhanced debugging:

Compare Hyperparameters

To log hyperparameters (such as prompt templates used) for your LLM application, paste in the following code in test_example.py:

test_example.py
import deepeval

...

@deepeval.set_hyperparameters
def hyperparameters():
    return {
        "chunk_size": 500,
        "temperature": 0,
        "model": "GPT-4",
        "prompt_template": """You are a helpful assistant, answer the following question in a non-judgemental tone.

        Question:
        {question}
        """,
    }

Execute deepeval test run test_example.py again to start comparing hyperparmeters for each test run.

Full Example

You can find the full example here on our Github.

Setup A Python Environement​

Installation​

Create Your First Test Case​

Create Your First Custom Metric​

LLM Evaluated Metrics​

Classic Metrics​

Combine Your Metrics​

Evaluate Your Dataset in Bulk​

Visualize Your Results​

View Test Run​

View Individual Test Cases​

Compare Hyperparameters​

Full Example​