``eval_utils`` module
================================
The ``eval_utils`` module provides a complete framework for evaluating LLM and model
endpoints. Everything starts from an **Experiment** — a bundle of a dataset, one or more
models, and a set of metrics. You define an experiment, run it, and get back tracked
results in MLflow and Delta tables.
.. attention:: These functions are designed to run within Databricks notebooks and require access to
Unity Catalog and MLflow.
Configuring an Experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^
An ``Experiment`` has three parts:
1. **Dataset** — which eval table to use, and which columns contain the input, expected output, and additional context.
2. **Models** — one or more models to evaluate, each targeting a Databricks serving endpoint or a LiteLLM-supported model.
- When configuring a model, you can use a prompt template, a prompt registry name, or a prompt alias. We suggest using the prompt registry for production experiments.
3. **Metrics** — which metrics to compute across all models (``latency``, ``token_count``, ``exact_match``, ``fuzzy_match``, ``llm_judge``, ``hit_rate``, or custom scorers).
You can define an experiment in Python or load it entirely from a YAML file.
Defining in Python
""""""""""""""""""
This example shows how to define an experiment in Python. It creates an eval table, defines an experiment, and runs it.
Pay attention to the different ways to configure a model (either via an endpoint or a model in the registry). All of these are available through both code and YAML.
.. important::
You must run ``deploy_model_serving_endpoint`` before you can use a custom model via an endpoint.
.. code-block:: python
from ml_toolkit.functions.eval_utils import (
create_eval_table,
run_experiment,
DatasetConfig,
ModelConfig,
Experiment,
Metric,
)
# 1. Create an eval dataset (one-time setup)
table_name = create_eval_table(
schema_name="sandbox",
table_name="vendor_eval",
primary_key="row_id",
df=eval_df,
)
# 2. Define the experiment
experiment = Experiment(
name="vendor-tagger-comparison",
dataset=DatasetConfig(
table=table_name,
primary_key="row_id",
input_column="input",
expected_output_column="expected_output",
additional_context_columns=["candidates"],
),
models=[
ModelConfig(
name="llama-8b",
endpoint="databricks-meta-llama-3-1-8b-instruct",
prompt_template="Tag: <>. Options: <>.",
max_output_tokens=256,
temperature=0.0,
),
ModelConfig(
name="llama-8b",
endpoint="endpoint-name",
version=1, # version of the model in the endpoint
prompt_template="Tag: <>. Options: <>.",
max_output_tokens=256,
temperature=0.0,
),
ModelConfig(
name="llama-8b",
model_name="catalog.schema.my_qlora_model",
version=2, # version of the model in the model registry
prompt_registry_name="catalog.schema.vendor_tagging_prompt",
prompt_alias="prod",
max_output_tokens=256,
temperature=0.0,
),
ModelConfig(
name="gpt-5",
litellm_model="gpt-5",
prompt_registry_name="catalog.schema.vendor_tagging_prompt",
system_prompt="You are a precise tagging assistant.",
max_output_tokens=9000,
temperature=0.0,
),
],
metrics=[Metric.LATENCY, Metric.EXACT_MATCH, Metric.FUZZY_MATCH],
)
# 3. Run the experiment
result = run_experiment(experiment)
result.summary_df.show()
``DatasetConfig`` points to an existing Unity Catalog eval table. ``ModelConfig`` defines
how to reach a model — either via ``endpoint`` (Databricks serving) or ``litellm_model``
(any LiteLLM-supported provider). Prompt templates use ``<>`` placeholders
that reference columns in the dataset.
To add LLM-as-judge scoring, include ``Metric.LLM_JUDGE`` and provide a
``LLMJudgeConfig``:
.. code-block:: python
from ml_toolkit.functions.eval_utils import LLMJudgeConfig
experiment = Experiment(
name="vendor-tagger-with-judge",
dataset=dataset,
models=[model],
metrics=[Metric.LATENCY, Metric.EXACT_MATCH, Metric.LLM_JUDGE],
llm_judge_config=LLMJudgeConfig(
judge_model="gpt-4o",
criteria="correctness",
rubric="Score 1-5 based on how well the output matches the expected output.",
max_score=5.0,
),
)
.. _yaml-configuration:
Loading from YAML
"""""""""""""""""
Every config type can be loaded from a YAML file. This makes experiments reproducible,
version-controllable, and easy to share.
.. code-block:: python
# Load a full experiment from a single YAML file
experiment = Experiment(experiment_yaml="configs/experiment.yaml")
result = run_experiment(experiment)
# Or load individual configs
dataset = DatasetConfig(dataset_yaml="configs/dataset.yaml")
model = ModelConfig(model_yaml="configs/model.yaml")
judge = LLMJudgeConfig(llm_judge_yaml="configs/judge.yaml")
When a YAML path is provided, the file contents override any other parameters passed
to the constructor.
**Experiment YAML** — bundles dataset, models, and metrics in one file:
.. code-block:: yaml
:caption: configs/experiment.yaml
name: vendor-tagger-comparison
description: Compare Llama-8B vs GPT-5 for vendor tagging
dataset:
table: catalog.schema.vendor_eval
primary_key: row_id
input_column: input
expected_output_column: expected_output
additional_context_columns:
- candidates
models:
- name: llama-8b
endpoint: databricks-meta-llama-3-1-8b-instruct
prompt_template: |
Tag the following vendor name: <>.
Output the vendor name and only the vendor name, as normalized in the options below: <>.
max_output_tokens: 256
temperature: 0.0
- name: gpt-5
litellm_model: gpt-5
prompt_registry_name: catalog.schema.vendor_tagging_prompt
max_output_tokens: 9000
temperature: 0.0
metrics:
- latency
- exact_match
- fuzzy_match
tags:
team: ml
**With LLM-as-judge** — add the ``llm_judge`` metric and a ``llm_judge_config`` block:
.. code-block:: yaml
:caption: configs/experiment_with_judge.yaml
name: vendor-tagger-with-judge
dataset:
table: catalog.schema.vendor_eval
primary_key: row_id
input_column: input
expected_output_column: expected_output
additional_context_columns:
- candidates
models:
- name: llama-8b
endpoint: databricks-meta-llama-3-1-8b-instruct
prompt_registry_name: catalog.schema.vendor_tagging_prompt
prompt_alias: prod
max_output_tokens: 256
temperature: 0.0
metrics:
- latency
- exact_match
- llm_judge
llm_judge_config:
judge_model: gpt-4o
criteria: correctness
rubric: |
Score 1-5 based on how well the model output matches the expected output:
5: Exact match or semantically identical
4: Minor differences but essentially correct
3: Partially correct
2: Mostly incorrect but shows understanding
1: Completely wrong
max_score: 5.0
**Individual config files** — useful when you want to mix and match:
.. code-block:: yaml
:caption: configs/dataset.yaml
table: catalog.schema.vendor_eval
primary_key: row_id
input_column: input
expected_output_column: expected_output
additional_context_columns:
- candidates
.. code-block:: yaml
:caption: configs/model_databricks.yaml — Databricks endpoint
name: llama-8b
endpoint: databricks-meta-llama-3-1-8b-instruct
prompt_registry_name: catalog.schema.vendor_tagging_prompt
prompt_alias: prod
max_output_tokens: 256
temperature: 0.0
.. code-block:: yaml
:caption: configs/model_litellm.yaml — LiteLLM model
name: gpt-5
litellm_model: gpt-5
prompt_registry_name: catalog.schema.vendor_tagging_prompt
prompt_version: 1
max_output_tokens: 9000
temperature: 0.0
Saving and reproducing
""""""""""""""""""""""
Any config object can be serialized back to YAML with ``to_yaml()``, and previously-run
experiments can be exported with ``get_experiment_yaml()``:
.. code-block:: python
# Save current experiment config
experiment.to_yaml("configs/my_experiment.yaml")
# Export a past experiment by ID
yaml_content = get_experiment_yaml(experiment_id)
Remote execution
""""""""""""""""
Experiments can be offloaded to Databricks serverless compute using the ``trigger_remote``
parameter. This is useful for long-running evaluations that you don't want blocking your
notebook session.
.. code-block:: python
:caption: Fire-and-forget
experiment = Experiment(experiment_yaml="configs/experiment.yaml")
# Submit to serverless — returns immediately
remote_run = run_experiment(experiment, trigger_remote=True)
# Track progress in the Databricks UI
print(remote_run.databricks_url)
print(remote_run.job_run_id)
# When ready, block until the job finishes and get results
result = remote_run.get_result()
result.summary_df.show()
.. code-block:: python
:caption: Block until the remote job completes
result = run_experiment(experiment, trigger_remote=True, wait=True)
result.summary_df.show()
Under the hood, the ``Experiment`` is serialized to a dictionary, submitted as a
Databricks serverless job, reconstructed on the remote cluster, and run as usual.
Results are tracked in MLflow and can be retrieved with ``get_experiment()``.
.. note::
**Custom callable metrics cannot be used remotely** — they are not serializable.
All built-in ``Metric`` enum values work with remote execution.
API Reference
^^^^^^^^^^^^^^
Running Evaluations
""""""""""""""""""""
.. autoapifunction:: ml_toolkit.functions.eval_utils.eval.run_experiment
.. autoapifunction:: ml_toolkit.functions.eval_utils.eval.run_evaluation
Table Management
"""""""""""""""""
.. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.create_eval_table
.. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.upsert_eval_table
.. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.lock_eval_table
.. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.unlock_eval_table
Retrieving Results
"""""""""""""""""""
Run-level
~~~~~~~~~
.. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_run
.. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_run_df
.. autoapifunction:: ml_toolkit.functions.eval_utils.results.list_runs_df
.. autoapifunction:: ml_toolkit.functions.eval_utils.results.compare_runs_df
Experiment-level
~~~~~~~~~~~~~~~~
.. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_experiment
.. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_experiment_df
.. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_experiment_yaml
.. autoapifunction:: ml_toolkit.functions.eval_utils.results.list_experiments_df
.. autoapifunction:: ml_toolkit.functions.eval_utils.results.compare_experiments_df
Configuration Types
""""""""""""""""""""
.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.config.Experiment
.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.config.DatasetConfig
.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.config.ModelConfig
.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.LLMJudgeConfig
.. autoapiclass:: ml_toolkit.functions.eval_utils.constants.Metric
Result Types
"""""""""""""
.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.EvalRunResult
.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.ExperimentResult
.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.RemoteExperimentRun
.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.CustomScorer
Multi-Model Comparison
"""""""""""""""""""""""
Functions for pivoting multi-model inference results and computing agreement
metrics across models. These are designed to work with the output of the
**Multi-Model Inference UDTF** (see :doc:`llm`), but can also be used
independently on any DataFrame with per-model output columns.
.. code-block:: python
:caption: End-to-end: UDTF output to comparison metrics
from ml_toolkit.functions.eval_utils import pivot_and_compare_results
from pyspark.sql import functions as F
# results_df is the output of the multi-model UDTF
models = ["openai/gpt-4o", "openai/databricks-claude-3-7-sonnet"]
compared = pivot_and_compare_results(results_df, models)
# Aggregate agreement rates
compared.select(
F.avg(F.col("metrics").getItem("all_models_strict")).alias("strict_rate"),
F.avg(F.col("metrics").getItem("all_models_relaxed")).alias("relaxed_rate"),
).show()
.. code-block:: python
:caption: Using add_comparison_metrics on custom data
from ml_toolkit.functions.eval_utils import add_comparison_metrics
# df has columns m0, m1, m2 (each array)
result = add_comparison_metrics(df, ["model_a", "model_b", "model_c"])
# result has original columns + "metrics" map column
The ``metrics`` map contains four keys:
.. list-table::
:header-rows: 1
:widths: 25 75
* - Key
- Description
* - ``all_models_strict``
- **1.0** if every model returned the exact same set of values (order-insensitive).
* - ``all_models_relaxed``
- **1.0** if for every pair of models, one output is a subset of the other.
* - ``majority_strict``
- **1.0** if a strict majority (>50%) of models returned the same set.
* - ``majority_relaxed``
- **1.0** if a relaxed majority (>50%) of models agree by subset criteria.
.. autoapifunction:: ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.pivot_and_compare_results
.. autoapifunction:: ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.add_comparison_metrics
.. autoapifunction:: ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.normalized_output_array