`eval_utils` module#

The eval_utils module provides a complete framework for evaluating LLM and model endpoints. Everything starts from an Experiment — a bundle of a dataset, one or more models, and a set of metrics. You define an experiment, run it, and get back tracked results in MLflow and Delta tables.

Attention

These functions are designed to run within Databricks notebooks and require access to Unity Catalog and MLflow.

Configuring an Experiment#

An Experiment has three parts:

Dataset — which eval table to use, and which columns contain the input, expected output, and additional context.
Models — one or more models to evaluate, each targeting a Databricks serving endpoint or a LiteLLM-supported model.
- When configuring a model, you can use a prompt template, a prompt registry name, or a prompt alias. We suggest using the prompt registry for production experiments.
Metrics — which metrics to compute across all models (latency, token_count, exact_match, fuzzy_match, llm_judge, hit_rate, or custom scorers).

You can define an experiment in Python or load it entirely from a YAML file.

Defining in Python#

This example shows how to define an experiment in Python. It creates an eval table, defines an experiment, and runs it. Pay attention to the different ways to configure a model (either via an endpoint or a model in the registry). All of these are available through both code and YAML.

Important

You must run deploy_model_serving_endpoint before you can use a custom model via an endpoint.

from ml_toolkit.functions.eval_utils import (
    create_eval_table,
    run_experiment,
    DatasetConfig,
    ModelConfig,
    Experiment,
    Metric,
)

# 1. Create an eval dataset (one-time setup)
table_name = create_eval_table(
    schema_name="sandbox",
    table_name="vendor_eval",
    primary_key="row_id",
    df=eval_df,
)

# 2. Define the experiment
experiment = Experiment(
    name="vendor-tagger-comparison",
    dataset=DatasetConfig(
        table=table_name,
        primary_key="row_id",
        input_column="input",
        expected_output_column="expected_output",
        additional_context_columns=["candidates"],
    ),
    models=[
        ModelConfig(
            name="llama-8b",
            endpoint="databricks-meta-llama-3-1-8b-instruct",
            prompt_template="Tag: <<input>>. Options: <<candidates>>.",
            max_output_tokens=256,
            temperature=0.0,
        ),
        ModelConfig(
            name="llama-8b",
            endpoint="endpoint-name",
            version=1,  # version of the model in the endpoint
            prompt_template="Tag: <<input>>. Options: <<candidates>>.",
            max_output_tokens=256,
            temperature=0.0,
        ),
        ModelConfig(
            name="llama-8b",
            model_name="catalog.schema.my_qlora_model",
            version=2,  # version of the model in the model registry
            prompt_registry_name="catalog.schema.vendor_tagging_prompt",
            prompt_alias="prod",
            max_output_tokens=256,
            temperature=0.0,
        ),
        ModelConfig(
            name="gpt-5",
            litellm_model="gpt-5",
            prompt_registry_name="catalog.schema.vendor_tagging_prompt",
            system_prompt="You are a precise tagging assistant.",
            max_output_tokens=9000,
            temperature=0.0,
        ),
    ],
    metrics=[Metric.LATENCY, Metric.EXACT_MATCH, Metric.FUZZY_MATCH],
)

# 3. Run the experiment
result = run_experiment(experiment)
result.summary_df.show()

DatasetConfig points to an existing Unity Catalog eval table. ModelConfig defines how to reach a model — either via endpoint (Databricks serving) or litellm_model (any LiteLLM-supported provider). Prompt templates use <<column_name>> placeholders that reference columns in the dataset.

To add LLM-as-judge scoring, include Metric.LLM_JUDGE and provide a LLMJudgeConfig:

from ml_toolkit.functions.eval_utils import LLMJudgeConfig

experiment = Experiment(
    name="vendor-tagger-with-judge",
    dataset=dataset,
    models=[model],
    metrics=[Metric.LATENCY, Metric.EXACT_MATCH, Metric.LLM_JUDGE],
    llm_judge_config=LLMJudgeConfig(
        judge_model="gpt-4o",
        criteria="correctness",
        rubric="Score 1-5 based on how well the output matches the expected output.",
        max_score=5.0,
    ),
)

Loading from YAML#

Every config type can be loaded from a YAML file. This makes experiments reproducible, version-controllable, and easy to share.

# Load a full experiment from a single YAML file
experiment = Experiment(experiment_yaml="configs/experiment.yaml")
result = run_experiment(experiment)

# Or load individual configs
dataset = DatasetConfig(dataset_yaml="configs/dataset.yaml")
model   = ModelConfig(model_yaml="configs/model.yaml")
judge   = LLMJudgeConfig(llm_judge_yaml="configs/judge.yaml")

When a YAML path is provided, the file contents override any other parameters passed to the constructor.

Experiment YAML — bundles dataset, models, and metrics in one file:

configs/experiment.yaml#

name: vendor-tagger-comparison
description: Compare Llama-8B vs GPT-5 for vendor tagging

dataset:
  table: catalog.schema.vendor_eval
  primary_key: row_id
  input_column: input
  expected_output_column: expected_output
  additional_context_columns:
    - candidates

models:
  - name: llama-8b
    endpoint: databricks-meta-llama-3-1-8b-instruct
    prompt_template: |
      Tag the following vendor name: <<input>>.
      Output the vendor name and only the vendor name, as normalized in the options below: <<candidates>>.
    max_output_tokens: 256
    temperature: 0.0

  - name: gpt-5
    litellm_model: gpt-5
    prompt_registry_name: catalog.schema.vendor_tagging_prompt
    max_output_tokens: 9000
    temperature: 0.0

metrics:
  - latency
  - exact_match
  - fuzzy_match

tags:
  team: ml

With LLM-as-judge — add the llm_judge metric and a llm_judge_config block:

configs/experiment_with_judge.yaml#

name: vendor-tagger-with-judge

dataset:
  table: catalog.schema.vendor_eval
  primary_key: row_id
  input_column: input
  expected_output_column: expected_output
  additional_context_columns:
    - candidates

models:
  - name: llama-8b
    endpoint: databricks-meta-llama-3-1-8b-instruct
    prompt_registry_name: catalog.schema.vendor_tagging_prompt
    prompt_alias: prod
    max_output_tokens: 256
    temperature: 0.0

metrics:
  - latency
  - exact_match
  - llm_judge

llm_judge_config:
  judge_model: gpt-4o
  criteria: correctness
  rubric: |
    Score 1-5 based on how well the model output matches the expected output:
    5: Exact match or semantically identical
    4: Minor differences but essentially correct
    3: Partially correct
    2: Mostly incorrect but shows understanding
    1: Completely wrong
  max_score: 5.0

Individual config files — useful when you want to mix and match:

configs/dataset.yaml#

table: catalog.schema.vendor_eval
primary_key: row_id
input_column: input
expected_output_column: expected_output
additional_context_columns:
  - candidates

configs/model_databricks.yaml — Databricks endpoint#

name: llama-8b
endpoint: databricks-meta-llama-3-1-8b-instruct
prompt_registry_name: catalog.schema.vendor_tagging_prompt
prompt_alias: prod
max_output_tokens: 256
temperature: 0.0

configs/model_litellm.yaml — LiteLLM model#

name: gpt-5
litellm_model: gpt-5
prompt_registry_name: catalog.schema.vendor_tagging_prompt
prompt_version: 1
max_output_tokens: 9000
temperature: 0.0

Saving and reproducing#

Any config object can be serialized back to YAML with to_yaml(), and previously-run experiments can be exported with get_experiment_yaml():

# Save current experiment config
experiment.to_yaml("configs/my_experiment.yaml")

# Export a past experiment by ID
yaml_content = get_experiment_yaml(experiment_id)

Remote execution#

Experiments can be offloaded to Databricks serverless compute using the trigger_remote parameter. This is useful for long-running evaluations that you don’t want blocking your notebook session.

Fire-and-forget#

experiment = Experiment(experiment_yaml="configs/experiment.yaml")

# Submit to serverless — returns immediately
remote_run = run_experiment(experiment, trigger_remote=True)

# Track progress in the Databricks UI
print(remote_run.databricks_url)
print(remote_run.job_run_id)

# When ready, block until the job finishes and get results
result = remote_run.get_result()
result.summary_df.show()

Block until the remote job completes#

result = run_experiment(experiment, trigger_remote=True, wait=True)
result.summary_df.show()

Under the hood, the Experiment is serialized to a dictionary, submitted as a Databricks serverless job, reconstructed on the remote cluster, and run as usual. Results are tracked in MLflow and can be retrieved with get_experiment().

Note

Custom callable metrics cannot be used remotely — they are not serializable. All built-in Metric enum values work with remote execution.

API Reference#

Running Evaluations#

ml_toolkit.functions.eval_utils.eval.run_experiment(experiment: ml_toolkit.functions.eval_utils.helpers.config.Experiment, *, mlflow_experiment_name: str | None = None, trigger_remote: bool = False, wait: bool = False) → ml_toolkit.functions.eval_utils.helpers.types.ExperimentResult | ml_toolkit.functions.eval_utils.helpers.types.RemoteExperimentRun[source]#

Run an experiment comparing multiple models against a single dataset.

Creates a parent MLflow run with nested child runs for each model. All results are stored in individual tables plus a combined table with a ‘model_name’ column for easy comparison.

Metrics are defined once at the experiment level and applied to all models, eliminating duplication.

Parameters:

experiment – Experiment object containing dataset, models, and metrics.
mlflow_experiment_name – MLflow experiment name. If not specified, defaults to ‘/Evals/{schema_name}/{table_base_name}’ derived from dataset table.
trigger_remote – If True, submit the experiment as a remote Databricks serverless job instead of running locally. The experiment must not contain callable metrics. Defaults to False.
wait – Only used when trigger_remote=True. If True, block and poll until the remote job completes, then return a full ExperimentResult. If False (default), return immediately with a RemoteExperimentRun.

Returns:

ExperimentResult if running locally or with trigger_remote=True and wait=True. RemoteExperimentRun if trigger_remote=True and wait=False (fire-and-forget).

Raises:

MLOpsToolkitTableNotFoundException – If dataset table doesn’t exist.
ValueError – If trigger_remote=True and experiment contains callable metrics.

Example

>>> dataset = DatasetConfig(
...     table="catalog.schema.vendor_eval",
...     primary_key="row_id",
...     input_column="vendor_name",
...     expected_output_column="canonical_name",
... )
>>> experiment = Experiment(
...     name="vendor-tagger-comparison",
...     dataset=dataset,
...     models=[
...         ModelConfig(name="llama-8b", endpoint="llama-endpoint"),
...         ModelConfig(name="gpt-4o", litellm_model="gpt-4o"),
...     ],
...     metrics=[Metric.LATENCY, Metric.EXACT_MATCH],
... )
>>> result = run_experiment(experiment)
>>> result.summary_df.orderBy("exact_match_accuracy", ascending=False).show()

>>> # Run remotely (fire-and-forget):
>>> remote_run = run_experiment(experiment, trigger_remote=True)
>>> print(remote_run.databricks_url)
>>> result = remote_run.get_result()  # blocks until done

ml_toolkit.functions.eval_utils.eval.run_evaluation(eval_table: str | None = None, *, primary_key: str | None = None, model_name: str | None = None, endpoint: str | None = None, litellm_model: str | None = None, version: int | None = None, prompt_template: str | None = None, system_prompt: str | None = None, prompt_registry_name: str | None = None, prompt_version: str | int | None = None, prompt_alias: str | None = None, metrics: list[str | Callable] = ['latency', 'token_count'], llm_judge_config: ml_toolkit.functions.eval_utils.helpers.types.LLMJudgeConfig | None = None, mlflow_experiment: str | None = None, run_name: str | None = None, batch_size: int = DEFAULT_BATCH_SIZE, max_concurrent: int = DEFAULT_MAX_CONCURRENT, timeout_seconds: float = DEFAULT_TIMEOUT_SECONDS, input_column: str | None = 'input', expected_output_column: str | None = 'expected_output', additional_context_columns: list[str] | None = None, tags: dict[str, str] | None = None, max_output_tokens: int | None = DEFAULT_MAX_OUTPUT_TOKENS, litellm_model_kwargs: dict[str, Any] | None = None, dataset: ml_toolkit.functions.eval_utils.helpers.config.DatasetConfig | None = None, model_config: ml_toolkit.functions.eval_utils.helpers.config.ModelConfig | None = None, is_nested: bool = False, _experiment_id: str | None = None, _model_name: str | None = None) → ml_toolkit.functions.eval_utils.helpers.types.EvalRunResult[source]#

Execute an evaluation run against a model endpoint.

Runs inference on all rows in the eval table using ai_query or OpenAI-compatible clients, computes specified metrics, and logs results to MLflow.

Parameters can be provided directly or via DatasetConfig/ModelConfig objects. When both are provided, explicit parameters take precedence over object values.

Parameters:

eval_table – Fully qualified eval table name (catalog.schema.table). Can be provided via dataset object instead.
primary_key – Name of the primary key column. Can be provided via dataset object.
model_name – Unity Catalog model name (e.g., ‘catalog.schema.model_name’). If provided without endpoint, the endpoint name will be automatically inferred from the model name. Either model_name/endpoint or litellm_model must be specified. Can be provided via config object.
endpoint – Databricks Model Serving endpoint name. Optional if model_name is provided (will be inferred). Either model_name/endpoint or litellm_model must be specified (not both). Can be provided via config object.
litellm_model – LiteLLM model identifier (e.g., ‘gpt-4’, ‘claude-3’). Either model_name/endpoint or litellm_model must be specified (not both). Can be provided via config object.
version – Optional integer version for routing to a specific model version in multi-version serving endpoints. Only used when endpoint is specified. Example: 1, 2, 3. Defaults to None (latest version). Can be provided via config object.
prompt_template – Template for constructing the user message using <<variable>> syntax. Available variables: <<input>>, <<candidates>>, and any columns from additional_context_columns. Defaults to “<<input>>”. Can be provided via config object.
system_prompt – Optional system prompt for chat completion. Can be provided via config object.
prompt_registry_name – Fully qualified prompt name from MLflow Prompt Registry (e.g., ‘catalog.schema.prompt_name’) or URI (e.g., ‘prompts:/catalog.schema.prompt_name@alias’). If provided, loads prompt_template from registry (converts {{variable}} to <<variable>>). Can be provided via config object.
prompt_version – Specific version to load from registry. Ignored if prompt_registry_name includes version/alias or if prompt_alias is provided. Can be provided via config object.
prompt_alias – Alias to load from registry (e.g., ‘production’). Takes precedence over prompt_version. Can be provided via config object.
metrics – List of metrics to compute. Defaults to [‘latency’, ‘token_count’]. Can include built-in metric names (‘latency’, ‘token_count’, ‘exact_match’, ‘fuzzy_match’, ‘llm_judge’) or custom scorer callables (Callable[[str, str, dict], float]). Can be provided via config object.
llm_judge_config – Configuration for LLM-as-judge metric. Required if ‘llm_judge’ is in metrics list. Can be provided via config object.
mlflow_experiment – MLflow experiment name. If not specified, defaults to ‘/Evals/{schema_name}/{table_base_name}’ derived from eval_table.
run_name – Optional name for the MLflow run. Defaults to ‘{endpoint_or_model}_{timestamp}’. Can be provided via config.name.
batch_size – Number of rows to process per batch.
max_concurrent – Maximum concurrent requests to the endpoint.
timeout_seconds – Timeout per inference request.
input_column – Name of the input column. Defaults to ‘input’. Can be provided via dataset object.
expected_output_column – Name of the expected output column for comparison metrics. Set to None if not available. Can be provided via dataset object.
additional_context_columns – Additional columns to include in prompt template context. Can be provided via dataset object.
tags – Optional tags to apply to the MLflow run. Merged with config.tags if both provided.
max_output_tokens – Maximum output tokens for model response. Can be provided via config object.
litellm_model_kwargs – Additional kwargs for LiteLLM model.
dataset – DatasetConfig object containing table reference and column configuration. Values are used as defaults when explicit parameters are not provided.
model_config – ModelConfig object containing model and prompt configuration. Values are used as defaults when explicit parameters are not provided.

Returns:

run_id: Unique identifier for this eval run
mlflow_run_id: MLflow run ID
results_table: Fully qualified name of results table
metrics_summary: Dict of aggregated metrics
row_count: Number of rows evaluated
error_count: Number of failed inferences

Return type:

EvalRunResult containing

Raises:

ValueError – If neither endpoint nor litellm_model specified.
ValueError – If both endpoint and litellm_model specified.
ValueError – If ‘llm_judge’ in metrics but llm_judge_config is None.
MLOpsToolkitTableNotFoundException – If eval_table doesn’t exist.
MLOpsToolkitEvalRunFailedException – If error rate exceeds threshold.

Example

>>> # Using model_name (endpoint automatically inferred):
>>> result = run_evaluation(
...     eval_table="catalog.schema.vendor_eval_20260126",
...     primary_key="row_id",
...     model_name="catalog.schema.vendor_tagger_model",
...     prompt_template="Tag the following vendor name: <<input>>",
...     metrics=["latency", "token_count", "exact_match"],
... )

>>> # Using explicit endpoint:
>>> result = run_evaluation(
...     eval_table="catalog.schema.vendor_eval_20260126",
...     primary_key="row_id",
...     endpoint="vendor-tagger-v1",
...     prompt_template="Tag the following vendor name: <<input>>",
...     metrics=["latency", "token_count", "exact_match"],
... )

>>> # Using version-based routing (defaults to latest version):
>>> result = run_evaluation(
...     eval_table="catalog.schema.vendor_eval",
...     model_name="catalog.schema.my_model",
...     prompt_template="Extract entity: <<input>>",
... )

>>> # Using specific version:
>>> result = run_evaluation(
...     eval_table="catalog.schema.vendor_eval",
...     model_name="catalog.schema.my_model",
...     version=1,
...     prompt_template="Extract entity: <<input>>",
... )

>>> # Using DatasetConfig and ModelConfig objects:
>>> dataset = DatasetConfig(table="catalog.schema.vendor_eval", primary_key="row_id")
>>> model_config = ModelConfig(name="gpt-4o-test", litellm_model="gpt-4o")
>>> result = run_evaluation(dataset=dataset, model_config=model_config)

>>> # Using prompt registry:
>>> result = run_evaluation(
...     eval_table="catalog.schema.vendor_eval_20260126",
...     primary_key="row_id",
...     endpoint="vendor-tagger-v1",
...     prompt_registry_name="mycatalog.myschema.vendor_tagging_prompt",
...     prompt_alias="production",
...     metrics=["latency", "token_count", "exact_match"],
... )

Table Management#

ml_toolkit.functions.eval_utils.table_management.create_eval_table(schema_name: str, table_name: str, primary_key: str, df: pyspark.sql.DataFrame, *, catalog_name: str = EVAL_CATALOG, additional_columns: dict[str, str] | None = None, description: str | None = None, tags: dict[str, str] | None = None) → str[source]#

Create a new evaluation dataset table_name in Unity Catalog.

Creates a Delta table_name with the standard eval schema_name plus any additional columns. The table_name name will be suffixed with a datetime stamp. The schema_name name must end with ‘_dataset_bronze’.

Parameters:

schema_name – Schema name (must end with ‘_dataset_bronze’).
table_name – Base table_name name (datetime suffix will be appended).
primary_key – Name of the primary key column in the DataFrame.
df – Spark DataFrame containing the evaluation data. Must include ‘input’ column and the specified primary key column.
catalog_name – Unity Catalog name. Defaults to ‘yd_tagging_platform_evals’.
additional_columns – Optional dict mapping column names to SQL types for additional columns beyond the standard schema_name.
description – Optional table_name description for UC metadata.
tags – Optional dict of UC tags to apply to the table_name.

Returns:

‘{catalog_name}.{schema_name}.{table_name}_{timestamp}’

Return type:

Fully qualified table_name name

Raises:

MLOpsToolkitInvalidSchemaNameException – If schema_name doesn’t end with ‘_dataset_bronze’.
MLOpsToolkitColumnNotFoundException – If DataFrame is missing ‘input’ or primary key column.
MLOpsToolkitDuplicatePrimaryKeyException – If primary key column contains duplicates.
MLOpsToolkitSchemaNotFoundException – If schema_name doesn’t exist.

Example

>>> df = spark.createDataFrame([
...     {"row_id": "1", "input": "acme corp", "candidates": ["Acme", "ACME Corp"]},
...     {"row_id": "2", "input": "widgets inc", "candidates": ["Widgets", "Widgets Inc"]},
... ])
>>> table_name = create_eval_table(
...     schema_name="vendor_tagging_dataset_bronze",
...     table_name="vendor_eval",
...     primary_key="row_id",
...     df=df,
...     description="Vendor name tagging evaluation set v1",
...     tags={"domain": "vendor", "version": "1.0"}
... )
>>> print(table_name)
'yd_tagging_platform_evals.vendor_tagging_dataset_bronze.vendor_eval_20260126_143052'

ml_toolkit.functions.eval_utils.table_management.upsert_eval_table(table_name: str, df: pyspark.sql.DataFrame, primary_key: str, *, create_new_version: bool = True) → str[source]#

Upsert rows into an existing evaluation table_name or create a new version.

If create_new_version is True (default), creates a new timestamped table_name with the combined data. If False, performs an in-place MERGE operation on the existing table_name (requires table_name to be unlocked).

Parameters:

table_name – Fully qualified table_name name (catalog_name.schema_name.table_name).
df – Spark DataFrame containing rows to upsert. Must include the primary key column and ‘input’ column.
primary_key – Name of the primary key column for merge matching.
create_new_version – If True, create a new timestamped table_name with merged data. If False, update the existing table_name in place.

Returns:

Fully qualified table_name name (new name if versioned, same if in-place).

Raises:

MLOpsToolkitTableNotFoundException – If table_name doesn’t exist.
MLOpsToolkitEvalTableLockedException – If table_name is locked and create_new_version is False.
MLOpsToolkitColumnNotFoundException – If DataFrame schema_name doesn’t match existing table_name.

Example

>>> new_rows = spark.createDataFrame([
...     {"row_id": "3", "input": "new vendor", "candidates": ["New Vendor"]}
... ])
>>> new_table = upsert_eval_table(
...     table_name="yd_tagging_platform_evals.vendor_tagging_dataset_bronze.vendor_eval_20260126_143052",
...     df=new_rows,
...     primary_key="row_id",
...     create_new_version=True
... )

ml_toolkit.functions.eval_utils.table_management.lock_eval_table(table_name: str, *, reason: str | None = None, locked_by: str | None = None) → None[source]#

Lock an evaluation table_name to prevent modifications.

Applies a UC tag to mark the table_name as locked. Locked tables cannot be modified via upsert_eval_table (in-place mode) or deleted.

Parameters:

table_name – Fully qualified table_name name (catalog_name.schema_name.table_name).
reason – Optional reason for locking (stored in tag value).
locked_by – Optional identifier of who locked the table_name. Defaults to current user from Spark session.

Raises:

MLOpsToolkitTableNotFoundException – If table_name doesn’t exist.
MLOpsToolkitEvalTableLockedException – If table_name is already locked.

Example

>>> lock_eval_table(
...     table_name="yd_tagging_platform_evals.vendor_tagging_dataset_bronze.vendor_eval_20260126_143052",
...     reason="Production eval set - do not modify",
...     locked_by="ml-team"
... )

Notes

Lock is implemented via UC tags: - Tag key: ‘eval_locked’ - Tag value: JSON with ‘locked_at’, ‘locked_by’, ‘reason’

ml_toolkit.functions.eval_utils.table_management.unlock_eval_table(table_name: str, *, force: bool = False) → None[source]#

Unlock a previously locked evaluation table_name.

Removes the lock tag from the table_name, allowing modifications.

Parameters:

table_name – Fully qualified table_name name (catalog_name.schema_name.table_name).
force – If True, unlock even if locked by a different user. Defaults to False.

Raises:

MLOpsToolkitTableNotFoundException – If table_name doesn’t exist.
ValueError – If table_name is not locked.
PermissionError – If table_name was locked by different user and force=False.

Example

>>> unlock_eval_table(
...     table_name="yd_tagging_platform_evals.vendor_tagging_dataset_bronze.vendor_eval_20260126_143052"
... )

Retrieving Results#

Run-level#

ml_toolkit.functions.eval_utils.results.get_run(run_id: str) → ml_toolkit.functions.eval_utils.helpers.types.EvalRunResult[source]#

Get metadata for a single eval run.

Parameters:: run_id – The eval run ID returned from run_evaluation().
Returns:: EvalRunResult containing run metadata and metrics summary.
Raises:: MLOpsToolkitEvalRunNotFoundException – If run not found.

Example

>>> run = get_run("eval_run_20260126_143052_abc123")
>>> print(run.metrics_summary)
{'latency_ms_p50': 145.2, 'exact_match_score_accuracy': 0.87}

ml_toolkit.functions.eval_utils.results.get_run_df(run_id: str, *, include_metrics: bool = True, filter_errors: bool = False) → pyspark.sql.DataFrame[source]#

Get per-row results for an eval run as a DataFrame.

Parameters:

run_id – The eval run ID returned from run_evaluation().
include_metrics – If True, include per-row metric columns. Defaults to True.
filter_errors – If True, exclude rows where inference failed. Defaults to False.

Returns:

All columns from original eval table
model_output: The model’s response text
latency_ms: Inference latency (if metric enabled)
prompt_tokens: Input token count (if metric enabled)
completion_tokens: Output token count (if metric enabled)
exact_match_score: 1.0 if exact match, 0.0 otherwise
fuzzy_match_score: Fuzzy match ratio [0.0, 1.0]
llm_judge_score: LLM judge score (if metric enabled)
error_message: Error message if inference failed, else NULL

Return type:

Spark DataFrame with columns

Raises:

MLOpsToolkitEvalRunNotFoundException – If run not found.

Example

>>> df = get_run_df("eval_run_20260126_143052_abc123")
>>> df.filter(df.exact_match_score == 0).show()

ml_toolkit.functions.eval_utils.results.list_runs_df(eval_table: str | None = None, *, experiment_id: str | None = None, status: Literal['all', 'completed', 'running', 'failed'] = 'all', limit: int = 100, order_by: Literal['created_at', 'completed_at', 'run_name'] = 'created_at', descending: bool = True) → pyspark.sql.DataFrame[source]#

List eval runs as a DataFrame.

Can filter by eval table, experiment, or status.

Parameters:

eval_table – Optional eval table name to filter by.
experiment_id – Optional experiment ID to filter by.
status – Filter runs by status. Options: ‘all’, ‘completed’, ‘running’, ‘failed’. Defaults to ‘all’.
limit – Maximum number of runs to return. Defaults to 100.
order_by – Column to sort results by. Defaults to ‘created_at’.
descending – Sort in descending order. Defaults to True.

Returns:

Spark DataFrame with run metadata columns.

Example

>>> runs = list_runs_df(eval_table="catalog.schema.vendor_eval")
>>> runs.show()

>>> runs = list_runs_df(experiment_id="exp_20260126_143052_abc123")
>>> runs.show()

ml_toolkit.functions.eval_utils.results.compare_runs_df(run_ids: list[str] | None = None, *, eval_table: str | None = None, limit: int = 100) → pyspark.sql.DataFrame[source]#

Get a comparison DataFrame for multiple eval runs.

Retrieves aggregated metrics across multiple runs for comparison.

Parameters:

run_ids – Optional list of specific run IDs to include.
eval_table – Optional eval table name to filter by.
limit – Maximum number of runs to return. Defaults to 100.

Returns:

Spark DataFrame with comparison metrics.

Example

>>> summary = compare_runs_df(eval_table="catalog.schema.vendor_eval")
>>> summary.orderBy("exact_match_accuracy", ascending=False).show()

Experiment-level#

ml_toolkit.functions.eval_utils.results.get_experiment(experiment_id: str) → ml_toolkit.functions.eval_utils.helpers.types.ExperimentResult[source]#

Get an experiment by its ID.

Parameters:: experiment_id – The experiment ID returned from run_experiment().
Returns:: ExperimentResult with all model results populated.
Raises:: MLOpsToolkitExperimentNotFoundException – If experiment not found.

Example

>>> exp = get_experiment("exp_20260126_143052_abc123")
>>> exp.summary_df.show()

ml_toolkit.functions.eval_utils.results.get_experiment_df(experiment_id: str, *, include_metrics: bool = True) → pyspark.sql.DataFrame[source]#

Get combined per-row results for an experiment.

Returns the combined results table with all model outputs.

Parameters:

experiment_id – The experiment ID.
include_metrics – If True, include per-row metric columns. Defaults to True.

Returns:

Spark DataFrame with combined results from all models, including a ‘config_name’ column to identify the model.

Raises:

MLOpsToolkitExperimentNotFoundException – If experiment not found.

Example

>>> df = get_experiment_df("exp_20260126_143052_abc123")
>>> df.filter(df.exact_match_score == 0).show()

ml_toolkit.functions.eval_utils.results.get_experiment_yaml(experiment_id: str, file_name: str | None = None) → str[source]#

Download the experiment YAML config artifact from MLflow.

Retrieves the YAML configuration that was saved when the experiment was run via run_experiment(). This YAML can be used to recreate the experiment: Experiment(experiment_yaml="restored.yaml").

Parameters:

experiment_id – The experiment ID returned from run_experiment().
file_name – Optional file path to save the YAML to. If provided, the YAML content is written to this path in addition to being returned as a string.

Returns:

The YAML config content as a string.

Raises:

MLOpsToolkitExperimentNotFoundException – If experiment not found.
FileNotFoundError – If the YAML artifact does not exist in MLflow (e.g., experiment was run before this feature was added).

Example

>>> yaml_content = get_experiment_yaml("exp_20260126_143052_abc123")
>>> print(yaml_content)

>>> # Save to file and recreate experiment
>>> get_experiment_yaml("exp_20260126_143052_abc123", file_name="config.yaml")
>>> experiment = Experiment(experiment_yaml="config.yaml")

ml_toolkit.functions.eval_utils.results.list_experiments_df(eval_table: str | None = None, *, status: Literal['all', 'completed', 'running', 'partial', 'failed'] = 'all', limit: int = 100, order_by: Literal['created_at', 'completed_at', 'experiment_name'] = 'created_at', descending: bool = True) → pyspark.sql.DataFrame[source]#

List experiments as a DataFrame.

Parameters:

eval_table – Optional eval table name to filter by.
status – Filter experiments by status. Options: ‘all’, ‘completed’, ‘running’, ‘partial’, ‘failed’. Defaults to ‘all’.
limit – Maximum number of experiments to return. Defaults to 100.
order_by – Column to sort results by. Defaults to ‘created_at’.
descending – Sort in descending order. Defaults to True.

Returns:

Spark DataFrame with experiment metadata.

Example

>>> experiments = list_experiments_df(eval_table="catalog.schema.vendor_eval")
>>> experiments.show()

ml_toolkit.functions.eval_utils.results.compare_experiments_df(experiment_ids: list[str] | None = None, *, eval_table: str | None = None, limit: int = 100) → pyspark.sql.DataFrame[source]#

Get a comparison DataFrame for multiple experiments.

For each experiment, shows the best-performing model on each metric.

Parameters:

experiment_ids – Optional list of specific experiment IDs to include.
eval_table – Optional eval table name to filter by.
limit – Maximum number of experiments to return. Defaults to 100.

Returns:

Spark DataFrame with experiment comparison metrics.

Example

>>> summary = compare_experiments_df(eval_table="catalog.schema.vendor_eval")
>>> summary.show()

Configuration Types#

class ml_toolkit.functions.eval_utils.helpers.config.Experiment[source]#

Bundles a dataset, model configurations, and metrics together.

An Experiment defines a complete evaluation scenario: what dataset to use, which models to compare, and which metrics to compute. Metrics are defined once and applied to ALL models, eliminating duplication.

Parameters:

name – Unique name for this experiment.
dataset – DatasetConfig reference to the evaluation dataset.
models – List of ModelConfig and/or LocalModelConfig objects to evaluate.
metrics – List of metrics to compute for all models. Can include Metric enum values, string names, or custom scorer callables.
llm_judge_config – Configuration for LLM-as-judge metric. Required if ‘llm_judge’ is in metrics list.
description – Optional description of the experiment.
tags – Optional tags for the experiment.
experiment_yaml – Optional path to YAML file to load config from. If provided, all other parameters are ignored.

Example

>>> from ml_toolkit.functions.eval_utils import (
...     DatasetConfig, ModelConfig, Experiment, Metric, LLMJudgeConfig
... )
>>> dataset = DatasetConfig(
...     table="catalog.schema.vendor_eval",
...     primary_key="row_id",
...     input_column="vendor_name",
...     expected_output_column="canonical_name",
... )
>>> experiment = Experiment(
...     name="vendor-tagger-comparison",
...     dataset=dataset,
...     models=[
...         ModelConfig(name="llama-8b", endpoint="llama-endpoint"),
...         ModelConfig(name="gpt-4o", litellm_model="gpt-4o"),
...     ],
...     metrics=[Metric.LATENCY, Metric.EXACT_MATCH, Metric.LLM_JUDGE],
...     llm_judge_config=LLMJudgeConfig(criteria="correctness"),
... )

# Or load from YAML: >>> experiment = Experiment(experiment_yaml=”path/to/config.yaml”)

Raises:: ValueError – If validation fails (no models, duplicate names, or llm_judge metric without config).

class ml_toolkit.functions.eval_utils.helpers.config.DatasetConfig[source]#

Reference to an evaluation dataset.

Encapsulates all information needed to locate and use an evaluation dataset for running evaluations.

Parameters:

table – Fully qualified table name (catalog.schema.table).
primary_key – Name of the primary key column.
input_column – Name of the input text column. Defaults to ‘input’.
expected_output_column – Name of the expected output column for comparison metrics. Set to None if not available.
additional_context_columns – Additional columns to include in prompt template context.
description – Optional description of the dataset.
tags – Optional tags for the dataset.
dataset_yaml – Optional path to YAML file to load config from. If provided, all other parameters are ignored.

Example

>>> dataset = DatasetConfig(
...     table="catalog.schema.vendor_eval_20260126",
...     primary_key="row_id",
...     input_column="vendor_name",
...     expected_output_column="canonical_name",
... )

# Or load from YAML: >>> dataset = DatasetConfig(dataset_yaml=”path/to/dataset.yaml”)

class ml_toolkit.functions.eval_utils.helpers.config.ModelConfig[source]#

Configuration for a single model/prompt to evaluate.

Represents one model/prompt configuration to evaluate against a dataset. Use with Experiment to compare multiple configurations.

Validation runs automatically on construction via __post_init__.

Parameters:

name – Unique name for this config (used in MLflow run name and results).
model_name – Unity Catalog model name (e.g., ‘catalog.schema.model_name’). If provided without endpoint, the endpoint name will be automatically inferred from the model name. Either model_name/endpoint or litellm_model must be specified.
endpoint – Databricks Model Serving endpoint name. Optional if model_name is provided (will be inferred). Either model_name/endpoint or litellm_model must be specified (not both).
litellm_model – LiteLLM model identifier (e.g., ‘gpt-4’, ‘claude-3’). Either model_name/endpoint or litellm_model must be specified (not both).
version – Optional integer version for routing to a specific model version in multi-version serving endpoints. Only used when endpoint is specified. Example: 1, 2, 3. Defaults to None (latest version).
prompt_template – Jinja2 template for constructing the user message. Available variables: <<input>>, <<candidates>>, and context columns. Use <<variable>> syntax. If prompt_registry_name is provided, this will be overridden by the loaded template.
system_prompt – Optional system prompt for chat completion. If prompt_registry_name is provided, this will be overridden.
prompt_registry_name – Fully qualified prompt name from MLflow Prompt Registry (e.g., ‘catalog.schema.prompt_name’) or URI (e.g., ‘prompts:/catalog.schema.prompt_name@alias’). If provided, loads prompt_template from registry (converts {{variable}} to <<variable>>).
prompt_version – Specific version to load from registry. Ignored if prompt_registry_name includes version/alias or if prompt_alias is provided.
prompt_alias – Alias to load from registry (e.g., ‘production’). Takes precedence over prompt_version.
max_output_tokens – Maximum output tokens for model response.
temperature – Temperature for model sampling (None uses model default).
tags – Additional tags to apply to the MLflow run for this config.
model_yaml – Optional path to YAML file to load config from. If provided, all other parameters are ignored.

Example

>>> from ml_toolkit.functions.eval_utils import ModelConfig
>>> # Using model_name (endpoint automatically inferred)
>>> config = ModelConfig(
...     name="llama-8b-prompt-v1",
...     model_name="catalog.schema.meta-llama-3-1-8b-instruct",
...     prompt_template="Tag the vendor: <<input>>",
... )

>>> # Using endpoint directly
>>> config = ModelConfig(
...     name="llama-8b-prompt-v1",
...     endpoint="databricks-meta-llama-3-1-8b-instruct",
...     prompt_template="Tag the vendor: <<input>>",
... )

>>> # Load prompt from MLflow Prompt Registry
>>> config = ModelConfig(
...     name="llama-8b-registry",
...     model_name="catalog.schema.meta-llama-3-1-8b-instruct",
...     prompt_registry_name="mycatalog.myschema.vendor_tagging_prompt",
...     prompt_alias="production",  # or prompt_version=2
... )

# Or load from YAML: >>> config = ModelConfig(model_yaml=”path/to/model.yaml”)

Raises:: ValueError – If validation fails (missing endpoint/model or both specified).

class ml_toolkit.functions.eval_utils.helpers.types.LLMJudgeConfig[source]#

Configuration for LLM-as-judge metric.

Parameters:

judge_model – Model to use as judge (e.g., ‘gpt-4’, ‘claude-opus-4’). For Databricks endpoints, use the endpoint name.
judge_endpoint – Optional Databricks endpoint for judge. If provided, this takes precedence over judge_model for Databricks serving.
criteria – Evaluation criteria (e.g., ‘correctness’, ‘relevance’).
rubric – Custom rubric/scoring guidelines for the judge. If judge_rubric_registry_name is provided, this will be overridden.
system_prompt – Custom system prompt for the judge LLM. If judge_system_prompt_registry_name is provided, this will be overridden.
max_score – Maximum score value. Defaults to 5.0.
judge_rubric_registry_name – Fully qualified prompt name from MLflow Prompt Registry for the rubric. If provided, loads rubric from registry.
judge_rubric_version – Specific version to load for rubric. Ignored if judge_rubric_registry_name includes version/alias or if judge_rubric_alias is provided.
judge_rubric_alias – Alias to load for rubric (e.g., ‘production’).
judge_system_prompt_registry_name – Fully qualified prompt name from MLflow Prompt Registry for the system prompt. If provided, loads system_prompt from registry.
judge_system_prompt_version – Specific version to load for system prompt.
judge_system_prompt_alias – Alias to load for system prompt.
llm_judge_yaml – Optional path to YAML file to load config from. If provided, all other parameters are ignored.

Examples

>>> config = LLMJudgeConfig(
...     judge_model="gpt-4",
...     criteria="correctness",
...     rubric="Score 1-5 based on whether the selected tag is the best match",
... )

>>> # Load rubric from registry
>>> config = LLMJudgeConfig(
...     judge_model="gpt-4",
...     judge_rubric_registry_name="mycatalog.myschema.judge_rubric",
...     judge_rubric_alias="production",
... )

>>> # Load both rubric and system prompt from registry
>>> config = LLMJudgeConfig(
...     judge_model="gpt-4",
...     judge_rubric_registry_name="mycatalog.myschema.judge_rubric",
...     judge_system_prompt_registry_name="mycatalog.myschema.judge_system_prompt",
... )

>>> # Or load from YAML:
>>> config = LLMJudgeConfig(llm_judge_yaml="path/to/config.yaml")

class ml_toolkit.functions.eval_utils.constants.Metric[source]#

Built-in evaluation metrics.

Inherits from str for backward compatibility with string metrics.

Example

>>> from ml_toolkit.functions.eval_utils import Metric
>>> metrics = [Metric.LATENCY, Metric.EXACT_MATCH]
>>> # Also works with strings for backward compatibility
>>> "latency" in [m.value for m in Metric]
True

__init__()[source]#: Initialize self. See help(type(self)) for accurate signature.

__new__()[source]#: Create and return a new object. See help(type) for accurate signature.

Result Types#

class ml_toolkit.functions.eval_utils.helpers.types.EvalRunResult[source]#

Result of an run_evaluation() execution.

run_id[source]#: Unique identifier for this eval run.

mlflow_run_id[source]#: MLflow run ID for tracking.

results_table[source]#: Fully qualified name of the results table_name.

metrics_summary[source]#: Dict of aggregated metrics (e.g., {‘exact_match’: 0.87}).

row_count[source]#: Number of rows evaluated.

error_count[source]#: Number of failed inferences.

created_at[source]#: Timestamp when the run started.

class ml_toolkit.functions.eval_utils.helpers.types.ExperimentResult[source]#

Result of running an experiment via run_experiment().

Contains results from all model runs and provides comparison utilities.

experiment_id[source]#: Unique identifier for this experiment (e.g., “exp_20260126_143052_abc12345”).

experiment_name[source]#: Name of the experiment that was run.

parent_run_id[source]#: Parent MLflow run ID that contains all nested runs.

mlflow_experiment[source]#: Path to the MLflow experiment.

dataset_hash[source]#: MD5 hash of the dataset for reproducibility tracking.

results_table[source]#: Fully qualified name of the combined results table.

model_results[source]#: Dict mapping model names to their EvalRunResult.

class ml_toolkit.functions.eval_utils.helpers.types.RemoteExperimentRun[source]#

Result returned when run_experiment() is called with trigger_remote=True.

Provides job tracking information and a convenience method to retrieve the full ExperimentResult once the remote job completes.

job_run_id[source]#: Databricks job run ID for tracking.

databricks_url[source]#: Direct URL to the job run in Databricks UI.

experiment_name[source]#: Name of the experiment that was submitted.

experiment_config[source]#: Serialized experiment config dict.

class ml_toolkit.functions.eval_utils.helpers.types.CustomScorer[source]#

Protocol for custom metric scorer functions.

Custom scorers must implement a __call__ method that takes the model output, expected output (may be None), and a context dict containing all row columns and metadata.

Example:

def vendor_format_scorer(
    model_output: str,
    expected_output: str | None,
    context: dict[str, Any],
) -> float:
    score = 0.0
    if model_output.istitle():
        score += 0.5
    if not any(c in model_output for c in ['@', '#', '$']):
        score += 0.5
    return score

Multi-Model Comparison#

Functions for pivoting multi-model inference results and computing agreement metrics across models. These are designed to work with the output of the Multi-Model Inference UDTF (see llm module), but can also be used independently on any DataFrame with per-model output columns.

End-to-end: UDTF output to comparison metrics#

from ml_toolkit.functions.eval_utils import pivot_and_compare_results
from pyspark.sql import functions as F

# results_df is the output of the multi-model UDTF
models = ["openai/gpt-4o", "openai/databricks-claude-3-7-sonnet"]
compared = pivot_and_compare_results(results_df, models)

# Aggregate agreement rates
compared.select(
    F.avg(F.col("metrics").getItem("all_models_strict")).alias("strict_rate"),
    F.avg(F.col("metrics").getItem("all_models_relaxed")).alias("relaxed_rate"),
).show()

Using add_comparison_metrics on custom data#

from ml_toolkit.functions.eval_utils import add_comparison_metrics

# df has columns m0, m1, m2 (each array<string>)
result = add_comparison_metrics(df, ["model_a", "model_b", "model_c"])
# result has original columns + "metrics" map column

The metrics map contains four keys:

Key	Description
`all_models_strict`	1.0 if every model returned the exact same set of values (order-insensitive).
`all_models_relaxed`	1.0 if for every pair of models, one output is a subset of the other.
`majority_strict`	1.0 if a strict majority (>50%) of models returned the same set.
`majority_relaxed`	1.0 if a relaxed majority (>50%) of models agree by subset criteria.

ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.pivot_and_compare_results(results_df: pyspark.sql.DataFrame, models: list[str]) → pyspark.sql.DataFrame[source]#

Pivot multi-model UDTF output and add comparison metrics.

Takes the long-format output from the multi-model UDTF (one row per model per input) and produces a wide-format DataFrame with one row per input, one column per model, and a metrics map.

Parameters:

results_df – DataFrame with columns model, output, parameters, error, raw_output (as produced by the multi-model inference UDTF).
models – Ordered list of model names (must match the values in the model column).

Returns:

DataFrame with row_id, parameters, model_0 .. model_{n-1}, and metrics.

ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.add_comparison_metrics(df: pyspark.sql.DataFrame, models: list[str]) → pyspark.sql.DataFrame[source]#

Add a metrics map column with pairwise and majority agreement scores.

Expects df to have columns m0, m1, … m{n-1} where each is an array<string> of normalized outputs (e.g. from normalized_output_array()).

The returned DataFrame has only the original columns plus a single metrics column of type map<string, double> with keys:

all_models_strict: 1.0 if every pairwise comparison is an exact set match.
all_models_relaxed: 1.0 if every pairwise comparison satisfies subset-or-superset.
majority_strict: 1.0 if a strict majority of outputs are identical.
majority_relaxed: 1.0 if a relaxed majority of outputs agree.

Parameters:

df – DataFrame with m0 .. m{n-1} array columns.
models – List of model names (only len(models) matters).

Returns:

DataFrame with original columns + metrics.

ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.normalized_output_array(colname: str)[source]#

Build a Spark Column that normalizes a model-output JSON string to array<string>.

Expects colname to contain JSON of the form {"model": "...", "output": "["a","b"]"}. The function:

Parses colname as struct<model, output>.
Parses the output field as array<string>.
Applies lower(trim(x)) to each element.
Filters out null and empty strings.
Returns array_distinct(...) of the result.

Parameters:: colname – Name of the column containing the JSON string.
Returns:: A PySpark Column expression.

eval_utils module#

Configuring an Experiment#

Defining in Python#

Loading from YAML#

Saving and reproducing#

Remote execution#

API Reference#

Running Evaluations#

Table Management#

Retrieving Results#

Run-level#

Experiment-level#

Configuration Types#

Result Types#

Multi-Model Comparison#

`eval_utils` module#