``eval_utils`` module ================================ The ``eval_utils`` module provides a complete framework for evaluating LLM and model endpoints. Everything starts from an **Experiment** — a bundle of a dataset, one or more models, and a set of metrics. You define an experiment, run it, and get back tracked results in MLflow and Delta tables. .. attention:: These functions are designed to run within Databricks notebooks and require access to Unity Catalog and MLflow. Configuring an Experiment ^^^^^^^^^^^^^^^^^^^^^^^^^^ An ``Experiment`` has three parts: 1. **Dataset** — which eval table to use, and which columns contain the input, expected output, and additional context. 2. **Models** — one or more models to evaluate, each targeting a Databricks serving endpoint or a LiteLLM-supported model. - When configuring a model, you can use a prompt template, a prompt registry name, or a prompt alias. We suggest using the prompt registry for production experiments. 3. **Metrics** — which metrics to compute across all models (``latency``, ``token_count``, ``exact_match``, ``fuzzy_match``, ``llm_judge``, ``hit_rate``, or custom scorers). You can define an experiment in Python or load it entirely from a YAML file. Defining in Python """""""""""""""""" This example shows how to define an experiment in Python. It creates an eval table, defines an experiment, and runs it. Pay attention to the different ways to configure a model (either via an endpoint or a model in the registry). All of these are available through both code and YAML. .. important:: You must run ``deploy_model_serving_endpoint`` before you can use a custom model via an endpoint. .. code-block:: python from ml_toolkit.functions.eval_utils import ( create_eval_table, run_experiment, DatasetConfig, ModelConfig, Experiment, Metric, ) # 1. Create an eval dataset (one-time setup) table_name = create_eval_table( schema_name="sandbox", table_name="vendor_eval", primary_key="row_id", df=eval_df, ) # 2. Define the experiment experiment = Experiment( name="vendor-tagger-comparison", dataset=DatasetConfig( table=table_name, primary_key="row_id", input_column="input", expected_output_column="expected_output", additional_context_columns=["candidates"], ), models=[ ModelConfig( name="llama-8b", endpoint="databricks-meta-llama-3-1-8b-instruct", prompt_template="Tag: <>. Options: <>.", max_output_tokens=256, temperature=0.0, ), ModelConfig( name="llama-8b", endpoint="endpoint-name", version=1, # version of the model in the endpoint prompt_template="Tag: <>. Options: <>.", max_output_tokens=256, temperature=0.0, ), ModelConfig( name="llama-8b", model_name="catalog.schema.my_qlora_model", version=2, # version of the model in the model registry prompt_registry_name="catalog.schema.vendor_tagging_prompt", prompt_alias="prod", max_output_tokens=256, temperature=0.0, ), ModelConfig( name="gpt-5", litellm_model="gpt-5", prompt_registry_name="catalog.schema.vendor_tagging_prompt", system_prompt="You are a precise tagging assistant.", max_output_tokens=9000, temperature=0.0, ), ], metrics=[Metric.LATENCY, Metric.EXACT_MATCH, Metric.FUZZY_MATCH], ) # 3. Run the experiment result = run_experiment(experiment) result.summary_df.show() ``DatasetConfig`` points to an existing Unity Catalog eval table. ``ModelConfig`` defines how to reach a model — either via ``endpoint`` (Databricks serving) or ``litellm_model`` (any LiteLLM-supported provider). Prompt templates use ``<>`` placeholders that reference columns in the dataset. To add LLM-as-judge scoring, include ``Metric.LLM_JUDGE`` and provide a ``LLMJudgeConfig``: .. code-block:: python from ml_toolkit.functions.eval_utils import LLMJudgeConfig experiment = Experiment( name="vendor-tagger-with-judge", dataset=dataset, models=[model], metrics=[Metric.LATENCY, Metric.EXACT_MATCH, Metric.LLM_JUDGE], llm_judge_config=LLMJudgeConfig( judge_model="gpt-4o", criteria="correctness", rubric="Score 1-5 based on how well the output matches the expected output.", max_score=5.0, ), ) .. _yaml-configuration: Loading from YAML """"""""""""""""" Every config type can be loaded from a YAML file. This makes experiments reproducible, version-controllable, and easy to share. .. code-block:: python # Load a full experiment from a single YAML file experiment = Experiment(experiment_yaml="configs/experiment.yaml") result = run_experiment(experiment) # Or load individual configs dataset = DatasetConfig(dataset_yaml="configs/dataset.yaml") model = ModelConfig(model_yaml="configs/model.yaml") judge = LLMJudgeConfig(llm_judge_yaml="configs/judge.yaml") When a YAML path is provided, the file contents override any other parameters passed to the constructor. **Experiment YAML** — bundles dataset, models, and metrics in one file: .. code-block:: yaml :caption: configs/experiment.yaml name: vendor-tagger-comparison description: Compare Llama-8B vs GPT-5 for vendor tagging dataset: table: catalog.schema.vendor_eval primary_key: row_id input_column: input expected_output_column: expected_output additional_context_columns: - candidates models: - name: llama-8b endpoint: databricks-meta-llama-3-1-8b-instruct prompt_template: | Tag the following vendor name: <>. Output the vendor name and only the vendor name, as normalized in the options below: <>. max_output_tokens: 256 temperature: 0.0 - name: gpt-5 litellm_model: gpt-5 prompt_registry_name: catalog.schema.vendor_tagging_prompt max_output_tokens: 9000 temperature: 0.0 metrics: - latency - exact_match - fuzzy_match tags: team: ml **With LLM-as-judge** — add the ``llm_judge`` metric and a ``llm_judge_config`` block: .. code-block:: yaml :caption: configs/experiment_with_judge.yaml name: vendor-tagger-with-judge dataset: table: catalog.schema.vendor_eval primary_key: row_id input_column: input expected_output_column: expected_output additional_context_columns: - candidates models: - name: llama-8b endpoint: databricks-meta-llama-3-1-8b-instruct prompt_registry_name: catalog.schema.vendor_tagging_prompt prompt_alias: prod max_output_tokens: 256 temperature: 0.0 metrics: - latency - exact_match - llm_judge llm_judge_config: judge_model: gpt-4o criteria: correctness rubric: | Score 1-5 based on how well the model output matches the expected output: 5: Exact match or semantically identical 4: Minor differences but essentially correct 3: Partially correct 2: Mostly incorrect but shows understanding 1: Completely wrong max_score: 5.0 **Individual config files** — useful when you want to mix and match: .. code-block:: yaml :caption: configs/dataset.yaml table: catalog.schema.vendor_eval primary_key: row_id input_column: input expected_output_column: expected_output additional_context_columns: - candidates .. code-block:: yaml :caption: configs/model_databricks.yaml — Databricks endpoint name: llama-8b endpoint: databricks-meta-llama-3-1-8b-instruct prompt_registry_name: catalog.schema.vendor_tagging_prompt prompt_alias: prod max_output_tokens: 256 temperature: 0.0 .. code-block:: yaml :caption: configs/model_litellm.yaml — LiteLLM model name: gpt-5 litellm_model: gpt-5 prompt_registry_name: catalog.schema.vendor_tagging_prompt prompt_version: 1 max_output_tokens: 9000 temperature: 0.0 Saving and reproducing """""""""""""""""""""" Any config object can be serialized back to YAML with ``to_yaml()``, and previously-run experiments can be exported with ``get_experiment_yaml()``: .. code-block:: python # Save current experiment config experiment.to_yaml("configs/my_experiment.yaml") # Export a past experiment by ID yaml_content = get_experiment_yaml(experiment_id) Remote execution """""""""""""""" Experiments can be offloaded to Databricks serverless compute using the ``trigger_remote`` parameter. This is useful for long-running evaluations that you don't want blocking your notebook session. .. code-block:: python :caption: Fire-and-forget experiment = Experiment(experiment_yaml="configs/experiment.yaml") # Submit to serverless — returns immediately remote_run = run_experiment(experiment, trigger_remote=True) # Track progress in the Databricks UI print(remote_run.databricks_url) print(remote_run.job_run_id) # When ready, block until the job finishes and get results result = remote_run.get_result() result.summary_df.show() .. code-block:: python :caption: Block until the remote job completes result = run_experiment(experiment, trigger_remote=True, wait=True) result.summary_df.show() Under the hood, the ``Experiment`` is serialized to a dictionary, submitted as a Databricks serverless job, reconstructed on the remote cluster, and run as usual. Results are tracked in MLflow and can be retrieved with ``get_experiment()``. .. note:: **Custom callable metrics cannot be used remotely** — they are not serializable. All built-in ``Metric`` enum values work with remote execution. API Reference ^^^^^^^^^^^^^^ Running Evaluations """""""""""""""""""" .. autoapifunction:: ml_toolkit.functions.eval_utils.eval.run_experiment .. autoapifunction:: ml_toolkit.functions.eval_utils.eval.run_evaluation Table Management """"""""""""""""" .. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.create_eval_table .. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.upsert_eval_table .. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.lock_eval_table .. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.unlock_eval_table Retrieving Results """"""""""""""""""" Run-level ~~~~~~~~~ .. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_run .. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_run_df .. autoapifunction:: ml_toolkit.functions.eval_utils.results.list_runs_df .. autoapifunction:: ml_toolkit.functions.eval_utils.results.compare_runs_df Experiment-level ~~~~~~~~~~~~~~~~ .. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_experiment .. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_experiment_df .. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_experiment_yaml .. autoapifunction:: ml_toolkit.functions.eval_utils.results.list_experiments_df .. autoapifunction:: ml_toolkit.functions.eval_utils.results.compare_experiments_df Configuration Types """""""""""""""""""" .. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.config.Experiment .. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.config.DatasetConfig .. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.config.ModelConfig .. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.LLMJudgeConfig .. autoapiclass:: ml_toolkit.functions.eval_utils.constants.Metric Result Types """"""""""""" .. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.EvalRunResult .. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.ExperimentResult .. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.RemoteExperimentRun .. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.CustomScorer Multi-Model Comparison """"""""""""""""""""""" Functions for pivoting multi-model inference results and computing agreement metrics across models. These are designed to work with the output of the **Multi-Model Inference UDTF** (see :doc:`llm`), but can also be used independently on any DataFrame with per-model output columns. .. code-block:: python :caption: End-to-end: UDTF output to comparison metrics from ml_toolkit.functions.eval_utils import pivot_and_compare_results from pyspark.sql import functions as F # results_df is the output of the multi-model UDTF models = ["openai/gpt-4o", "openai/databricks-claude-3-7-sonnet"] compared = pivot_and_compare_results(results_df, models) # Aggregate agreement rates compared.select( F.avg(F.col("metrics").getItem("all_models_strict")).alias("strict_rate"), F.avg(F.col("metrics").getItem("all_models_relaxed")).alias("relaxed_rate"), ).show() .. code-block:: python :caption: Using add_comparison_metrics on custom data from ml_toolkit.functions.eval_utils import add_comparison_metrics # df has columns m0, m1, m2 (each array) result = add_comparison_metrics(df, ["model_a", "model_b", "model_c"]) # result has original columns + "metrics" map column The ``metrics`` map contains four keys: .. list-table:: :header-rows: 1 :widths: 25 75 * - Key - Description * - ``all_models_strict`` - **1.0** if every model returned the exact same set of values (order-insensitive). * - ``all_models_relaxed`` - **1.0** if for every pair of models, one output is a subset of the other. * - ``majority_strict`` - **1.0** if a strict majority (>50%) of models returned the same set. * - ``majority_relaxed`` - **1.0** if a relaxed majority (>50%) of models agree by subset criteria. .. autoapifunction:: ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.pivot_and_compare_results .. autoapifunction:: ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.add_comparison_metrics .. autoapifunction:: ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.normalized_output_array