``eval_utils`` module
================================

The ``eval_utils`` module provides a complete framework for evaluating LLM and model
endpoints. Everything starts from an **Experiment** — a bundle of a dataset, one or more
models, and a set of metrics. You define an experiment, run it, and get back tracked
results in MLflow and Delta tables.

.. attention:: These functions are designed to run within Databricks notebooks and require access to
               Unity Catalog and MLflow.


Configuring an Experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^

An ``Experiment`` has three parts:

1. **Dataset** — which eval table to use, and which columns contain the input, expected output, and additional context.
2. **Models** — one or more models to evaluate, each targeting a Databricks serving endpoint or a LiteLLM-supported model.

   - When configuring a model, you can use a prompt template, a prompt registry name, or a prompt alias. We suggest using the prompt registry for production experiments.

3. **Metrics** — which metrics to compute across all models (``latency``, ``token_count``, ``exact_match``, ``fuzzy_match``, ``llm_judge``, ``hit_rate``, or custom scorers).

You can define an experiment in Python or load it entirely from a YAML file.

Defining in Python
""""""""""""""""""

This example shows how to define an experiment in Python. It creates an eval table, defines an experiment, and runs it.
Pay attention to the different ways to configure a model (either via an endpoint or a model in the registry). All of these are available through both code and YAML.

.. important::

   You must run ``deploy_model_serving_endpoint`` before you can use a custom model via an endpoint.

.. code-block:: python

   from ml_toolkit.functions.eval_utils import (
       create_eval_table,
       run_experiment,
       DatasetConfig,
       ModelConfig,
       Experiment,
       Metric,
   )

   # 1. Create an eval dataset (one-time setup)
   table_name = create_eval_table(
       schema_name="sandbox",
       table_name="vendor_eval",
       primary_key="row_id",
       df=eval_df,
   )

   # 2. Define the experiment
   experiment = Experiment(
       name="vendor-tagger-comparison",
       dataset=DatasetConfig(
           table=table_name,
           primary_key="row_id",
           input_column="input",
           expected_output_column="expected_output",
           additional_context_columns=["candidates"],
       ),
       models=[
           ModelConfig(
               name="llama-8b",
               endpoint="databricks-meta-llama-3-1-8b-instruct",
               prompt_template="Tag: <<input>>. Options: <<candidates>>.",
               max_output_tokens=256,
               temperature=0.0,
           ),
           ModelConfig(
               name="llama-8b",
               endpoint="endpoint-name",
               version=1,  # version of the model in the endpoint
               prompt_template="Tag: <<input>>. Options: <<candidates>>.",
               max_output_tokens=256,
               temperature=0.0,
           ),
           ModelConfig(
               name="llama-8b",
               model_name="catalog.schema.my_qlora_model",
               version=2,  # version of the model in the model registry
               prompt_registry_name="catalog.schema.vendor_tagging_prompt",
               prompt_alias="prod",
               max_output_tokens=256,
               temperature=0.0,
           ),
           ModelConfig(
               name="gpt-5",
               litellm_model="gpt-5",
               prompt_registry_name="catalog.schema.vendor_tagging_prompt",
               system_prompt="You are a precise tagging assistant.",
               max_output_tokens=9000,
               temperature=0.0,
           ),
       ],
       metrics=[Metric.LATENCY, Metric.EXACT_MATCH, Metric.FUZZY_MATCH],
   )

   # 3. Run the experiment
   result = run_experiment(experiment)
   result.summary_df.show()

``DatasetConfig`` points to an existing Unity Catalog eval table. ``ModelConfig`` defines
how to reach a model — either via ``endpoint`` (Databricks serving) or ``litellm_model``
(any LiteLLM-supported provider). Prompt templates use ``<<column_name>>`` placeholders
that reference columns in the dataset.

To add LLM-as-judge scoring, include ``Metric.LLM_JUDGE`` and provide a
``LLMJudgeConfig``:

.. code-block:: python

   from ml_toolkit.functions.eval_utils import LLMJudgeConfig

   experiment = Experiment(
       name="vendor-tagger-with-judge",
       dataset=dataset,
       models=[model],
       metrics=[Metric.LATENCY, Metric.EXACT_MATCH, Metric.LLM_JUDGE],
       llm_judge_config=LLMJudgeConfig(
           judge_model="gpt-4o",
           criteria="correctness",
           rubric="Score 1-5 based on how well the output matches the expected output.",
           max_score=5.0,
       ),
   )

.. _yaml-configuration:

Loading from YAML
"""""""""""""""""

Every config type can be loaded from a YAML file. This makes experiments reproducible,
version-controllable, and easy to share.

.. code-block:: python

   # Load a full experiment from a single YAML file
   experiment = Experiment(experiment_yaml="configs/experiment.yaml")
   result = run_experiment(experiment)

   # Or load individual configs
   dataset = DatasetConfig(dataset_yaml="configs/dataset.yaml")
   model   = ModelConfig(model_yaml="configs/model.yaml")
   judge   = LLMJudgeConfig(llm_judge_yaml="configs/judge.yaml")

When a YAML path is provided, the file contents override any other parameters passed
to the constructor.

**Experiment YAML** — bundles dataset, models, and metrics in one file:

.. code-block:: yaml
   :caption: configs/experiment.yaml

   name: vendor-tagger-comparison
   description: Compare Llama-8B vs GPT-5 for vendor tagging

   dataset:
     table: catalog.schema.vendor_eval
     primary_key: row_id
     input_column: input
     expected_output_column: expected_output
     additional_context_columns:
       - candidates

   models:
     - name: llama-8b
       endpoint: databricks-meta-llama-3-1-8b-instruct
       prompt_template: |
         Tag the following vendor name: <<input>>.
         Output the vendor name and only the vendor name, as normalized in the options below: <<candidates>>.
       max_output_tokens: 256
       temperature: 0.0

     - name: gpt-5
       litellm_model: gpt-5
       prompt_registry_name: catalog.schema.vendor_tagging_prompt
       max_output_tokens: 9000
       temperature: 0.0

   metrics:
     - latency
     - exact_match
     - fuzzy_match

   tags:
     team: ml

**With LLM-as-judge** — add the ``llm_judge`` metric and a ``llm_judge_config`` block:

.. code-block:: yaml
   :caption: configs/experiment_with_judge.yaml

   name: vendor-tagger-with-judge

   dataset:
     table: catalog.schema.vendor_eval
     primary_key: row_id
     input_column: input
     expected_output_column: expected_output
     additional_context_columns:
       - candidates

   models:
     - name: llama-8b
       endpoint: databricks-meta-llama-3-1-8b-instruct
       prompt_registry_name: catalog.schema.vendor_tagging_prompt
       prompt_alias: prod
       max_output_tokens: 256
       temperature: 0.0

   metrics:
     - latency
     - exact_match
     - llm_judge

   llm_judge_config:
     judge_model: gpt-4o
     criteria: correctness
     rubric: |
       Score 1-5 based on how well the model output matches the expected output:
       5: Exact match or semantically identical
       4: Minor differences but essentially correct
       3: Partially correct
       2: Mostly incorrect but shows understanding
       1: Completely wrong
     max_score: 5.0

**Individual config files** — useful when you want to mix and match:

.. code-block:: yaml
   :caption: configs/dataset.yaml

   table: catalog.schema.vendor_eval
   primary_key: row_id
   input_column: input
   expected_output_column: expected_output
   additional_context_columns:
     - candidates

.. code-block:: yaml
   :caption: configs/model_databricks.yaml — Databricks endpoint

   name: llama-8b
   endpoint: databricks-meta-llama-3-1-8b-instruct
   prompt_registry_name: catalog.schema.vendor_tagging_prompt
   prompt_alias: prod
   max_output_tokens: 256
   temperature: 0.0

.. code-block:: yaml
   :caption: configs/model_litellm.yaml — LiteLLM model

   name: gpt-5
   litellm_model: gpt-5
   prompt_registry_name: catalog.schema.vendor_tagging_prompt
   prompt_version: 1
   max_output_tokens: 9000
   temperature: 0.0

Saving and reproducing
""""""""""""""""""""""

Any config object can be serialized back to YAML with ``to_yaml()``, and previously-run
experiments can be exported with ``get_experiment_yaml()``:

.. code-block:: python

   # Save current experiment config
   experiment.to_yaml("configs/my_experiment.yaml")

   # Export a past experiment by ID
   yaml_content = get_experiment_yaml(experiment_id)

Remote execution
""""""""""""""""

Experiments can be offloaded to Databricks serverless compute using the ``trigger_remote``
parameter. This is useful for long-running evaluations that you don't want blocking your
notebook session.

.. code-block:: python
   :caption: Fire-and-forget

   experiment = Experiment(experiment_yaml="configs/experiment.yaml")

   # Submit to serverless — returns immediately
   remote_run = run_experiment(experiment, trigger_remote=True)

   # Track progress in the Databricks UI
   print(remote_run.databricks_url)
   print(remote_run.job_run_id)

   # When ready, block until the job finishes and get results
   result = remote_run.get_result()
   result.summary_df.show()

.. code-block:: python
   :caption: Block until the remote job completes

   result = run_experiment(experiment, trigger_remote=True, wait=True)
   result.summary_df.show()

Under the hood, the ``Experiment`` is serialized to a dictionary, submitted as a
Databricks serverless job, reconstructed on the remote cluster, and run as usual.
Results are tracked in MLflow and can be retrieved with ``get_experiment()``.

.. note::

   **Custom callable metrics cannot be used remotely** — they are not serializable.
   All built-in ``Metric`` enum values work with remote execution.


API Reference
^^^^^^^^^^^^^^

Running Evaluations
""""""""""""""""""""

.. autoapifunction:: ml_toolkit.functions.eval_utils.eval.run_experiment

.. autoapifunction:: ml_toolkit.functions.eval_utils.eval.run_evaluation


Table Management
"""""""""""""""""

.. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.create_eval_table

.. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.upsert_eval_table

.. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.lock_eval_table

.. autoapifunction:: ml_toolkit.functions.eval_utils.table_management.unlock_eval_table


Retrieving Results
"""""""""""""""""""

Run-level
~~~~~~~~~

.. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_run

.. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_run_df

.. autoapifunction:: ml_toolkit.functions.eval_utils.results.list_runs_df

.. autoapifunction:: ml_toolkit.functions.eval_utils.results.compare_runs_df

Experiment-level
~~~~~~~~~~~~~~~~

.. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_experiment

.. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_experiment_df

.. autoapifunction:: ml_toolkit.functions.eval_utils.results.get_experiment_yaml

.. autoapifunction:: ml_toolkit.functions.eval_utils.results.list_experiments_df

.. autoapifunction:: ml_toolkit.functions.eval_utils.results.compare_experiments_df


Configuration Types
""""""""""""""""""""

.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.config.Experiment

.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.config.DatasetConfig

.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.config.ModelConfig

.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.LLMJudgeConfig

.. autoapiclass:: ml_toolkit.functions.eval_utils.constants.Metric


Result Types
"""""""""""""

.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.EvalRunResult

.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.ExperimentResult

.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.RemoteExperimentRun

.. autoapiclass:: ml_toolkit.functions.eval_utils.helpers.types.CustomScorer


Multi-Model Comparison
"""""""""""""""""""""""

Functions for pivoting multi-model inference results and computing agreement
metrics across models. These are designed to work with the output of the
**Multi-Model Inference UDTF** (see :doc:`llm`), but can also be used
independently on any DataFrame with per-model output columns.

.. code-block:: python
   :caption: End-to-end: UDTF output to comparison metrics

   from ml_toolkit.functions.eval_utils import pivot_and_compare_results
   from pyspark.sql import functions as F

   # results_df is the output of the multi-model UDTF
   models = ["openai/gpt-4o", "openai/databricks-claude-3-7-sonnet"]
   compared = pivot_and_compare_results(results_df, models)

   # Aggregate agreement rates
   compared.select(
       F.avg(F.col("metrics").getItem("all_models_strict")).alias("strict_rate"),
       F.avg(F.col("metrics").getItem("all_models_relaxed")).alias("relaxed_rate"),
   ).show()

.. code-block:: python
   :caption: Using add_comparison_metrics on custom data

   from ml_toolkit.functions.eval_utils import add_comparison_metrics

   # df has columns m0, m1, m2 (each array<string>)
   result = add_comparison_metrics(df, ["model_a", "model_b", "model_c"])
   # result has original columns + "metrics" map column

The ``metrics`` map contains four keys:

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Key
     - Description
   * - ``all_models_strict``
     - **1.0** if every model returned the exact same set of values (order-insensitive).
   * - ``all_models_relaxed``
     - **1.0** if for every pair of models, one output is a subset of the other.
   * - ``majority_strict``
     - **1.0** if a strict majority (>50%) of models returned the same set.
   * - ``majority_relaxed``
     - **1.0** if a relaxed majority (>50%) of models agree by subset criteria.

.. autoapifunction:: ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.pivot_and_compare_results

.. autoapifunction:: ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.add_comparison_metrics

.. autoapifunction:: ml_toolkit.functions.eval_utils.helpers.multi_model_comparison.normalized_output_array