Anyscale Inference Reference
============================

The Anyscale backend submits LLM batch-inference jobs to Anyscale Jobs running
on a managed Ray cluster, instead of Databricks GPU compute. Use it when you
need:

- Multi-node distributed inference at H100 / B200 scale.
- A high-memory CPU head with GPU workers (scaling-laws sweeps).
- Job-Queue-based prioritization across teams and job kinds.

User-facing entrypoints:

- :func:`ml_toolkit.functions.llm.inference.run_inference` with
  ``compute_type="anyscale"``.
- :func:`ml_toolkit.functions.llm.inference.submit_anyscale_inference`
  for advanced callers that want to bypass ``run_inference``'s validation.


Compute configs
---------------

The platform team registers a fixed set of compute configs in Anyscale. Pass
the matching :class:`AnyscaleComputeConfig` member (or its raw string value)
as ``anyscale_compute_config``.

When ``anyscale_compute_config`` is left ``None``, **Family-1** configs
auto-resolve from ``(gpu_type, num_gpus)``. **Family-2** configs (CPU-head)
must always be passed explicitly — they're an opt-in topology for scaling-laws
sweeps and never auto-selected.

Family 1 — GPU head + optional GPU workers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For training/inference where the head node also runs vLLM/Ray work.

===================== =========== =========== ========= ====================================
Name                  Total nodes GPU type    GPU count Use when
===================== =========== =========== ========= ====================================
``h100-single-node``  1           H100-8X     8         Single-node training/inference, dev
``h100-two-nodes``    2           H100-8X     16        Small distributed training
``h100-four-nodes``   4           H100-8X     32        Medium distributed training
``b200-single-node``  1           B200-8X     8         Single-node B200 work
``b200-two-nodes``    2           B200-8X     16        Small distributed B200
``b200-four-nodes``   4           B200-8X     32        Medium distributed B200
===================== =========== =========== ========= ====================================

Family 2 — CPU head + GPU workers (scaling-laws sweeps)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For workloads where a high-memory head node coordinates many GPU workers.
Head is ``r6i.16xlarge`` (32 vCPU, 512 GB RAM). Useful when the head needs
to hold tokenized datasets, dispatch state, or large lookup tables in memory.
**Naming counts GPU workers, not total nodes.**

======================= ============= ===================== =========== ========= ===============================
Name                    Head          GPU workers           Total nodes GPU count Use when
======================= ============= ===================== =========== ========= ===============================
``h100-cpu-head-1gpu``  r6i.16xlarge  1 × p5.48xlarge       2           8         Smallest scaling-laws, H100
``h100-cpu-head-2gpu``  r6i.16xlarge  2 × p5.48xlarge       3           16
``h100-cpu-head-4gpu``  r6i.16xlarge  4 × p5.48xlarge       5           32
``h100-cpu-head-8gpu``  r6i.16xlarge  8 × p5.48xlarge       9           64        Largest scaling-laws, H100
``b200-cpu-head-1gpu``  r6i.16xlarge  1 × p6-b200.48xlarge  2           8         Smallest scaling-laws, B200
``b200-cpu-head-2gpu``  r6i.16xlarge  2 × p6-b200.48xlarge  3           16
``b200-cpu-head-4gpu``  r6i.16xlarge  4 × p6-b200.48xlarge  5           32
``b200-cpu-head-8gpu``  r6i.16xlarge  8 × p6-b200.48xlarge  9           64        Largest scaling-laws, B200
======================= ============= ===================== =========== ========= ===============================

Override example
~~~~~~~~~~~~~~~~

Default auto-resolve picks Family-1. Override with Family-2 when you want a
CPU head:

.. code-block:: python

   from ml_toolkit.functions.llm.inference import (
       run_inference, AnyscaleComputeConfig,
   )

   # Auto-resolves to AnyscaleComputeConfig.H100_TWO_NODES
   run_inference(
       "cat.sch.input", "cat.sch.model", "cat.sch.output",
       compute_type="anyscale",
       gpu_type="h100", num_gpus=16,
   )

   # Override: same 16 H100s, but with a high-memory CPU head
   run_inference(
       "cat.sch.input", "cat.sch.model", "cat.sch.output",
       compute_type="anyscale",
       gpu_type="h100", num_gpus=16,
       anyscale_compute_config=AnyscaleComputeConfig.H100_CPU_HEAD_2GPU,
   )


Routing labels (``job_kind`` and ``team``)
------------------------------------------

To land on the right Job Queue partition with the right priority, set
``anyscale_job_kind`` and ``anyscale_team`` consistently. Untagged jobs fall
through to ``shared-fallback`` (priority 10).

``anyscale_job_kind`` — see :class:`AnyscaleJobKind`:

- ``incremental`` — recurring/incremental refresh (highest priority).
- ``backfill`` — one-shot historical recompute.
- ``fine_tuning`` — supervised fine-tuning runs.
- ``training`` — pretraining / from-scratch training.

``anyscale_team`` — see :class:`AnyscaleTeam`:

- ``corporate`` — Corporate workspace.
- ``yoda`` — YODA workspace.

Other team strings fall through to base routing rules.


Code reference
--------------

.. autoclass:: ml_toolkit.functions.llm.inference.AnyscaleComputeConfig
   :members:
   :undoc-members:

.. autoclass:: ml_toolkit.functions.llm.inference.AnyscaleJobKind
   :members:
   :undoc-members:

.. autoclass:: ml_toolkit.functions.llm.inference.AnyscaleTeam
   :members:
   :undoc-members:

.. autofunction:: ml_toolkit.functions.llm.inference.submit_anyscale_inference