Anyscale Inference Reference ============================ The Anyscale backend submits LLM batch-inference jobs to Anyscale Jobs running on a managed Ray cluster, instead of Databricks GPU compute. Use it when you need: - Multi-node distributed inference at H100 / B200 scale. - A high-memory CPU head with GPU workers (scaling-laws sweeps). - Job-Queue-based prioritization across teams and job kinds. User-facing entrypoints: - :func:`ml_toolkit.functions.llm.inference.run_inference` with ``compute_type="anyscale"``. - :func:`ml_toolkit.functions.llm.inference.submit_anyscale_inference` for advanced callers that want to bypass ``run_inference``'s validation. Compute configs --------------- The platform team registers a fixed set of compute configs in Anyscale. Pass the matching :class:`AnyscaleComputeConfig` member (or its raw string value) as ``anyscale_compute_config``. When ``anyscale_compute_config`` is left ``None``, **Family-1** configs auto-resolve from ``(gpu_type, num_gpus)``. **Family-2** configs (CPU-head) must always be passed explicitly — they're an opt-in topology for scaling-laws sweeps and never auto-selected. Family 1 — GPU head + optional GPU workers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For training/inference where the head node also runs vLLM/Ray work. ===================== =========== =========== ========= ==================================== Name Total nodes GPU type GPU count Use when ===================== =========== =========== ========= ==================================== ``h100-single-node`` 1 H100-8X 8 Single-node training/inference, dev ``h100-two-nodes`` 2 H100-8X 16 Small distributed training ``h100-four-nodes`` 4 H100-8X 32 Medium distributed training ``b200-single-node`` 1 B200-8X 8 Single-node B200 work ``b200-two-nodes`` 2 B200-8X 16 Small distributed B200 ``b200-four-nodes`` 4 B200-8X 32 Medium distributed B200 ===================== =========== =========== ========= ==================================== Family 2 — CPU head + GPU workers (scaling-laws sweeps) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For workloads where a high-memory head node coordinates many GPU workers. Head is ``r6i.16xlarge`` (32 vCPU, 512 GB RAM). Useful when the head needs to hold tokenized datasets, dispatch state, or large lookup tables in memory. **Naming counts GPU workers, not total nodes.** ======================= ============= ===================== =========== ========= =============================== Name Head GPU workers Total nodes GPU count Use when ======================= ============= ===================== =========== ========= =============================== ``h100-cpu-head-1gpu`` r6i.16xlarge 1 × p5.48xlarge 2 8 Smallest scaling-laws, H100 ``h100-cpu-head-2gpu`` r6i.16xlarge 2 × p5.48xlarge 3 16 ``h100-cpu-head-4gpu`` r6i.16xlarge 4 × p5.48xlarge 5 32 ``h100-cpu-head-8gpu`` r6i.16xlarge 8 × p5.48xlarge 9 64 Largest scaling-laws, H100 ``b200-cpu-head-1gpu`` r6i.16xlarge 1 × p6-b200.48xlarge 2 8 Smallest scaling-laws, B200 ``b200-cpu-head-2gpu`` r6i.16xlarge 2 × p6-b200.48xlarge 3 16 ``b200-cpu-head-4gpu`` r6i.16xlarge 4 × p6-b200.48xlarge 5 32 ``b200-cpu-head-8gpu`` r6i.16xlarge 8 × p6-b200.48xlarge 9 64 Largest scaling-laws, B200 ======================= ============= ===================== =========== ========= =============================== Override example ~~~~~~~~~~~~~~~~ Default auto-resolve picks Family-1. Override with Family-2 when you want a CPU head: .. code-block:: python from ml_toolkit.functions.llm.inference import ( run_inference, AnyscaleComputeConfig, ) # Auto-resolves to AnyscaleComputeConfig.H100_TWO_NODES run_inference( "cat.sch.input", "cat.sch.model", "cat.sch.output", compute_type="anyscale", gpu_type="h100", num_gpus=16, ) # Override: same 16 H100s, but with a high-memory CPU head run_inference( "cat.sch.input", "cat.sch.model", "cat.sch.output", compute_type="anyscale", gpu_type="h100", num_gpus=16, anyscale_compute_config=AnyscaleComputeConfig.H100_CPU_HEAD_2GPU, ) Routing labels (``job_kind`` and ``team``) ------------------------------------------ To land on the right Job Queue partition with the right priority, set ``anyscale_job_kind`` and ``anyscale_team`` consistently. Untagged jobs fall through to ``shared-fallback`` (priority 10). ``anyscale_job_kind`` — see :class:`AnyscaleJobKind`: - ``incremental`` — recurring/incremental refresh (highest priority). - ``backfill`` — one-shot historical recompute. - ``fine_tuning`` — supervised fine-tuning runs. - ``training`` — pretraining / from-scratch training. ``anyscale_team`` — see :class:`AnyscaleTeam`: - ``corporate`` — Corporate workspace. - ``yoda`` — YODA workspace. Other team strings fall through to base routing rules. Code reference -------------- .. autoclass:: ml_toolkit.functions.llm.inference.AnyscaleComputeConfig :members: :undoc-members: .. autoclass:: ml_toolkit.functions.llm.inference.AnyscaleJobKind :members: :undoc-members: .. autoclass:: ml_toolkit.functions.llm.inference.AnyscaleTeam :members: :undoc-members: .. autofunction:: ml_toolkit.functions.llm.inference.submit_anyscale_inference