Anyscale Inference Reference#
The Anyscale backend submits LLM batch-inference jobs to Anyscale Jobs running on a managed Ray cluster, instead of Databricks GPU compute. Use it when you need:
Multi-node distributed inference at H100 / B200 scale.
A high-memory CPU head with GPU workers (scaling-laws sweeps).
Job-Queue-based prioritization across teams and job kinds.
User-facing entrypoints:
ml_toolkit.functions.llm.inference.run_inference()withcompute_type="anyscale".ml_toolkit.functions.llm.inference.submit_anyscale_inference()for advanced callers that want to bypassrun_inference’s validation.
Compute configs#
The platform team registers a fixed set of compute configs in Anyscale. Pass
the matching AnyscaleComputeConfig member (or its raw string value)
as anyscale_compute_config.
When anyscale_compute_config is left None, Family-1 configs
auto-resolve from (gpu_type, num_gpus). Family-2 configs (CPU-head)
must always be passed explicitly — they’re an opt-in topology for scaling-laws
sweeps and never auto-selected.
Family 1 — GPU head + optional GPU workers#
For training/inference where the head node also runs vLLM/Ray work.
Name |
Total nodes |
GPU type |
GPU count |
Use when |
|---|---|---|---|---|
|
1 |
H100-8X |
8 |
Single-node training/inference, dev |
|
2 |
H100-8X |
16 |
Small distributed training |
|
4 |
H100-8X |
32 |
Medium distributed training |
|
1 |
B200-8X |
8 |
Single-node B200 work |
|
2 |
B200-8X |
16 |
Small distributed B200 |
|
4 |
B200-8X |
32 |
Medium distributed B200 |
Family 2 — CPU head + GPU workers (scaling-laws sweeps)#
For workloads where a high-memory head node coordinates many GPU workers.
Head is r6i.16xlarge (32 vCPU, 512 GB RAM). Useful when the head needs
to hold tokenized datasets, dispatch state, or large lookup tables in memory.
Naming counts GPU workers, not total nodes.
Name |
Head |
GPU workers |
Total nodes |
GPU count |
Use when |
|---|---|---|---|---|---|
|
r6i.16xlarge |
1 × p5.48xlarge |
2 |
8 |
Smallest scaling-laws, H100 |
|
r6i.16xlarge |
2 × p5.48xlarge |
3 |
16 |
|
|
r6i.16xlarge |
4 × p5.48xlarge |
5 |
32 |
|
|
r6i.16xlarge |
8 × p5.48xlarge |
9 |
64 |
Largest scaling-laws, H100 |
|
r6i.16xlarge |
1 × p6-b200.48xlarge |
2 |
8 |
Smallest scaling-laws, B200 |
|
r6i.16xlarge |
2 × p6-b200.48xlarge |
3 |
16 |
|
|
r6i.16xlarge |
4 × p6-b200.48xlarge |
5 |
32 |
|
|
r6i.16xlarge |
8 × p6-b200.48xlarge |
9 |
64 |
Largest scaling-laws, B200 |
Override example#
Default auto-resolve picks Family-1. Override with Family-2 when you want a CPU head:
from ml_toolkit.functions.llm.inference import (
run_inference, AnyscaleComputeConfig,
)
# Auto-resolves to AnyscaleComputeConfig.H100_TWO_NODES
run_inference(
"cat.sch.input", "cat.sch.model", "cat.sch.output",
compute_type="anyscale",
gpu_type="h100", num_gpus=16,
)
# Override: same 16 H100s, but with a high-memory CPU head
run_inference(
"cat.sch.input", "cat.sch.model", "cat.sch.output",
compute_type="anyscale",
gpu_type="h100", num_gpus=16,
anyscale_compute_config=AnyscaleComputeConfig.H100_CPU_HEAD_2GPU,
)
Routing labels (job_kind and team)#
To land on the right Job Queue partition with the right priority, set
anyscale_job_kind and anyscale_team consistently. Untagged jobs fall
through to shared-fallback (priority 10).
anyscale_job_kind — see AnyscaleJobKind:
incremental— recurring/incremental refresh (highest priority).backfill— one-shot historical recompute.fine_tuning— supervised fine-tuning runs.training— pretraining / from-scratch training.
anyscale_team — see AnyscaleTeam:
corporate— Corporate workspace.yoda— YODA workspace.
Other team strings fall through to base routing rules.
Code reference#
- class ml_toolkit.functions.llm.inference.AnyscaleComputeConfig#
Registered Anyscale compute configs.
Two families of cluster shapes are registered with the platform team’s Anyscale account. Pass any member of this enum (or its raw string value) as
anyscale_compute_configtorun_inference()/submit_anyscale_inference().Family 1 — GPU head (single- or multi-node, all GPU)
Name
Total nodes
GPU type
GPU count
Use when
h100-single-node
1
H100-8X
8
Single-node training/inference, dev
h100-two-nodes
2
H100-8X
16
Small distributed training
h100-four-nodes
4
H100-8X
32
Medium distributed training
b200-single-node
1
B200-8X
8
Single-node B200 work
b200-two-nodes
2
B200-8X
16
Small distributed B200
b200-four-nodes
4
B200-8X
32
Medium distributed B200
Family 1 is auto-resolved from
(gpu_type, num_gpus)whencompute_configis leftNone.Family 2 — CPU head (r6i.16xlarge) + GPU workers (scaling-laws sweeps)
Use when the head needs to hold tokenized datasets, dispatch state, or large lookup tables in memory. Naming counts GPU workers, not total nodes. Always passed explicitly — never auto-resolved.
Name
Head
GPU workers
Total nodes
GPU count
Use when
h100-cpu-head-1gpu
r6i.16xlarge
1 × p5.48xlarge
2
8
Smallest scaling-laws, H100
h100-cpu-head-2gpu
r6i.16xlarge
2 × p5.48xlarge
3
16
h100-cpu-head-4gpu
r6i.16xlarge
4 × p5.48xlarge
5
32
h100-cpu-head-8gpu
r6i.16xlarge
8 × p5.48xlarge
9
64
Largest scaling-laws, H100
b200-cpu-head-1gpu
r6i.16xlarge
1 × p6-b200.48xlarge
2
8
Smallest scaling-laws, B200
b200-cpu-head-2gpu
r6i.16xlarge
2 × p6-b200.48xlarge
3
16
b200-cpu-head-4gpu
r6i.16xlarge
4 × p6-b200.48xlarge
5
32
b200-cpu-head-8gpu
r6i.16xlarge
8 × p6-b200.48xlarge
9
64
Largest scaling-laws, B200
- B200_CPU_HEAD_1GPU = 'b200-cpu-head-1gpu'#
- B200_CPU_HEAD_2GPU = 'b200-cpu-head-2gpu'#
- B200_CPU_HEAD_4GPU = 'b200-cpu-head-4gpu'#
- B200_CPU_HEAD_8GPU = 'b200-cpu-head-8gpu'#
- B200_FOUR_NODES = 'b200-four-nodes'#
- B200_SINGLE_NODE = 'b200-single-node'#
- B200_TWO_NODES = 'b200-two-nodes'#
- H100_CPU_HEAD_1GPU = 'h100-cpu-head-1gpu'#
- H100_CPU_HEAD_2GPU = 'h100-cpu-head-2gpu'#
- H100_CPU_HEAD_4GPU = 'h100-cpu-head-4gpu'#
- H100_CPU_HEAD_8GPU = 'h100-cpu-head-8gpu'#
- H100_FOUR_NODES = 'h100-four-nodes'#
- H100_SINGLE_NODE = 'h100-single-node'#
- H100_TWO_NODES = 'h100-two-nodes'#
- __new__(value)#
- class ml_toolkit.functions.llm.inference.AnyscaleJobKind#
Routes the job to the right Job Queue partition.
Untagged jobs fall through to
shared-fallback(priority 10).- BACKFILL = 'backfill'#
- FINE_TUNING = 'fine_tuning'#
- INCREMENTAL = 'incremental'#
- TRAINING = 'training'#
- __new__(value)#
- class ml_toolkit.functions.llm.inference.AnyscaleTeam#
Team label, paired with
AnyscaleJobKindfor queue routing.Untagged team falls through to base rules.
- CORPORATE = 'corporate'#
- YODA = 'yoda'#
- __new__(value)#
- ml_toolkit.functions.llm.inference.submit_anyscale_inference(input_source, model_name, output_table, *, compute_config=None, image_uri=None, job_kind=None, team=None, job_queue=None, priority=None, timeout_s=None, py_modules=None, working_dir=None, excludes=None, requirements=None, extra_env_vars=None, reasoning_parser='auto', max_actor_restarts=0, max_task_retries=0, **optional_params)#
Submit Anyscale Jobs inference from a Databricks notebook.
See
ml_toolkit.functions.llm.inference.run_inference()for the user-facing entrypoint that validates and dispatches to this function.Compute config is auto-resolved from
(gpu_type, num_gpus)whencompute_configis left asNone— seeAnyscaleComputeConfigfor the Family-1 mapping. Passcompute_configexplicitly to override (typical reason: opting into a Family-2 CPU-head topology likeAnyscaleComputeConfig.H100_CPU_HEAD_2GPUso anr6i.16xlargehead can hold a tokenized dataset while GPU workers run inference).job_kind/teamset Anyscaletagsthat route the job to the right Job Queue partition — seeAnyscaleJobKindandAnyscaleTeam. Untagged jobs fall through toshared-fallback(priority 10).py_modulesdefaults to the localml_toolkitpackage directory so dev iterations override the baked-in image copy without uploading the full repo. Passworking_dirinstead if you need ancillary files alongside the package.- Return type:
RemoteAnyscaleRun