Anyscale Inference Reference#

The Anyscale backend submits LLM batch-inference jobs to Anyscale Jobs running on a managed Ray cluster, instead of Databricks GPU compute. Use it when you need:

Multi-node distributed inference at H100 / B200 scale.
A high-memory CPU head with GPU workers (scaling-laws sweeps).
Job-Queue-based prioritization across teams and job kinds.

User-facing entrypoints:

ml_toolkit.functions.llm.inference.run_inference() with compute_type="anyscale".
ml_toolkit.functions.llm.inference.submit_anyscale_inference() for advanced callers that want to bypass run_inference’s validation.

Compute configs#

The platform team registers a fixed set of compute configs in Anyscale. Pass the matching AnyscaleComputeConfig member (or its raw string value) as anyscale_compute_config.

When anyscale_compute_config is left None, Family-1 configs auto-resolve from (gpu_type, num_gpus). Family-2 configs (CPU-head) must always be passed explicitly — they’re an opt-in topology for scaling-laws sweeps and never auto-selected.

Family 1 — GPU head + optional GPU workers#

For training/inference where the head node also runs vLLM/Ray work.

Name	Total nodes	GPU type	GPU count	Use when
`h100-single-node`	1	H100-8X	8	Single-node training/inference, dev
`h100-two-nodes`	2	H100-8X	16	Small distributed training
`h100-four-nodes`	4	H100-8X	32	Medium distributed training
`b200-single-node`	1	B200-8X	8	Single-node B200 work
`b200-two-nodes`	2	B200-8X	16	Small distributed B200
`b200-four-nodes`	4	B200-8X	32	Medium distributed B200

Family 2 — CPU head + GPU workers (scaling-laws sweeps)#

For workloads where a high-memory head node coordinates many GPU workers. Head is r6i.16xlarge (32 vCPU, 512 GB RAM). Useful when the head needs to hold tokenized datasets, dispatch state, or large lookup tables in memory. Naming counts GPU workers, not total nodes.

Name	Head	GPU workers	Total nodes	GPU count	Use when
`h100-cpu-head-1gpu`	r6i.16xlarge	1 × p5.48xlarge	2	8	Smallest scaling-laws, H100
`h100-cpu-head-2gpu`	r6i.16xlarge	2 × p5.48xlarge	3	16
`h100-cpu-head-4gpu`	r6i.16xlarge	4 × p5.48xlarge	5	32
`h100-cpu-head-8gpu`	r6i.16xlarge	8 × p5.48xlarge	9	64	Largest scaling-laws, H100
`b200-cpu-head-1gpu`	r6i.16xlarge	1 × p6-b200.48xlarge	2	8	Smallest scaling-laws, B200
`b200-cpu-head-2gpu`	r6i.16xlarge	2 × p6-b200.48xlarge	3	16
`b200-cpu-head-4gpu`	r6i.16xlarge	4 × p6-b200.48xlarge	5	32
`b200-cpu-head-8gpu`	r6i.16xlarge	8 × p6-b200.48xlarge	9	64	Largest scaling-laws, B200

Override example#

Default auto-resolve picks Family-1. Override with Family-2 when you want a CPU head:

from ml_toolkit.functions.llm.inference import (
    run_inference, AnyscaleComputeConfig,
)

# Auto-resolves to AnyscaleComputeConfig.H100_TWO_NODES
run_inference(
    "cat.sch.input", "cat.sch.model", "cat.sch.output",
    compute_type="anyscale",
    gpu_type="h100", num_gpus=16,
)

# Override: same 16 H100s, but with a high-memory CPU head
run_inference(
    "cat.sch.input", "cat.sch.model", "cat.sch.output",
    compute_type="anyscale",
    gpu_type="h100", num_gpus=16,
    anyscale_compute_config=AnyscaleComputeConfig.H100_CPU_HEAD_2GPU,
)

Routing labels (`job_kind` and `team`)#

To land on the right Job Queue partition with the right priority, set anyscale_job_kind and anyscale_team consistently. Untagged jobs fall through to shared-fallback (priority 10).

anyscale_job_kind — see AnyscaleJobKind:

incremental — recurring/incremental refresh (highest priority).
backfill — one-shot historical recompute.
fine_tuning — supervised fine-tuning runs.
training — pretraining / from-scratch training.

anyscale_team — see AnyscaleTeam:

corporate — Corporate workspace.
yoda — YODA workspace.

Other team strings fall through to base routing rules.

Code reference#

class ml_toolkit.functions.llm.inference.AnyscaleComputeConfig#

Registered Anyscale compute configs.

Two families of cluster shapes are registered with the platform team’s Anyscale account. Pass any member of this enum (or its raw string value) as anyscale_compute_config to run_inference() / submit_anyscale_inference().

Family 1 — GPU head (single- or multi-node, all GPU)

Name	Total nodes	GPU type	GPU count	Use when
h100-single-node	1	H100-8X	8	Single-node training/inference, dev
h100-two-nodes	2	H100-8X	16	Small distributed training
h100-four-nodes	4	H100-8X	32	Medium distributed training
b200-single-node	1	B200-8X	8	Single-node B200 work
b200-two-nodes	2	B200-8X	16	Small distributed B200
b200-four-nodes	4	B200-8X	32	Medium distributed B200

Family 1 is auto-resolved from (gpu_type, num_gpus) when compute_config is left None.

Family 2 — CPU head (r6i.16xlarge) + GPU workers (scaling-laws sweeps)

Use when the head needs to hold tokenized datasets, dispatch state, or large lookup tables in memory. Naming counts GPU workers, not total nodes. Always passed explicitly — never auto-resolved.

Name	Head	GPU workers	Total nodes	GPU count	Use when
h100-cpu-head-1gpu	r6i.16xlarge	1 × p5.48xlarge	2	8	Smallest scaling-laws, H100
h100-cpu-head-2gpu	r6i.16xlarge	2 × p5.48xlarge	3	16
h100-cpu-head-4gpu	r6i.16xlarge	4 × p5.48xlarge	5	32
h100-cpu-head-8gpu	r6i.16xlarge	8 × p5.48xlarge	9	64	Largest scaling-laws, H100
b200-cpu-head-1gpu	r6i.16xlarge	1 × p6-b200.48xlarge	2	8	Smallest scaling-laws, B200
b200-cpu-head-2gpu	r6i.16xlarge	2 × p6-b200.48xlarge	3	16
b200-cpu-head-4gpu	r6i.16xlarge	4 × p6-b200.48xlarge	5	32
b200-cpu-head-8gpu	r6i.16xlarge	8 × p6-b200.48xlarge	9	64	Largest scaling-laws, B200

B200_CPU_HEAD_1GPU = 'b200-cpu-head-1gpu'#

B200_CPU_HEAD_2GPU = 'b200-cpu-head-2gpu'#

B200_CPU_HEAD_4GPU = 'b200-cpu-head-4gpu'#

B200_CPU_HEAD_8GPU = 'b200-cpu-head-8gpu'#

B200_FOUR_NODES = 'b200-four-nodes'#

B200_SINGLE_NODE = 'b200-single-node'#

B200_TWO_NODES = 'b200-two-nodes'#

H100_CPU_HEAD_1GPU = 'h100-cpu-head-1gpu'#

H100_CPU_HEAD_2GPU = 'h100-cpu-head-2gpu'#

H100_CPU_HEAD_4GPU = 'h100-cpu-head-4gpu'#

H100_CPU_HEAD_8GPU = 'h100-cpu-head-8gpu'#

H100_FOUR_NODES = 'h100-four-nodes'#

H100_SINGLE_NODE = 'h100-single-node'#

H100_TWO_NODES = 'h100-two-nodes'#

__new__(value)#

class ml_toolkit.functions.llm.inference.AnyscaleJobKind#

Routes the job to the right Job Queue partition.

Untagged jobs fall through to shared-fallback (priority 10).

BACKFILL = 'backfill'#

FINE_TUNING = 'fine_tuning'#

INCREMENTAL = 'incremental'#

TRAINING = 'training'#

__new__(value)#

class ml_toolkit.functions.llm.inference.AnyscaleTeam#

Team label, paired with AnyscaleJobKind for queue routing.

Untagged team falls through to base rules.

CORPORATE = 'corporate'#

YODA = 'yoda'#

__new__(value)#

ml_toolkit.functions.llm.inference.submit_anyscale_inference(input_source, model_name, output_table, *, compute_config=None, image_uri=None, job_kind=None, team=None, job_queue=None, priority=None, timeout_s=None, py_modules=None, working_dir=None, excludes=None, requirements=None, extra_env_vars=None, reasoning_parser='auto', max_actor_restarts=0, max_task_retries=0, **optional_params)#

Submit Anyscale Jobs inference from a Databricks notebook.

See ml_toolkit.functions.llm.inference.run_inference() for the user-facing entrypoint that validates and dispatches to this function.

Compute config is auto-resolved from (gpu_type, num_gpus) when compute_config is left as None — see AnyscaleComputeConfig for the Family-1 mapping. Pass compute_config explicitly to override (typical reason: opting into a Family-2 CPU-head topology like AnyscaleComputeConfig.H100_CPU_HEAD_2GPU so an r6i.16xlarge head can hold a tokenized dataset while GPU workers run inference).

job_kind / team set Anyscale tags that route the job to the right Job Queue partition — see AnyscaleJobKind and AnyscaleTeam. Untagged jobs fall through to shared-fallback (priority 10).

py_modules defaults to the local ml_toolkit package directory so dev iterations override the baked-in image copy without uploading the full repo. Pass working_dir instead if you need ancillary files alongside the package.

Return type:: RemoteAnyscaleRun