Anyscale Inference Reference#

The Anyscale backend submits LLM batch-inference jobs to Anyscale Jobs running on a managed Ray cluster, instead of Databricks GPU compute. Use it when you need:

  • Multi-node distributed inference at H100 / B200 scale.

  • A high-memory CPU head with GPU workers (scaling-laws sweeps).

  • Job-Queue-based prioritization across teams and job kinds.

User-facing entrypoints:

Compute configs#

The platform team registers a fixed set of compute configs in Anyscale. Pass the matching AnyscaleComputeConfig member (or its raw string value) as anyscale_compute_config.

When anyscale_compute_config is left None, Family-1 configs auto-resolve from (gpu_type, num_gpus). Family-2 configs (CPU-head) must always be passed explicitly — they’re an opt-in topology for scaling-laws sweeps and never auto-selected.

Family 1 — GPU head + optional GPU workers#

For training/inference where the head node also runs vLLM/Ray work.

Name

Total nodes

GPU type

GPU count

Use when

h100-single-node

1

H100-8X

8

Single-node training/inference, dev

h100-two-nodes

2

H100-8X

16

Small distributed training

h100-four-nodes

4

H100-8X

32

Medium distributed training

b200-single-node

1

B200-8X

8

Single-node B200 work

b200-two-nodes

2

B200-8X

16

Small distributed B200

b200-four-nodes

4

B200-8X

32

Medium distributed B200

Family 2 — CPU head + GPU workers (scaling-laws sweeps)#

For workloads where a high-memory head node coordinates many GPU workers. Head is r6i.16xlarge (32 vCPU, 512 GB RAM). Useful when the head needs to hold tokenized datasets, dispatch state, or large lookup tables in memory. Naming counts GPU workers, not total nodes.

Name

Head

GPU workers

Total nodes

GPU count

Use when

h100-cpu-head-1gpu

r6i.16xlarge

1 × p5.48xlarge

2

8

Smallest scaling-laws, H100

h100-cpu-head-2gpu

r6i.16xlarge

2 × p5.48xlarge

3

16

h100-cpu-head-4gpu

r6i.16xlarge

4 × p5.48xlarge

5

32

h100-cpu-head-8gpu

r6i.16xlarge

8 × p5.48xlarge

9

64

Largest scaling-laws, H100

b200-cpu-head-1gpu

r6i.16xlarge

1 × p6-b200.48xlarge

2

8

Smallest scaling-laws, B200

b200-cpu-head-2gpu

r6i.16xlarge

2 × p6-b200.48xlarge

3

16

b200-cpu-head-4gpu

r6i.16xlarge

4 × p6-b200.48xlarge

5

32

b200-cpu-head-8gpu

r6i.16xlarge

8 × p6-b200.48xlarge

9

64

Largest scaling-laws, B200

Override example#

Default auto-resolve picks Family-1. Override with Family-2 when you want a CPU head:

from ml_toolkit.functions.llm.inference import (
    run_inference, AnyscaleComputeConfig,
)

# Auto-resolves to AnyscaleComputeConfig.H100_TWO_NODES
run_inference(
    "cat.sch.input", "cat.sch.model", "cat.sch.output",
    compute_type="anyscale",
    gpu_type="h100", num_gpus=16,
)

# Override: same 16 H100s, but with a high-memory CPU head
run_inference(
    "cat.sch.input", "cat.sch.model", "cat.sch.output",
    compute_type="anyscale",
    gpu_type="h100", num_gpus=16,
    anyscale_compute_config=AnyscaleComputeConfig.H100_CPU_HEAD_2GPU,
)

Routing labels (job_kind and team)#

To land on the right Job Queue partition with the right priority, set anyscale_job_kind and anyscale_team consistently. Untagged jobs fall through to shared-fallback (priority 10).

anyscale_job_kind — see AnyscaleJobKind:

  • incremental — recurring/incremental refresh (highest priority).

  • backfill — one-shot historical recompute.

  • fine_tuning — supervised fine-tuning runs.

  • training — pretraining / from-scratch training.

anyscale_team — see AnyscaleTeam:

  • corporate — Corporate workspace.

  • yoda — YODA workspace.

Other team strings fall through to base routing rules.

Code reference#

class ml_toolkit.functions.llm.inference.AnyscaleComputeConfig#

Registered Anyscale compute configs.

Two families of cluster shapes are registered with the platform team’s Anyscale account. Pass any member of this enum (or its raw string value) as anyscale_compute_config to run_inference() / submit_anyscale_inference().

Family 1 — GPU head (single- or multi-node, all GPU)

Name

Total nodes

GPU type

GPU count

Use when

h100-single-node

1

H100-8X

8

Single-node training/inference, dev

h100-two-nodes

2

H100-8X

16

Small distributed training

h100-four-nodes

4

H100-8X

32

Medium distributed training

b200-single-node

1

B200-8X

8

Single-node B200 work

b200-two-nodes

2

B200-8X

16

Small distributed B200

b200-four-nodes

4

B200-8X

32

Medium distributed B200

Family 1 is auto-resolved from (gpu_type, num_gpus) when compute_config is left None.

Family 2 — CPU head (r6i.16xlarge) + GPU workers (scaling-laws sweeps)

Use when the head needs to hold tokenized datasets, dispatch state, or large lookup tables in memory. Naming counts GPU workers, not total nodes. Always passed explicitly — never auto-resolved.

Name

Head

GPU workers

Total nodes

GPU count

Use when

h100-cpu-head-1gpu

r6i.16xlarge

1 × p5.48xlarge

2

8

Smallest scaling-laws, H100

h100-cpu-head-2gpu

r6i.16xlarge

2 × p5.48xlarge

3

16

h100-cpu-head-4gpu

r6i.16xlarge

4 × p5.48xlarge

5

32

h100-cpu-head-8gpu

r6i.16xlarge

8 × p5.48xlarge

9

64

Largest scaling-laws, H100

b200-cpu-head-1gpu

r6i.16xlarge

1 × p6-b200.48xlarge

2

8

Smallest scaling-laws, B200

b200-cpu-head-2gpu

r6i.16xlarge

2 × p6-b200.48xlarge

3

16

b200-cpu-head-4gpu

r6i.16xlarge

4 × p6-b200.48xlarge

5

32

b200-cpu-head-8gpu

r6i.16xlarge

8 × p6-b200.48xlarge

9

64

Largest scaling-laws, B200

B200_CPU_HEAD_1GPU = 'b200-cpu-head-1gpu'#
B200_CPU_HEAD_2GPU = 'b200-cpu-head-2gpu'#
B200_CPU_HEAD_4GPU = 'b200-cpu-head-4gpu'#
B200_CPU_HEAD_8GPU = 'b200-cpu-head-8gpu'#
B200_FOUR_NODES = 'b200-four-nodes'#
B200_SINGLE_NODE = 'b200-single-node'#
B200_TWO_NODES = 'b200-two-nodes'#
H100_CPU_HEAD_1GPU = 'h100-cpu-head-1gpu'#
H100_CPU_HEAD_2GPU = 'h100-cpu-head-2gpu'#
H100_CPU_HEAD_4GPU = 'h100-cpu-head-4gpu'#
H100_CPU_HEAD_8GPU = 'h100-cpu-head-8gpu'#
H100_FOUR_NODES = 'h100-four-nodes'#
H100_SINGLE_NODE = 'h100-single-node'#
H100_TWO_NODES = 'h100-two-nodes'#
__new__(value)#
class ml_toolkit.functions.llm.inference.AnyscaleJobKind#

Routes the job to the right Job Queue partition.

Untagged jobs fall through to shared-fallback (priority 10).

BACKFILL = 'backfill'#
FINE_TUNING = 'fine_tuning'#
INCREMENTAL = 'incremental'#
TRAINING = 'training'#
__new__(value)#
class ml_toolkit.functions.llm.inference.AnyscaleTeam#

Team label, paired with AnyscaleJobKind for queue routing.

Untagged team falls through to base rules.

CORPORATE = 'corporate'#
YODA = 'yoda'#
__new__(value)#
ml_toolkit.functions.llm.inference.submit_anyscale_inference(input_source, model_name, output_table, *, compute_config=None, image_uri=None, job_kind=None, team=None, job_queue=None, priority=None, timeout_s=None, py_modules=None, working_dir=None, excludes=None, requirements=None, extra_env_vars=None, reasoning_parser='auto', max_actor_restarts=0, max_task_retries=0, **optional_params)#

Submit Anyscale Jobs inference from a Databricks notebook.

See ml_toolkit.functions.llm.inference.run_inference() for the user-facing entrypoint that validates and dispatches to this function.

Compute config is auto-resolved from (gpu_type, num_gpus) when compute_config is left as None — see AnyscaleComputeConfig for the Family-1 mapping. Pass compute_config explicitly to override (typical reason: opting into a Family-2 CPU-head topology like AnyscaleComputeConfig.H100_CPU_HEAD_2GPU so an r6i.16xlarge head can hold a tokenized dataset while GPU workers run inference).

job_kind / team set Anyscale tags that route the job to the right Job Queue partition — see AnyscaleJobKind and AnyscaleTeam. Untagged jobs fall through to shared-fallback (priority 10).

py_modules defaults to the local ml_toolkit package directory so dev iterations override the baked-in image copy without uploading the full repo. Pass working_dir instead if you need ancillary files alongside the package.

Return type:

RemoteAnyscaleRun