ML Inference

Model inference speed for LLMs, image classification, and generative AI workloads.

Llama 2 7B Tokens/sec

Higher is better -- sorted by performance

Our Recommendations

G2g2-standard-8

Cost-effective GPU inference for most models

The NVIDIA L4 GPU hits the sweet spot for inference: 42 tokens/sec for Llama 2 7B at a fraction of H100 cost. Hardware NVENC for video workloads is a bonus. Best price-performance for most inference use cases.

A3a3-highgpu-8g

Large models and maximum inference throughput

When you need maximum inference speed or are serving large models (70B+ parameters), the H100's 80GB HBM3 memory and raw compute power are unmatched. 185 tokens/sec for Llama 2 7B.

C3c3-standard-44

CPU-only inference for small quantized models

CPU-only inference works for quantized small models when GPU cost isn't justified. 3.2 tokens/sec for Llama 2 7B Q4 on 44 vCPUs. Viable for batch processing or low-volume inference.

Budget PickG2(g2-standard-4)

A single L4 GPU at the smallest G2 machine type gives you GPU inference capability at the lowest possible cost. Ideal for development, testing, and low-traffic inference endpoints.

Price-Performance: Llama 2 7B Tokens/sec

VM Series	Machine Type	Performance	Cost/hr	Perf/$
G2	g2-standard-8	42 tokens	$0.918	45.8
A3	a3-highgpu-8g	185 tokens	$29.387	6.3

All Benchmark Data

VM Series	Machine Type	Metric	Result	Notes
G2	g2-standard-8	Llama 2 7B Tokens/sec	42 tokens/s	Llama 2 7B, FP16, single L4 GPU, vLLM
A3	a3-highgpu-8g	Llama 2 7B Tokens/sec	185 tokens/s	Llama 2 7B, FP16, single H100 GPU, vLLM
C3	c3-standard-44	Llama 2 7B Tokens/sec	3.2 tokens/s	Llama 2 7B, CPU-only inference with llama.cpp, Q4 quantized
G2	g2-standard-8	ResNet-50 Images/sec	850 images/s	ResNet-50, batch size 64, FP16, single L4 GPU
A3	a3-highgpu-8g	ResNet-50 Images/sec	4,200 images/s	ResNet-50, batch size 64, FP16, single H100 GPU
G2	g2-standard-8	Stable Diffusion Images/min	8.5 images/min	Stable Diffusion XL, 1024x1024, 30 steps, single L4 GPU
A3	a3-highgpu-8g	Stable Diffusion Images/min	28 images/min	Stable Diffusion XL, 1024x1024, 30 steps, single H100 GPU