← All Workloads

ML Inference

Model inference speed for LLMs, image classification, and generative AI workloads.

Llama 2 7B Tokens/sec

Higher is better -- sorted by performance

Our Recommendations

G2g2-standard-8

Cost-effective GPU inference for most models

The NVIDIA L4 GPU hits the sweet spot for inference: 42 tokens/sec for Llama 2 7B at a fraction of H100 cost. Hardware NVENC for video workloads is a bonus. Best price-performance for most inference use cases.

A3a3-highgpu-8g

Large models and maximum inference throughput

When you need maximum inference speed or are serving large models (70B+ parameters), the H100's 80GB HBM3 memory and raw compute power are unmatched. 185 tokens/sec for Llama 2 7B.

C3c3-standard-44

CPU-only inference for small quantized models

CPU-only inference works for quantized small models when GPU cost isn't justified. 3.2 tokens/sec for Llama 2 7B Q4 on 44 vCPUs. Viable for batch processing or low-volume inference.

Budget PickG2(g2-standard-4)

A single L4 GPU at the smallest G2 machine type gives you GPU inference capability at the lowest possible cost. Ideal for development, testing, and low-traffic inference endpoints.

Price-Performance: Llama 2 7B Tokens/sec

VM SeriesMachine TypePerformanceCost/hrPerf/$
G2g2-standard-842 tokens$0.91845.8
A3a3-highgpu-8g185 tokens$29.3876.3

All Benchmark Data

VM SeriesMachine TypeMetricResultNotes
G2g2-standard-8Llama 2 7B Tokens/sec42 tokens/sLlama 2 7B, FP16, single L4 GPU, vLLM
A3a3-highgpu-8gLlama 2 7B Tokens/sec185 tokens/sLlama 2 7B, FP16, single H100 GPU, vLLM
C3c3-standard-44Llama 2 7B Tokens/sec3.2 tokens/sLlama 2 7B, CPU-only inference with llama.cpp, Q4 quantized
G2g2-standard-8ResNet-50 Images/sec850 images/sResNet-50, batch size 64, FP16, single L4 GPU
A3a3-highgpu-8gResNet-50 Images/sec4,200 images/sResNet-50, batch size 64, FP16, single H100 GPU
G2g2-standard-8Stable Diffusion Images/min8.5 images/minStable Diffusion XL, 1024x1024, 30 steps, single L4 GPU
A3a3-highgpu-8gStable Diffusion Images/min28 images/minStable Diffusion XL, 1024x1024, 30 steps, single H100 GPU