ML Inference
Model inference speed for LLMs, image classification, and generative AI workloads.
Llama 2 7B Tokens/sec
Higher is better -- sorted by performance
Our Recommendations
Cost-effective GPU inference for most models
The NVIDIA L4 GPU hits the sweet spot for inference: 42 tokens/sec for Llama 2 7B at a fraction of H100 cost. Hardware NVENC for video workloads is a bonus. Best price-performance for most inference use cases.
Large models and maximum inference throughput
When you need maximum inference speed or are serving large models (70B+ parameters), the H100's 80GB HBM3 memory and raw compute power are unmatched. 185 tokens/sec for Llama 2 7B.
CPU-only inference for small quantized models
CPU-only inference works for quantized small models when GPU cost isn't justified. 3.2 tokens/sec for Llama 2 7B Q4 on 44 vCPUs. Viable for batch processing or low-volume inference.
A single L4 GPU at the smallest G2 machine type gives you GPU inference capability at the lowest possible cost. Ideal for development, testing, and low-traffic inference endpoints.
Price-Performance: Llama 2 7B Tokens/sec
All Benchmark Data
| VM Series | Machine Type | Metric | Result | Notes |
|---|---|---|---|---|
| G2 | g2-standard-8 | Llama 2 7B Tokens/sec | 42 tokens/s | Llama 2 7B, FP16, single L4 GPU, vLLM |
| A3 | a3-highgpu-8g | Llama 2 7B Tokens/sec | 185 tokens/s | Llama 2 7B, FP16, single H100 GPU, vLLM |
| C3 | c3-standard-44 | Llama 2 7B Tokens/sec | 3.2 tokens/s | Llama 2 7B, CPU-only inference with llama.cpp, Q4 quantized |
| G2 | g2-standard-8 | ResNet-50 Images/sec | 850 images/s | ResNet-50, batch size 64, FP16, single L4 GPU |
| A3 | a3-highgpu-8g | ResNet-50 Images/sec | 4,200 images/s | ResNet-50, batch size 64, FP16, single H100 GPU |
| G2 | g2-standard-8 | Stable Diffusion Images/min | 8.5 images/min | Stable Diffusion XL, 1024x1024, 30 steps, single L4 GPU |
| A3 | a3-highgpu-8g | Stable Diffusion Images/min | 28 images/min | Stable Diffusion XL, 1024x1024, 30 steps, single H100 GPU |