ML Training
Training throughput for deep learning models using GPU-accelerated compute.
GPT-2 Training Throughput
Higher is better -- sorted by performance
Our Recommendations
Large-scale model training and fine-tuning
8x H100 GPUs with 1800 Gbps NVLink interconnect. 48,000 tokens/sec GPT-2 training throughput. 15.8 PFLOPS FP8 compute. This is the machine for serious training workloads -- nothing else on GCP comes close.
Budget training and fine-tuning smaller models
8x L4 GPUs provide solid training performance for smaller models at roughly 1/3 the cost of A3. 5,400 tokens/sec GPT-2 throughput. Good for fine-tuning, small model training, and research experiments.
A single L4 GPU with enough CPU and memory for most fine-tuning jobs. Use Spot pricing for 70% savings on training runs that can handle interruptions.
All Benchmark Data
| VM Series | Machine Type | Metric | Result | Notes |
|---|---|---|---|---|
| A3 | a3-highgpu-8g | GPT-2 Training Throughput | 48,000 tokens/s | GPT-2 medium, 8x H100, DeepSpeed ZeRO-3, FP16 |
| G2 | g2-standard-96 | GPT-2 Training Throughput | 5,400 tokens/s | GPT-2 medium, 8x L4, DeepSpeed ZeRO-2, FP16 |
| A3 | a3-highgpu-8g | ResNet-50 Training Images/sec | 12,800 images/s | ResNet-50, 8x H100, mixed precision, PyTorch DDP |
| G2 | g2-standard-96 | ResNet-50 Training Images/sec | 2,400 images/s | ResNet-50, 8x L4, mixed precision, PyTorch DDP |