This table provides an overview of approximate model sizes (in billions of parameters) that can be run on various VRAM configurations, along with examples of known models. Note that these are estimates and may vary based on specific implementations, architectures, and optimizations.
VRAM (GB) | FP32 | FP16/BF16 | INT8 | INT4 | INT2 | Example Models |
---|---|---|---|---|---|---|
16 | 3-4B | 6-8B | 12-16B | 24-32B | 48-64B | GPT-2 (1.5B), BERT-Large (340M) |
24 | 5-6B | 10-12B | 20-24B | 40-48B | 80-96B | GPT-J (6B), BLOOM-7B1 |
48 | 10-12B | 20-24B | 40-48B | 80-96B | 160-192B | T5-11B, BLOOM-7B1 (FP32) |
80 | 18-20B | 36-40B | 72-80B | 144-160B | 288-320B | GPT-NeoX-20B, BLOOM-176B2 |
96 | 22-24B | 44-48B | 88-96B | 176-192B | 352-384B | BLOOM-176B2, Jurassic-1 Jumbo (178B)2 |
128 | 30-32B | 60-64B | 120-128B | 240-256B | 480-512B | GPT-3 175B2, PaLM 540B2 |
160 | 38-40B | 76-80B | 152-160B | 304-320B | 608-640B | PaLM 540B2, Megatron-Turing NLG 530B2 |
192 | 46-48B | 92-96B | 184-192B | 368-384B | 736-768B | BLOOM-176B (FP16) |
256 | 62-64B | 124-128B | 248-256B | 496-512B | 992-1024B | GPT-3 175B (INT8), LLaMA 2 70B (FP32) |
320 | 78-80B | 156-160B | 312-320B | 624-640B | 1248-1280B | Chinchilla 70B (FP32) |
384 | 94-96B | 188-192B | 376-384B | 752-768B | 1504-1536B | PaLM 540B (INT8) |
512 | 126-128B | 252-256B | 504-512B | 1008-1024B | 2016-2048B | GPT-3 175B (FP16), BLOOM-176B (FP32) |
Notes:
- Can run in full precision (FP32)
- Requires quantization or other optimization techniques
Additional Considerations:
- These estimates assume the entire VRAM is available for the model, which is often not the case in practice due to memory used by the framework, operating system, and other processes.
- Model parallelism and other advanced techniques can allow running even larger models by distributing them across multiple GPUs.
- Inference typically requires less memory than training, so larger models can often be run for inference on smaller VRAM configurations.
- The exact sizes can vary based on model architecture, implementation details, and specific optimizations used.
Key Takeaways:
- 16-24GB VRAM: Suitable for most consumer-grade AI tasks and smaller research models.
- 48-96GB VRAM: Enables work with medium to large-scale models, often used in professional and research settings.
- 128-256GB VRAM: Allows running some of the largest publicly available models with various optimizations.
- 320-512GB VRAM: Provides capacity for the largest current models and future developments, often achieved through multi-GPU setups.
This table demonstrates the significant impact of quantization and other optimization techniques in enabling larger models to run on limited VRAM. As AI continues to advance, we can expect further innovations in model compression and memory-efficient architectures to push these boundaries even further.
And hold on ... It does not end with transformers...