GPU Infrastructure Planning for Enterprise LLM Deployment
GPU infrastructure is the foundation of every private LLM deployment. Undersize it and your models run too slowly to be useful. Oversize it and you burn capital on idle silicon. The challenge is that GPU infrastructure decisions are sticky: procurement lead times are long, hardware lifecycles span years, and switching costs are real. Getting the sizing right from the start matters more here than in almost any other infrastructure domain.
The GPU Landscape for LLM Inference
Not all GPUs are created equal for LLM workloads. The two characteristics that matter most for inference are VRAM capacity (how large a model you can load) and memory bandwidth (how quickly you can read model weights during generation). Compute FLOPS, while important for training, is secondary for inference where the bottleneck is almost always memory bandwidth.
NVIDIA H100 SXM
- VRAM: 80 GB HBM3
- Memory bandwidth: 3.35 TB/s
- Interconnect: NVLink 4.0 (900 GB/s GPU-to-GPU)
- TDP: 700W
- Street price: $25,000-$35,000 per unit
The H100 SXM is the current performance leader for LLM inference. Its HBM3 memory bandwidth is 67% higher than the A100, translating directly to faster token generation. The NVLink interconnect enables efficient tensor parallelism across 8 GPUs within a node, essential for serving models that exceed single-GPU VRAM capacity.
NVIDIA H100 PCIe
- VRAM: 80 GB HBM3
- Memory bandwidth: 2.0 TB/s
- Interconnect: PCIe Gen5 (128 GB/s)
- TDP: 350W
- Street price: $20,000-$28,000 per unit
The PCIe variant offers the same VRAM but lower memory bandwidth and no NVLink. It fits in standard PCIe servers without specialized cooling, making it accessible for existing data center environments. The reduced bandwidth means 30-40% lower tokens-per-second compared to SXM, but for smaller models or lower-throughput requirements, the economics can be favorable.
NVIDIA A100 80GB
- VRAM: 80 GB HBM2e
- Memory bandwidth: 2.0 TB/s
- Interconnect: NVLink 3.0 (600 GB/s)
- TDP: 400W (SXM), 300W (PCIe)
- Street price: $10,000-$15,000 (secondary market)
The A100 remains a compelling value proposition, particularly on the secondary market where prices have dropped significantly. Memory bandwidth matches the H100 PCIe, and VRAM capacity is identical. For organizations optimizing total cost of ownership rather than peak performance, the A100 offers 70-80% of H100 SXM inference performance at 40-50% of the price.
NVIDIA L40S
- VRAM: 48 GB GDDR6X
- Memory bandwidth: 864 GB/s
- Interconnect: PCIe Gen4
- TDP: 350W
- Street price: $7,000-$10,000 per unit
The L40S targets a different niche: deploying smaller models (7B-13B) in standard data center racks without the cooling and power requirements of HBM-based GPUs. Its GDDR6X memory bandwidth is significantly lower than HBM-based cards, which limits throughput for larger models. However, for serving quantized 7B-13B models, the L40S provides good price-performance in a data center-friendly form factor.
Consumer GPUs: RTX 4090 and RTX 5090
Consumer GPUs are sometimes considered for cost-sensitive deployments. The RTX 4090 (24 GB VRAM, 1 TB/s bandwidth) and RTX 5090 (32 GB VRAM) can run smaller quantized models effectively. However, they are explicitly not designed for data center use: they lack ECC memory, enterprise driver support, and the thermal design for 24/7 operation in rack environments. NVIDIA's EULA also prohibits data center use of GeForce cards. Use them for development and testing only.
VRAM Requirements by Model Size
The VRAM needed to serve a model depends on the number of parameters, the precision of the weights, and the KV cache for concurrent requests. Here are the approximate requirements:
Model Weight Memory
- 7B parameters: 14 GB (FP16), 7 GB (INT8), 3.5 GB (INT4)
- 13B parameters: 26 GB (FP16), 13 GB (INT8), 6.5 GB (INT4)
- 34B parameters: 68 GB (FP16), 34 GB (INT8), 17 GB (INT4)
- 70B parameters: 140 GB (FP16), 70 GB (INT8), 35 GB (INT4)
- 405B parameters: 810 GB (FP16), 405 GB (INT8), ~200 GB (INT4)
KV Cache Overhead
Beyond model weights, you must budget VRAM for the key-value cache that stores attention state for each concurrent request. KV cache size scales with context length, batch size, number of attention heads, and head dimension. For a 70B model with 4096 context length:
- Per-request KV cache: approximately 1-2 GB
- With 16 concurrent requests: 16-32 GB additional VRAM
- With 64 concurrent requests: 64-128 GB additional VRAM (often exceeding the weight memory itself)
This is why PagedAttention (implemented in vLLM) is so important: it manages KV cache memory dynamically, avoiding the need to pre-allocate the maximum possible cache for every request. Without it, you need significantly more VRAM headroom.
Practical GPU Configurations
- 7B model, moderate throughput: 1x L40S or 1x A100 40GB. Quantized INT4, this runs comfortably with ample room for KV cache.
- 70B model, production throughput: 2x H100 80GB (FP16) or 1x H100 80GB (INT4 quantized). For high concurrency, 4x H100 80GB provides headroom for KV cache.
- 405B model, production throughput: 8x H100 80GB (FP16) with tensor parallelism. Quantized, 4x H100 80GB is feasible.
Multi-GPU Serving Strategies
When a model exceeds a single GPU's VRAM, you must distribute it across multiple GPUs. Two primary parallelism strategies apply:
Tensor Parallelism
Tensor parallelism splits individual matrix operations across GPUs. Each GPU holds a slice of every layer and computes a portion of each operation. The GPUs must communicate intermediate results at every layer, requiring high-bandwidth interconnects (NVLink or NVSwitch). This is the standard approach for multi-GPU inference within a single node.
Key constraint: Tensor parallelism is limited by interconnect bandwidth. Over PCIe (64-128 GB/s), the communication overhead is substantial and limits scaling beyond 2 GPUs. Over NVLink (600-900 GB/s), 4-8 GPU tensor parallelism is efficient.
Pipeline Parallelism
Pipeline parallelism assigns different layers of the model to different GPUs. GPU 1 runs layers 1-20, GPU 2 runs layers 21-40, and so on. Each request flows through the GPUs sequentially, creating a pipeline. This approach requires less inter-GPU bandwidth since communication only happens between adjacent pipeline stages, but it introduces pipeline bubbles (idle time) when the pipeline is not fully saturated.
Best for: Multi-node deployments where inter-node bandwidth (typically InfiniBand at 200-400 Gb/s) is lower than intra-node NVLink bandwidth. Use tensor parallelism within a node and pipeline parallelism across nodes.
Expert Parallelism (MoE Models)
For mixture-of-experts models like Mixtral or DeepSeek V3, expert parallelism distributes different expert networks across GPUs. Since only a subset of experts activates per token, this can be more efficient than tensor or pipeline parallelism for MoE architectures. However, it requires careful balancing to avoid hotspots when certain experts are activated more frequently.
Capacity Planning
Sizing your GPU cluster requires estimating your inference demand and working backward to hardware.
Demand Estimation
- Request volume: How many inference requests per day? What is the peak-to-average ratio?
- Token volume: Average input and output tokens per request? Maximum context length used?
- Latency SLA: What is the acceptable time-to-first-token and total response time?
- Concurrency: How many simultaneous requests must be served without queuing?
Throughput Benchmarking
Before procurement, benchmark your target model and serving framework on the candidate GPU. Measure tokens per second at various batch sizes and concurrency levels. Key metrics:
- Tokens per second (throughput): Total output tokens generated per second across all concurrent requests
- Time to first token (TTFT): Latency from request arrival to the first output token. This determines perceived responsiveness.
- Inter-token latency (ITL): Time between consecutive output tokens. This determines streaming speed.
A single H100 80GB serving Llama 3 70B in INT4 quantization typically achieves 40-60 tokens per second per request with TTFT of 200-500ms. At batch size 16, total throughput reaches 400-600 tokens per second with higher per-request latency.
The Capacity Formula
Required GPUs = (Peak concurrent requests x Average output tokens per request) / (Throughput per GPU at target latency SLA)
Add a 30-50% buffer for headroom, maintenance windows, and growth. Round up to the nearest server configuration (typically 4 or 8 GPUs per server).
Cloud GPU vs. On-Premise Economics
Cloud GPU instances offer flexibility but at a premium. The comparison:
- Cloud GPU hourly cost: An 8x H100 instance on major cloud providers costs $20-$30 per hour, or $175,000-$260,000 per year at 100% utilization. Reserved instances bring this down by 30-40%.
- On-premise 8x H100 amortized cost: Hardware at $280,000 amortized over 3 years is approximately $93,000 per year, plus $36,000-$48,000 for colocation, power, and networking.
- Break-even utilization: On-premise hardware becomes cheaper than cloud instances at roughly 40-50% average utilization when factoring in all operational costs.
The decision also hinges on workload predictability. If your GPU utilization is highly variable (development workloads, batch processing), cloud instances with per-hour billing may be more cost-effective. For steady-state inference serving with predictable demand, on-premise hardware wins decisively.
Procurement Timelines and Strategy
GPU procurement is not like ordering commodity servers. Lead times and availability vary significantly:
- H100 SXM systems (DGX, HGX): Lead times of 8-16 weeks through tier-1 OEMs (Dell, HPE, Lenovo, Supermicro). Shorter for smaller configurations, longer for large clusters.
- A100 systems: Readily available on the secondary market and through OEMs with lead times of 2-4 weeks.
- L40S systems: Generally available with 2-6 week lead times through standard server procurement channels.
Procurement Recommendations
- Start small, scale deliberately: Begin with a 2-4 GPU configuration for validation. Expand to production capacity once you have benchmarked your actual workload.
- Negotiate service contracts: GPU hardware failures, while uncommon, require specialized replacement. Ensure next-business-day parts replacement is included.
- Plan for the next generation: NVIDIA's B200 and subsequent generations will offer significant performance improvements. Design your infrastructure so that GPU servers can be swapped without re-architecting the entire stack.
- Consider refurbished A100s: For workloads that do not require bleeding-edge performance, refurbished A100 80GB systems at 40-50% of new H100 pricing offer compelling economics.
GPU infrastructure planning is ultimately a capacity engineering exercise grounded in workload analysis. Measure your actual inference demand, benchmark candidate hardware against your specific models and latency requirements, and build with enough headroom for growth. Over-provisioning by 30% is cheaper than under-provisioning and being forced into emergency procurement at premium pricing.