Performance Testing Overview¶
To help users select the appropriate MIG profile for their workloads, we conducted benchmark tests using LLM fine-tuning, PyTorch training, and GROMACS molecular dynamics simulations. Tests were run on the NVIDIA A100 GPUs (full 80 GB and MIG profiles) available on Wulver.
The results below show differences in runtime, accuracy, memory usage, and service unit (SU) cost across profiles. Observations and notes are included to explain results.
List of Benchmark Tests¶
The GROMACS benchmark test was conducted for different MIG profiles and a full GPU. The results suggest that the 40G MIG shows better performance compared to a full 80G GPU. We also calculated the cost/performance parameter, which is obtained by taking the ratio of SU consumption to performance. Based on this parameter, the 20G MIG is the best choice. However, if users prefer performance over cost/performance, then the 40G MIG would be the recommended option.
GPU Profile | Nodes | Cores | Threads | Performance (ns/day) | Walltime (s) | SU Consumption | Cost/Performance |
---|---|---|---|---|---|---|---|
A100_10gb MIG | 1 | 1 | 4 | 43.28 | 499.131 | 6 | 3.32717 |
A100_20gb MIG | 1 | 1 | 4 | 81.205 | 265.995 | 8 | 2.36439 |
A100_40gb MIG | 1 | 1 | 4 | 115.507 | 187.003 | 12 | 2.49336 |
A100_80gb full GPU | 1 | 1 | 4 | 118.59 | 182.142 | 20 | 4.04756 |
The benchmark script fine-tunes the Qwen 1.5B Instruct model on the Alpaca-cleaned dataset using QLoRA. Training is done with 4-bit quantization to save memory and LoRA adapters so that only a small set of parameters are updated. The Hugging Face TRL SFTTrainer
handles training, while the script also logs runtime, GPU/CPU memory, and tokens processed per second. The setup runs consistently on both full NVIDIA's A100 80GB GPU and different MIG slices (10 GB, 20 GB, 40 GB), making it useful for comparing speed and cost across profiles.
GPU Profile | Walltime (h) | SU Total | Tokens Processed | Tokens/s | Peak Allocated (GB) | Peak Reserved (GB) | SU per 1M Tokens |
---|---|---|---|---|---|---|---|
A100_10gb MIG | 1.092 | 3.28 | 166327 | 42.3 | 5.68 | 8.97 | 19.05 |
A100_20gb MIG | 0.556 | 2.78 | 166327 | 83 | 5.68 | 18.3 | 16.39 |
A100_40gb MIG | 0.353 | 3.18 | 166327 | 130.9 | 5.68 | 23.55 | 18.88 |
A100_80gb full GPU | 0.267 | 4.53 | 166327 | 173.2 | 5.68 | 23.55 | 27.04 |
-
Peak Allocated ≈ 5.7 GB across all runs: The model + LoRA fine-tune has a fixed memory demand, regardless of MIG size.
-
Peak Reserved varies (8.9 → 23.5 GB): PyTorch’s caching allocator grabs bigger chunks when more GPU memory is available, but this doesn’t change training feasibility.
-
Efficiency vs. Speed: Smaller MIGs (e.g., 10 GB, 20 GB) can be more cost-efficient per token, while larger MIGs or the full 80 GB GPU finish training faster.
-
Choosing a profile: The right option depends on priorities — use smaller MIGs to save SUs on long jobs, or larger MIGs when wall-time (speed) is more important.
Info
SU values are calculated as:
SU = (max(#CPUs, #RAM/4GB) + 16 × (GPU_mem/80)) × hours
Example (A100_20GB, 0.556 hr walltime, 1 CPU, 4 GB RAM, 20 GB GPU):
SU = (max(1, 4/4) + 16 × (20/80)) × 0.556
= (1 + 4) * 0.556 = 2.78
We ran a matrix multiplication benchmark on different NVIDIA A100 MIG profiles and the full GPU. The test multiplies large square matrices (sizes like 4096×4096 up to 49k×49k) using PyTorch and CUDA.
Matrix multiplication is the core operation in deep learning — it’s what neural networks spend most of their time doing. Measuring how many TFLOPs (trillion floating point operations per second) each MIG slice achieves gives a good picture of its raw compute power.
GPU Profile | SMs | Memory (GB) | Peak FP16 TFLOPs | Peak FP32 (TF32) TFLOPs | Peak Matrix Size (n) | Peak GPU Mem Used (GB) | SU Usage Factor |
---|---|---|---|---|---|---|---|
A100_10gb MIG | 14 | 9.5 | 38.694 | 18.985 | 12288 (FP16), 22528 (FP32) | 7.57 | 2 |
A100_20gb MIG | 28 | 19.5 | 79.304 | 37.887 | 20480 (FP16), 32256 (FP32) | 15.52 | 4 |
A100_40gb MIG | 42 | 39.5 | 118.924 | 55.576 | 49152 (FP16), 32768 (FP32) | 18.01 | 8 |
A100_80gb full GPU | 108 | 79.3 | 286.185 | 135.676 | 16384 (FP16), 16384 (FP32) | 18.01 | 16 |
- Peak FP16 performance (fast half-precision mode used in AI training).
- Peak FP32 performance (single precision with TF32 tensor cores, higher accuracy but slower).
- Largest tested matrix size (n) where peak performance was observed.
- Peak GPU memory usage, to see whether memory or compute was the bottleneck.
- SU usage factor, to tie performance back to billing.
The results show that performance scales almost linearly with MIG size (number of SMs), while memory never became the limiting factor. This means compute capacity is the main driver of speed, and users can choose between smaller slices (cheaper, slower) or larger slices (faster, higher SU rate) depending on their workload needs.