Performance Testing Overview¶

To help users select the appropriate MIG profile for their workloads, we conducted benchmark tests using LLM fine-tuning, PyTorch training, and GROMACS molecular dynamics simulations. Tests were run on the NVIDIA A100 GPUs (full 80 GB and MIG profiles) available on Wulver.

The results below show differences in runtime, accuracy, memory usage, and service unit (SU) cost across profiles. Observations and notes are included to explain results.

List of Benchmark Tests¶

GROMACSLLM Fine-TuningMatrix Multiplication Benchmarks

The GROMACS benchmark test was conducted for different MIG profiles and a full GPU. The results suggest that the 40G MIG shows better performance compared to a full 80G GPU. We also calculated the cost/performance parameter, which is obtained by taking the ratio of SU consumption to performance. Based on this parameter, the 20G MIG is the best choice. However, if users prefer performance over cost/performance, then the 40G MIG would be the recommended option.

GPU Profile	Nodes	Cores	Threads	Performance (ns/day)	Walltime (s)	SU Consumption	Cost/Performance
A100_10gb MIG	1	1	4	43.28	499.131	6	3.32717
A100_20gb MIG	1	1	4	81.205	265.995	8	2.36439
A100_40gb MIG	1	1	4	115.507	187.003	12	2.49336
A100_80gb full GPU	1	1	4	118.59	182.142	20	4.04756

The benchmark script fine-tunes the Qwen 1.5B Instruct model on the Alpaca-cleaned dataset using QLoRA. Training is done with 4-bit quantization to save memory and LoRA adapters so that only a small set of parameters are updated. The Hugging Face TRL SFTTrainer handles training, while the script also logs runtime, GPU/CPU memory, and tokens processed per second. The setup runs consistently on both full NVIDIA's A100 80GB GPU and different MIG slices (10 GB, 20 GB, 40 GB), making it useful for comparing speed and cost across profiles.

GPU Profile	Walltime (h)	SU Total	Tokens Processed	Tokens/s	Peak Allocated (GB)	Peak Reserved (GB)	SU per 1M Tokens
A100_10gb MIG	1.092	3.28	166327	42.3	5.68	8.97	19.05
A100_20gb MIG	0.556	2.78	166327	83	5.68	18.3	16.39
A100_40gb MIG	0.353	3.18	166327	130.9	5.68	23.55	18.88
A100_80gb full GPU	0.267	4.53	166327	173.2	5.68	23.55	27.04

Peak Allocated ≈ 5.7 GB across all runs: The model + LoRA fine-tune has a fixed memory demand, regardless of MIG size.
Peak Reserved varies (8.9 → 23.5 GB): PyTorch’s caching allocator grabs bigger chunks when more GPU memory is available, but this doesn’t change training feasibility.
Efficiency vs. Speed: Smaller MIGs (e.g., 10 GB, 20 GB) can be more cost-efficient per token, while larger MIGs or the full 80 GB GPU finish training faster.
Choosing a profile: The right option depends on priorities — use smaller MIGs to save SUs on long jobs, or larger MIGs when wall-time (speed) is more important.

Info

SU values are calculated as:
SU = (max(#CPUs, #RAM/4GB) + 16 × (GPU_mem/80)) × hours

Example (A100_20GB, 0.556 hr walltime, 1 CPU, 4 GB RAM, 20 GB GPU):

SU = (max(1, 4/4) + 16 × (20/80)) × 0.556
   = (1 + 4) * 0.556 = 2.78

We ran a matrix multiplication benchmark on different NVIDIA A100 MIG profiles and the full GPU. The test multiplies large square matrices (sizes like 4096×4096 up to 49k×49k) using PyTorch and CUDA.

Matrix multiplication is the core operation in deep learning — it’s what neural networks spend most of their time doing. Measuring how many TFLOPs (trillion floating point operations per second) each MIG slice achieves gives a good picture of its raw compute power.

GPU Profile	SMs	Memory (GB)	Peak FP16 TFLOPs	Peak FP32 (TF32) TFLOPs	Peak Matrix Size (n)	Peak GPU Mem Used (GB)	SU Usage Factor
A100_10gb MIG	14	9.5	38.694	18.985	12288 (FP16), 22528 (FP32)	7.57	2
A100_20gb MIG	28	19.5	79.304	37.887	20480 (FP16), 32256 (FP32)	15.52	4
A100_40gb MIG	42	39.5	118.924	55.576	49152 (FP16), 32768 (FP32)	18.01	8
A100_80gb full GPU	108	79.3	286.185	135.676	16384 (FP16), 16384 (FP32)	18.01	16

Peak FP16 performance (fast half-precision mode used in AI training).
Peak FP32 performance (single precision with TF32 tensor cores, higher accuracy but slower).
Largest tested matrix size (n) where peak performance was observed.
Peak GPU memory usage, to see whether memory or compute was the bottleneck.
SU usage factor, to tie performance back to billing.

The results show that performance scales almost linearly with MIG size (number of SMs), while memory never became the limiting factor. This means compute capacity is the main driver of speed, and users can choose between smaller slices (cheaper, slower) or larger slices (faster, higher SU rate) depending on their workload needs.