Skip to content

Performance Testing Overview

To help users select the appropriate MIG profile for their workloads, we conducted benchmark tests using LLM fine-tuning, PyTorch training, and GROMACS molecular dynamics simulations. Tests were run on the NVIDIA A100 GPUs (full 80 GB and MIG profiles) available on Wulver.

The results below show differences in runtime, accuracy, memory usage, and service unit (SU) cost across profiles. Observations and notes are included to explain results.

List of Benchmark Tests

The GROMACS benchmark test was conducted for different MIG profiles and a full GPU. The results suggest that the 40G MIG shows better performance compared to a full 80G GPU. We also calculated the cost/performance parameter, which is obtained by taking the ratio of SU consumption to performance. Based on this parameter, the 20G MIG is the best choice. However, if users prefer performance over cost/performance, then the 40G MIG would be the recommended option.

GPU Profile Nodes Cores Threads Performance (ns/day) Walltime (s) SU Consumption Cost/Performance
A100_10gb MIG 1 1 4 43.28 499.131 6 3.32717
A100_20gb MIG 1 1 4 81.205 265.995 8 2.36439
A100_40gb MIG 1 1 4 115.507 187.003 12 2.49336
A100_80gb full GPU 1 1 4 118.59 182.142 20 4.04756

The benchmark script fine-tunes the Qwen 1.5B Instruct model on the Alpaca-cleaned dataset using QLoRA. Training is done with 4-bit quantization to save memory and LoRA adapters so that only a small set of parameters are updated. The Hugging Face TRL SFTTrainer handles training, while the script also logs runtime, GPU/CPU memory, and tokens processed per second. The setup runs consistently on both full NVIDIA's A100 80GB GPU and different MIG slices (10 GB, 20 GB, 40 GB), making it useful for comparing speed and cost across profiles.

GPU Profile Walltime (h) SU Total Tokens Processed Tokens/s Peak Allocated (GB) Peak Reserved (GB) SU per 1M Tokens
A100_10gb MIG 1.092 3.28 166327 42.3 5.68 8.97 19.05
A100_20gb MIG 0.556 2.78 166327 83 5.68 18.3 16.39
A100_40gb MIG 0.353 3.18 166327 130.9 5.68 23.55 18.88
A100_80gb full GPU 0.267 4.53 166327 173.2 5.68 23.55 27.04
  • Peak Allocated ≈ 5.7 GB across all runs: The model + LoRA fine-tune has a fixed memory demand, regardless of MIG size.

  • Peak Reserved varies (8.9 → 23.5 GB): PyTorch’s caching allocator grabs bigger chunks when more GPU memory is available, but this doesn’t change training feasibility.

  • Efficiency vs. Speed: Smaller MIGs (e.g., 10 GB, 20 GB) can be more cost-efficient per token, while larger MIGs or the full 80 GB GPU finish training faster.

  • Choosing a profile: The right option depends on priorities — use smaller MIGs to save SUs on long jobs, or larger MIGs when wall-time (speed) is more important.

Info

SU values are calculated as:
SU = (max(#CPUs, #RAM/4GB) + 16 × (GPU_mem/80)) × hours

Example (A100_20GB, 0.556 hr walltime, 1 CPU, 4 GB RAM, 20 GB GPU):

SU = (max(1, 4/4) + 16 × (20/80)) × 0.556
   = (1 + 4) * 0.556 = 2.78

We ran a matrix multiplication benchmark on different NVIDIA A100 MIG profiles and the full GPU. The test multiplies large square matrices (sizes like 4096×4096 up to 49k×49k) using PyTorch and CUDA.

Matrix multiplication is the core operation in deep learning — it’s what neural networks spend most of their time doing. Measuring how many TFLOPs (trillion floating point operations per second) each MIG slice achieves gives a good picture of its raw compute power.

GPU Profile SMs Memory (GB) Peak FP16 TFLOPs Peak FP32 (TF32) TFLOPs Peak Matrix Size (n) Peak GPU Mem Used (GB) SU Usage Factor
A100_10gb MIG 14 9.5 38.694 18.985 12288 (FP16), 22528 (FP32) 7.57 2
A100_20gb MIG 28 19.5 79.304 37.887 20480 (FP16), 32256 (FP32) 15.52 4
A100_40gb MIG 42 39.5 118.924 55.576 49152 (FP16), 32768 (FP32) 18.01 8
A100_80gb full GPU 108 79.3 286.185 135.676 16384 (FP16), 16384 (FP32) 18.01 16
  • Peak FP16 performance (fast half-precision mode used in AI training).
  • Peak FP32 performance (single precision with TF32 tensor cores, higher accuracy but slower).
  • Largest tested matrix size (n) where peak performance was observed.
  • Peak GPU memory usage, to see whether memory or compute was the bottleneck.
  • SU usage factor, to tie performance back to billing.

The results show that performance scales almost linearly with MIG size (number of SMs), while memory never became the limiting factor. This means compute capacity is the main driver of speed, and users can choose between smaller slices (cheaper, slower) or larger slices (faster, higher SU rate) depending on their workload needs.