Fixing Mixed Precision Underutilization for Speed Gains

Issue

Despite using float16 or bfloat16, training was not significantly faster.
Profiling revealed:

Tensor Cores underutilized
Inconsistent performance across training steps
Suboptimal memory bandwidth usage

Solution

We fixed AMP underuse by enforcing correct mixed-precision behavior:

1. Enable Autocast + GradScaler

Wrapped forward/backward in torch.cuda.amp.autocast()
Used GradScaler to handle loss scaling and overflow detection
Verified mixed precision was being correctly applied at all layers

2. Model & Layer Compliance

Ensured all model components supported float16 (e.g., no hard-coded float32)
Disabled layers that defaulted to float32 manually

🚀 Outcome

Training speed increased by 1.3× to 2× on Ampere/Hopper GPUs
Lower VRAM usage with bfloat16 on A100s
Stable training with no accuracy drop

💡 Takeaway

Mixed precision only helps if configured correctly.
Always validate that your model is truly using AMP and that Tensor Cores are active.