12/07/2025
📖 Deep Quantization 📘
📢 Part 3: Introducing PyTorch Quantization: Slimmer, Speedier Models for CPU & Mobile!
🔧 Two Modes to Fit Your Workflow
Eager Mode: Manually fuse layers (Conv+BN+ReLU) and insert quant stubs—straightforward for supported torch.nn modules.
FX Graph Mode: Automatic graph rewriting for wider model support—just tweak your model once and let FX do the rest.
🎛️ Three Quantization Algorithms
1. Dynamic Quantization
▪️ Weights are quantized at load time; activations on-the-fly.
▪️ Ideal for transformers, LSTMs—drop-in speed boost with minimal fuss.
2. Static (Post-Training) Quantization
▪️ Calibrate both weights & activations ahead of inference.
▪️ Leverages FBGEMM on x86 or QNNPACK on ARM—best when deployment and training hardware match.
3. Quantization-Aware Training (QAT)
▪️ Simulates int8 effects during training, then fine-tunes to recover precision.
▪️ Yields the highest post-quant accuracy for vision and speech nets.
🚀 3-Step Workflow
1️⃣ Prepare Your Model
▪️ Fuse adjacent layers for a single-pass compute (e.g., Conv+BN+ReLU).
▪️ Wrap with QuantStub/DeQuantStub if you only want to quantize specific submodules.
2️⃣ Configure & Quantize
▪️ Pick your algorithm (Dynamic, Static, or QAT).
▪️ Supply a small representative dataset for range calibration (Static/QAT).
3️⃣ Validate on CPU
▪️ Run inference through the PyTorch CPU backend (mobile too!).
▪️ Compare accuracy against your float32 baseline—expect a tiny drop (often