Skip to content

LLaMA Quantized Inference Results

This section presents empirical results on the application of TorchMX to the LLaMA 3.1 series of models, specifically the 8B and 70B variants. Our objective is to evaluate the efficacy of quantization using the Microscaling Floating Point (MXFP) format, which allows low-bit inference across all major tensor operations. We demonstrate that TorchMX enables near-lossless inference—achieving sub-2% accuracy degradation—without requiring post-training calibration.


Quantization Setup

We apply MXFP quantization with a block size of 32 to the following components:

  • All weights and activations in projection and MLP layers
  • Query, Key, and Value (QKV) vectors
  • Attention weight matrices (used in matmul with Value)

Matrix multiplications and softmax layers are computed in dequantized bfloat16.


Evaluation Setup

  • Models Evaluated: LLaMA 3.1-8B, LLaMA 3.1-70B
  • Datasets: PIQ, ARC Easy, ARC Challenge, HellaSwag, Winogrande
  • Baseline Precision: bfloat16
  • Inference Hardware: NVIDIA A100 80GB

Accuracy Comparison

Model ProjW ProjA MlpW MlpA Query Key Value Atten W Aver. Acc. (%) Acc. Δ (%)
LLaMA 3.1-8B (bf16) - - - - - - - - 73.60
FP6 FP8 FP6 FP8 - - - - 73.26 -0.34
FP6 FP6 FP6 FP6 - - - - 73.12 -0.48
FP6 FP8 FP6 FP8 FP6 FP6 FP6 FP6 71.79 -1.81
FP6 FP6 FP6 FP6 FP6 FP6 FP6 FP6 71.76 -1.84
LLaMA 3.1-70B (bf16) - - - - - - - - 79.93
FP6 FP8 FP6 FP6 - - - - 79.35 -0.58
FP6 FP6 FP6 FP6 - - - - 78.94 -1.00
FP6 FP8 FP6 FP6 FP6 FP6 FP6 FP6 78.63 -1.30
FP6 FP6 FP6 FP6 FP6 FP6 FP6 FP6 78.63 -1.47

Analysis and Insights

  • Projection activations appear more sensitive to quantization than MLP activations, especially under FP6.
  • A full-stack FP6 configuration achieves excellent tradeoffs, showing just \~1.8% degradation while offering substantial compression.
  • Using FP8 for activations (especially in projection) recovers up to 0.2–0.5% accuracy in both 8B and 70B variants.
  • Value vectors can be further compressed (e.g., MXFP4) with negligible loss (results not shown).

Reproducibility

To replicate these benchmarks, see the example script:

examples/quantize_llama.md