LLaMA Quantized Inference Results

This section presents empirical results on the application of TorchMX to the LLaMA 3.1 series of models, specifically the 8B and 70B variants. Our objective is to evaluate the efficacy of quantization using the Microscaling Floating Point (MXFP) format, which allows low-bit inference across all major tensor operations. We demonstrate that TorchMX enables near-lossless inference—achieving sub-2% accuracy degradation—without requiring post-training calibration.

Quantization Setup

We apply MXFP quantization with a block size of 32 to the following components:

All weights and activations in projection and MLP layers
Query, Key, and Value (QKV) vectors
Attention weight matrices (used in matmul with Value)

Matrix multiplications and softmax layers are computed in dequantized bfloat16.

Evaluation Setup

Models Evaluated: LLaMA 3.1-8B, LLaMA 3.1-70B
Datasets: PIQ, ARC Easy, ARC Challenge, HellaSwag, Winogrande
Baseline Precision: bfloat16
Inference Hardware: NVIDIA A100 80GB

Accuracy Comparison

Model	ProjW	ProjA	MlpW	MlpA	Query	Key	Value	Atten W	Aver. Acc. (%)	Acc. Δ (%)
LLaMA 3.1-8B (bf16)	-	-	-	-	-	-	-	-	73.60	—
	FP6	FP8	FP6	FP8	-	-	-	-	73.26	-0.34
	FP6	FP6	FP6	FP6	-	-	-	-	73.12	-0.48
	FP6	FP8	FP6	FP8	FP6	FP6	FP6	FP6	71.79	-1.81
	FP6	FP6	FP6	FP6	FP6	FP6	FP6	FP6	71.76	-1.84
LLaMA 3.1-70B (bf16)	-	-	-	-	-	-	-	-	79.93	—
	FP6	FP8	FP6	FP6	-	-	-	-	79.35	-0.58
	FP6	FP6	FP6	FP6	-	-	-	-	78.94	-1.00
	FP6	FP8	FP6	FP6	FP6	FP6	FP6	FP6	78.63	-1.30
	FP6	FP6	FP6	FP6	FP6	FP6	FP6	FP6	78.63	-1.47

Analysis and Insights

Projection activations appear more sensitive to quantization than MLP activations, especially under FP6.
A full-stack FP6 configuration achieves excellent tradeoffs, showing just \~1.8% degradation while offering substantial compression.
Using FP8 for activations (especially in projection) recovers up to 0.2–0.5% accuracy in both 8B and 70B variants.
Value vectors can be further compressed (e.g., MXFP4) with negligible loss (results not shown).

Reproducibility

To replicate these benchmarks, see the example script:

1	`examples/quantize_llama.md`