torchmx.quant_api
Quantization API for LLM models.
mx_dynamic_activation_mx_weights
Quantize the model with MXFP Dynamic quantization for activations and MXFP
quantization for weights. This directly replaces the nn.Linear module's weight param
This is a helper function to be used with torchao.quantization.quantize_
.
You can use this if you want to quantize all Linear layers in the model with MXFP
Dynamic quantization and do not want to make a distinction between Attention and MLP
See below for an example of how to use this function.
Arguments:
weight_elem_dtype
dtypes.DType, optional - Weight element dtype. Defaults to dtypes.float6_e3m2.weight_block_size
int, optional - Weight block size. Defaults to 32.activation_elem_dtype
dtypes.DType, optional - Activation element dtype. Defaults to dtypes.float8_e4m3.activation_block_size
int, optional - Activation block size. Defaults to 32.
Usage:
quantize_linear_
Quantize an LLM by swapping the Linear layers in place
This method only replaces/quantizes the linear layers. Use this as an approximation as we do not quantize QKV and other stuff. Use this only when a specific attention layer is not implemented.
Arguments:
model
torch.nn.Module - The model to quantize.qconfig
QLLMConfig - The quantization configuration.
quantize_llm_
Quantize the LLM by swapping the Attention Layer and MLP Layer in place.
The implemented Layers is expected to handle all possible quantization layers.
Refer to torchmx/layers/mx_llama_attention.py
for more details.
Arguments:
model
torch.nn.Module - The model to quantize.qattention_config
QAttentionConfig - The quantization configuration for the attention layers.qmlp_config
QLinearConfig - The quantization configuration for the MLP layers.