torchmx.quant_api

Quantization API for LLM models.

mx_dynamic_activation_mx_weights

def mx_dynamic_activation_mx_weights(
        weight_elem_dtype: dtypes.DType = dtypes.float6_e3m2,
        weight_block_size: int = 32,
        activation_elem_dtype: dtypes.DType = dtypes.float8_e4m3,
        activation_block_size: int = 32)

Quantize the model with MXFP Dynamic quantization for activations and MXFP quantization for weights. This directly replaces the nn.Linear module's weight param This is a helper function to be used with torchao.quantization.quantize_. You can use this if you want to quantize all Linear layers in the model with MXFP Dynamic quantization and do not want to make a distinction between Attention and MLP See below for an example of how to use this function.

Arguments:

weight_elem_dtype dtypes.DType, optional - Weight element dtype. Defaults to dtypes.float6_e3m2.
weight_block_size int, optional - Weight block size. Defaults to 32.
activation_elem_dtype dtypes.DType, optional - Activation element dtype. Defaults to dtypes.float8_e4m3.
activation_block_size int, optional - Activation block size. Defaults to 32.

Usage:

import torchao

model = LLM()
torchao.quantization.quantize_(
    model,
    mx_dynamic_activation_mx_weights(
        weight_elem_dtype=dtypes.float6_e3m2,
        weight_block_size=weight_block_size,
        activation_elem_dtype=dtypes.float8_e4m3,
        activation_block_size=activation_block_size,
    ),
)
print(f"Quantized model: {model}")

quantize_linear_

[view source]

def quantize_linear_(model: torch.nn.Module, qconfig: QLinearConfig)

Quantize an LLM by swapping the Linear layers in place

This method only replaces/quantizes the linear layers. Use this as an approximation as we do not quantize QKV and other stuff. Use this only when a specific attention layer is not implemented.

Arguments:

model torch.nn.Module - The model to quantize.
qconfig QLLMConfig - The quantization configuration.

quantize_llm_

[view source]

def quantize_llm_(model: torch.nn.Module, qattention_config: QAttentionConfig,
                  qmlp_config: QLinearConfig)

Quantize the LLM by swapping the Attention Layer and MLP Layer in place. The implemented Layers is expected to handle all possible quantization layers. Refer to torchmx/layers/mx_llama_attention.py for more details.

Arguments:

model torch.nn.Module - The model to quantize.
qattention_config QAttentionConfig - The quantization configuration for the attention layers.
qmlp_config QLinearConfig - The quantization configuration for the MLP layers.