torchmx.quant_api

Quantization API for LLM models.

quantize_linear_

def quantize_linear_(model: torch.nn.Module, qconfig: QLinearConfig)

Quantize an LLM by swapping the Linear layers in place

This method only replaces/quantizes the linear layers. Use this as an approximation as we do not quantize QKV and other stuff. Use this only when a specific attention layer is not implemented.

Arguments:

model torch.nn.Module - The model to quantize.
qconfig QLLMConfig - The quantization configuration.

quantize_llm_

[view source]

def quantize_llm_(model: torch.nn.Module, qattention_config: QAttentionConfig,
                  qmlp_config: QLinearConfig)

Quantize the LLM by swapping the Attention Layer and MLP Layer in place. The implemented Layers is expected to handle all possible quantization layers. Refer to torchmx/layers/mx_llama_attention.py for more details.

Arguments:

model torch.nn.Module - The model to quantize.
qattention_config QAttentionConfig - The quantization configuration for the attention layers.
qmlp_config QLinearConfig - The quantization configuration for the MLP layers.