torchmx.quant_api
Quantization API for LLM models.
quantize_linear_
Quantize an LLM by swapping the Linear layers in place
This method only replaces/quantizes the linear layers. Use this as an approximation as we do not quantize QKV and other stuff. Use this only when a specific attention layer is not implemented.
Arguments:
model
torch.nn.Module - The model to quantize.qconfig
QLLMConfig - The quantization configuration.
quantize_llm_
Quantize the LLM by swapping the Attention Layer and MLP Layer in place.
The implemented Layers is expected to handle all possible quantization layers.
Refer to torchmx/layers/mx_llama_attention.py
for more details.
Arguments:
model
torch.nn.Module - The model to quantize.qattention_config
QAttentionConfig - The quantization configuration for the attention layers.qmlp_config
QLinearConfig - The quantization configuration for the MLP layers.