No-Op-Aware Training and Quantization Framework for Outlier Robust Transformer based Language Models

Sameed Ahmed Khan, A S M Humaun Kabir

Abstract


This paper introduces a no-op-aware training and quantization framework for transformer-based language models that improves robustness to activation outliers while enabling efficient low-precision deployment. We modify an OPT-12L12H model with No-Op-Aware Attention Training (NOAT), combining conditional per-head gating with a Softmax1-based attention activation to suppress extreme attention during training. The model is trained and then quantized with two schemes: standard 8-bit uniform quantization and GPTQ-based post-training quantization. Experimental evaluation shows that the NOAT-trained, GPTQ-quantized model not only preserves but slightly improves perplexity to 10.68 compared to the full-precision 10.96. The paper also shows that GPTQ applied to the NOAT model closely matches the statistical structure of the full-precision activations by maintaining kurtosis, whereas uniform quantization exhibits heavier tails, indicating higher presence of outliers. Stabilizing attention activations during training substantially enhances the effectiveness of downstream quantization, narrowing the gap between model efficiency and accuracy and enabling more reliable deployment of large language models on low resource hardware


Full Text:

PDF

References


A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008. [Online]. Available: https://arxiv.org/abs/1706.03762

S. A. Khan, S. Shulepina, D. Shulepin, and R. A. Lukmanov, “Review of algorithmic solutions for deployment of neural networks on lite devices,” Computer Research and Modeling, vol. 16, no. 7, pp. 1601–1619, 2024. [Online]. Available: http://crm-en.ics.org.ru/journal/article/3557/

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. Int. Conf. Learning Representations (ICLR), 2015.

M. Horowitz, “Computing’s energy problem (and what we can do about it),” in Proc. 2014 IEEE Int. Solid-State Circuits Conf. (ISSCC), 2014, pp. 10–14, doi:10.1109/ISSCC.2014.6757323.

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” arXiv preprint, 2016. [Online]. Available: https://arxiv.org/abs/1510.00149.

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” CoRR, 2017. [Online]. Available: http://arxiv.org/abs/1712.05877

R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper," CoRR, 2018. [Online]. Available: http://arxiv.org/abs/1806.08342

M. Nagel, M. van Baalen, T. Blankevoort, and M. Welling, “Data-free quantization through weight equalization and bias correction,” CoRR, 2019. [Online]. Available: http://arxiv.org/abs/1906.04721

S. Dai, R. Venkatesan, H. Ren, B. Zimmer, W. J. Dally, and B. Khailany, “VS-Quant: Per-vector scaled quantization for accurate low-precision neural network inference,” CoRR, 2021. [Online]. Available: https://arxiv.org/abs/2102.04503

B. Rouhani, R. Zhao, V. Elango, R. Shafipour, M. Hall, M. Mesmakhosroshahi, A. More, L. Melnick, M. Golub, G. Varatkar, et al., “With shared microexponents, a little shifting goes a long way,” arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2302.08007

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., “OPT: Open pre-trained transformer language models,” arXiv preprint, 2022. [Online]. Available: https://arxiv.org/abs/2205.01068


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162