Adapting Large Language Models for narrow domains using the exponential moving average method

D.K. Sviridenko, E.V. Bobrova, K.S. Zaytsev, E.V. Dyuldin, B.M. Shifman

Abstract


Adapting Large Language Models (LLMs) to specialized subject areas requires fine-tuning on profile data corpora, which is inevitably associated with the risk of catastrophic forgetting of previously acquired knowledge. This paper presents a comparative study of four fine-tuning strategies: (1) standard supervised learning based on cross-entropy loss (CE), (2) method with L1-regularization of weights (CE-L1), (3) regularization method using a static "teacher" model (KL-CE), and (4) the proposed approach using a Teacher Exponential Moving Average (TEMA), in which the "teacher" weights are updated dynamically via exponential smoothing of the trainee model weights. Experimental validation was performed on Qwen2-0.5B and Qwen2-1.5B models using 4-bit quantization and LoRA adapters on a medical corpus comprising over 27,000 cytological reports based on the Bethesda System. A comprehensive evaluation of generation quality was conducted using lexical (BLEU, ROUGE, METEOR, ChrF) and semantic (BLEURT, BERTScore) metrics, as well as the MMLU benchmark (5-shot) to control for the preservation of general cognitive abilities. The results showed that the KL-CE method limits model adaptation to the new domain, while the L1-regularization method (CE-L1) demonstrates low efficiency in both generation and knowledge retention. At the same time, standard fine-tuning (CE), while providing high quality on new data, reduces generation quality on "general" data not included in the training domain. The proposed TEMA method provides the best balance between plasticity and stability, improves the semantic quality of generation, and minimizes the degradation of general knowledge. The obtained data allow recommending TEMA as an effective tool for adapting LLMs for highly specialized tasks, such as medical diagnostics

Full Text:

PDF (Russian)

References


Raffel, C. et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 2020.

[Liu, P. et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM CSUR, 2023.

Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, and Y. Zhang, “An empirical study of catastrophic forgetting in large language models during continual fine-tuning,” arXiv preprint arXiv:2308.08747, 2023.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N. C., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114:3521 – 3526, 2016.

Zhicheng Wang, Yufang Liu, Tao Ji, Xiaoling Wang, Yuanbin Wu, Congcong Jiang, Ye Chao, Zhencong Han, Ling Wang, Xu Shao, et al. 2023. Rehearsalfree continual language learning via efficient parameter isolation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10933–10946.

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

Zhang, Y., Jiang, S., Zhao, M., Li, Y., Fan, Y., Wu, X., & Chen, Q. (2025). Gere: Towards efficient anti-forgetting in continual learning of llm via general samples replay. arXiv preprint arXiv:2508.04676.

Sanyal, S., Prairie, H., Das, R., Kavis, A., & Sanghavi, S. (2025). Upweighting easy samples in fine-tuning mitigates forgetting. arXiv preprint arXiv:2502.02797.

Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947.

Y. Chen, S. Zhang, G. Qi, and X. Guo. Parameterizing context: Unleashing the power of parameter-efficient fine-tuning and in-context tuning for continual table semantic parsing. Advances in Neural Information Processing Systems, 36, 2024.

Tarvainen, A., & Valpola, H. Mean Teachers are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results. NeurIPS, 2017

Ali Behrouz , Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested Learning: The Illusion of Deep Learning Architecture [Электронный ресурс] - https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/ (22.12.2025).

Qin, Y., Qian, C., Yi, J., Chen, W., Lin, Y., Han, X., ... & Zhou, J. (2022). Exploring mode connectivity for pre-trained language models. arXiv preprint arXiv:2210.14102.

Ren, W., Li, X., Wang, L., Zhao, T., & Qin, W. (2024). Analyzing and reducing catastrophic forgetting in parameter efficient tuning. arXiv preprint arXiv:2402.18865.

Ali S, Cibas E. The Bethesda System for Reporting Thyroid Cytopathology. (Ali SZ, Cibas ES, eds.). Cham: Springer International Publishing; 2018. doi: https://doi.org/10.1007/978-3-319-60570-8

Ali SZ, Baloch ZW, Cochand-Priollet B, Schmitt FC, Vielh P, VanderLaan PA. The 2023 Bethesda System for Reporting Thyroid Cytopathology. Thyroid®. July 2023. doi: https://doi.org/10.1089/thy.2023.0141

[Электронный ресурс] - https://unsloth.ai/

[Электронный ресурс] - https://github.com/BY571/sft-kl-lora-trainer

[Электронный ресурс] - https://github.com/EugeneCS/mephi_nlp/tree/sviridenko

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).

Chin-Yew, L. (2004). Rouge: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, 2004.

Banerjee, S., & Lavie, A. (2005, June). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72).

Popović, M. (2017, September). chrF++: words helping character n-grams. In Proceedings of the second conference on machine translation (pp. 612-618).

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186).

Liu, Y., Maier, W., Minker, W., & Ultes, S. (2021). Naturalness evaluation of natural language generation in task-oriented dialogues using bert. arXiv preprint arXiv:2109.02938.

[Электронный ресурс] - https://huggingface.co/datasets/cais/mmlu

Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019, May). Similarity of neural network representations revisited. In International conference on machine learnin (pp. 3519-3529). PMlR.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162