Development of Cross-Language Embeddings for Extracting Chemical Structures from Texts in Russian and English

Alexey Molodchenkov, Dmitry Deviatkin, Sergey Loginov, Alexey Lupatov, Alisa Gisina, Anton Lukin

Abstract


This study is dedicated to describing an algorithm for implementation cross-lingual embeddings to extract chemical structures from texts in both Russian and English. The proposed algorithm focuses on fine-tuning of pre-trained models based on transformer architecture. After analyzing existing models, mBERT and LaBSE were selected. The training datasets for these models included texts related to chemistry and adjacent fields of science. Fine-tuning was done using a collected set of scientific articles and patent texts in Russian and English. For English, the ChemProt corpus was also used. The model was trained on tasks such as masked language modeling and entity recognition. Comparisons were made with several models, including BioBERT. The results of the experiments showed that the proposed implementation of embeddings more effectively solve the task of recognition chemical structure names in texts in both Russian and English.

Full Text:

PDF

References


T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in Vector Space. https://doi.org/10.48550/arXiv.1301.3781;

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, Winter 1989;

H. Sak, A. Senior, F. Beaufays. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition https://doi.org/10.48550/arXiv.1402.1128;

Attention Is All You Need A. Vaswani, L. Jones, N. Shazeer, N. Parmar, J. Uszkoreit, A. N. Gomez, Ł. Kaiser, I. Polosukhin https://doi.org/10.48550/arXiv.1706.03762;

Sciapp [Electronic resource]. – URL: https://sciapp.ru/ (accessed: 19.09.2024)

Taboureau O. et al. ChemProt: a disease chemical biology database //Nucleic Acids Research. – 2010. – Vol. 39. – No. suppl_1. – pp. D367-D372.

mBERT base model [Electronic resource]. – URL: https://huggingface.co/google-bert/bert-base-multilingual-cased (accessed: 19.09.2024)

Li, B., He, Y., & Xu, W. (2021). Cross-lingual Named Entity Recognition Using Parallel Corpus: A New Approach Using XLM-Roberta Alignment. arXiv preprint arXiv:2101.11112.

Chipman H. A. et al. mBART: Multidimensional Monotone BART //Bayesian Analysis. – 2022. – Vol. 17. – No. 2. – pp. 515-544.

F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang. Language-agnostic BERT Sentence Embedding //arXiv preprint arXiv:2007.01852. – 2020, doi: https://doi.org/10.48550/arXiv.2007.01852.

LaBSE base model [Internet resource]. – URL: https://huggingface.co/cointegrated/LaBSE-en-ru (accessed: 19.09.2024)

X. Ouyang, S. Wang, C. Pang, Y. Sun, H. Tian, H. Wu. ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora, 2020, https://doi.org/10.48550/arXiv.2012.15674.

Mikel Artetxe, Holger Schwenk; Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the Association for Computational Linguistics 2019; 7 597–610. doi: https://doi.org/10.1162/tacl_a_00288.

F. Luo, W. Wang, J. Liu, Y. Liu, B. Bi. VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation. 2020, https://doi.org/10.48550/arXiv.2010.16046.

Y. Fang, S. Wang, Z. Gan, S. Sun, J. Liu. FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding. 2020, https://doi.org/10.48550/arXiv.2009.05166.

H. Huang, Y. Liang, N. Duan, M. Gong, L. Shou. Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks. 2019, https://doi.org/10.48550/arXiv.1909.00964.

Aroca-Ouellette, S., and Rudzicz, F. (2020). "On Losses for Modern Language Models," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics), 4970–4981. Available online at: https://www.aclweb.org/anthology/2020.emnlp-main.403

Lukashkina Yu. N., Vorontsov K. V. Assessing Stability and Completeness of Topic Models of Multidisciplinary Text Collections. [Electronic resource]. – URL: http://www.machinelearning.ru/wiki/images/4/4b/Lukashkina2017MSc.pdf (accessed: 19.10.2024)

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics, 36(4), 1234-1240.

S. Chithrananda, B. Ramsundar, G. Grand. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction https://doi.org/10.48550/arXiv.2010.09885;

BERT base model https://huggingface.co/google-bert/bert-base-uncased.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность ИБП для ЦОД СНЭ

ISSN: 2307-8162