A survey on natural language semantic search algorithms

Nikita Shalagin

Abstract


Semantic search represents a modern approach to information retrieval based on understanding the meaning and context of queries, enabling more relevant results compared to traditional keyword-based search methods. The use of natural language processing technologies, such as the Transformer architecture and large pre-trained language models, has significantly improved the quality of semantic search. These models have demonstrated high performance in various benchmarks, leading to their widespread application in numerous areas. The main advantages of semantic search include a higher level of accuracy and relevance of results, enhanced user experience, and the ability to freely express queries. However, despite significant achievements, there are challenges related to the computational complexity of models, the limited size of the text that can be processed, and the response time in realtime modes. In solutions requiring near-real-time processing of user queries, developers often need to apply less resourceintensive methods, which can reduce search quality. A common dilemma for specialists developing practical applications is the trade-off between computational costs, processing speed, and search quality. This work aims to review current semantic search methods and provide a comparative analysis. Special attention is given to examining the advantages and disadvantages of various approaches, as well as analyzing the prospects for their further development and application in different fields.

Full Text:

PDF (Russian)

References


Vaswani Ashish, Shazeer Noam, Parmar Niki et al. Attention is all you need. — 2017. — URL: https://arxiv.org/abs/1706.03762.

Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina. Bert: Pre-training of deep bidirectional transformers for language understanding. — 2018. — URL: https://arxiv.org/abs/1810.04805.

Liu Yinhan, Ott Myle, Goyal Naman et al. Roberta: A robustly optimized bert pretraining approach. — 2019. — URL: https://arxiv.org/abs/1907.11692.

Language models are unsupervised multitask learners / Alec Radford, Jeff Wu, Rewon Child et al. — 2019.

Brown Tom B., Mann Benjamin, Ryder Nick et al. Language models are few-shot learners. — 2020. — URL: https://arxiv.org/abs/2005.14165.

Wang Alex, Singh Amanpreet, Michael Julian et al. Glue: A multitask benchmark and analysis platform for natural language understanding. — 2018. — URL: https://arxiv.org/abs/1804.07461.

Rajpurkar Pranav, Zhang Jian, Lopyrev Konstantin, Liang Percy. Squad: 100,000+ questions for machine comprehension of text. — 2016. — URL: https://arxiv.org/abs/1606.05250.

Lai Guokun, Xie Qizhe, Liu Hanxiao et al. Race: Large-scale reading comprehension dataset from examinations. — 2017. — URL: https://arxiv.org/abs/1704.04683.

Zellers Rowan, Holtzman Ari, Bisk Yonatan et al. Hellaswag: Can a machine really finish your sentence? — 2019. — URL: https://arxiv.org/abs/1905.07830.

Position-aware attention and supervised data improve slot filling /Yuhao Zhang, Victor Zhong, Danqi Chen et al. // Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. — Copenhagen, Denmark : Association for Computational Linguistics, 2017. — . — P. 35–45. — URL: https://aclanthology.org/ D17-1004.

Yang Yinfei, Cer Daniel, Ahmad Amin et al. Multilingual universal sentence encoder for semantic retrieval. — 2019. — 1907.04307.

Embedding-based retrieval in facebook search / Jui-Ting Huang, Ashish Sharma, Shuying Sun et al. // CoRR. — 2020. — Vol. abs/2006.11632. — arXiv : 2006.11632.

Zhang Yanzhao, Long Dingkun, Xu Guangwei, Xie Pengjun. Hlatr: Enhance multi-stage text retrieval with hybrid list aware transformer reranking. — 2022. — 2205.10569.

Karpukhin Vladimir, Oğuz Barlas, Min Sewon et al. Dense passage retrieval for open-domain question answering. — 2020. — 2004.04906.

Borgeaud Sebastian, Mensch Arthur, Hoffmann Jordan et al. Improving language models by retrieving from trillions of tokens. — 2022. — 2112.04426.

Thakur Nandan, Reimers Nils, Daxenberger Johannes, Gurevych Iryna. Augmented sbert: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. — 2021. — 2010.08240.

Okapi at trec-6 automatic ad hoc, vlc, routing, filtering and qsdr /Steve Walker, Stephen E Robertson, Mohand Boughanem et al. // NIST SPECIAL PUBLICATION SP. — 1998. — P. 125–136.

Penha Gustavo, Palumbo Enrico, Aziz Maryam et al. Improving content retrievability in search with controllable query generation. — 2023. — 2303.11648.

Jagerman Rolf, Zhuang Honglei, Qin Zhen et al. Query expansion by prompting large language models. — 2023. — 2305.03653.

Zhang Yang, Bartley Travis M., Graterol-Fuenmayor Mariana et al. A chat about boring problems: Studying gpt-based text normalization. — 2024. — 2309.13426.

Bengio Yoshua, Courville Aaron, Vincent Pascal. Representation learning: A review and new perspectives. — 2014. — 1206.5538.

Gao Luyu, Callan Jamie. Unsupervised corpus aware language model pre-training for dense passage retrieval. — 2021. — 2108.05540.

Xiao Shitao, Liu Zheng, Shao Yingxia, Cao Zhao. Retromae: Pre- training retrieval-oriented language models via masked autoencoder. — 2022. — 2205.12035.

Wu Xing, Ma Guangyuan, Lin Meng et al. Contextual masked autoencoder for dense passage retrieval. — 2022. — 2208.07670.

Malkov Yu. A., Yashunin D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. — 2018. — 1603.09320.

Guo Ruiqi, Sun Philip, Lindgren Erik et al. Accelerating large-scale inference with anisotropic vector quantization. — 2020. — 1908.10396.

Retrieve re-rank. — https://www.sbert.net/examples/applications/ retrieve_rerank/README.html#retrieve-re-rank. — Accessed: 2022- 12- 21.

Nogueira Rodrigo, Cho Kyunghyun. Passage re-ranking with bert. — 2020. — 1901.04085.

Dai Zhuyun, Callan Jamie. Deeper text understanding for IR with contextual neural language modeling // Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. — ACM, 2019. — jul. — URL:

CEDR / Sean MacAvaney, Andrew Yates, Arman Cohan, Nazli Goharian // Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. — ACM, 2019. — jul. — URL:

Cross-domain modeling of sentence-level evidence for document

retrieval / Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian Zhang, Jimmy Lin // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP20 International Journal of Open Information Technologies ISSN: 2307- 162 vol. 12, no. 9, 2024 IJCNLP). — Hong Kong, China : Association for Computational Linguistics, 2019. — . — P. 3490–3496. — URL: https://aclanthology. org/D19-1352.

Li Canjia, Yates Andrew, MacAvaney Sean et al. Parade: Passage representation aggregation for document reranking. — 2021. — 2008.09093.

Bajaj Payal, Campos Daniel, Craswell Nick et al. Ms marco: A human generated machine reading comprehension dataset. — 2018. — 611.09268

Tay Yi, Tran Vinh Q., Dehghani Mostafa et al. Transformer memory as a differentiable search index. — 2022. — URL: https://arxiv.org/abs/2202.06991.

Raffel Colin, Shazeer Noam, Roberts Adam et al. Exploring the limits of transfer learning with a unified text-to-text transformer. — 2019. — URL: https://arxiv.org/abs/1910.10683.

Sutskever Ilya, Vinyals Oriol, Le Quoc V. Sequence to sequence learning with neural networks. — 2014. — URL: https://arxiv.org/abs/1409.3215.

Wang Yujing, Hou Yingyan, Wang Haonan et al. A neural corpus indexer for document retrieval. — 2023. — 2206.02743.

Tang Yubao, Zhang Ruqing, Guo Jiafeng et al. Listwise generative retrieval models via a sequential learning process. — 2024. — 2403.12499.

Mehta Sanket Vaibhav, Gupta Jai, Tay Yi et al. Dsi++: Updating transformer memory with new documents. — 2022. — 2212.09744.

Searching for answers in a pandemic: An overview of trec-covid /

Ellen M. Voorhees, Ian Soboroff, Kirk Roberts et al. // Journal of Biomedical Informatics. — 2021. — Vol. 121. — URL: https://doi.

org/10.1016/j.jbi.2021.103865.

Natural questions: a benchmark for question answering research /Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield et al. //Transactions of the Association for Computational Linguistics. — 2019. — Vol. 7. — P. 452–466.

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension / Mandar Joshi, Eunsol Choi, Daniel Weld, Luke Zettlemoyer // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). — Vancouver, Canada : Association for Computational Linguistics, 2017. — . — P. 1601–1611. — URL: https://aclanthology.org/ P17-1147.

Nentidis Anastasios, Krithara Anastasia, Paliouras Georgios, Bougiatiotis Konstantinos. Bioasq: A challenge on large-scale biomedical semantic indexing and question answering. — urlhttp://participants-area.bioasq.org/. — 2021. — Accessed: 2024-07-17.

Quora. Quora question pairs. — 2017. — Accessed: 2024-07-17. URL: https://www.kaggle.com/c/quora-question-pairs.

FEVER: a large-scale dataset for fact extraction and VERification / James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Arpit Mittal // NAACL-HLT. — New Orleans, Louisiana : Association for Computational Linguistics, 2018. — P. 809–819. — URL: https://aclanthology.org/N18-1074.

HotpotQA: A dataset for diverse, explainable multi-hop question answering / Zhilin Yang, Peng Qi, Saizheng Zhang et al. // Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing / Association for Computational Linguistics. — 2018. — P. 2369–2380. — URL: https://arxiv.org/abs/1809.09600

Www’18 open challenge: Financial opinion mining and question answering / Saulo Macedo Maia, Siegfried Handschuh, André Freitas et al. // Companion Proceedings of the The Web Conference 2018. — 2018. — URL: https://github.com/dayanfcosta/fiqa-2018-task1/blob/ master/datasets/Readme_task1.pdf.

Fact or fiction: Verifying scientific claims / David Wadden, Shanchuan Lin, Kyle Lo et al. // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). — Online : Association for Computational Linguistics, 2020. — . — P. 7534–7550. — URL: https://aclanthology.org/2020.emnlp-main.609.

Cohan Arman, Feldman Sergey, Beltagy Iz et al. SciDocs: A Benchmark Suite for Document-Level Representation Learning. — https://allenai.org/data/scidocs. — 2020. — Version 1.0.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность MoNeTec 2024

ISSN: 2307-8162