About automatic recognition of printed text
Abstract
This article examines the internal structure of a system for recognizing printed data and converting it into a convenient format for use. The developed project provides a solution for converting images and PDF files containing printed text and formulas into LaTeX format by using free, open-source libraries and a custom-trained model for data classification, detection and segmentation. This study provides a detailed overview of the system development stages starting from the analysis of existing algorithms to the creation of a custom model for text and formula recognition. The focus is on selecting tools and libraries suitable for tasks related to the automation of recognizing and converting printed documents into LaTeX format and on attempts to improve the performance of these libraries. The process of creating and preparing the dataset for model training which includs image and text annotation. The outcomes of the model training are presented in tables showing the achieved recognition accuracies for various data classes. The article presents the results of testing the system's functionality for recognizing text and formulas in both Russian and English. It also details the strengths and weaknesses of the developed system and the challenges encountered during its development. Based on our development, we have created a website that includes an interface for converting images and PDF files into LaTeX format, as well as a Telegram bot with similar functionality
Full Text:
PDF (Russian)References
AWS. AWS Free Tier. 2024. url: https://aws.amazon.com/ru/free/ (дата обр. 22.05.2024).
Weights Biases. Weights Biases. 2024. url: https://wandb.ai/site (дата обр. 24.04.2024).
Lukas Blecher. LaTeX OCR. 2024. url: https://github.com/lukas-blecher/LaTeX-OCR (дата обр. 22.05.2024).
ChatGPT. ChatGPT: AI Language Model. url: https://chatgpt.com (дата обр. 18.05.2024).
Yandex Cloud. Yandex Cloud. 2024. url: https://yandex.cloud/ru/?utm_referrer=https%3A%2F%2Fwww.google.com%2F (дата обр. 24.04.2024).
CVAT. CVAT: Computer Vision Annotation Tool. 2024. url: https://www.cvat.ai (дата обр. 22.05.2024).
Yuntian Deng, Anssi Kanervisto, Jeffrey Ling и Alexander M. Rush. “Image-to-Markup Generation with Coarse-to-Fine Attention”. В: arXiv preprint (2016). url: https://arxiv.org/pdf/1609.04938.pdf.
Docker. Compose: Define and Run Multi-Container Applications. 2024. url: https://docs.docker.com/compose/ (дата обр. 22.05.2024).
Ankush Hafizov. CVAT2YOLO: Convert CVAT annotations to YOLO format. 2024. url:https://github.com/ankhafizov/CVAT2YOLO (дата обр. 22.05.2024).
Yann LeCun, Leon Bottou, Yoshua Bengio и Patrick Haffner. “Gradient-Based Learning Applied to Document Recognition”. В: Proceedings of the IEEE 86.11 (1998), с. 2278—2324. doi: 10.1109/5.726791. url:http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf.
Mathpix. Mathpix: Convert Images to LaTeX, AsciiMath, MathML, and more. url: https://mathpix.com (дата обр. 10.05.2024).
Erik G. Miller и Paul A. Viola. “Ambiguity and Constraint in Mathematical Expression Recognition”. В: Proceedings of the AAAI Conference on Artificial Intelligence. 1998. url: https://people.cs.umass.edu/~elm/papers/AAAI-98.pdf.
Norm. nougat-latex-base. 2024. url: https://huggingface.co/Norm/nougat-latex-base (дата обр. 22.05.2024).
Faisal Shafait, Daniel Keysers и Thomas M. Breuel. “Performance Comparison of Six Algorithms for Page Segmentation”. В: Lecture Notes in Computer Science. Т. 3872. Springer, 2006, с. 368—379. doi: 10.1007/11669487_33. url: https://link.springer.com/chapter/10.1007/11669487_33.
Sumeet Sohan Singh. “Teaching Machines to Code: Neural Markup Generation with Visual Attention”. В: arXiv preprint (2018). url: https://arxiv.org/abs/1802.05415.
Ultralytics. YOLOv8: Real-Time Object Detection and Segmentation. 2024. url: https://ultralytics.com/yolov8 (дата обр. 21.05.2024).
Zenodo. Record 56198. 2024. url: https://zenodo.org/records/56198#.V2px0jXT6eA (дата обр. 22.05.2024).
Zenodo. Record 7738969. 2024. url: https://zenodo.org/records/7738969 (дата обр. 20.05.2024).
Refbacks
- There are currently no refbacks.
Abava Кибербезопасность MoNeTec 2024
ISSN: 2307-8162