Thematic classification of Narod.ru hosting websites as part of the strategy for preserving early Internet sites

I. Aslanov, A. Kozlova, I. Bibilov, E. Kotelnikov

Abstract


The study focuses on the preservation and analysis of websites hosted on “Narod.ru,” an active web-hosting platform during 2000–2013. Within this work, the hosted websites are considered as disappearing objects of digital heritage, whose preservation and examination may be of interest to experts from various fields, particularly cultural scholars and researchers of early internet digital folklore.
The study proposes an approach to thematic classification of websites using large language models. First, a manual annotation of a sample of archived hosting websites was conducted, where main website pages were assigned thematic categories according to Google’s taxonomy. Then, based on the annotated sample, the performance of 17 proprietary and open large language models was evaluated for the task of multi-label thematic classification of web pages. The best result was achieved by the gemini-2.5-pro model (Samples F1 = 0.708). The proposed approach to thematic classification of early internet websites enables researchers to identify and analyze cultural, social, and communicative patterns in the formation of digital society.


Full Text:

PDF (Russian)

References


S. Agarwal et al., “Gpt-oss-120b & gpt-oss-20b Model Card,” arXiv, 2025, Available: https://arxiv.org/abs/2508.10925.

“Anthropic. Claude Sonnet 4.5 System Card,” Anthropic, 2025. Available: https://www.anthropic.com/claude-sonnet-4-5-system-card

Y. Bai et al., “Kimi K2: Open Agentic Intelligence,” arXiv, 2025. Available: https://arxiv.org/abs/2507.20534

M. Bergman, “The Deep Web: Surfacing Hidden Value,” Journal of Electronic Publishing, 2001.

A Chapekis. et al., “When Online Content Disappears,” Pew Research Center, 17 May. 2024. Available: https://www.pewresearch.org/data-labs/2024/05/17/when-online-content-disappears/

Charter on the Preservation of Digital Heritage, UNESCO, 15 October 2003. Available: https://www.unesco.org/en/legal-affairs/charter-preservation-digital-heritage

G. Comanic. [et al.], “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabilities,” arXiv, 2025. Available: https://arxiv.org/abs/2507.06261

Digital Folklore: to computer users, with love and respect, eds. O. Lialina et al., Stuttgar, Merz & Solitude, 286 p., 2009.

D. Guo et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv, 2025. Available: https://arxiv.org/abs/2501.12948

M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” arXiv, 2022. Available: https://arxiv.org/abs/2203.05794

B. Haslop, M. A. Schnabel and S. Aydin, “Digital Decay.” Parallelism in Architecture, Environment And Computing Techniques (PACT), 2016.

O. Lialina and D. Espenschied, “One terabyte of kilobyte age,” Tumblr, Available: https://oneterabyteofkilobyteage.tumblr.com/

O. Lialina, “Ruins and Templates of Geocities,” Still there, Available: https://contemporary-home-computing.org/still-there/geocities.html

A. Liu et al., “DeepSeek-V3 Technical Report,” arXiv, 2024. Available: https://arxiv.org/abs/2412.19437

OpenAI. GPT-5 System Card, OpenAI, 2025. Available: https://cdn.openai.com/gpt-5-system-card.pdf

R. Passonneau, “Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation,” In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 2006. Available: https://aclanthology.org/L06-1392/

R. Vijgen, “The Deleted City: A Digital Archaeology,” Parsons Journal for Information Mapping. Available: http://piim.newschool.edu/journal/issues/2013/02/pdfs/ParsonsJournalForInformationMapping_Vijgen_Richard.pdf

A. Yang et al., “Qen3 Technical Report,” arXiv, 2025. Available: https://arxiv.org/abs/2505.09388

xAI. Grok 4 Model Card, xAI, 2025. Available: https://data.x.ai/2025-08-20-grok-4-model-card.pdf

xAI. Grok 4 Fast Model Card, xAI, 2025. Available: https://data.x.ai/2025-09-19-grok-4-fast-model-card.pdf

A. Kozlova, I. Aslanov, I. Bibilov and E Kotelnikov, “Preserving Early Internet Websites for Interdisciplinary Research: A Case Study of the “Narod.ru” Hosting Platform (2000–2013),” In Information Society: Education, Science, Culture, and Technologies of the Future. Issue 9. Proceedings of the XXVIII International Joint Scientific Conference “Internet and Modern Society,” IMS-2025, St. Petersburg, June 23–25, 2025, St. Petersburg, ITMO University, 2025 (in print).


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность ИТ конгресс СНЭ

ISSN: 2307-8162