Accelerating Latency-Critical Applications with AI-Powered Semi-Automatic Fine-Grained Parallelization on SMT Processors

Denis Los; Igor Petushkov

Accelerating Latency-Critical Applications with AI-Powered Semi-Automatic Fine-Grained Parallelization on SMT Processors

Denis Los, Igor Petushkov

Abstract

Latency-critical applications tend to show low utilization of functional units due to frequent cache misses and mispredictions during speculative execution in high-performance superscalar processors. However, due to significant impact on single-thread performance, Simultaneous Multithreading (SMT) technology is rarely used with heavy threads of latency-critical applications. In this paper, we explore utilization of SMT technology to support fine-grained parallelization of latency-critical applications. Following the advancements in the development of Large Language Models (LLMs), we introduce Aira, an AI-powered Parallelization Adviser. To implement Aira, we extend AI Coding Agent in Cursor IDE with additional tools connected through Model Context Protocol, enabling end-to-end AI Agent for parallelization. Additional connected tools enable LLM-guided hotspot detection, collection of dynamic dependencies with Dynamic Binary Instrumentation, SMT-aware performance simulation to estimate performance gains. We apply Aira with Relic parallel framework for fine-grained task parallelism on SMT cores to parallelize latency-critical benchmarks representing real-world applications used in industry. We show 17% geomean performance gain from parallelization of latency-critical benchmarks using Aira with Relic framework.

Full Text:

PDF

References

D. M. Tullsen, S. J. Eggers, and H. M. Levy, "Simultaneous multithreading: maximizing on-chip parallelism," in Proc. 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, 1995, pp. 392-403.

V. Packirisamy, Yangchun Luo, Wei-Lung Hung, A. Zhai, Pen-Chung Yew and Tin-Fook Ngai, "Efficiency of thread-level speculation in SMT and CMP architectures - performance, power and thermal perspective,"in 2008 IEEE International Conference on Computer Design, Lake Tahoe, CA, 2008, pp. 286-293, DOI: 10.1109/ICCD.2008.4751875.

J. D. Collins et al., "Speculative precomputation: long-range prefetching of delinquent loads," in Proc. 28th Annual International Symposium on Computer Architecture, Gothenburg, Sweden, 2001, pp. 14-25, DOI: 10.1109/ISCA.2001.937427.

D. Los, I. Petushkov, “Exploring Fine-grained Task Parallelism on Simultaneous Multithreading Cores,” International Journal of Open Information Technologies, vol. 12, no. 10, pp. 145-151, 2024

T. Kadosh et al., “OMPar: Automatic Parallelization with AI-Driven Source-to-Source Compilation”, arXiv: 2409.14771, 2024

Cursor: The AI Code Editor, 2025. [Online]. Available: https://cursor.com/

T. Grosser et al. “Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation.” Parallel Process. Lett, vol. 22, 2012

M.-W. Benabderrahmane et al, “The Polyhedral Model Is More Widely Applicable Than You Think”, Lecture Notes in Computer Science, vol. 6011, pp. 283-303, DOI: 10.1007/978-3-642-11970-5_16.

M. Norouzi et al, “Automatic construct selection and variable classification in OpenMP,” in Proc. of the ACM International Conference on Supercomputing, Phoenix, Arizona, 2019, pp. 330-341

A. Jimboreal et al. “Adapting the polyhedral model as a framework for efficient speculative parallelization,” in Proc. of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New Orleans, Louisiana, USA, 2012, pp. 295-296

Amini, Mehdi et al. “Par4All: From Convex Array Regions to Heterogeneous Computing.” International Conference on High Performance Embedded Architectures and Compilers, 2012

U. Bondhugula et.al, “Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model”, Lecture Notes in Computer Science, vol. 4959, pp. 132-146.

Daniel Nichols, Joshua H Davis, Zhaojun Xie, Arjun Rajaram, and Abhinav Bhatele. 2024. Can Large Language Models Write Parallel Code? arXiv preprint arXiv:2401.12554 (2024).

Tal Kadosh, Niranjan Hasabnis, Vy A Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, et al. 2023. MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks. arXiv preprint arXiv:2312.13322 (2023).

Le Chen, Arijit Bhattacharjee, Nesreen Ahmed, Niranjan Hasabnis, Gal Oren, Vy Vo, and Ali Jannesari. 2024. OMPGPT: A Generative Pre-trained Transformer Model for OpenMP. arXiv:2401.16445 [cs.SE]

Miguel Romero Rosas, Miguel Torres Sanchez, and Rudolf Eigenmann. 2024. Should AI Optimize Your Code? A Comparative Study of Current Large Language Models Versus Classical Optimizing Compilers. arXiv preprint arXiv:2406.12146 (2024).

Tal Kadosh, Nadav Schneider, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, and Gal Oren. 2023. Advising OpenMP Parallelization via a Graph-Based Approach with Transformers. arXiv preprint arXiv:2305.11999 (2023).

Le Chen, Quazi Ishtiaque Mahmud, Hung Phan, Nesreen Ahmed, and Ali Jannesari. 2023. Learning to Parallelize with OpenMP by Augmented Heterogeneous AST Representation. Proceedings of Machine Learning and Systems 5 (2023).

S. Saini, A. Naraikin, R. Biswas, D. Barkai and T. Sandstrom, "Early performance evaluation of a "Nehalem" cluster using scientific and engineering applications," in Proc. of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, OR, USA, 2009, pp. 1-12, DOI: 10.1145/1654059.1654084.

E. Athanasaki, N. Anastopoulos, K. Kourtis, N. Koziris, “Exploring the performance limits of simultaneous multithreading for memory intensive applications,” The Journal of Supercomputing, vol. 44, pp. 64-97, 2008, DOI: 10.1007/s11227-007-0149-x

R. Schöne, D. Hackenberg, and D. Molka, “Simultaneous multithreading on x86_64 systems: an energy efficiency evaluation,” in Proc. of the 4th Workshop on Power-Aware Computing and Systems, Cascais, Portugal, 2011, Article 10, DOI: 10.1145/2039252.2039262.

M. Bakhshalipour, M. Likhachev, and P. B. Gibbons, "RTRBench: A Benchmark Suite for Real-Time Robotics," in 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Singapore, Singapore, 2022, pp. 175-186, DOI: 10.1109/ISPASS55109.2022.00024.

Refbacks

There are currently no refbacks.

Abava Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162