Sound augmentation methods

Yulia Romanovskaya, Eugene Ilyushin


The problem of sound recognition is becoming more relevant and in demand every year. Considering the recognition voice commands task, it becomes clear that a large amount of training data is required, since models must take the difference in timbres, speed, diction features, and many other factors into account. The actual collection of this data is to be very time-consuming, but in fact, impossible. As a result, the search for algorithms for the automatic creation of training synthetic datasets is actively underway. Augmentation is a method of creating additional data based on existing ones. There are two fundamentally  different approaches. The first approach takes existing data as input and returns the same data, but with changed characteristics (i.e., accelerated or louder samples). The second method uses the original data only for training the model, and generates new data independently. This article provides an overview of the  entire spectrum of existing augmentation methods. We try several methods in  our experiments and make conclusions about the application and usage of the presented approaches as well as their impact on the quality of sound recognition an example of a voice recognition task.

Full Text:

PDF (Russian)


Chauhan Nagesh. Audio data analysis using deep learning with python (part 1). — URL: audio- data- analysis- deep- learning- python- part- 1.html.

Oppenheim Alan V. Speech spectrograms using the fast fourier transform.— Vol. 7, no. 8.— P. 57–62.— URL: http://ieeexplore. (online; accessed: 2023-01-13).

Muda Lindasalwa, Begam Mumtaj, Elamvazuthi I. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. — arxiv : 1003.4083 [cs].

Nanni Loris, Maguolo Gianluca, Paci Michelangelo. Data augmenta- tion approaches for improving animal audio classification. — arxiv : 1912.07756 [cs, eess, stat].

Investigation of data augmentation techniques for disordered speech recognition / Mengzhe Geng, Xurong Xie, Shansong Liu et al. // Interspeech 2020. — P. 696–700. — arxiv : 2201.05562 [cs, eess].

Audio augmentation for speech recognition / Tom Ko, Vi- jayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur // Inter- speech 2015. — ISCA. — P. 3586–3589. — URL: https://www. (on- line; accessed: 2023-01-13).

SpecAugment: A simple data augmentation method for automatic speech recognition / Daniel S. Park, William Chan, Yu Zhang et al. // Interspeech 2019. — P. 2613–2617. — arxiv : 1904.08779 [cs, eess, stat].

Yang Jeong Hyeon, Kim Nam Kyun, Kim Hong Kook. Se-resnet with gan-based data augmentation applied to acoustic scene classification // DCASE 2018 workshop. — 2018.

Donahue Chris, McAuley Julian, Puckette Miller. Adversarial audio synthesis. — arxiv : 1802.04208 [cs].

Overcoming data scarcity in speaker identification: Dataset augmenta- tion with synthetic MFCCs via character-level RNN / Jordan J. Bird, Diego R. Faria, Cristiano Premebida et al. // 2020 IEEE Interna- tional Conference on Autonomous Robot Systems and Competitions (ICARSC). — IEEE. — P. 146–151. — URL: org/document/9096166/ (online; accessed: 2023-01-13).

A comparison on data augmentation methods based on deep learning for audio classification / Shengyun Wei, Shun Zou, Feifan Liao, weimin lang. — Vol. 1453, no. 1. — P. 012085. — URL: https:// 6596/1453/1/012085 (online; accessed: 2023-01-13).

Ilyushin Eugene, Namiot Dmitry, Chizhov Ivan. Attacks on machine learning systems-common problems and methods. — Vol. 10, no. 3. — P. 17–22.

Stroeva Ekaterina, Tonkikh Aleksey. Methods for formal verification of artificial neural networks: A review of existing approaches.— Vol. 10, no. 10. — P. 21–29.

Artificial intelligence in cybersecurity. — URL: 3732 (online; accessed: 2022-12).

Rejaibi Emna, Komaty Ali, Meriaudeau Fabrice et al. MFCC-based recurrent neural network for automatic clinical depression recognition and assessment from speech. — arxiv : 1909.07208 [cs, eess].

Rahman Aamer Abdul, Angel Arul Jothi J. Classification of Urban- Sound8k: A study using convolutional neural network and multiple data augmentation techniques // Soft Computing and its Engineering Applications / Ed. by Kanubhai K. Patel, Deepak Garg, Atul Patel, Pawan Lingras. — Springer Singapore. — Vol. 1374. — P. 52– 64.— Series Title: Communications in Computer and Information Science. URL: 5 (online; accessed: 2023-01-13).

A study on data augmentation of reverberant speech for robust speech recognition / Tom Ko, Vijayaditya Peddinti, Daniel Povey et al. // 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). — IEEE. — P. 5220–5224. — URL: (online; accessed: 2023- 01-13).

Yun Deokgyu, Choi Seung Ho. Deep learning-based estimation of reverberant environment for audio data augmentation.— Vol. 22, no. 2. — P. 592. — URL: (online; accessed: 2023-01-13).

An evolutionary-based generative approach for audio data augmen- tation / Silvan Mertes, Alice Baird, Dominik Schiller et al. // 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP). — IEEE. — P. 1–6. — URL: document/9287156/ (online; accessed: 2023-01-13).

Wei Shengyun, Xu Kele, Wang Dezhi et al. Sample mixed-based data augmentation for domestic audio tagging. — arxiv : 1808.03883 [cs, eess].

Jaitly Navdeep, Hinton Geoffrey E. Vocal tract length perturbation (vtlp) improves speech recognition // Proc. ICML Workshop on Deep Learning for Audio, Speech and Language. — Vol. 117. — 2013. — P. 21.

Caceres Zach. Library for SpecAugment realization. — URL: https: // (online; accessed: 2023-01-30).

Romanovskaya Yulia. Experiments. — URL: fabuloudy/augmentations (online; accessed: 2023-01-30).


  • There are currently no refbacks.

Abava  Absolutech Convergent 2022

ISSN: 2307-8162