Comparative Study on End-to-End Speech Recognition Using Pre-trained Models

Ghobrial, Martha F.; Gody, Amr M.; Muhammad, Sayed T.

doi:10.21608/fuje.2024.312102.1089

Comparative Study on End-to-End Speech Recognition Using Pre-trained Models

Document Type : Original Article

Authors

¹ Electronics and Communication Department , Fayoum university,Fayoum ,Egypt

² Kyman Faryes Faculty of engineering

³ Computers and Systems Engineering Department, Faculty of Engineering, Fayoum University,Fayoum ,Egypt

10.21608/fuje.2024.312102.1089

Abstract

In the field of speech and audio signal processing, pre-trained models (PTMs) are commonly available. Pre-trained models (PTMs) offer a collection of initial weights and biases that may be adjusted for a particular task, which makes them a popular starting point for ML model development .State-of-the-art performance in speech recognition, natural language processing, and other applications has been shown using pre-trained model representations. Embeddings obtained from these models are used as inputs for learning algorithms that are used for a variety of downstream tasks. This study compares pretrained models to show how they perform in Automatic Speech Recognition (ASR). The literature research indicates that self-supervised models based on Wav2Vec2.0 and fully supervised models such as Whisper are the basic paradigms and approaches for ASR currently. This study evaluated and compared these strategies in order to check how well they perform across a wide range of test scenarios. This survey aims to serve as a practical manual for understanding, using, and generating PTMs for different NLP tasks.

Keywords