Speech assistants (Alexa, Siri, etc.) found their way into people’s homes and establish a natural and easy way of communication with electronic devices. We at the Institute for Communications Technology are conducting research in the field of automatic speech recognition (ASR) – the core technology driving such devices – reaching from topics of basic research up to commercial ASR system building. Driven by recent advancements in Deep Learning, we use state-of-the-art technologies in a diverse set of research and development projects, aiming at improved recognition accuracy and higher robustness of ASR engines in adverse acoustic conditions.
We as humans process information in a multi-modal fashion, meaning we use multiple sources of available information (e.g., auditory, visual, experience, contextual knowledge) to understand and communicate with other humans. Common automatic speech recognition systems use only a single microphone to capture speech but are expected to recognize speech as precise as humans do. We at the Signal Proccessing and Machine Learning group at IfN explore novel methods to incorporate additional information sources into ASR systems, utilizing conventional hybrid ASR methods, as well as the most recent end-to-end modeling approaches such as transformers.
 
    The ubiquitous presence of electronic devices equipped with camera sensors offers an additional information source for automatic speech recognition thereby allowing lip reading. By fusion of the acoustic microphone signal with the visual tracking of the speakers’ lip movement, the performance of speech recognition systems can greatly be improved – especially under adverse acoustic conditions as the visual stream of information is not affected by acoustic noise. As technology partner in the consortium of the large-scale national SPEAKER project (funded by the Bundesministerium für Wirtschaft und Energie, BMWi) we develop such an audiovisual ASR method for a German privacy-driven speech assistant jointly with 18 partners from German research institutions and industry.
[1] S. Receveur, R. Weiss, T. Fingscheidt; Turbo Automatic Speech Recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24, Issue 5, 2016
[2] T. Lohrenz, T. Fingscheidt; Turbo Fusion of Magnitude and Phase Information for DNN-Based Phoneme Recognition, Proceedings of ASRU workshop, Okinawa, Japan, 2017
[3] M. Strake, P. Behr, T. Lohrenz, T. Fingscheidt, Densenet BLSTM for Acoustic Modeling in Robust ASR, Proceedings of SLT Workshop, Greece, Athens, 2018
[4] T. Lohrenz, T. Fingscheidt, BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example, Proceedings of INTERSPEECH, Shanghai, China, 2020
[5] T. Lohrenz, Z. Li, T. Fingscheidt, Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition, Proceedings of INTERSPEECH, Brno, Czech Rep., 2021 (https://arxiv.org/abs/2104.00120)