Hal 9000 reading lips

11/11/2023

Called the Dynamic Fusion Network (DNF), the approach employs aspects of existing decision and representation fusion strategies in a unified view using the posterior probabilities of the single-modality models as representations of the uni-modal streams. The authors propose a novel dynamic stream weighting technique for combining the audio and visual input streams for AVSR, in order to improve the robustness of AVSR in noisy conditions.

The first article in the AVSR category is “Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition” by Wentao Yu, Steffen Zeiler and Dorothea Kolossa, at the Institute of Communication Acoustics, at Ruhr University in Bochum, Germany. And concomitant with this “democratization” of speech processing is a desire to free Future Speech Interfaces from some inherent difficulties that have traditionally handicapped speech applications: Indeed, while speech has always been something of a “specialist” field, requiring fluency in topics such as Mel Frequency Cepstral Coefficients (MFCC), Gaussian Mixture Model-Hidden Markov Models (GMM-HMM), and the like, the staggering growth of Machine Learning techniques and an increasing preference for Open Source solutions are today propelling speech processing into the mainstream. Despite these enormous gains, though, we may rightfully speak of a new kind of revolution in speech processing today. In the ensuing years, numerous techniques for treating the speech signal were developed by researchers worldwide, giving rise today to a wide variety of tools such as Automatic Speech Recognition (ASR) applications, powerful speech compression tools, Text-To-Speech (TTS) synthesis, as well as speaker identification and, more recently, diarization tools, to name only a few. It was not long after the advent of digital computers in the 1950’s that the idea of using computers to recognize speech began to be investigated.

0 Comments

Hal 9000 reading lips

Leave a Reply.

Author

Archives

Categories