Researchers from Google developed new AI system, which can pick out individual voices in a crowd by suppressing all other sounds
This is breakthrough development as computers are not efficient in focusing their attention on a particular person in a noisy environment. In a new paper, the researchers presented a deep learning audio-visual model for isolating a single speech signal from a mixture of sounds such as other voices and background noise. “In this work, we are able to computationally produce videos in which speech of specific people is enhanced while all other sounds are suppressed,” Mosseri and Lang said. However, automatic speech separation separating an audio signal into its individual speech sources remains a significant challenge for computers.
The method works on ordinary videos with a single audio track. User need to select the face of the person in the video they want to hear. Also, they can selected algorithmically based on context. The researchers believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing to improved hearing aids, especially in situations where there are multiple people speaking.
The visual signal improves the speech separation quality significantly in cases of mixed speech, and o associates the separated, clean speech tracks with the visible speakers in the video, the researchers said. “A unique aspect of our technique is in combining both the auditory and visual signals of an input video to separate the speech,” said the researchers. “Intuitively, movements of a person’s mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person,” they explained.