Posts

Showing posts from 2019

Speech recognition in Hangouts Meet

Image
There are many possible applications for speech recognition in Real Time Communication services live captions, simultaneous translation, voice commands or storing/summarising audio conversations.

Speech recognition in the form of live captioning has been available in Hangouts Meet for some months but recently it was promoted to a button in the main UI and I have started to use it almost every day.


I'm mostly interested in the recognition technology and specifically on how to integrate DeepSpeech in RTC media servers to provide a cost effective solution but in this post I wanted to spend some time analysing how Hangouts Meet implemented captioning from a signalling point of view.

At a very high level there are at least three possible architectures for speech recognition in RTC services:

Implementing P2P-SFU transitions in WebRTC

Image
One of the more disruptive aspects of WebRTC is the ability of establishing P2P connections without any server involved in the media path.   However this doesn’t scale well for multiparty audio/video calls as the bandwidth and cpu required for a full mesh of N:N P2P connections is too much in most of the cases.
But the fact that you support multiparty calls doesn’t mean that you shouldn’t consider using P2P connections when there are only two participants in the room and switch back to the SFU when the third participant joins or when you need to enable some recording or broadcasting capabilities only available in the SFU.



In fact this approach of using P2P when only two participants are connected and switching to the SFU has been successfully implemented in many products in the last years (Jitsi, Hangouts, Facebook).   The implementation of this feature is not trivial but it provides enough advantages to be worth it: Minimise the network hops reducing the end to end delay and the cha…