Showing posts from 2019

Speech recognition in Hangouts Meet

There are many possible applications for speech recognition in Real Time Communication services live captions, simultaneous translation, voice commands or storing/summarising audio conversations. Speech recognition in the form of live captioning has been  available in Hangouts Meet  for some months but recently it was promoted to a button in the main UI and I have started to use it almost every day. I'm mostly interested in the recognition technology and specifically on how to integrate  DeepSpeech  in RTC media servers to provide a cost effective solution but in this post I wanted to spend some time analysing how Hangouts Meet implemented captioning from a signalling point of view. At a very high level there are at least three possible architectures for speech recognition in RTC services: A) On device speech recognition :  This is the cheapest option but not all the devices have support for it, the quality of the models is not as good as in the cloud ones and it requir

Implementing P2P-SFU transitions in WebRTC

One of the more disruptive aspects of WebRTC is the ability of establishing P2P connections without any server involved in the media path.   However this doesn’t scale well for multiparty audio/video calls as the bandwidth and cpu required for a full mesh of N:N P2P connections is too much in most of the cases. But the fact that you support multiparty calls doesn’t mean that you shouldn’t consider using P2P connections when there are only two participants in the room and switch back to the SFU when the third participant joins or when you need to enable some recording or broadcasting capabilities only available in the SFU. In fact this approach of using P2P when only two participants are connected and switching to the SFU has been successfully implemented in many products in the last years ( Jitsi , Hangouts , Facebook ).   The implementation of this feature is not trivial but it provides enough advantages to be worth it: Minimise the network hops reducing the end to en