Speech recognition in Hangouts Meet

There are many possible applications for speech recognition in Real Time Communication services live captions, simultaneous translation, voice commands or storing/summarising audio conversations. Speech recognition in the form of live captioning has been  available in Hangouts Meet  for some months but recently it was promoted to a button in the main UI and I have started to use it almost every day. I'm mostly interested in the recognition technology and specifically on how to integrate  DeepSpeech  in RTC media servers to provide a cost effective solution but in this post I wanted to spend some time analysing how Hangouts Meet implemented captioning from a signalling point of view. At a very high level there are at least three possible architectures for speech recognition in RTC services: A) On device speech recognition :  This is the cheapest option but not all the devices have support for it, the quality of the models is not as good as in the cloud ones and it requir

Implementing P2P-SFU transitions in WebRTC

One of the more disruptive aspects of WebRTC is the ability of establishing P2P connections without any server involved in the media path.   However this doesn’t scale well for multiparty audio/video calls as the bandwidth and cpu required for a full mesh of N:N P2P connections is too much in most of the cases. But the fact that you support multiparty calls doesn’t mean that you shouldn’t consider using P2P connections when there are only two participants in the room and switch back to the SFU when the third participant joins or when you need to enable some recording or broadcasting capabilities only available in the SFU. In fact this approach of using P2P when only two participants are connected and switching to the SFU has been successfully implemented in many products in the last years ( Jitsi , Hangouts , Facebook ).   The implementation of this feature is not trivial but it provides enough advantages to be worth it: Minimise the network hops reducing the end to en

Sending Packet Loss Feedback in WebRTC SFUs

One of the responsibilities of WebRTC SFUs is to receive and send RTCP packets.  RTCP packets include different types of feedback about audio and video streams and one of the most important RTCP packets is the Receiver Report (RR) . RR packets are sent from the receiver of the media stream towards the sender of that media stream.  In case of an SFU the RR are generated and sent from the SFU to the media stream Sender and also from every stream Receiver to the SFU (Figure 1). The feedback sent inside RR packets include fields to calculate the round-trip-time delay, the jitter and the packet loss introduced by the network. The packet loss reported in these RR packets is important because the audio and video being sent will be adjusted based on that parameter: In case of audio streams the packet loss in the network modifies the level of robustness of the OPUS codec.   In presence of high packet loss the sender increases the level of redundancy of the forward error correction