Speech recognition in Hangouts Meet

There are many possible applications for speech recognition in Real Time Communication services live captions, simultaneous translation, voice commands or storing/summarising audio conversations.

Speech recognition in the form of live captioning has been available in Hangouts Meet for some months but recently it was promoted to a button in the main UI and I have started to use it almost every day.

I'm mostly interested in the recognition technology and specifically on how to integrate DeepSpeech in RTC media servers to provide a cost effective solution but in this post I wanted to spend some time analysing how Hangouts Meet implemented captioning from a signalling point of view.

At a very high level there are at least three possible architectures for speech recognition in RTC services:

Implementing P2P-SFU transitions in WebRTC

One of the more disruptive aspects of WebRTC is the ability of establishing P2P connections without any server involved in the media path.   However this doesn’t scale well for multiparty audio/video calls as the bandwidth and cpu required for a full mesh of N:N P2P connections is too much in most of the cases.
But the fact that you support multiparty calls doesn’t mean that you shouldn’t consider using P2P connections when there are only two participants in the room and switch back to the SFU when the third participant joins or when you need to enable some recording or broadcasting capabilities only available in the SFU.

In fact this approach of using P2P when only two participants are connected and switching to the SFU has been successfully implemented in many products in the last years (Jitsi, Hangouts, Facebook).   The implementation of this feature is not trivial but it provides enough advantages to be worth it: Minimise the network hops reducing the end to end delay and the cha…

Sending Packet Loss Feedback in WebRTC SFUs

One of the responsibilities of WebRTC SFUs is to receive and send RTCP packets.  RTCP packets include different types of feedback about audio and video streams and one of the most important RTCP packets is the Receiver Report (RR).

RR packets are sent from the receiver of the media stream towards the sender of that media stream.  In case of an SFU the RR are generated and sent from the SFU to the media stream Sender and also from every stream Receiver to the SFU (Figure 1).

The feedback sent inside RR packets include fields to calculate the round-trip-time delay, the jitter and the packet loss introduced by the network.