Speech recognition in Hangouts Meet

There are many possible applications for speech recognition in Real Time Communication services live captions, simultaneous translation, voice commands or storing/summarising audio conversations.

Speech recognition in the form of live captioning has been available in Hangouts Meet for some months but recently it was promoted to a button in the main UI and I have started to use it almost every day.


I'm mostly interested in the recognition technology and specifically on how to integrate DeepSpeech in RTC media servers to provide a cost effective solution but in this post I wanted to spend some time analysing how Hangouts Meet implemented captioning from a signalling point of view.

At a very high level there are at least three possible architectures for speech recognition in RTC services:

A) On device speech recognition:  This is the cheapest option but not all the devices have support for it, the quality of the models is not as good as in the cloud ones and it requires some additional CPU usage that can be a problem for limited devices.
B) Recognition in a separate server:  This is very inefficient in terms of network because the client needs to send the audio twice in parallel, it is expensive for the service provider but doesn't require changes in the media server.
C) Recognition from the media server:  This is very efficient from the client and network point of view but requires changes in the media server and it is expensive for the service provider.

Given the fact that Google owns its own Speech Recognition service and because of that the cost is probably not their biggest concern it looks like the most reasonable approach for Hangouts Meet could be the Option C.   So I tried to confirm it and see how the transcriptions are sent between the media server and the browser.   

First thing I did was to take a look at the minified code and to the HTTP requests sent by Hangouts Meet web  page during the conversation but I couldn't find and reference to stt/transcription/recognition or anything like that.   I also ran a very simple WebSpeech snippet in the browser console while connected to Hangouts and saw the results where different in the Hangouts page and in the on-device recognition (It was better in the page).

So next I decided to take a look at the DataChannels in chrome://webrtc-internals and voilĂ , there were a lot of messages received while I was talking so those had to be the messages with the speech recognition content.

After that I checked in webrtc-internals events and I saw there was a specific DataChannel created for the captions the first time you enable them in addition to the default DataChannel that it is always open in Hangouts Meet.

To try to figure out the exact options of that DataChannel and the format of the information sent in those messages I replaced the RTCPeerConnection.createDataChannel API with my own to be able to intercept those calls using this simple snippet in the browser console.

With that in place I saw the unreliable DataChannel for the captions is created with maxRetransmissions=10 and the payload is binary data probably in protobuf format. Still we can convert it to string and see a couple of fields with the user id and the text inside:


And that's all for today.  In the next post I'll try to share some information around my DeepSpeech experiments.  As usual feedback is more than welcomed either here or in Twitter.




Comments

Popular posts from this blog

Bandwidth Estimation in WebRTC (and the new Sender Side BWE)

Controlling bandwidth usage in WebRTC (and how googSuspendBelowMinBitrate works)

Using DSCP for WebRTC packet marking and prioritization