Speech recognition in Hangouts Meet

There are many possible applications for speech recognition in Real Time Communication services live captions, simultaneous translation, voice commands or storing/summarising audio conversations. Speech recognition in the form of live captioning has been  available in Hangouts Meet  for some months but recently it was promoted to a button in the main UI and I have started to use it almost every day. I'm mostly interested in the recognition technology and specifically on how to integrate  DeepSpeech  in RTC media servers to provide a cost effective solution but in this post I wanted to spend some time analysing how Hangouts Meet implemented captioning from a signalling point of view. At a very high level there are at least three possible architectures for speech recognition in RTC services: A) On device speech recognition :  This is the cheapest option but not all the devices have support for it, the quality of the models is not as good as in the cloud ones and it requir