Improving Real Time Communications with Machine Learning
When we talk about the applications of Artificial Intelligence / Machine Learning (AI/ML) for Real Time Communications (RTC) we can group them in two different planes:
- Service Level: There are many features that can be added to a videoconference service, for example identification of the participants, augmented reality, emotion detection, speech transcription or audio translation. These features are usually based on image and speech recognition and language processing.
- Infrastructure Level: There are many ways to apply ML that do not provide new features but improve the quality and/or reliability of the audio/video transmission.
Service level applications are fun, but they are more for Product Managers and I like technology more, so in the next sections I will try to describe possible applications of AI/ML for Real Time Communications at Infrastructure Level organizing those ideas in five different categories.
Optimizing video quality
The first way would be to select the best possible encoding parameters for a specific video or a specific part of the frame. For example if we can detect the most important parts of the scene (maybe the talking head) and use better encoding quality (lower quantization level) for those areas. Or another example, we can detect the type of information and give preference to framerate vs quality depending if it is a high motion video or a typical conversation.
The second way could be to reduce the amount of information being sent by removing the information that can be regenerated by the receiver. As an extreme example, we all know the shape of human hair so even if you can, can you send lower quality and reconstruct the hair in the receiver? One example of this application can be seen in the RAISR demos by Google.