Improving Real Time Communications with Machine Learning

October 25, 2017

When we talk about the applications of Artificial Intelligence / Machine Learning (AI/ML) for Real Time Communications (RTC) we can group them in two different planes:

Service Level: There are many features that can be added to a videoconference service, for example identification of the participants, augmented reality, emotion detection, speech transcription or audio translation. These features are usually based on image and speech recognition and language processing.
Infrastructure Level: There are many ways to apply ML that do not provide new features but improve the quality and/or reliability of the audio/video transmission.

Service level applications are fun, but they are more for Product Managers and I like technology more, so in the next sections I will try to describe possible applications of AI/ML for Real Time Communications at Infrastructure Level organizing those ideas in five different categories.

Optimizing video quality

Some of the ML algorithms used for image recognition can be used to optimize the video transmission in RTC services. I can imagine at least three ways in which these algorithms can help to improve the video quality.

The first way would be to select the best possible encoding parameters for a specific video or a specific part of the frame. For example if we can detect the most important parts of the scene (maybe the talking head) and use better encoding quality (lower quantization level) for those areas. Or another example, we can detect the type of information and give preference to framerate vs quality depending if it is a high motion video or a typical conversation.

The second way could be to reduce the amount of information being sent by removing the information that can be regenerated by the receiver. As an extreme example, we all know the shape of human hair so even if you can, can you send lower quality and reconstruct the hair in the receiver? One example of this application can be seen in the RAISR demos by Google.

The same process could be also used to improve the readability or increase the detail level of objects that are too far or out of focus.

It is also possible to apply ML in the video codec implementations to optimize the processing required to encode the frames as you can see in this code included in the VP9 codebase.

Optimizing audio quality

We should be able to eliminate redundant audio data that can be regenerated in the receiver side in the same way we describe for video in the previous section. In an extreme case we would only need to send the text and the accent / speed... of the speaker and the receiver should be able to reconstruct almost the same voice based on a previous learning process.

One of the problems with audio quality is the intelligibility in noisy environments. ML algorithms can also help with this as shown by RRNoise project by Mozilla/Xiph by learning how to better differentiate and suppress noise vs voice.
Banner

Tuning transmission settings

The amount of parameters involved in a RTC session is really big, it goes from codecs with tons of settings, bitrates, packetization sizes, buffers, timeouts.... Deciding which ones to use at a given time is not trivial and can even require per user/network adjusts that change dynamically. A ML based system could learn what is the best combination of those parameters for a specific user in some specific conditions.

In case of multiparty calls it is critical to include some algorithms to decide how the available bitrate is distributed for the different streams in the room. For example is it better to send 2 videos at 50kbps or disable one and send the other one at 100kbps. A ML algorithm able to make those decisions (bitrates, framerates, resolutions, codecs...) in real time based on the uplink/downlink characteristics of all the participants in the call, but also based on the type of conversation and on who is/are the active speaker/s could provide a much better quality of experience.

Resource allocation and planning

Most of the RTC infrastructures include the concept of Media Servers. Those are the servers routing the audio and video packets between the different participants and are specially important in case of multiparty calls.

For these calls you can use ML algorithms to decide what is the best server to be used for a specific call based on the location of the server, the location of the participant/s and the status of the servers (basically the load and network status).

In the same way it can use for forecasting to predict load and make sure the amount of available resources is the optimum one.

Diagnostics and Monitoring

Most of the people who have been working building RTC platforms for a while has probably experienced how painful it is to debug issues. Same way that ML is started to be used for medical diagnostics in e-health it can be also used to debug and find the root cause when there are problems. For example based on quality metrics from different participants and the status of the servers can diagnose if it is a bug or a network issue, and in the second case which network was the responsible.

We can also use the classification algorithms available in ML to classify calls according to quality scores or other parameters to generate reports or monitoring purposes.

We can also use unsupervised learning to detect anomalies in the system automatically and trigger alerts.

As you can see there are many ways to use ML to improve the quality and/or reliability of our RTC platforms. And these were only some examples and I'm sure you have many other ideas.

You can follow me in Twitter if you are interested in Real Time Communications.