Measuring WebRTC video quality for different bitrates

Measuring WebRTC video quality for different bitrates - Playing with VMAF

October 05, 2018

I've been wanting to play with Netflix Video Multi-Method Assessment Fusion (VMAF) for a while and yesterday I found the time and the motivation to give it a try.

Netflix VMAF is an algorithm to generate a video quality score by comparing a reference image/video with a distorted image/video. To do that VMAF calculates scores using tradicional image quality metrics like VIF or DLM and then aggregate them using a Machine Learning model (SVM) trained with the videos and scores coming from real users. Smart, isn't it? (You can see a high level description of those metrics that are aggregated in this Netflix post or the Wikipedia page)

It is important to notice that VMAF works in a per-frame base so it is NOT a good tool to measure the quality impact of many artefacts happening in Real Time Communications (delays, reduced/frozen framerate, audio/video desync). However we can use it to measure the impact of different encoding settings like the average bitrate of the encoding.

As you can expect coming from Netflix VMAF was designed for video streaming and trained using images from movies. Anyway I was interested on seeing how it performs with mobile videoconference like videos so I recorded a short typical VGA video with a talking face not very stable.

I reencoded that sample video in VP8 using ffmpeg with different bitrates (50kbps, 100kbps, 200kbps, 400kbps, 600kpbs, 800kpbs, 1.2mbps, 2mpbs) and then used the ffmpeg2vmaf command line tool to calculate the score of those videos and presented them in the following graph:

VMAF scores for

What we can see in this test is that beyond 600 (or even a little bit lower) the quality improvement is not that high and beyond 1200 it is barely noticeable. Remember that these results are based on VMAF default model (not tuned for videoconference videos) and for my specific test video but the results don't look very different that what our experience with real users in production tells us.

This kind of test can help us decide the max bitrate we want to use for our WebRTC conferences, although there are other implications beyond quality like battery consumption and the results depend on the type of video, use case and how picky your users of your application are. That's the reason why we have to be careful when playing with video bitrates in production. As an example Facebook explained how increasing the bitrate lead to lower user scores because of the implications in battery consumption. In Houseparty we always do A/B testing to quantify the impact of any relevant change like this and decide the optimal video bitrate for our specific use case.

Google is including VMAF in the WebRTC test suite and implementing some frame alignment to overcome the limitation of having to compare a specific reference frame with the corresponding distorted one. It would be nice if in the future we could expand the VMAF idea including new metrics in the ML algorithm to account for delays, framerate or video desynchronization. That core idea of multi-method fusion looks very powerful!

You can follow me in Twitter if you are interested in Real Time Communications.