Media Ingestion protocols Review

This post summaries the most relevant protocols using for media ingestion.    In these media ingestion flows a client that is broadcasting some content sends it to an endpoint (typically a server) that will re-format it for distribution to a potentially large audience, using a different protocol optimised for media playback. The following sections show the main ideas behind each protocol as well as some of their advantages and disadvantages.    The protocols are organised into three buckets (tradicional protocols, modern protocols and next generation protocols) that correspond to three waves of protocols with the first one based on TCP, the second one on UDP or HTTP and the third one with QUIC.  You can see a summary of those characteristics as well as the timeline and adoption in the following diagram. Traditional protocols RTMP RTMP (Real-Time Messaging Protocol) is the most widely used protocol for media ingestion nowadays.  It was developed more than 20y ago and popularised by Ado

Different types of latency measurements in WebRTC

When building WebRTC services one of the most important metrics to measure the user experience is the latency of the communications.    The latency is important because it has an impact on the conversational interactivity but also on video quality when using retransmissions (that is the most common case) because the effectiveness of retransmissions depend on how fast you get them. And to be fair at the end of the day latency is what differentiates Real Time Communications from other types of communications and protocols like the ones used for streaming use cases that are less sensitive to delays, so it is clear that latency is an important metric to track. However there is no single measurement of latency and different platforms, APIs and people usually measure different types of latency.   From what I've seen in the past we can see differences in these four axis described below. One Hop latency vs End to End latency When there are multiple servers involved in a conversation the na

Screensharing content detection

One interesting feature in WebRTC is the ability to configure a content hint for the media tracks so that WebRTC can optimize the transmission for that specific type of content.   That way if the content hint is "text" it will try to optimize the transmission for readability and if the content hint is "motion" it will try to optimize the transmission for fluidity even if that means reducing the resolution or the definition of the video. This is specially useful when sharing documents or slides where the "crispiness" of the text is very important for the user experience.   You can see the impact of those hints in the video encoding in this screenshot taken from the W3C spec: This is very useful but there is a small problem.  What happens when we don't know the type of content being shared?  How do we know if the browser tab being shared has some static text slides or a youtube video being played? One possible option could be to do some type of image pro

New look at WebRTC usage in Google Meet

I hadn't looked at Google Meet webrtc internals for a while so while I was having a meeting last week I decided that it was a good time to check what were the latest changes that had been added. P2P Connections One of the first things that I checked was if Google Meet was using P2P connections when there are only two participants in the room and I was surprised that it was not the case.   P2P support was included in the past ( P2P-SFU transitions discussion ) but apparently has been removed. This increase the infrastructure cost (not an issue for Google) and increase a bit the end to end latency for the 1:1 calls but given that Google Meet is probably deployed in many points of presence that's probably not a big increase and the simplicity of not having to handle another type of connections and the transition between them (P2P <-> SFU) is a big benefit so it looks reasonable. ICE candidates and (NO) TURN servers Google Meet is not configuring any ICE servers anymore and t

Handling back-pressure in RTC services

In the context of software, back-pressure refers to actions taken by systems to “push back” downstream forces. As such, it is a defensive action taken unilaterally while under duress or if the aggregate call pattern exhibits too many spikes, or is too bursty. This approach is commonly used in microservices infrastructures but we don't talk too much about it in the context of Real-Time Communication platforms even if they can be equally important to handle spikes of load while keeping the quality of service inside some reasonable limits.   During times of spikes or burstiness, the quality could be slightly degraded but it should still be usable and the servers should never go down. The typical techniques to apply back-pressure and get some protection against those unsustainable patterns of traffic are buffering, dropping, and controlling the sender: Buffering : Messages can be queued and processed at a lower rate when needed to limit the impact of spikes or high load. Dropping : Mes

The role of Real Time Communications in the Multiverse

Recently many companies have been talking about something that they usually call the metaverse .   There is no single clear definition of what the metaverse is but something like this sounds close enough: "A highly connected environment with lots of interactive players and complex simulation creating rich experiences, something more than a game but less than the real world" ( What is the metaverse, and why is it worth so much money? ) For me, the easiest way to understand it is to think about something similar to the world shown in the Ready Player One movie where users can interact with each other in a digital environment making use of virtual reality devices and sensors. Perhaps the biggest company speaking publicly about the metaverse is Facebook and is betting hard on it saying that they want to move from being a "social company to a metaverse company" but there are also many gaming companies working on that direction or at least talking about it.  In this blo

WebRTC Video Codecs performance evaluation

The standard and most popular codecs in WebRTC are VP8 and H.264 but those are not the only options we have.   VP9 has been available for a while and some  big services are using it  and AV1 support has been recently added to Chrome . When comparing codecs there are interesting considerations like interoperability and licensing but probably the most important factors are how good the codec is in terms of compression and how cheap the codec is in terms of cpu&memory usage. The compression ratio is usually the first thing we look at and there are many comparisons available for that, but the resource consumption is equally important if we want to be able to use the codecs for real time use cases. Given that AV1 is available in the Chrome Canary versions I decided to run some tests to get estimation of where we stand in terms of cpu usage for the 4 available codecs in WebRTC ecosystem.   The idea of the test is to compare the whole video pipeline with those 4 codecs and not just the co