Review of Signaling in different WebRTC applications

  This post provides a quick review of the signaling channel implementation in various popular WebRTC platforms. It examines the protocol used for the channel, how messages are serialized, and whether the applications use Session Description Protocol (SDP) as an opaque string over the wire, or if they instead send the required parameters in a custom format. To provide a variety of platforms, I have included a mix of popular end-user applications, cloud providers, and open-source implementations in the table. If you would like, I am happy to add others to the list. How was it tested? To test it, join a room and check in Chrome Developer Tools whether there are WebSocket connections established or periodic HTTP requests being made. Then, inspect the messages of those connections and requests and check if the format is Binary/JSON/XML. In case of Binary messages, it's harder to see the content, and there's a chance that the information is compressed/encrypted, and there's s

Perfect Interactive Broadcasting Architecture

While we might sometimes talk about low-latency or interactive broadcasting in a generic way, it's important to note that there are actually two distinct types of streaming use cases that require different levels of interactivity. Conversational use cases where multiple participants are talking together and that conversation is being streamed to many other viewers. These viewers can potentially become speakers at some point too. Single stream use cases where just one person is streaming their video feed (it can be their camera, their screen, or a combination of both) to many other viewers who can interact in different ways. The most obvious way is through chat messages, but it can also include emoji reactions or even bids on an auction being streamed. The conversational use case has specific requirements. For instance, it demands effective synchronization of multiple streams, ultra-low latency (less than 250ms) only between the users who are speaking, and an element that per

WebRTC header extensions review

WebRTC supports the concept of RTP header extensions to extend media packets with additional metadata.    One of the most common use cases is to attach the audio level to audio packets so that the server can calculate active speakers without having to decode the audio packets. Some of these header extensions are standard and have been used for a while, but there are others that are added by Google when needed and are only documented lightly in the website and the libwebrtc code.   Those header extensions and its usage is not very well know and this post is an attempt to give visibility of them for the WebRTC community. To discover some of these headers you can usually take a look at the Offer/Answer of a Google Meet session or take a look at the libwebrtc source code here: Audio Levels ( urn:ietf:params:rtp-hdrext:ssrc-audio-level ) Doc [Very common] The header contains the audio level (volume) of the audi

Existing WebRTC is not great for broadcasting use cases

WebRTC was originally designed for real-time communication with a small number of participants, where latency requirements are extremely strict (typically <250ms). However, it has also been utilized for broadcasting use cases, such as YouTube Studio or Cloudflare CDN, where protocols used in the past have been different, typically Adobe’s RTMP and protocols based on HTTP. WebRTC offers a new range of broadcasting use cases, particularly those requiring hyper-low latency, such as those with audience interactivity, for instance, user reactions or auction use cases. However, choosing WebRTC comes with tradeoffs, including increased complexity, scalability challenges, or lower quality. While it's possible to address the first two with enough time and effort, the primary concern should be how to obtain the best possible quality. Why do we have lower quality when using WebRTC? First of all, a clarification.  In a perfect network with infinite bandwidth there is no much difference in q

Media Ingestion protocols Review

This post summaries the most relevant protocols using for media ingestion.    In these media ingestion flows a client that is broadcasting some content sends it to an endpoint (typically a server) that will re-format it for distribution to a potentially large audience, using a different protocol optimised for media playback. The following sections show the main ideas behind each protocol as well as some of their advantages and disadvantages.    The protocols are organised into three buckets (tradicional protocols, modern protocols and next generation protocols) that correspond to three waves of protocols with the first one based on TCP, the second one on UDP or HTTP and the third one with QUIC.  You can see a summary of those characteristics as well as the timeline and adoption in the following diagram. Traditional protocols RTMP RTMP (Real-Time Messaging Protocol) is the most widely used protocol for media ingestion nowadays.  It was developed more than 20y ago and popularised by Ado

Different types of latency measurements in WebRTC

When building WebRTC services one of the most important metrics to measure the user experience is the latency of the communications.    The latency is important because it has an impact on the conversational interactivity but also on video quality when using retransmissions (that is the most common case) because the effectiveness of retransmissions depend on how fast you get them. And to be fair at the end of the day latency is what differentiates Real Time Communications from other types of communications and protocols like the ones used for streaming use cases that are less sensitive to delays, so it is clear that latency is an important metric to track. However there is no single measurement of latency and different platforms, APIs and people usually measure different types of latency.   From what I've seen in the past we can see differences in these four axis described below. One Hop latency vs End to End latency When there are multiple servers involved in a conversation the na

Screensharing content detection

One interesting feature in WebRTC is the ability to configure a content hint for the media tracks so that WebRTC can optimize the transmission for that specific type of content.   That way if the content hint is "text" it will try to optimize the transmission for readability and if the content hint is "motion" it will try to optimize the transmission for fluidity even if that means reducing the resolution or the definition of the video. This is specially useful when sharing documents or slides where the "crispiness" of the text is very important for the user experience.   You can see the impact of those hints in the video encoding in this screenshot taken from the W3C spec: This is very useful but there is a small problem.  What happens when we don't know the type of content being shared?  How do we know if the browser tab being shared has some static text slides or a youtube video being played? One possible option could be to do some type of image pro