Posts

Loss based bandwidth estimation in WebRTC

Image
Measuring available bandwidth and avoiding congestion is the most critical and complex part of the video pipeline in WebRTC. The concept of bandwidth estimation (BWE) is simple: monitor packet latency, and if latency increases or packet loss occurs, back off and send less data. The first part is known as delay-based estimation, while the second part, less known, is referred to as loss-based estimation. In the original implementation of WebRTC, the logic for loss-based estimation was straightforward: if there was more than 2% packet loss don't increase the bitrate sent and if it is more than 10% reduce the bitrate being sent. However, this naive approach had a flaw. Some networks also experience packet loss not due to congestion but inherent to the network itself (e.g., certain WiFi networks). We call that packet loss static or inherent packet loss. To address this issue, the latest versions of Google’s WebRTC library introduced a more modern and sophisticated solution after seve

Audio Mixing or Forwarding

Image
How many audio streams should your WebRTC server forward to the participants in a room? There are various options, ranging from the simplest approach of forwarding everything, to the most extreme option of mixing all audio and sending just a single stream. A few weeks ago, we engaged in a Twitter conversation about this very topic . Following that discussion, bloggeek also wrote a post on the subject . For me it is always interesting to see what different types of applications are doing because at least in some of those cases they have the ability to do A/B testing and compare the results with millions of users before making a decision. The simplest way to determine the best approach is to enter a room with different applications and inspect the SDP (Session Description Protocol) in chrome://webrtc-internals . Within this tool, you can examine how many channels are being forwarded when you're in a room and look for potential clues within the SDP (some people use the "mixed&q

Architecture for AI integration in conferencing applications

Image
With the latest improvements in ML technology, especially generative algorithms and large language models, more and more conferencing applications are adding these capabilities to their offerings. This ML technology can be applied to conferencing applications at two different levels: the infrastructure level with improvements in media handling and transmission, and the application level with new features or capabilities for the users. At the infrastructure level (codecs, noise suppression, etc.) most of the high-level ideas were covered in this other post . Some interesting recent advances are applying “ML codecs” for audio redundancy, and the next frontier is applying generative algorithms also to video, as well as general applications to photorealistic avatars. This post focuses on the second level (the application part) and how to implement typical features such as summarization, image generation, or moderation. The idea here is to present a reference architecture that can be used

Review of Signaling in different WebRTC applications

Image
  This post provides a quick review of the signaling channel implementation in various popular WebRTC platforms. It examines the protocol used for the channel, how messages are serialized, and whether the applications use Session Description Protocol (SDP) as an opaque string over the wire, or if they instead send the required parameters in a custom format. To provide a variety of platforms, I have included a mix of popular end-user applications, cloud providers, and open-source implementations in the table. If you would like, I am happy to add others to the list. How was it tested? To test it, join a room and check in Chrome Developer Tools whether there are WebSocket connections established or periodic HTTP requests being made. Then, inspect the messages of those connections and requests and check if the format is Binary/JSON/XML. In case of Binary messages, it's harder to see the content, and there's a chance that the information is compressed/encrypted, and there's s

Perfect Interactive Broadcasting Architecture

Image
While we might sometimes talk about low-latency or interactive broadcasting in a generic way, it's important to note that there are actually two distinct types of streaming use cases that require different levels of interactivity. Conversational use cases where multiple participants are talking together and that conversation is being streamed to many other viewers. These viewers can potentially become speakers at some point too. Single stream use cases where just one person is streaming their video feed (it can be their camera, their screen, or a combination of both) to many other viewers who can interact in different ways. The most obvious way is through chat messages, but it can also include emoji reactions or even bids on an auction being streamed. The conversational use case has specific requirements. For instance, it demands effective synchronization of multiple streams, ultra-low latency (less than 250ms) only between the users who are speaking, and an element that per

WebRTC header extensions review

WebRTC supports the concept of RTP header extensions to extend media packets with additional metadata.    One of the most common use cases is to attach the audio level to audio packets so that the server can calculate active speakers without having to decode the audio packets. Some of these header extensions are standard and have been used for a while, but there are others that are added by Google when needed and are only documented lightly in the website and the libwebrtc code.   Those header extensions and its usage is not very well know and this post is an attempt to give visibility of them for the WebRTC community. To discover some of these headers you can usually take a look at the Offer/Answer of a Google Meet session or take a look at the libwebrtc source code here: https://chromium.googlesource.com/external/webrtc/+/master/api/rtp_parameters.h Audio Levels ( urn:ietf:params:rtp-hdrext:ssrc-audio-level ) Doc [Very common] The header contains the audio level (volume) of the audi

Existing WebRTC is not great for broadcasting use cases

WebRTC was originally designed for real-time communication with a small number of participants, where latency requirements are extremely strict (typically <250ms). However, it has also been utilized for broadcasting use cases, such as YouTube Studio or Cloudflare CDN, where protocols used in the past have been different, typically Adobe’s RTMP and protocols based on HTTP. WebRTC offers a new range of broadcasting use cases, particularly those requiring hyper-low latency, such as those with audience interactivity, for instance, user reactions or auction use cases. However, choosing WebRTC comes with tradeoffs, including increased complexity, scalability challenges, or lower quality. While it's possible to address the first two with enough time and effort, the primary concern should be how to obtain the best possible quality. Why do we have lower quality when using WebRTC? First of all, a clarification.  In a perfect network with infinite bandwidth there is no much difference in q