Showing posts from 2023

Audio Mixing or Forwarding

How many audio streams should your WebRTC server forward to the participants in a room? There are various options, ranging from the simplest approach of forwarding everything, to the most extreme option of mixing all audio and sending just a single stream. A few weeks ago, we engaged in a Twitter conversation about this very topic . Following that discussion, bloggeek also wrote a post on the subject . For me it is always interesting to see what different types of applications are doing because at least in some of those cases they have the ability to do A/B testing and compare the results with millions of users before making a decision. The simplest way to determine the best approach is to enter a room with different applications and inspect the SDP (Session Description Protocol) in chrome://webrtc-internals . Within this tool, you can examine how many channels are being forwarded when you're in a room and look for potential clues within the SDP (some people use the "mixed&q

Architecture for AI integration in conferencing applications

With the latest improvements in ML technology, especially generative algorithms and large language models, more and more conferencing applications are adding these capabilities to their offerings. This ML technology can be applied to conferencing applications at two different levels: the infrastructure level with improvements in media handling and transmission, and the application level with new features or capabilities for the users. At the infrastructure level (codecs, noise suppression, etc.) most of the high-level ideas were covered in this other post . Some interesting recent advances are applying “ML codecs” for audio redundancy, and the next frontier is applying generative algorithms also to video, as well as general applications to photorealistic avatars. This post focuses on the second level (the application part) and how to implement typical features such as summarization, image generation, or moderation. The idea here is to present a reference architecture that can be used

Review of Signaling in different WebRTC applications

  This post provides a quick review of the signaling channel implementation in various popular WebRTC platforms. It examines the protocol used for the channel, how messages are serialized, and whether the applications use Session Description Protocol (SDP) as an opaque string over the wire, or if they instead send the required parameters in a custom format. To provide a variety of platforms, I have included a mix of popular end-user applications, cloud providers, and open-source implementations in the table. If you would like, I am happy to add others to the list. How was it tested? To test it, join a room and check in Chrome Developer Tools whether there are WebSocket connections established or periodic HTTP requests being made. Then, inspect the messages of those connections and requests and check if the format is Binary/JSON/XML. In case of Binary messages, it's harder to see the content, and there's a chance that the information is compressed/encrypted, and there's s

Perfect Interactive Broadcasting Architecture

While we might sometimes talk about low-latency or interactive broadcasting in a generic way, it's important to note that there are actually two distinct types of streaming use cases that require different levels of interactivity. Conversational use cases where multiple participants are talking together and that conversation is being streamed to many other viewers. These viewers can potentially become speakers at some point too. Single stream use cases where just one person is streaming their video feed (it can be their camera, their screen, or a combination of both) to many other viewers who can interact in different ways. The most obvious way is through chat messages, but it can also include emoji reactions or even bids on an auction being streamed. The conversational use case has specific requirements. For instance, it demands effective synchronization of multiple streams, ultra-low latency (less than 250ms) only between the users who are speaking, and an element that per

WebRTC header extensions review

WebRTC supports the concept of RTP header extensions to extend media packets with additional metadata.    One of the most common use cases is to attach the audio level to audio packets so that the server can calculate active speakers without having to decode the audio packets. Some of these header extensions are standard and have been used for a while, but there are others that are added by Google when needed and are only documented lightly in the website and the libwebrtc code.   Those header extensions and its usage is not very well know and this post is an attempt to give visibility of them for the WebRTC community. To discover some of these headers you can usually take a look at the Offer/Answer of a Google Meet session or take a look at the libwebrtc source code here: Audio Levels ( urn:ietf:params:rtp-hdrext:ssrc-audio-level ) Doc [Very common] The header contains the audio level (volume) of the audi

Existing WebRTC is not great for broadcasting use cases

WebRTC was originally designed for real-time communication with a small number of participants, where latency requirements are extremely strict (typically <250ms). However, it has also been utilized for broadcasting use cases, such as YouTube Studio or Cloudflare CDN, where protocols used in the past have been different, typically Adobe’s RTMP and protocols based on HTTP. WebRTC offers a new range of broadcasting use cases, particularly those requiring hyper-low latency, such as those with audience interactivity, for instance, user reactions or auction use cases. However, choosing WebRTC comes with tradeoffs, including increased complexity, scalability challenges, or lower quality. While it's possible to address the first two with enough time and effort, the primary concern should be how to obtain the best possible quality. Why do we have lower quality when using WebRTC? First of all, a clarification.  In a perfect network with infinite bandwidth there is no much difference in q

Media Ingestion protocols Review

This post summaries the most relevant protocols using for media ingestion.    In these media ingestion flows a client that is broadcasting some content sends it to an endpoint (typically a server) that will re-format it for distribution to a potentially large audience, using a different protocol optimised for media playback. The following sections show the main ideas behind each protocol as well as some of their advantages and disadvantages.    The protocols are organised into three buckets (tradicional protocols, modern protocols and next generation protocols) that correspond to three waves of protocols with the first one based on TCP, the second one on UDP or HTTP and the third one with QUIC.  You can see a summary of those characteristics as well as the timeline and adoption in the following diagram. Traditional protocols RTMP RTMP (Real-Time Messaging Protocol) is the most widely used protocol for media ingestion nowadays.  It was developed more than 20y ago and popularised by Ado