Media Ingestion protocols Review
This post summaries the most relevant protocols using for media ingestion. In these media ingestion flows a client that is broadcasting some content sends it to an endpoint (typically a server) that will re-format it for distribution to a potentially large audience, using a different protocol optimised for media playback.
The following sections show the main ideas behind each protocol as well as some of their advantages and disadvantages. The protocols are organised into three buckets (tradicional protocols, modern protocols and next generation protocols) that correspond to three waves of protocols with the first one based on TCP, the second one on UDP or HTTP and the third one with QUIC. You can see a summary of those characteristics as well as the timeline and adoption in the following diagram.
RTMP (Real-Time Messaging Protocol) is the most widely used protocol for media ingestion nowadays. It was developed more than 20y ago and popularised by Adobe/Flash initially as a playback protocol. It is supported by the biggest Internet video platforms like Youtube, Facebook or Twitch.
The protocol transports media over TCP, includes an initial handshake and support for arbitrary content (metadata, RPC and media) using a proprietary binary format. The feedback mechanisms to control the media transmission is limited as well as the codecs supported. There are some variations of RTMP developed for proxy traversal or P2P but they are not that popular.
It is a protocol used for ingestion and not playback so in most of the cases it requieres other protocol and some repacketization for distribution (typically using HLS protocol).
RTSP (Real-Time Streaming Protocol) is another protocol developed in the 90s but in this case it was fully standard from the beginning and reuses other internet standard protocols like RTP (Real-Time Protocol), RTCP (Real-Time Control Protocol) and SDP (Session Description Protocol). The main motivation was its standardisation and the lower latency compared with RTMP. It is not used that much today and mostly limited to specific use cases like IP cameras.
It is a protocol that separates the control and the media plane. The establishment part of RTSP is similar to HTTP and uses SDP to describe the media content and the media part uses RTP over UDP. It could support most of the feedback and capabilities of RTP extensions but usually the implementations are limited to the basic RTP profile.
SRT (Secure Reliable Protocol) is a protocol invented in 2012 and released in 2017 as a replacement for the tradicional protocols to be better suited for low latency transmission over noisy networks. It is used by many professional broadcasters and some internet services like Caffeine.tv.
The protocol makes use of UDP for the transmission of control and data packets and includes feedback for retransmissions under packet loss. The protocol syntax is based on a previously existing protocol for file transfer called ULP and has different modes to enable use cases from reliable file transfer to low latency streaming.
RIST (Reliable Internet Stream Transport) is a protocol similar to SRT in terms of features and motivations. It was developed to increase the reliability against packet losses and the security included in SRT as well as to support higher bitrate streams.
This protocol is also transported over UDP although in this case it uses RTP instead of inventing a new protocol for that. It includes feedback for retransmissions under packet loss.
It is a slightly better protocol that reuses some existing protocols but so far it had lower adoption than SRT.
WebRTC is a project that was designed for videoconferencing use cases but that has seen also some adoption for streaming applications. Its real time nature makes it suitable for use cases where ultra low latency is more important than getting the highest possible quality and also the native support in browsers makes it very simple to integrate for applications. Different streaming services use it in different degrees like for example Youtube Studio or StreamYard.
The protocol doesn't define the signalling/control plane and that's left to each application although recently a new specification (WHIP) has defined a very simple HTTP signalling protocol for basic ingestion use cases. The protocol does define the media plane that is composed of different layers for connectivity (ICE protocol), encryption (DTLS&SRTP protocols) and media itself (RTP&RTP procols). That makes it a complex protocol even if it is mostly a composition of protocols existing for many years. It includes feedback mechanisms for retransmissions and bandwidth estimation as well as many other extensions for FEC or buffers control.
The WebRTC protocol stack can be used also for playback if needed although in many cases the applications still rely on traditional protocols for playback to prioritise reachability and quality over the lowest possible latency.
FLT (Faster than Light Protocol) is a protocol designed to provide a very low latency replacement for RTMP. The protocol was designed for Microsoft Mixer service and many applications and services implemented it although it is mostly unused with the shutdown of Mixer.
The protocol was making use of UDP and RTP to provide ultra low latency in a way similar to WebRTC but without the ICE and DTLS parts and it includes a control protocol over TCP to establish and negotiate the media connection.
HTTP ingestion (HLS, DASH...)
HTTP ingestion consist on reusing the protocols that are mostly used for media delivery (HLS, DASH...) but can also be used for media ingestion. The most popular of these protocols is probably HLS (HTTP Live Streaming) and it is supported by Akamai or Youtube Live in addition to RTMP for ingestion.
These protocols use HTTP as transport to send the media segments encoding pieces of audio+video of a small number of seconds duration and also to send the playlists with the metadata and the references to those segments. They make use of the delivery and encryption mechanisms of HTTP and doesn't include any specific feedback mechanism to adapt the ingestion to the network conditions beyond controlling latencies and buffers.
Next generation protocols
RUSH (Reliable (unreliable) streaming protocol) is a bidirectional media protocol designed by Facebook as a replacement for RTMP with the goal to provide support for new audio and video codecs, extensibility ini the form of new message types, and multi-track support. This protocol is only used by Facebook at this point both for ingestion and for media distribution inside their network.
The RUSH protocol is based on two core ideas:
- Use QUIC as the transport protocol. The QUIC protocol, built on UDP, serves as the foundation for HTTP/3 and offers robust security measures in addition to affording granular control over packet priorities and reliability.
- Use a different QUIC stream for each video frame so that they can be prioritized independently in case of congestion based on the latency and the importance of each frame.
The specification includes a control protocol and the media protocol both running on top of a single QUIC connection. The messages have a binary encoding and include fields similar to the ones in RTP.
WARP (Live media transport over QUIC) protocol is also a bidirectional protocol but in this case designed by Twitch as a replacement for RTMP for the ingestion but also for existing playback protocols like HLS. Twitch is already using it for the distribution of media chunks in their backbone network.
The ideas behind this protocol are similar to the ones in RUSH with a configurable latency vs quality trade-off. The main difference is that the unit of transmission is not a frame but a segment. A segment is a group of encoded frames packages in a MP4 container. Another difference is that WARP is defined on top of WebTransport instead of just QUIC (WebTransport is a thin layer over QUIC intended for web browsers support)
The specification includes a control protocol and the media protocol both running on top of a single QUIC connection. All the messages (control and media) includes a type-length encoded binary header that should allow any intermediaries in the distribution pipeline to prioritise and filter messages based on their content.
The protocol can also be used for playback and the use of segments makes it very simple to do the conversion to other segment-based protocol like HLS if needed.
After many years where most of the ingestion has been happening with RTMP there is finally some alternatives getting enough traction. WebRTC and SRT where the first ones and have already been adopted for many use cases while the new protocols designed over QUIC (RUSH, WARP or QUICR) are likely to be widely deployed in the next years.
The use of one protocol over the others depends mostly on two factors:
- Support and interoperability. The most widely supported protocol is still RTMP and some streaming equipment only support that protocol or the newer SRT/RIST protocols. However, if we ignore the new QUIC protocols, most of these protocols are already supported in the widely used opensource OBS streaming client.
- Latency vs quality trade-off. Depending on the use case it is possible to select a protocol that is better suited for ultra low latency (like WebRTC) or when that is not strictly required other protocol can be selected that can provide the highest possible quality (like HLS).
As a final note this is the result of the wowza report on protocols being used by for ingestion: