HOWTO Use temporal scalability to adapt video bitrates

We have talked many times about video scalability (VP9 SVC, Simulcast) and how it enables modern RTC infrastructures to provide high quality multiparty experiences while maintaining low infrastructure costs (re-encoding the video is more expensive and also degrades the media quality).  This approach was popularised by Vidyo and can be seen in modern platforms like OpenTok and services like Hangouts/Meet these days.

Typically we talk about temporal and spatial scalability although you can also have quality scalability.   With spatial scalability you generate multiple versions of the video with different resolutions.  In case of temporal scalability the sequence of frames is encoded in a way so that some frames can be dropped in the server and the resulting frames sequence can be still decoded in the receiver side.

This is an example of a typical VP8 encoding with 2 temporal layers where the base layer (blue frames) don't depend on the higher layer (yellow frames):

This type of encoding (usually combined with spatial scalability too) is very useful in multiparty scenarios because it allows media servers to generate different bitrates for different participants without having to reencode the video.  For example you could forward 30 fps to a participant using a high quality DSL/Fiber line while forwarding only 15 fps to a participant using 3G.

Implementing this selective forwarding in a server is very easy, you just need to drop the frames with different layer identifiers. The server can check the temporal layer identifier (TID) in the VP8 header (or similar headers for other codecs) to make the decision when receiving a packet.

But there is a problem with selective forwarding....   RTP packets include 2 identifiers that are supposed to be sequential:
  • RTP sequence numbers (SN).  The identifier of the RTP packet.  Used for reordering and retransmissions.
  • VP8 Picture IDs (PID).  The identifier of the frame.  Used to establish dependencies between frames.

If you just drop some packets in the server then the receiver will see non consecutive sequence numbers and will request many unneeded retransmissions, and also will see gaps in picture ids and (in some cases) will think that the received sequence is not decodable and the video will freeze.

Ok, not a big deal, let's rewrite those identifiers before forwarding the packets. In Figure 1 you can see how a media server can selectively forward a video stream with 3 layers (TID=0, 1, 2) dropping the frames belonging to the highest layer (TID=2) while rewriting the SN and PID of the forwarded packets.   In that example the identifiers of the forth frame (second blue one) have to be rewritten when dropping the third frame because it belongs to layer 2. Easy peasy.

Figure 1. Translating identifiers when dropping layer 2 in the server

This was very easy, but not so fast... What happens when there is a packet lost and you have to decide how to rewrite the sequence number of the forth packet but you missed the third one?  As you can see in Figure 2 you don't know if the third packet was belonging to layer 2 or not so how do you rewrite the forth packet identifiers?  There are 2 options depending if the missing packet was in layer 2 or not and we don't know it.
Figure 2. Different possible decisions when a packet is lost

To be honest I have discussed this point with many people (apparently using different approaches right now) and I'm still not completely sure about what is the right solution here.   But I can explain the solution that in my opinion is the best one.

The basic idea is to assume the worse case scenario when you receive a packet and you are not sure about the sequence number that you should use.   You can always assume that the missing packets had to be forwarded and if afterwards you were wrong you will be sending some useless packets but you will never have a broken sequence or frozen video.   This is the approach shown in Figure 3.

Figure 3. Proposal of how to address packet lost when rewriting the identifiers

As you can see in Figure 3 it is important that you forward the packet number 3 (pink one) when receiving the retransmission even if you are not forwarding the layer 2.  Otherwise the receiver will keep asking for retransmissions and also won't be able to decode the video sequence until it has the information of PID=3 (pink one) and can decide to ignore it and proceed decoding the frame with PID=4 (blue one).

Hope this is useful, I'm very interested on receiving feedback about any problem you find with this approach or any better solution that you use right now in your implementation.

You can follow me in Twitter if you are interested in Real Time Communications.


Popular posts from this blog

Bandwidth Estimation in WebRTC (and the new Sender Side BWE)

Controlling bandwidth usage in WebRTC (and how googSuspendBelowMinBitrate works)

Using Native WebRTC simulcast support in Chrome (or how to be as good as Hangouts) [WiP]