The role of Real Time Communications in the Multiverse
Recently many companies have been talking about something that they usually call the metaverse. There is no single clear definition of what the metaverse is but something like this sounds close enough: "A highly connected environment with lots of interactive players and complex simulation creating rich experiences, something more than a game but less than the real world" (What is the metaverse, and why is it worth so much money?)
Perhaps the biggest company speaking publicly about the metaverse is Facebook and is betting hard on it saying that they want to move from being a "social company to a metaverse company" but there are also many gaming companies working on that direction or at least talking about it. In this blog post we can find a nice representation of the market around this trend.
With that context in mind, I started to wonder what is the role of Real-Time Communications (RTC) in this ecosystem or more precisely what are the differences or customizations that we could need to do to our existing RTC solutions to provide the features and quality needed for this futuristic metaverse vision.
To try to frame the problem I started to think about this quadrant with two types of communications (those face to face and those with remote people) and two types of worlds where they can happen (the real and the virtual worlds). Given this classification, the ones that are more interesting or at least specific of the metaverse are the face-to-face communications in the virtual world.
High-quality spatial audio
One obvious requirement will be the use of high-quality and spatial audio. To have an immersive experience in the metaverse the audio quality has to be crystal clear and the voice of the people needs to be heard coming from the place where they are supposed to be in the virtual space and with the volume corresponding to the attenuation due to the distance between user avatars in the virtual world.
and there are many applications already doing something similar as you can see from simple apps like hubbub to fancy products like Google Starline. Some introduction to spatial audio can be found in this interesting presentation from Dolby: Improving intelligibility with spatial audio
The voice coming from participants also needs to be adapted to the environment where you are placed in the virtual space to sound realistic. For example, if you are in a cave the reverberation and echo need to be different than if you are in an open space.
For the former (adaptation to the space and conditions) those effects could be applied on the receiver side while the later (voice tunning) could be applied probably in the sender side.
Large scale routing infrastructure based on virtual locations
While being in the metaverse you are communicating with people around you instead of or not only with people in predefined rooms. That means that the architecture for audio routing has the following particularities:
- Routing must be based on location instead of room identifiers.
- Routing must be able to scale to a very large number of participants (imagine somebody singing/shouting in a concert venue or thousands of people talking at the same time in that venue).
To address these requirements we need an architecture where the participants connect to a voice server close to their physical location to have as low latency as possible but the audio routing is based on the virtual location instead. So your server will subscribe to the audio of users with a virtual location (the geographical position in the metaverse) close to yours and when receiving the audio from those servers filter it (or even mix it) based on the distance between the source and the destination of the audio.
In addition to the distance, there are other parameters (like the presence of walls in the middle) that need to be taken into consideration to decide what are the most relevant audio streams for a user so probably the voice servers need to get access to more information from the 3d world servers than just the location.
Given the tightly coupling required with the 3d environment and the low latency overlay already required it is possible that we don't really need specific RTC servers but just to add some new capabilities to the existing gaming servers to properly route and filter audio packets in addition to other 3d interactions.
Video is secondary and replaced with the transmission of face features/expressions
My expectation would be that traditional video transmission will become less relevant and even marginal. It would be mostly replaced with very detailed face features extraction on the sender side combined with very realistic 3d rendering of faces on the receiver side. Something like super high-quality 3d animojis that will be even photorealistic for some use cases.
The technology for that is almost there and you can see demos of videoconferencing doing this way from NVIDIA and amazing digital humans in this Epic product.
Arbitrary data channels are already in place
A third dimension of RTC apart from audio and video is sometimes data (for example in the case of WebRTC that's provided by DataChannels). In the case of the metaverse by definition, it needs to have very low latency and high scale messaging overlay to distribute and synchronize the state of the world so I don't expect any need for data capabilities in the RTC layers.
Some standard requirements like captioning for accessibility or even recording could also apply (hopefully we don't end up in a metaverse where everything is recorded) but don't look like they have different challenges than today.
It is clear that the metaverse vision is coming one way or another and looks like one of the critical components will be the Real-Time Communications and specifically the voice communications support. Our current solutions for RTC need to evolve and go to the next level especially in terms of audio manipulation (effects and adaptation to the 3d environment) and large-scale architectures while it is possible video communications will be less relevant in these metaverse use cases.
And as usual feedback is more than welcomed either here or in Twitter.