Friday 3 June, I published MPAI wants to do it again, an article describing the 5 use cases MPAI intends to address in its next Call for Technologies. Yesterday, I ran across How will AI power the metaverse? by Luca Sambucci. The paper identifies some key technologies that happen to be listed among those planned to be in the next MPAI Call for Technologies.
Luca writes: “A digital world requires the presence of digital places, as in rooms or villas or grassy hills, to allow whoever is occupying them at that moment to move around, interact with the environment and carry out the various activities allowed by that particular place, whether it be a meeting room immersed in a mountainous landscape, a comet in deepest space or a reproduction of Minas Tirith.”
Indeed, MPAI defines an Environment as “A Physical or Virtual Space containing a Scene” and a Scene as “An Environment populated by humans and real objects or by avatars and virtual objects”. The next MPAI Call requests proposals for Environment Model and Environment Descriptors. The main requirement is that an independent third party should be able to reproduce the intended Environment.
Luca then writes: “Although in the metaverse potentially nobody knows who you are, there will be situations – such as metaverse-hosted business meetings – where masquerading behind a nickname and a Salvador Dali mask may not be commonly accepted behaviour. In those environments it will be necessary, and useful, to be present not only with one’s real name but also with an avatar that looks as much like us as possible.”
Indeed, MPAI intends to call for technologies that allow a human to send an avatar model and Descriptors that enable an independent third party to create an avatar who faithfully reproduces the sending human.
Now Luca writes: “But the recognition won’t stop there. AI will also be able to copy our facial expressions onto the avatar, so that our smile is also the avatar’s smile, transferring more and more expressions – frowning, yawning, surprise, blinking, etc. – onto our digital twin, to make our transposition from the physical to the digital world is as realistic as possible.”
Of course, this is what will be written in the Call. But there will be more. To enable satisfactory human-machine interaction we need to be able to detect as accurately as possible the Personal Status of a human, Personal Status being “The ensemble of information internal to a person, including Emotion, Cognitive State, and Attitude”. A digital representation of Personal Status allows for the creation of a synthetic Personal Status (the one that an avatar is supposed to have) that is appropriately manifested on the avatar as if it were a human.
Again, Luca writes “In a digital world, we need digital people. As we already know, artificial intelligence is now able to hold discussions, correctly interpreting input and producing appropriately correlated output, giving the impression of understanding what is being said and being able to reply back. This ability, achieved through large language models of which GPT-3 is one example among many, can be incorporated into the various digital agents that will populate the metaverse to produce highly realistic virtual assistants or companions. In online games these agents are called NPCs (Non-Playing Characters), i.e. elements that are usually graphically similar to user avatars but are there only to do a few simple tasks, such as starting a quest, handing out rewards, giving out info or doing something for cosmetic reasons (e.g. walking around).”
The next MPAI Call introduces the figure of Virtual Secretary, defined as “an avatar not representing a human participant whose role is to: 1) make and visually share a summary of what other ABV avatars say, 2) receive comments, 3) process the vocal and textual comments using the avatars’ Personal Status showing in their Text, Speech, Face, and Gestures, 4) edit the summary accordingly, and 5) display the summary”. Sure, this is not the entertainment-prone avatar envisaged by Luca, but is one that has a deep understanding of what humans – but also other avatars – say.
Finally, Luca writes: “The idea here is to enable groups of people from different countries, each speaking a different language, to speak and understand each other in real-time.” The 3 Automatic Speech Translation Use Cases in the MPAI-MMC V1 standard already provide the basic technologies, but more is needed to improve the performance of the envisaged multilingual automatic speech translation.
Rome was not built in one day and so will not the metaverse. Staring from the bricks as MPAI is doing is the right way.