Digital humans and MPAI

Leonardo Chiariglione
2022-01-12

“Digital human” has recently become a trendy expression and different meanings can be attached to it. MPAI says that it is “a digital object able to receive text/audio/video/commands (“Information”) and generate Information that is congruent with the received Information”.

MPAI has been developing several standards for “digital humans” and plans on extending them and developing more.

Let’s have an overview.

In Conversation with Emotion a digital human perceives text or speech from and video of a human. It then generates text or speech that is congruent with content and emotion of the perceived media, and displays itself as an avatar whose lips move in sync with its speech and according to the emotion embedded in the synthetic speech.

In Multimodal Question Answering a digital human perceives text or speech from a human asking a question about an object held by the human, and the video of the human holding the object. In response it generates text or speech that is a response to the human question and is congruent with the perceived media data including the emotional state of the human.

Adding an avatar whose lips move in sync with the generated speech could have a more satisfactory rendering of the speech generated by the digital human.

In Automatic Speech Translation a digital human is told to translate speech or text generated by the human into a specified language and to preserve or not the speech features of the input speech, in case the input is speech. The digital human then generates translated text, if the input is text and translated speech preserving or not the input speech features, if the input is speech.

Adding an avatar whose lips move in sync with the generated speech and according to its embedded emotion could have a more satisfactory rendering of the digital human speech.

In Emotion Enhanced Speech a digital human is told to add an emotion to an emotion-less speech by giving

A model utterance: the digital human extracts and adds the speech features of the model utterance to the emotion-less speech segment.
An emotion taken from the MPAI standard list of emotions: the digital human adds the speech features obtained by combining the speech features proper of the selected emotion to the speech features of the emotion-less speech to the emotion-less speech.

In both cases an avatar can be animated by the emotion-enhanced speech.

MPAI has more digital human use cases under development:

Human-CAV Interaction: A digital human (a face) speaks to a group of humans gazing at the human it is responding to.

Mixed-reality Collaborative Spaces: A digital human (a torso) utters the speech of a participant in a virtual videoconference while its torso moves in sync with the participant’s torso.

Conversation about a scene: A digital human (a face) converses with a human about the objects of a scene the human is part of gazing at the human or at an object.

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit

Notice