About MPAI-MMC

This is the public page of the Multi-Modal Conversation (MPAI-MMC) standard providing technical information. Also see the MPAI-MMC homepage.

MPAI has developed Version 1 and Version 2 of MPAI-MMC.

MPAI-MMC: Version 2 | Version 1

Technical Specification: Multimodal Conversation (MPAI-MMC) V2 specifies technologies further enhancing the capability of a human to converse with a machine in a variety of application environments compared to V1. In particular it extends the notion and the data format of Emotion to Personal Status that additionally includes Cognitive State and Social Attitude. V2 applies Personal Status and other data types to support new use cases.

	Personal Status Extraction: provides an estimate of the Personal Status (PS) – of a human or an avatar – conveyed by a Modality (Text, Speech, Face, and Gesture). PS is the ensemble of Factors, i.e., information internal to a human or an avatar (Emotion, Cognitive State, and Social Attitude), extracted through the steps of Description Extraction and PS Interpretation.
Figure 1 – Personal Status Extraction (PSE)
	An entity – a real or digital human – converses with a machine possibly about physical objects in the environment. The machine captures and understands Speech, extracts Personal Status from the Text, Speech, Face, and Gesture Factors, fuses the Factors into an estimated Personal Status of the entity to achieve a better understanding of the context in which the entity converses. The machine is represented by a Portable Avatar.
Figure 2 – Conversation with Personal Status (MMC-CPS)
	A human holds a conversation with a machine about objects around the human. While conversing, the human points their fingers to indicate their interest in a particular object. The machine uses Visual Scene Description to extract the Human Object and the Physical Object, uses PSE to understand the human’s PS, and uses Personal Status Display (PSD) to respond while showing its PS.
Figure 3 – Conversation About a Scene (CAS)
	Humans converse with a CAV which understands their utterances and their PSs by means of the PSE and manifests itself as the output of a PSD. HCI also recognises humans by face and speech both when they are outside and approach the CAV and inside the cabin. The figure also represents the communication of the Ego CAV HCI with Remote HCIs.
Figure 4 – Human-Connected Autonomous Vehicle (CAV) Interaction (HCI)
	The Virtual Secretary (VS) is a human-like speaking avatar not representing a human who produces a summary of what is being said at the meeting, including the participants’ PSs. Participating avatars can make comments to the VS, answer questions, etc. The VS manifests itself through a PSD.
Figure 5 – Avatar-Based Videoconference (Virtual Secretary)

MPAI-MMC: Version 1 – Version 2

Version 1: MPAI-MMC V1 enables human-machine conversation emulating human-human conversation in completeness and intensity using AI. The MPAI-MMC standard includes 5 Use Cases: Conversation with Emotion, Multimodal Question Answering, Unidirectional Speech Translation, Bidirectional Speech Translation and One-to-Many Unidirectional Speech Translation.

The figures below shows the reference models of the MPAI-MMC Use Cases. Note that an Implementation is supposed to run in the MPAI-specified AI Framework (MPAI-AIF).

	Conversation with Emotion (CWE) enables a human to holds an audio-visual conversation using audio and video with a machine impersonated by a synthetic voice and an animated face, both expressing emotion appropriate to the conversation with a human displaying an emotional state.
Figure 1 – Conversation with Emotion
	Multimodal Question Answering (MQA) enables a user to request information using speech concerning an object the user displays and to receive the requested information from a machine via synthetic speech.
Figure 2 – Multimodal Question Answering
	Unidirectional Speech Translation (UST) allows a user to select a language different from the one s/he uses and to get a spoken utterance translated into the desired language with a synthetic voice that optionally preserves the personal vocal traits of the spoken utterance.
Figure 3 – Unidirectional Speech Translation
	Bidirectional Speech Translation (BST) allows a human to hold a dialogue with another human. Both speech their own language and their translated speech is a synthetic speech that optionally preserves their personal vocal traits.
Figure 4 – Bidirectional Speech Translation
	One-to-Many Speech Translation (MST) enables a human to select a number of languages and have their speech translates to the selected languages using a synthetic speech that optionally preserves their personal vocal traits.
Figure 5 – One-to-Many Speech Translation

If you wish to participate in this work you have the following options:

Join MPAI
Keep an eye on this page.

Return to the MPAI-MMC page

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit

Notice