This is the public page of the Multi-Modal Conversation (MPAI-MMC) standard providing technical information.  Also see the MPAI-MMC homepage.

MPAI has developed Version 1 and  Version 2 of MPAI-MMC.


MPAI-MMC
Version 2 | Version 1

Technical Specification: Multimodal Conversation (MPAI-MMC) V2 specifies technologies further enhancing the capability of a human to converse with a machine in a variety of application environments compared to V1. In particular it extends the notion and the data format of Emotion to Personal Status that additionally includes Cognitive State and Social Attitude. V2 applies Personal Status and other data types to support new use cases.

Personal Status Extraction: provides an estimate of the Personal Status (PS) – of a human or an avatar – conveyed by a Modality (Text, Speech, Face, and Gesture). PS is the ensemble of Factors, i.e., information internal to a human  or an avatar (Emotion, Cognitive State, and Social Attitude), extracted through the steps of Description Extraction and PS Interpretation. 
Figure 1 – Personal Status Extraction (PSE)
An entity – a real or digital human – converses with a machine possibly about physical objects in the environment. The machine captures and understands Speech, extracts Personal Status from the Text, Speech, Face, and Gesture Factors, fuses the Factors into an estimated Personal Status of the entity to achieve a better understanding of the context in which the entity converses. The machine is represented by a Portable Avatar.
Figure 2 – Conversation with Personal Status (MMC-CPS)
A human holds a conversation with a machine about objects around the human. While conversing, the human points their fingers to indicate their interest in a particular object. The machine uses Visual Scene Description to extract the Human Object and the Physical Object, uses PSE to understand the human’s PS, and uses Personal Status Display (PSD)  to respond while showing its PS.
Figure 3 – Conversation About a Scene (CAS)
Humans converse with a CAV which understands their utterances and their PSs by means of the PSE and manifests itself as the output of a PSD. HCI also recognises humans by face and speech both when they are outside and approach the CAV and inside the cabin. The figure also represents the communication of the Ego CAV HCI with Remote HCIs.
Figure 4 – Human-Connected Autonomous Vehicle (CAV) Interaction (HCI)
The Virtual Secretary (VS) is a human-like speaking avatar not representing a human who produces a summary of what is being said at the meeting, including the participants’ PSs. Participating avatars can make comments to the VS, answer questions, etc. The VS manifests itself through a PSD.
Figure 5 – Avatar-Based Videoconference (Virtual Secretary)

MPAI-MMC: Version 1Version 2

Version 1: MPAI-MMC V1 enables human-machine conversation emul­ating human-human conversation in completeness and intensity using AI. The MPAI-MMC standard includes 5 Use Cases: Conversation with Emotion, Multimodal Question Answering, Unidirectional Speech Translation, Bidirectional Speech Translation and One-to-Many Unidirectional Speech Translation.

The figures below shows the reference models of the MPAI-MMC Use Cases. Note that an Implementation is supposed to run in the MPAI-specified AI Framework (MPAI-AIF).

Conversation with Emotion (CWE) enables a human to holds an audio-visual conver­sation using audio and video with a machine impersonated by a synthetic voice and an animated face, both expressing emotion appropriate to the conversation with a human displaying an emotional state.
Figure 1 – Conversation with Emotion
Multimodal Question Answering (MQA) enables a user to request information using speech concerning an object the user displays and to receive the requested information from a machine via synthetic speech.
Figure 2 – Multimodal Question Answering
Unidirectional Speech Translation (UST) allows a user to select a language different from the one s/he uses and to get a spoken utterance translated into the desired language with a synthetic voice that optionally preserves the personal vocal traits of the spoken utterance.
Figure 3 – Unidirectional Speech Translation
Bidirectional Speech Translation (BST) allows a human to hold a dialogue with another human. Both speech their own language and their translated speech is a synthetic speech that optionally preserves their personal vocal traits.
Figure 4 – Bidirectional Speech Translation
One-to-Many Speech Translation (MST) enables a human to select a number of languages and have their speech translates to the selected languages using a synthetic speech that optionally preserves their personal vocal traits.
Figure 5 – One-to-Many Speech Translation

If you wish to participate in this work you have the following options:

  1. Join MPAI
  2. Keep an eye on this page.

Return to the MPAI-MMC page