This is the public page of the Multi-modal conversation (MPAI-MMC) standard.  See the MPAI-MMC homepage.

MPAI has developed Version 1 whose text can be downloaded and has developed Use Cases and Functional Requirements for Version 2.

MPAI-MMC: Version 1 – Version 2

Version 1: MPAI-MMC V1 enables human-machine conversation emul­ating human-human conversation in completeness and intensity using AI. The MPAI-MMC standard includes 5 Use Cases: Conversation with Emotion, Multimodal Question Answering, Unidirectional Speech Translation, Bidirectional Speech Translation and One-to-Many Unidirectional Speech Translation.

The figures below shows the reference models of the MPAI-MMC Use Cases. Note that an Implementation is supposed to run in the MPAI-specified AI Framework (MPAI-AIF).

Conversation with Emotion (CWE) enables a human to holds an audio-visual conver­sation using audio and video with a machine impersonated by a synthetic voice and an animated face, both expressing emotion appropriate to the conversation with a human displaying an emotional state.
Figure 1 – Conversation with Emotion
Multimodal Question Answering (MQA) enables a user to request information using speech concerning an object the user displays and to receive the requested information from a machine via synthetic speech.
Figure 2 – Multimodal Question Answering
Unidirectional Speech Translation (UST) allows a user to select a language different from the one s/he uses and to get a spoken utterance translated into the desired language with a synthetic voice that optionally preserves the personal vocal traits of the spoken utterance.
Figure 3 – Unidirectional Speech Translation
Bidirectional Speech Translation (BST) allows a human to hold a dialogue with another human. Both speech their own language and their translated speech is a synthetic speech that optionally preserves their personal vocal traits.
Figure 4 – Bidirectional Speech Translation
One-to-Many Speech Translation (MST) enables a human to select a number of languages and have their speech translates to the selected languages using a synthetic speech that optionally preserves their personal vocal traits.
Figure 5 – One-to-Many Speech Translation

MPAI-MMC: Version 1 Version 2

Multimodal Conversation (MPAI-MMC) V2 intends to specify technologies further enhancing the capability of a human to converse with a machine in a variety of application environments. V2 will specify technologies supporting 5 new use cases:The MPAI-MMC Technical Specification has been developed by the MMC Development Committee (MMC-DC) chaired by Miran Choi (ETRI). The MPAI-MMC Technical Specification has been approved and is available for download. A Reference Software implementation is being developed.

MMC-DC is developing the Reference Software, Conformance Testing and Performance Assessment Specifications. It is also developing MPAI-MMC Version 2. This will contain the Human-CAV Interaction subsystem of Connected Autonomous Vehicles (MPAI-CAV) depicyed .

Personal Status Extraction: provides an estimate of the Personal Status (PS) – of a human or an avatar – conveyed by Text, Speech, Face, and Gesture. PS is the ensemble of information internal to a person, including Emotion, Cognitive State, and Attitude.
Figure 6 – Personal Status Extraction (PSE)
Personal Status Display: generates an avatar from Text and PS that

  1. Utters speech with the intended PS
  2. Displays a the face whose lips move in sync with the text and shows the intended PS
  3. Makes gestures accompanying the text  showing the intended PS.
Figure 7 – Personal Status Display (PSD)
Conversation About a Scene: a human holds a conversation with a machine about objects in a scene. While conversing, the human points their fingers to indicate their interest in a particular object. The machine uses Visual Scene Description to extract the Human Object and the Physical Object, PSE to understand the human’s PS and PSD  to respond showing its PS.
Figure 8 – Conversation About a Scene (CAS)
Human-Connected Autonomous Vehicle (CAV) Interaction: a group of humans converses with a CAV which understands their utterances and their PSs by means of the PSE and manifests itself as the output of a PSD.

HCI should recognise humans by face and speech both when they are outside and approach the CAV and inside the cabin.

HCI should create Audio-Visual Scene Descriptors to be able to deal with the individual humans

Figure 9 – Human-CAV Interaction (HCI)
Avatar-Based Videoconference is a system where Virtual Twins of humans, embodied in speaking avatars having a high level of similarity, in terms of voice and appearance, with their Human Twins, are directed by Human Twins. Each participant has a client connected to the server which is optionally augmented by the Virtual Secretary.
Figure 10 – Avatar-Based Videoconference (ABV)
The Client receives participant’s Avatar Model and spoken language preferences (at the start), and audio and video throughout the session.

The Client extracts visual and speech features for authentication and constantly generates Avatar Descriptions using the participant’s PS to improve the accuracy of the participant description.

Figure 10 – Avatar-Based Videoconference (Client TX)
The Server is operated by a manager who distributes the room model and oversees authentication using participants’ face and speech descriptors.

The Server translates the utterances of the individual participants into speech in the languages selected by the participants

Figure 10 – Avatar-Based Videoconference (Server)
The Virtual Secretary (VS) is a human-like speaking avatar not representing a human who takes note of what is being said at the meeting taking the avatars’ PS into account. Meeting avatars can make comments to the VS, answering questions, etc. The VS manifests itself through a PSD showing its PS.
Figure 10 – Avatar-Based Videoconference (Virtual Secretary)
Each participant can arrange avatars at the meeting according to their preferences. The participant can select a point of view that may coincide with their avatar’s point of view and see  the meeting participants as they have been located and hear them with a correct spatial localisation.
Figure 11 – Avatar-Based Videoconference (Client RX)

Read about MPAI-MMC V2:

  1.  2 min video (YouTube ) and video (non YouTube) illustrating MPAI-MMC V2.
  2. slides presented at the online meeting on 2022/07/12.
  3. video recording of the online presentation (Youtube, non-YouTube) made at that 12 July presentation.
  4. Call for Technologies, Use Cases and Functional Requirements, and Framework Licence.

The MPAI Secretariat shall receive submissions in response to the MPAI-MMC V2 Call for Technologies by 2022/10/24T23:59 UTC.


If you wish to participate in this work you have the following options:

  1. Join MPAI
  2. Participate until the Functional Requirements of MMC-HCI are approved (after that only MPAI members can participate) by sending an email to the MPAI Secretariat.
  3. Keep an eye on this page.

Return to the MPAI-MMC page