This is the public page of the Multi-modal conversation (MPAI-MMC) standard. See the MPAI-MMC homepage.
MPAI has developed Version 1 whose text can be downloaded and has developed Use Cases and Functional Requirements for Version 2.
MPAI-MMC: Version 1 – Version 2
Version 1: MPAI-MMC V1 enables human-machine conversation emulating human-human conversation in completeness and intensity using AI. The MPAI-MMC standard includes 5 Use Cases: Conversation with Emotion, Multimodal Question Answering, Unidirectional Speech Translation, Bidirectional Speech Translation and One-to-Many Unidirectional Speech Translation.
The figures below shows the reference models of the MPAI-MMC Use Cases. Note that an Implementation is supposed to run in the MPAI-specified AI Framework (MPAI-AIF).
MPAI-MMC: Version 1 Version 2
Multimodal Conversation (MPAI-MMC) V2 intends to specify technologies further enhancing the capability of a human to converse with a machine in a variety of application environments. V2 will specify technologies supporting 5 new use cases:The MPAI-MMC Technical Specification has been developed by the MMC Development Committee (MMC-DC) chaired by Miran Choi (ETRI). The MPAI-MMC Technical Specification has been approved and is available for download. A Reference Software implementation is being developed.
MMC-DC is developing the Reference Software, Conformance Testing and Performance Assessment Specifications. It is also developing MPAI-MMC Version 2. This will contain the Human-CAV Interaction subsystem of Connected Autonomous Vehicles (MPAI-CAV) depicyed .
![]() |
Personal Status Extraction: provides an estimate of the Personal Status (PS) – of a human or an avatar – conveyed by Text, Speech, Face, and Gesture. PS is the ensemble of information internal to a person, including Emotion, Cognitive State, and Attitude. |
Figure 6 – Personal Status Extraction (PSE) | |
![]() |
Personal Status Display: generates an avatar from Text and PS that
|
Figure 7 – Personal Status Display (PSD) | |
![]() |
Conversation About a Scene: a human holds a conversation with a machine about objects in a scene. While conversing, the human points their fingers to indicate their interest in a particular object. The machine uses Visual Scene Description to extract the Human Object and the Physical Object, PSE to understand the human’s PS and PSD to respond showing its PS. |
Figure 8 – Conversation About a Scene (CAS) | |
![]() |
Human-Connected Autonomous Vehicle (CAV) Interaction: a group of humans converses with a CAV which understands their utterances and their PSs by means of the PSE and manifests itself as the output of a PSD.
HCI should recognise humans by face and speech both when they are outside and approach the CAV and inside the cabin. HCI should create Audio-Visual Scene Descriptors to be able to deal with the individual humans |
Figure 9 – Human-CAV Interaction (HCI) | |
![]() |
Avatar-Based Videoconference is a system where Virtual Twins of humans, embodied in speaking avatars having a high level of similarity, in terms of voice and appearance, with their Human Twins, are directed by Human Twins. Each participant has a client connected to the server which is optionally augmented by the Virtual Secretary. |
Figure 10 – Avatar-Based Videoconference (ABV) | |
![]() |
The Client receives participant’s Avatar Model and spoken language preferences (at the start), and audio and video throughout the session.
The Client extracts visual and speech features for authentication and constantly generates Avatar Descriptions using the participant’s PS to improve the accuracy of the participant description. |
Figure 10 – Avatar-Based Videoconference (Client TX) | |
![]() |
The Server is operated by a manager who distributes the room model and oversees authentication using participants’ face and speech descriptors.
The Server translates the utterances of the individual participants into speech in the languages selected by the participants |
Figure 10 – Avatar-Based Videoconference (Server) | |
![]() |
The Virtual Secretary (VS) is a human-like speaking avatar not representing a human who takes note of what is being said at the meeting taking the avatars’ PS into account. Meeting avatars can make comments to the VS, answering questions, etc. The VS manifests itself through a PSD showing its PS. |
Figure 10 – Avatar-Based Videoconference (Virtual Secretary) | |
![]() |
Each participant can arrange avatars at the meeting according to their preferences. The participant can select a point of view that may coincide with their avatar’s point of view and see the meeting participants as they have been located and hear them with a correct spatial localisation. |
Figure 11 – Avatar-Based Videoconference (Client RX) |
Read about MPAI-MMC V2:
- 2 min video (YouTube ) and video (non YouTube) illustrating MPAI-MMC V2.
- slides presented at the online meeting on 2022/07/12.
- video recording of the online presentation (Youtube, non-YouTube) made at that 12 July presentation.
- Call for Technologies, Use Cases and Functional Requirements, and Framework Licence.
The MPAI Secretariat shall receive submissions in response to the MPAI-MMC V2 Call for Technologies by 2022/10/24T23:59 UTC.
If you wish to participate in this work you have the following options:
- Join MPAI
- Participate until the Functional Requirements of MMC-HCI are approved (after that only MPAI members can participate) by sending an email to the MPAI Secretariat.
- Keep an eye on this page.
Return to the MPAI-MMC page