About the Multimodal Conversation Standard

The Multimodal Conversation (MPAI‑MMC) V2.5 standard defines a comprehensive and interoperable framework for human‑machine and machine‑machine conversation systems integrating text, speech, vision, and behavioral signals. MPAI-MMC enables the creation of Advanced conversational applications that combine AI Modules (AIMs) exchanging standard Data Types and operating within the MPAI Artificial Intelligence Framework (MPAI‑AIF).

Key Use Cases

MPAI‑MMC V2.5 fully specifies a rich set of multimodal conversational applications:

  • Answer to Multimodal Question (MMC‑AMQ): Responds to queries combining text, speech, and visual input.
  • Conversation About a Scene (MMC‑CAS): Enables interactive dialogue about objects and environments using speech, gestures, and visual cues.
  • Conversation with Personal Status (MMC‑CPS): Extracts and expresses internal states (Personal Status) during interaction.
  • Conversation with Emotion (MMC‑CWE): Supports emotionally expressive audio‑visual dialogue with synthetic agents.
  • Human‑Connected Autonomous Vehicle Interaction (MMC‑HCI): Enables natural interaction between humans and autonomous vehicles using multimodal signals.
  • Multimodal Question Answering (MMC‑MQA): Answers queries about displayed objects and scenes.
  • Text and Speech Translation (MMC‑TST): Provides flexible multimodal translation with optional preservation of speech characteristics.
  • Virtual Meeting Secretary (MMC‑VMS): Summarises meetings, interprets participant signals, and supports interaction in virtual environments.
  • Personal Status Extraction (MMC‑PSE): Estimates internal states from text, speech, face, and gestures.

Powered by the MPAI AI Framework

MPAI‑MMC operates within the MPAI Artificial Intelligence Framework (AIF), which provides a standard Execution Environment having an architecture composed of components (AIMs) that can implemented in a platform‑independent manner and dynamically configured and orchestrated.

 Benefits for the Ecosystem

MPAI‑MMC enables a multi‑vendor, interoperable AI ecosystem:

  • Technology Providers Offer standard-compliant AI components to a global market
  • Developers & Integrators Build applications using reusable, interoperable modules
  • End Users Access more powerful, transparent, and trustworthy AI applications
  • Society Benefits from reduced opacity of AI through modular, inspectable systems

A New Paradigm for Conversational AI

MPAI‑MMC promotes a shift from monolithic AI systems to:

  • Composable AI architectures
  • Reusable multimodal components
  • Transparent and explainable workflows
  • With shared Data Types and reusable AIMs

MPAI‑MMC enables scalable innovation in the multimodal conversation domain and component reusability. Indeed, most AI Modules are reused across the MPAI-MMC use cases, ensuring efficiency, consistency, and rapid development.

 Conclusion

MPAI‑MMC V2.5 delivers a complete, interoperable framework for building next-generation conversational systems that:

  • Understand and generate across modalities
  • Capture human behavioural signals
  • Operate in standard, secure, and composable environments

 

 

Technical Specification: Multimodal Conversation (MPAI-MMC) V2 specifies technologies further enhancing the capability of a human to converse with a machine in a variety of application environments compared to V1. In particular it extends the notion and the data format of Emotion to Personal Status that additionally includes Cognitive State and Social Attitude. V2 applies Personal Status and other data types to support new use cases.

Personal Status Extraction: provides an estimate of the Personal Status (PS) – of a human or an avatar – conveyed by a Modality (Text, Speech, Face, and Gesture). PS is the ensemble of Factors, i.e., information internal to a human  or an avatar (Emotion, Cognitive State, and Social Attitude), extracted through the steps of Description Extraction and PS Interpretation. 
Figure 1 – Personal Status Extraction (PSE)
An entity – a real or digital human – converses with a machine possibly about physical objects in the environment. The machine captures and understands Speech, extracts Personal Status from the Text, Speech, Face, and Gesture Factors, fuses the Factors into an estimated Personal Status of the entity to achieve a better understanding of the context in which the entity converses. The machine is represented by a Portable Avatar.
Figure 2 – Conversation with Personal Status (MMC-CPS)
A human holds a conversation with a machine about objects around the human. While conversing, the human points their fingers to indicate their interest in a particular object. The machine uses Visual Scene Description to extract the Human Object and the Physical Object, uses PSE to understand the human’s PS, and uses Personal Status Display (PSD)  to respond while showing its PS.
Figure 3 – Conversation About a Scene (CAS)
Humans converse with a CAV which understands their utterances and their PSs by means of the PSE and manifests itself as the output of a PSD. HCI also recognises humans by face and speech both when they are outside and approach the CAV and inside the cabin. The figure also represents the communication of the Ego CAV HCI with Remote HCIs.
Figure 4 – Human-Connected Autonomous Vehicle (CAV) Interaction (HCI)
The Virtual Secretary (VS) is a human-like speaking avatar not representing a human who produces a summary of what is being said at the meeting, including the participants’ PSs. Participating avatars can make comments to the VS, answer questions, etc. The VS manifests itself through a PSD.
Figure 5 – Avatar-Based Videoconference (Virtual Secretary)

MPAI-MMC: Version 1Version 2

Version 1: MPAI-MMC V1 enables human-machine conversation emul­ating human-human conversation in completeness and intensity using AI. The MPAI-MMC standard includes 5 Use Cases: Conversation with Emotion, Multimodal Question Answering, Unidirectional Speech Translation, Bidirectional Speech Translation and One-to-Many Unidirectional Speech Translation.

The figures below shows the reference models of the MPAI-MMC Use Cases. Note that an Implementation is supposed to run in the MPAI-specified AI Framework (MPAI-AIF).

Conversation with Emotion (CWE) enables a human to holds an audio-visual conver­sation using audio and video with a machine impersonated by a synthetic voice and an animated face, both expressing emotion appropriate to the conversation with a human displaying an emotional state.
Figure 1 – Conversation with Emotion
Multimodal Question Answering (MQA) enables a user to request information using speech concerning an object the user displays and to receive the requested information from a machine via synthetic speech.
Figure 2 – Multimodal Question Answering
Unidirectional Speech Translation (UST) allows a user to select a language different from the one s/he uses and to get a spoken utterance translated into the desired language with a synthetic voice that optionally preserves the personal vocal traits of the spoken utterance.
Figure 3 – Unidirectional Speech Translation
Bidirectional Speech Translation (BST) allows a human to hold a dialogue with another human. Both speech their own language and their translated speech is a synthetic speech that optionally preserves their personal vocal traits.
Figure 4 – Bidirectional Speech Translation
One-to-Many Speech Translation (MST) enables a human to select a number of languages and have their speech translates to the selected languages using a synthetic speech that optionally preserves their personal vocal traits.
Figure 5 – One-to-Many Speech Translation

If you wish to participate in this work you have the following options:

  1. Join MPAI
  2. Keep an eye on this page.

Return to the MPAI-MMC page