MPAI-WMG V1.0 Multimodal Conversation V1.0

1 AI Workflows

MMC-AMQ	Answer to Multimodal Question	MMC-HCI	Human-CAV In teraction
MMC-CAS	Conversation About a Scene	MMC-MQA	Multimodal Question Answering
MMC-CPS	Conversation with Personal Status	MMC-TST	Text and Speech Translation
MMC-CWE	Conversation with Emotion	MMC-VMS	Virtual Meeting Secretary

1.1 Answer to Multimodal Question

MMC-AMQ is a PAAI that provides a Text and/or Speech response to a Text or Speech question related to an image.

It is composed of the following collaborating PAAIs:

Automatic Speech Recognition	Converts a Speech Object into a Text Object.
Text and Image Query	Receives a Text Object and a Visual Object. Produces a Text Object of the response to the input Text or Speech Object.
Text-To-Speech	Converting the Text Object into a Speech Object.

Figure 14 – Reference Model of MMC-AMQ

The following links analyse the AI Modules:

Automatic Speech Recognition
Text and Image Query
Text-To-Speech

MMC-AMQ performs Interpretation-Reasoning Operations.

1.2 Conversation About a Scene

MMC-CAS is a PAAI that:engages in a conversation with an Entity about objects in a scene, e.g., a shop full of objects where the salesclerk is a Machine.

It is composed of the following collaborating PAAIs:

Visual Scene Description	Provides the Visual Scene Descriptors.
Visual Object Identification	Provides the ID of a Visual Object.
Automatic Speech Recognition	Converts input the Speech Object into a Text Object.
Natural Language Understanding	Refines the Text providing Refined Text and Meaning.
Personal Status Extraction	Extracts the Entity’s Personal Status.
Entity Dialogue Processing	Responds to the input from the Entity.
Personal Status Display	Provides the Machine’s Portable Avatar.
Audio-Visual Scene Rendering	Displays the Scene as seen by the Avatar and the Avatar from an Entity-selected Point of View.

Figure 15 – Reference Model of MMC-CAS

The following links analyse the AI Modules:

MMC-CAS performs Descriptors-Interpretation-Reasoning Operations.

1.3 Conversation with Emotion

MMC-CWE is a PAAI that converses with an Entity in natural language Shows itself as a speaking avatar displaying an Emotion congruent with the Emotion displayed by the Entity.

It is composed of the following collaborating PAAIs:

Automatic Speech Recognition	Converts the input Speech Object into a Text Object.
Entity Speech Description	Extracts the Entity’s Speech Descriptors.
Entity Face Description	Extracts the Entity’s Face Descriptors.
Natural Language Understanding	Provides Refined Text and Meaning.
PS-Speech Interpretation	Provides the Entity’s Emotion in Speech.
PS-Face Interpretation	Provides the Entity’s Emotion in Face.
PS-Text Interpretation	Provides the Entity’s Emotion in Text.
Multimodal Emotion Fusion	Provides a single Entity Emotion.
Entity Dialogue Processing	Responds to the input from the Entity providing Machine Text and Machine Emotion.
Text-To-Speech	Provides Machine Speech that conveys Machine’s Emotion.
Video Lip Animation	Displays the Avatar’s Face.

Figure 16 – Reference Model of MMC-CWE

The following links analyse the AI Modules:

Automatic Speech Recognition
Entity Speech Description
Entity Face Description
Natural Language Understanding
PS-Text Interpretation
PS-Speech Interpretation
PS-Face Interpretation
Multimodal Emotion Fusion
Entity Dialogue Processing
Text-to-Speech
Video Lip Animation

MMC-CWE performs Descriptors-Interpretation-Reasoning Operations.

1.4 Conversation with Personal Status

MMC-CPS is a PAAI that Converses with an Entity indicating visual objects and displaying a Personal Status while displaying itself a Personal Status that is congruent with that of the Entity’s.

It is composed of the following collaborating PAAIs:

Visual Scene Description	Describes the Visual Scene.
Speech Scene Description	Describes the Speech Scene.
Visual Object Identification	Identifies Visual Objects.
Automatic Speech Recognition	Recognises the Entity’s Speech.
Natural Language Understanding	Refines the Recognised Speech and extracts the Meaning from Text
Personal Status Extraction	Extracts the Personal Status.
Entity Dialogue Processing	Responds to the Entity with a Personal Status congruent with that of the Entity.
Personal Status Display	Produces the Personal Status.
Audio-Visual Scene Rendering	Renders the Audio-Visual Scene as seen by the MMC-CPS PAAI.

The following links analyse the AI Modules:

MMC-CPS performs Descriptors-Interpretation-Reasoning Operations.

1.5 Human-CAV Interaction

Human-CAV interaction (HCI) is a PAAI that

Responds to human utterances expressed by text, speech, face, and gesture.
Executes requests that are in the scope of a human close to or inside a CAV:
Recognise the identity of human owner or renter (face and speech).
Respond to humans’ commands and queries.
Converse with humans.
Manifest itself as an audio-visual entity.
Exchange information with the Autonomous Motion Subsystem in response to humans’ requests.
Communicate with a Process or Remote HCIs.

It is composed of the following collaborating PAAIs:

Audio-Visual Scene Description	Produces Audio-Visual Scene Descriptors.
Automatic Speech Recognition	Produces Recognised Text.
Visual Object Identification	Provides the Instance IDs of indicated Visual Objectd.
Natural Language Understanding	Produces Refined Text and Meaning.
Speaker Identity Recognition	Produces the human’s Speaker ID.
Personal Status Extraction	Produces the human’s Personal Status.
Face Identity Recognition	Produces the human’s Face ID.
Entity Dialogue Processing	Produces MMC-CAV PAAI’s Text Object and Personal Status and AMS-HCI Messages and Ego-Remote HCI Messages.
Personal Status Display	Produces MMC-CAV PAAI’s Portable Avatar.
Audio-Visual Scene Rendering	2. Produces MMC-CAV PAAI’s Output Speech, Output Audio, and Output Visual.

Figure 17 – Reference Model of MMC-HCI

The following links analyse the AI Modules:

Audio-Visual Scene Description

Audio-Visual Scene Rendering

Automatic Speech Recognition

Entity Dialogue Processing

Face Identity Recognition

Natural Language Understanding

Personal Status Display

Personal Status Extraction

Speaker Identity Recognition

Visual Object Identification

1.6 Multimodal Question Answering

MMC-MQA is a PAAI that implements the Text and Image Query PAAIs of the Answer to Multimodal Question (MMC-AMQ) PAAI with the following PAAIs:

Visual Object Identification	Provides the ID of a Visual Object.
Automatic Speech Recognition	Recognises the Text in the Speech.
Natural Language Understanding	Refines Recognised Text and extracts Meaning
Question Analysis Module	Provides the Intention.
Answer to Question Module	Produces a response to the question.
Text-To-Speech	Synthesises Speech from Text.

Figure 18 – Reference Model of MMC-MQA

The following links analyse the AI Modules:

Visual Object Identification
Automatic Speech Recognition
Natural Language Understanding
Question Analysis Module
Answer to Question Module
Text-to-Speech

1.7 Text and Speech Translation

MMC-TST is a PAAI that translates a Text Object or a Speech Object into a Text Object and/or a Speech Object in a different language. MMC-TST may optionally retain the features of the input Speech Object in the translated Speech Object.

It is composed of the following PAAIs:

Automatic Speech Recognition	Extracts Text from Speech
Text-to-Text Translation	Translates Text in a language into Text in another language
Entity Speech Description	Extracts Speech Descriptors from Speech/
Text-to-Text with Descriptors	Synthesises Speech adding Speech Descriptors.

Figure 19 – Reference Model of MM-TST

The following links analyse the AI Modules:

Automatic Speech Recognition
Text and Speech Translation
Text-to-Text Translation
Entity Speech Description
Text-to-Speech with Descriptors

1.8 Virtual Meeting Secretary

MMC-VMS is a PAAI

Describing avatars attending a meeting in terms of Space-Time.
Interpreting Text provided and Speech uttered by avatars attending a meeting.
Extracting the avatars’ Personal Status.
Producing a Summary of what is being said.
Producing Text and Personal Status of its responses.
Producing a Portable Avatar of itself.

It is composed of the following PAAIs:

Portable Avatar Demultiplexing	Provides the Data required by Virtual Secretary’s AIMs.
Automatic Speech Recognition	Provides Recognised Text.
Natural Language Understanding	Extracts Meaning.
Personal Status Extraction	Extracts Personal Status.
Summary Creation Module	Produces and refines Summary using Edited Summary.
Entity Dialogue Processing	Produces Text, Virtual Secretary Personal Status, and Edited Summary.
Personal Status Display	Shows Virtual Secretary as Virtual Secretary Portable Avatar.

Figure 20 – Reference Model of MMC-VMS

The following links analyse the AI Modules:

Portable Avatar Demultiplexing
Automatic Speech Recognition
Natural Language Understanding
Personal Status Extraction
Summary Creation Module
Entity Dialogue Processing
Personal Status Display

2 AI Modules

2.1 Automatic Speech Recognition

MMC-ASR is a PAAI that:

Receives	Language Selector	Signalling the language of the speech.
	Auxiliary Text	Text that may be used to provide context information.
	Speech Object	Speech to be recognised.
	Speaker ID	ID of speaker uttering speech.
	Speech Overlap	Data type providing information of speech overlap.
	Speaker Time	Time during which the speech is to be recognised.
Produces	Recognised Text	(Also called text Transcript).

MMC-ASR PAAI can receive various types of data that may be used to help it do a better job:

Just Speech Data to be recognised.
Qualifier providing Speech-related information , such as language of the Speech.
Auxiliary Text providing the context of the Speech.
Speaker ID identifying the speaker-dependent.
Speech Overlap telling OSD-ASR that there may be more than one speech in all or part of the Speech.
Speaker Time indicating when the Speech is to be recognised.

MMC-ASR performs Description-Interpretation-level Operations.

2.2 Entity Dialogue Processing

MMC-EDP is a PAAI that:

Receives	Text Object	Text of the entity upstream to be processed.
	Object Instance ID	of an object in a scene.
	Personal Status	of the entity upstream.
	Text Descriptors	Descriptors of input Text Object.
	AV Scene Geometry	Geometry of the AV scene containing object whose ID is provided.
	Speaker ID	ID of speaker uttering the speech that contains the Text Object.
	Face ID	ID of the face of the speaker uttering the speech that contains the Text Object.
	Summary	A summary of the discussions being held in the environment.
Handles	Text Object	from an entity upstream.
Recognises	The identity	of entity upstream using speech and/or face.
Takes into account	Past Text Objects	and their spatial arrangement.
Produces	Summary	Edited summary based on input data.
	Text Object	of Machine.
	Personal Status	of Machine.

The MMC-EDP PAAI can receive various types of data that may be used to help it provide a better or richer response:

Text to which it responds with:
1. Text that is a response to
  1. A finite set of questions.
  2. An indefinite set of questions to which a response is provided by a general-purpose or purpose-built Large Language Model (LLM) or Small Language Model (SLM).
2. A Personal Status obtained by inferring, from the input Text, the internal state of the Entity that has generated the Text and creating a fictitious Machine Personal Status that is congruent with the Personal Status of the Entity-provided Text.
Entity ID helping to access previously stored PAAI Experiences with the same Entity.
Meaning provided by a Natural Language Understanding (NLU) that produces Refined Text and Meaning from Recognised Text to help the MMC-EDP produce a better response.
Personal Status provided by a Personal Status Extraction PAAI that can act on Text and, if available other Factors – Speech, Face, and Gesture. As for Text and depending on the capability of the communication system incorporating the MMC-EDP, The EDP can produce an extended Personal Status including Speech, Face, and Gesture.
Audio-Visual Scene Geometry provided by an Audio-Visual Scene Description PAAI because the spatial context in which the information-providing Entity operates helps focus the MMC-EDP’s response.
Audio-Visual Object IDs provided by an Audio Object Identification (CAE-AOI) or Visual Object Identification (OSD-VOI) because the identity of one or more Audio or Visual or Audio-Visual Objects present in the spatial environment where the Entity operates helps focus the MMC-EDP’s response.

Therefore, an MMC-EDP could provide the following prompt to its LLM:

Please respond to the following Text provided by an Entity with the following Entity ID which is believed to hold the following Personal Status and is located in a scene populated by the following audio objects identified by their Audio Object IDs and visual objects identified by their Visual Object IDs that are located at the following Points of Views of the scene, respectively.

The data referred to by the underlined words are provided by the data of the EDP.

MMC-EDP performs Reasoning Level Operations.

2.3 Natural Language Understanding

MMC-NLU is a PAAI that:

Receives	Text Object	Provided by the Entity.
	Recognised Text	Provided by an MMC-ASR.
	ID	of an Object Instance.
	AV Scene Descriptors	Including the object of which an ID is provided.
Refines	Input Text	if coming from an MMC-ASR.
Extracts	Meaning	from Recognised Text or Entity’s Text Object.
Produces	Refined Text	Text that corrects/refines the Recognised Text.
	Meaning	Meaning of the input Text.

Instance ID and Audio-Visual Scene Descriptors may help the NLU reason about incorrectly recognised word(s) and identify the correct Meaning of the Text.

MMC-NLU performs Interpretation or Reasoning Level Operations.

2.4 Personal Status Extraction

MMC-PSE is a PAAI that includes a plurality of PAAIs producing the Personal Status of an Entity that produces text, utters speech, and displays face and gesture.

Receives	Selector	Indicating whether each Factor is given as Media or Descriptors
	Text	Text data
	Speech	Speech data
	Face	Face data
	Gesture	Gesture data
	Text Descriptors	Text Descriptors
	Speech Descriptors	Speech Descriptors
	Face Descriptors	Face Descriptors
	Gesture Descriptors	Gesture Descriptors
Produces	Personal Status	Combining or Factor Personal Statuses.

MMC-PSE can be implemented data processing or neural network technologies as:

A single component receiving the four Factor data.
Four components receiving and interpreting the four Factor data, and a multiplexer.
Four components receiving and describing the four Factor data, four components interpreting the descriptors, and a multiplexer.
A variety of combination where each of the input data may be directly interpreted or first described and then interpreted.

MMC-PSE performs Descriptors-Interpretation Level Operations.

2.5 Speaker Identity Recognition

MMC-SIR is a PAAI that produce an estimation of the ID of the Entity that uttered speech by acting on a plurality of data sources:

Receives	Speech Object	Speech of which the Speaker is requested.
	Auxiliary Text	Text related to the Speech.
	Speech Time	Time during whose duration Speaker ID is requested.
	Speech Overlap	Data signalling which parts of the Speech Data have overlapping speech.
	Speech Scene Geometry	Disposition of Speech Data in the scene where the Speech whose speaker is to be identified is located.
Produces	Speaker Identifier	ID of speaker referencing a Taxonomy.

Various technologies can be used to implement an MMC-SIR: Hidden Markov model (HMM), Dynamic time warping (DTW), Neural Networks[3], Deep Forward and Recurrent Neural Networks, End-to-End ASR.

MMC-SIR may perform Descriptors-Interpretation Level Operations.

2.6 Summary Creation Module

MMC-SCM is a PAAI that:

Receives	Entity ID	ID of Entity of which a report is made.
	Text Object	Text Object whose Data is reported Text.
	Space-Time	Entity’s space-time information.
	Personal Status	Entity’s Personal Status
	Summary	Summary produced
Produces	Summary	Summary revised by MMC-EDP sent back to the Entity making the summary.

MMC-SCM

Can be implemented with a Neural Network.
Unlike most summarisers, MMC-SCM
1. Adds information on ID, space-time and Personal Status of the Entity producing the Text.
2. Receives a proposed revised Summary that it should review.

MMC-TIQ performs Descriptors-Interpretation-Reasoning Level Operation.

2.7 Text and Image Query

MMC-TIQ is a PAAI that:

Receives	Text Object	Textual part of query.
	Image Visual Object	Image part of query.
Produces	Text Object	In response to Text and Image provided as input.

An MMC-TIQ can operate with various levels of complexity, e.g.,:

It interprets the text from a limited repertory of questions – e.g., What is this object? Which objects are there in this picture? – related to a specific image.
It accepts generic questions on generic visual information.

An MMC-TIQ can be implemented as a Large Language Model.

MMC-TIQ performs Descriptors-Interpretation-Reasoning Level Operation.

2.8 Text and Speech Translation

MMC-TST is a PAAI composed of a set of collaborating PAAIs – MMC-ASR, MMC-TTT, MMC-ESD, and MMC-TSD – that:

Receives	Selector	To choose between output Text/Speech; if Speech, retain or not input Speech features
	Language Preferences	Signalling requested input and output language.
	Text	Text to be translated.
	Speech	Speech to be translated.
Performs	Some or all of:
	Speech Conversion	Into Text if input is Speech.
	Text Translation	To the target language.
	Descriptor Extraction	From Speech.
	Text Conversion	Into Speech adding the Input Speech’s Features, if output is Speech..
Produces	Translated Text	Depending on Selector.
	Translated Speech	Depending on Selector.

MMC-TST performs Data Processing (MMC-TSD), Descriptors (MMC-ESD), and Interpretation (MMC-ASR Level Operation.

2.9 Text-To-Speech

MMC-TTS is a PAAI that:

Receives	Text Object	to be converted to speech.
	Personal Status	to be contained in the Synthesised Speech Object.
	Speech Model	to be used synthesise speech.
Feeds	Text Object and Personal Status	to Speech Model.
Produces	Synthesised Speech Object

MMC-RRS can be implemented as a Neural Network. Common models are:

WaveNet: generates raw audio waveforms.
Tacotron Series: produces mel-spectrograms from text that is converted to speech by by a vocoder (e.g., WaveNet). Uses using an autoregressive approach.
FastSpeech Series: produces mel-spectrograms with a non-autoregressive approach

MMC-TTS performs Data Processing Level Operations.

2.10 Text-to-Speech with Descriptors

MMC-TSD is a PAAI that

Receives	Text Object	to be translated with the colour of the input Speech Descriptors.
	Speech Descriptors	to be used to produce synthetic Speech.
Produces	Synthesised Speech Object	having the Descriptors of the input Speech Object.

MMC-TSD adds speech colour information provided by the Descriptors to the synthesised speech. If it is implemented as a neural network, it requires knowledge of the semantics of descriptors.

MMC-TSD performs Data Processing Level Operations.

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit

MPAI-WMG V1.0 Multimodal Conversation V1.0

1 AI Workflows

1.1 Answer to Multimodal Question

1.2 Conversation About a Scene

1.3 Conversation with Emotion

1.4 Conversation with Personal Status

1.5 Human-CAV Interaction

1.6 Multimodal Question Answering

1.7 Text and Speech Translation

1.8 Virtual Meeting Secretary

2 AI Modules

2.1 Automatic Speech Recognition

2.2 Entity Dialogue Processing

2.3 Natural Language Understanding

2.4 Personal Status Extraction

2.5 Speaker Identity Recognition

2.6 Summary Creation Module

2.7 Text and Image Query

2.8 Text and Speech Translation

2.9 Text-To-Speech

2.10 Text-to-Speech with Descriptors

Notice