This document is a working draft of Version 2 of Technical Specification: Multimodal Conversation (MPAI-MMC) published with a request for Community Comments. Comments should be sent to the MPAI Secretariat by 2023/09/25T23:59 UTC to enable MPAI to consider comments for potential inclusion in the final text of the Technical Specification planned to be approved for publication by the 36^th General Assembly (2023/09/29). The deadline for submitting a response is September 25 at 23:59 UTC.

WARNING

Use of the technologies described in this Technical Specification may infringe patents, copyrights or intellectual property rights of MPAI Members or non-members.

MPAI and its Members accept no responsibility whatsoever for damages or liability, direct or consequential, which may result from the use of this Technical Specification.

Readers are invited to review Annex 3 – Notices and Disclaimers.

1 Introduction (Informative) 7

2 Scope of Standard. 8

3 Terms and Definitions. 10

4 References. 12

4.1 Normative References. 12

4.2 Informative References. 12

5 Use Case Architectures. 13

5.1 Conversation with Personal Status (CPS) 13

5.1.1 Scope of Conversation with Personal Status. 13

5.1.2 Reference Architecture of Conversation with Personal Status. 13

5.1.3 I/O Data of Conversation with Personal Status. 14

5.1.4 Functions of AI Modules of Conversation with Personal Status. 14

5.1.5 I/O Data of AI Modules of Conversation with Personal Status. 15

5.1.6 JSON Metadata of Conversation with Personal Status. 15

5.2 Conversation with Emotion (CWE) 15

5.2.1 Scope of Conversation with Emotion. 15

5.2.2 Reference Architecture of Conversation with Emotion. 15

5.2.3 I/O Data of Conversation with Emotion. 16

5.2.4 Functions of AI Modules of Conversation with Emotion. 17

5.2.5 I/O Data of AI Modules of Conversation with Emotion. 17

5.2.6 JSON Metadata of Conversation with Emotion. 17

5.3 Multimodal Question Answering (MQA) 17

5.3.1 Scope of Multimodal Question Answering. 17

5.3.2 Reference Architecture of Multimodal Question Answering. 18

5.3.3 I/O Data of Multimodal Question Answering. 18

5.3.4 Functions of AI Modules of Multimodal Question Answering. 19

5.3.5 I/O Data of AI Modules of Multimodal Question Answering. 19

5.3.6 JSON Metadata of Multimodal Question Answering. 19

5.4 Conversation About a Scene (CAS) 19

5.4.1 Scope of Conversation About a Scene. 19

5.4.2 Reference Architecture of Conversation About a Scene. 20

5.4.3 I/O Data of Conversation About a Scene. 21

5.4.4 Functions of AI Modules of Conversation About a Scene. 21

5.4.5 I/O Data of AI Modules of Conversation About a Scene. 21

5.4.6 JSON Metadata of Conversation About a Scene. 22

5.5 Virtual Secretary for Videoconference (VSV) 22

5.5.1 Scope of Virtual Secretary for Videoconference. 22

5.5.2 Reference Architecture of Virtual Secretary for Videoconference. 22

5.5.3 I/O Data of Virtual Secretary for Videoconference. 24

5.5.4 Functions of AI Modules of Virtual Secretary for Videoconference. 24

5.5.5 I/O Data of AI Modules of Virtual Secretary for Videoconference. 24

5.5.6 JSON Metadata of Virtual Secretary for Videoconference. 25

5.6 Human-Connected Autonomous Vehicle (CAV) Interaction (HCI) 25

5.6.1 Scope of Human-CAV Interaction. 25

5.7 Reference Architecture of Human-CAV Interaction. 25

5.7.1 I/O Data of Human-CAV Interaction. 27

5.7.2 Functions of AI Modules of Human-CAV Interaction. 28

5.7.3 I/O Data of AI Modules of Human-CAV Interaction. 28

5.7.4 JSON Metadata of Human-CAV Interaction. 29

5.8 Unidirectional Speech Translation (UST) 29

5.8.1 Scope of Unidirectional Speech Translation. 29

5.8.2 Reference Architecture of Unidirectional Speech Translation. 29

5.8.3 I/O Data of Unidirectional Speech Translation. 30

5.8.4 Functions of AI Modules of Unidirectional Speech Translation. 30

5.8.5 I/O Data of AI Modules of Unidirectional Speech Translation. 31

5.8.6 JSON Metadata of Unidirectional Speech Translation. 31

5.9 Bidirectional Speech Translation (BST) 31

5.9.1 Scope of Bidirectional Speech Translation. 31

5.9.2 Reference Architecture of Bidirectional Speech Translation. 31

5.9.3 I/O Data of Bidirectional Speech Translation. 32

5.9.4 Functions of AI Modules of Bidirectional Speech Translation. 32

5.9.5 I/O Data of AI Modules of Bidirectional Speech Translation. 33

5.9.6 JSON Metadata of Bidirectional Speech Translation. 33

5.10 One-to-Many Speech Translation (MST) 33

5.10.1 Scope of One-to-Many Speech Translation. 33

5.10.2 Reference Architecture of One-to-Many Speech Translation. 33

5.10.3 I/O Data of One-to-Many Speech Translation. 34

5.10.4 Functions of AI Modules of One-to-Many Speech Translation. 34

5.10.5 I/O Data of AI Modules of One-to-Many Speech Translation. 34

5.10.6 JSON Metadata of One-to-Many Speech Translation. 35

6 Composite AI Modules. 35

6.1 Personal Status Extraction (PSE) 35

6.1.1 Scope of Personal Status Extraction. 35

6.1.2 Reference Architecture of Personal Status Extraction. 35

6.1.3 I/O Data of Personal Status Extraction. 36

6.1.4 Functions of AI Modules of Personal Status Extraction. 36

6.1.5 I/O Data of AI Modules of Personal Status Extraction. 37

6.1.6 JSON Metadata of Personal Status Extraction. 37

6.2 Personal Status Display (PSD) 37

6.2.1 Scope of Personal Status Display. 37

6.2.2 Reference Architecture of Personal Status Display. 37

6.2.3 I/O Data of Personal Status Display. 38

7 Data Formats. 38

7.1 Audio File. 40

7.2 Audio Scene Descriptors. 40

7.3 Cognitive State. 40

7.5 Face Descriptors. 44

7.6 Gesture Descriptors. 45

7.7 InstanceID.. Error! Bookmark not defined.

7.8 Intention. 46

7.8.1 Syntax. 46

7.8.2 Semantics. 46

7.9 Language identifier 47

7.10 Meaning. 47

7.10.1 Syntax. 47

7.10.2 Semantics. 48

7.11 Personal Status. 48

7.11.1 Factors and Modalities. 48

7.11.2 Personal Status Data. 49

7.12 Instance Identifier Error! Bookmark not defined.

7.13 Instance Identifier 45

7.13.1 Syntax. 45

7.13.2 Semantics. 45

7.14 Social Attitude. 52

7.14.1 Syntax. 52

7.14.2 Semantics. 52

7.15 Spatial Attitude. 57

7.16 Speech Descriptors. 57

7.17 Speech Features. 57

7.17.1 Syntax. 57

7.17.2 Semantics. 58

7.18 Text 59

7.19 Text Descriptors. 60

7.20 Video. 60

7.21 Video File. 60

7.22 Video of Faces KB Query Format 60

7.23 Visual Scene Descriptors. 60

Annex 1 – MPAI Basics. 61

1 General 61

2 Governance of the MPAI Ecosystem.. 61

3 AI Framework. 62

4 Audio-Visual Scene Description. 63

4.1 Audio Scene Descriptors. 63

4.2 Visual Scene Descriptors. 63

5 Avatar-Based Videoconference. 64

6 Connected Autonomous Vehicles. 64

Annex 2 – MPAI-wide terms and definitions. 67

Annex 3 – Notices and Disclaimers Concerning MPAI Standards (Informative) 70

Annex 4 – Patent declarations (Informative) 72

Annex 5 – Personal Status (Informative) 73

Annex 6 – AIW and AIM Metadata of MMC-CPS. 76

1 Metadata for MPAI-CPS AIW… 76

2 AIM metadata for CPS. 83

2.1 Visual Scene Description. 83

2.2 Audio Scene Description. 84

2.3 SpatialObjectIdentification. 85

2.4 SpeechRecognition. 86

2.5 Language Understanding. 87

2.6 PersonalStatusExtraction. 88

2.7 DialogueProcessing. 90

2.8 PersonalStatusDisplay. 91

Annex 7 – AIW and AIM Metadata of MMC-CWE.. 93

1 AIW metadata for CWE.. 93

2 AIM metadata. 99

2.1 SpeechRecognition. 99

2.2 Visual Scene Description. 100

2.3 Language Understanding. 101

2.4 PersonalStatusExtraction. 102

2.5 Dialogue Processing. 103

2.6 SpeechSynthesisEmotion. 105

2.7 Lips Animation. 106

Annex 8 – AIW and AIM Metadata of MMC-MQA.. 108

1 AIW metadata for MQA.. 108

2 AIM metadata. 113

2.1 VisualSceneDescription. 113

2.2 PhysicalObjectIdentification. 114

2.3 SpeechRecognition. 115

2.4 Language Understanding. 116

2.5 Question Analysis. 117

2.6 Question Answering. 118

2.7 SpeechSynthesisText 119

Annex 9 – AIW and AIM Metadata of MMC-CAS. 121

AIW metadata for MMC-CAS. 121
AIM metadata for MMC-CAS. 128

2.1 Visual Scene Description. 128

2.2 SpatialObjectIdentification. 129

2.3 SpeechRecognition. 131

2.4 LanguageUnderstanding. 131

2.5 PersonalStatusExtraction. 133

2.6 DialogueProcessing. 134

2.7 ScenePresentation. 135

2.8 PersonalStatusDisplay. 136

Annex 10 – AIW and AIM Metadata of CAV-HCI. 138

AIW metadata for HCI. 138
Metadata for HCI AIMs. 146

2.1 Audio Scene Description. 146

2.2 }Visual Scene Description. 147

2.3 SpeechRecognition. 149

2.4 SpatialObjectIdentification. 150

2.5 LanguageUnderstanding. 151

2.6 SpeakerRecognition. 152

2.7 PersonalStatusExtraction. 153

2.8 FaceRecognition. 154

2.9 DialogueProcessing. 155

2.9 PersonalStatusDisplay. 156

Annex 11 – AIW and AIM Metadata of ARA-VSV.. 158

1 Metadata for VSV AIW… 158

AIM metadata for ARA-VSV.. 164

2.1 SpeechRecognition. 164

2.2 AvatarDescriptorParsing. 165

2.3 LanguageUnderstanding. 166

2.4 PersonalStatusExtraction. 167

2.5 Summarisation. 169

2.6 DialogueProcessing. 170

2.7 PersonalStatusDisplay. 172

Annex 12 – AIW and AIM Metadata of MMC-UST. 174

1 AIW metadata for UST. 174

2 AIM metadata. 178

2.1 SpeechRecognition. 178

2.2 Translation. 178

2.3 Speech Feature Extraction. 180

2.4 Speech Synthesis. 180

Annex 13 – AIW and AIM Metadata of MMC-BST. 182

1 AIW metadata for BST. 182

2 AIM metadata. 187

2.1 SpeechRecognition. 187

2.2 Translation. 188

2.3 Speech Feature Extraction. 190

2.4 Speech Synthesis. 191

Annex 14 – AIW and AIM Metadata of MMC-MST. 193

AIW metadata for MST. 193
AIM metadata. 198

2.1 SpeechRecognition. 198

2.2 Translation. 199

2.3 Speech Feature Extraction. 200

2.4 Speech Synthesis. 200

Annex 15 – Metadata of MMC-PSE Composite AIM… 203

PersonalStatusExtraction. 203

1.1 PSTextDescription. 209

1.2 PSSpeechDescription. 209

1.3 PSFaceDescription. 210

1.4 PSBodyDescription. 211

1.5 PSTextInterpretation. 212

1.6 PSSpeechInterpretation. 213

1.7 PSFaceInterpretation. 214

1.8 PSBodyInterpretation. 215

1.9 PersonalStatusCombination. 216

Annex 16 – Communication Among AIM Implementors (Informative) 218

1 Introduction (Informative)

From the moment a human built the first machine, there was a need to “communicate” with it. As more complex machines were built, the need for more sophisticated communication methods arose. Today, as personal devices become more pervasive and the use of information and other online services becomes ubiquitous, human-machine communication often becomes more direct and even “personal”. In the past, humans communicated with more primitive machines by touch, but now the possibility of using speech and visual means enhances this trend.

The ability of Artificial Intelligence to learn from interactions with humans gives machines the ability to improve their “conversational” capabilities by better understanding the meaning of what humans type or say and by providing more pertinent responses. If properly trained, machines can also learn to understand additional or hidden meanings of a sentence by analysing a human’s text, speech, or gestures. Machines can also be made to develop and rely on “internal statuses” comparable to those driving the attitudes of conversing humans. Thus, they can provide responses – in text, speech, and gestures – that are more human-like and richer in content.

The mission of the international, unaffiliated, non-profit Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) Standards Developing Organisation is to develop AI-enabled data coding standards. MPAI believes that its standards should enable humans to select machines whose internal operation they understand to some degree, rather than machines that are just “black boxes” resulting from unknown training with unknown data. Thus, an implemented MPAI standard breaks up monolithic AI applications, yielding a set of interacting components with identified data whose semantics is known, as far as possible.

This opportunity for individual humans also offers a positive impact on industry, as component developers can compete in providing components with standard interfaces that have improved performance compared to other implementations. This “Lego-type” approach to application development is made possible by the MPAI AI Framework standard [2], where “applications” (called AI Workflows – AIWs) are composed of AI Modules (called AIMs) executed in AI Frameworks (called AIFs). AIMs are defined by their functions and data, but not by their internal architecture, which may be based on AI or data processing technologies, and implemented in software, hardware, or hybrid technology. Annex 1 – MPAI Basic provides additional details on the MPAI standards ecosystem and MPAI standards relevant to this Technical Specification.

Technical Specification: Multimodal Conversation (MPAI-MMC) V2 provides the technologies supporting the implementation of a subset of the possibilities envisaged by this introduction. It is organised in Use Cases, such as Conversation with Emotion, Multimodal Question Answering, and Unidirectional Speech Translation, corresponding to AI Workflows. Each Use Case provides the functions, and the input/output data of the AIW and the AIM topology. Each AIM of the Use Case is specified in terms of functions and input/output data. A single chapter also collects all data formats referenced in the specification.

In this Introduction and in the following, Terms beginning with a capital letter are defined in Table 1 if they are specific to this Standard and in Table 45 if they are common to all MPAI Standards. The chapters and the Annexes are Normative unless they are labelled as Informative.

2 Scope of Standard

Multimodal Conversation (MPAI-MMC) specifies:

The technologies required to analyse the text and/or the speech and other non-verbal components exchanged in human-machine and machine-machine conversation with the goal to emulate human-human conversation in completeness and intensity.
Use Cases that apply the technologies, both from MPAI-MMC and other MPAI standards:
- “Conversation with Personal Status” (CPS), enabling conversation and question answering with a machine able to extract the inner state of the entity it is conversing with and showing itself as a speaking digital human able to express a Personal Status. By adding or removing minor components to this general Use Case, five Use Cases are spawned:
- “Conversation with Emotion” (CWE), supporting audio-visual conversation with a machine impersonated by a synthetic voice and an animated face.
- “Multimodal Question Answering” (MQA), supporting request for information about a displayed object.
- “Conversation About a Scene” (CAS) where a human converses with a machine pointing at the objects scattered in a room and displaying Personal Status in their speech, face, and gestures while the machine responds displaying its Personal Status in speech, face, and gesture.
- “Human-Connected Autonomous Vehicle Interaction” (HCI) where humans converse with a machine displaying Personal Status after having been properly identified by the machine with their speech and face in outdoor and indoor conditions while the machine responds displaying its Personal Status in speech, face, and gesture.
- “Virtual Secretary for Videoconference” (VSV) where an avatar not representing a human in a virtual conference makes and displays a summary of what other avatars say, receives and interprets comments using the avatars’ utterances and Personal Statuses, and displays the edited summary.
- Three Uses Cases supporting conversational translation applications. In each Use Case, users can specify whether speech or text is used as input and, if it is speech, whether their speech features are preserved in the interpreted speech:
- “Unidirectional Speech Translation” (UST).
- “Bidirectional Speech Translation” (BST).
- “One-to-Many Speech Translation” (MST).
One Composite AIMs that applies the technologies, both from MPAI-MMC and other MPAI standards: Personal Status Extraction analyses the Personal Status conveyed by Text, Speech, Face, and Gesture – of a real or digital human – and provides an estimate of the Personal Status.

Note that:

Each Use Case normatively defines:
- The Functions of the AIW implementing it and of the AIMs.
- The Connections between and among the AIMs
- The Semantics and the Formats of the input and output data of the AIW and the AIMs.
Each Composite AIM normatively defines:
- The Functions of the Composite AIM implementing it and of the AIMs.
- The Connections between and among the AIMs
- The Semantics and the Formats of the input and output data of the AIW and the AIMs.

The word normatively implies that an Implementation claiming Conformance to:

An AIW, shall:
1. Perform the AIW function specified in the appropriate Section of Chapter 5.
2. All AIMs, their topology and connections should conform with the AIW Architecture specified in the appropriate Section of Chapter 5.
3. The AIW and AIM input and output data should have the formats specified in the appropriate Sections of Chapter 0.
An AIM, shall:
1. Perform the AIM function specified by the appropriate Section of Chapter 5.
2. Receive and produce the data specified in the appropriate Section of Chapter 5.
3. Receive as input and produce as output data having the format specified in Section Chapter 0.
4. A data Format, the data shall have the format specified in Chapter 0.

Users of this Technical Specification should note that:

This Technical Specification defines Interoperability Levels but does not mandate any.
Implementers decide the Interoperability Level their Implementation satisfies.
Implementers can use the Reference Software of this Technical Specification to develop their Implementations.
The Conformance Testing specification can be used to test the conformity of an Implementation to this Standard.
Performance Assessors can assess the level of Performance of an Implementation based on the Performance Assessment specification of this Standard.
Implementers and Users should consider Annex 2 – Notices and Disclaimers.

The current Version of MPAI-MMC has been developed by the MPAI Multimodal Conversation Development Committee (MM-DC). MPAI expects to produce future MPAI-MMC Versions extending the scope of the Use Cases and/or add new Use Cases within the Multimodal Conversation scope.

3 Terms and Definitions

The terms used in this standard whose first letter is capital have the meaning defined in Table 1.

Table 1 – Table of terms and definitions

Term	Definition
Audio	Digital representation of an analogue audio signal sampled at a frequency between 8-192 kHz with a number of bits/sample between 8 and 32, and non-linear and linear quantisation.
Audio Object	Coded representation of Audio information with its metadata. An Audio Object can be a combination of Audio Objects.
Audio Scene	The Audio Objects of an Environment with Object location metadata.
Audio-Visual Object	Coded representation of Audio-Visual information with its metadata. An Audio-Visual Object can be a combination of Audio-Visual Objects.
Audio-Visual Scene	(AV Scene) The Audio-Visual Objects of an Environment with Object location metadata.
Avatar	An animated 3D object representing a real or fictitious person in a Virtual Space.
Avatar Model	An inanimate avatar exposing interfaces to enable animation for animation.
Cognitive State	An element of the internal status reflecting the way a human or avatar understands the Environment, such as “Confused”, “Dubious”, “Convinced”.
Colour (of speech)	The timber of an identifiable voice independent of a current Personal Status and language.
Connected Autonomous Vehicle	A vehicle able to autonomously reach an assigned geographical position by: 1. Understanding human utterances. 2. Planning a route. 3. Sensing and interpreting the Environment. 4. Exchanging information with other CAV. 5. Acting on the CAV’s motion actuation subsystem.
Descriptor	Coded representation of text, audio, speech, or visual feature.
Emotion	The coded representation of the internal state resulting from the interaction of a human or avatar with the Environment or subsets of it, such as “Angry”, “Sad”, “Determined”.
Environment	A Virtual Space containing a Scene.
Environment Model	The static audio and visual components of the Environment, e.g., walls, table, and chairs.
Face	The portion of a 2D or 3D digital representation corresponding to the face of a human.
Factor	One of Emotion, Cognitive State and Attitude.
Grade	The intensity of a Factor.
Identifier	The label uniquely associated with a human or an avatar or an object.
Instance	An element of a set of entities – Physical Objects, users etc. – belonging to some levels in a hierarchical classification (taxonomy).
Intention	The result of analysis of the goal of an input question.
Manifestation	The manner of showing the Personal Status, or a subset of it, in any one of Speech, Face, and Physical Gesture.
Meaning	Information extracted from Text such as syntactic and semantic information, Personal Status, and other information, such as an Object Identifier.
Modality	One of Text, Speech, Face, or Gesture.
Object Descriptor	An individual attribute of the coded representation of an object in a Scene, including its Spatial Attitude.
Orientation	The set of the 3 roll, pitch, yaw angles indicating the rotation around the principal axis (x) of an Object, its y axis having an angle of 90˚ counterclockwise (right-to-left) with the x axis and its z axis pointing up toward the viewer.
Personal Status	The ensemble of information internal to a person, including Emotion, Cognitive State, and Attitude.
Physical Gesture	A movement of the body or part of it, such as the head, arm, hand, and finger, often a complement to a vocal utterance.
Pitch	The fundamental frequency of Speech. Pitch is the attribute that makes it possible to judge sounds as “higher” and “lower.”
Point of View	The Spatial Attitude of a human or avatar looking at an Environment.
Position	The 3 coordinates (x,y,z) of a representative point of an object in the Real and Virtual Space.
Refined Text	The Text resulting from the analysis of the Text produced by Speech Recognition made by Language Understanding.
Scene	A structured composition of Objects.
Scene Presentation	The format used by an audio-visual renderer to render the Audio-Visual Scene internal to the machine from a selected Point of View.
Social Attitude	An element of the internal status related to the way a human or avatar intends to position vis-à-vis the Environment or subsets of it, e.g., “Respectful”, “Confrontational”, “Soothing”.
Spatial Attitude	Position and Orientation and their velocities and accelerations of a Human and Physical Object in a Real or Virtual Environment.
Spatial Attribute	Position and Orientation and their velocities and accelerations of a Human and Physical Object in a Real or Virtual Environment.
Speech	Digital representation of analogue speech sampled at a frequency between 8 kHz and 96 kHz with a number of bits/sample of 8, 16 and 24, and non-linear and linear quantisation.
Speech Features	Aspects of a speech segment that enable its description and reproduction, e.g., degree of vocal tension, Pitch, etc., and that can be automatically recognised and extracted for speech synthesis or other related purposes.
Speech Rate	The number of Speech Units per second.
Speech Unit	Phoneme, syllable, or word as a segment of Speech.
Text	A sequence of characters drawn from a finite alphabet.
Visual Object	Coded representation of Visual information with its metadata. A Video Object can be a combination of Video Objects.
Vocal Gesture	Utterance, such as cough, laugh, hesitation, etc. Lexical elements are excluded.

4 References

4.1 Normative References

This standard normatively references the following documents, both from MPAI and other standards organisations. MPAI standards are publicly available at https://mpai.community/standards/resources/.

Technical Specification; MPAI Ecosystem Governance (MPAI-GME) V1.1; https://mpai.community/standards/mpai-gme/.
Technical Specification; AI Framework (MPAI-AIF) 1; https://mpai.community/standards/mpai-aif/.
Technical Specification: Avatar Representation and Animation (MPAI-ARA) V1; https://mpai.community/standards/mpai-ara/.
Technical Specification: Context-based Audio Enhancement (MPAI-CAE) V2; https://mpai.community/standards/mpai-cae/.
Technical Specification: Connected Autonomous Vehicle (MPAI-CAV) V2; https://mpai.community/standards/mpai-cav/.
Technical Specification: Visual Object and Scene Description (MPAI-OSD) V2; https://mpai.community/standards/mpai-osd/.
Khronos; Graphics Language Transmission Format (glTF); October 2021; https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html
ISO 639; Codes for the Representation of Names of Languages – Part 1: Alpha-2 Code.
ISO/IEC 10646; Information technology – Universal Coded Character Set.
ITU-R; Long-form file format for the international exchange of audio programme materials with metadata; BS.2088-1 (10/2019) https://www.loc.gov/preservation/digital/formats/fdd/fdd000001.shtml.
ISO/IEC 14496-10; Information technology – Coding of audio-visual objects – Part 10: Advanced Video Coding.
ISO/IEC 14496-12; Information technology – Coding of audio-visual objects – Part 12: ISO base media file format.
ISO/IEC 23008-2; Information technology – High efficiency coding and media delivery in heterogeneous environments – Part 2: High Efficiency Video Coding.
ISO/IEC 23094-1; Information technology – General video coding – Part 1: Essential Video Coding.
MPAI; The MPAI Statutes; https://mpai.community/statutes/.
MPAI; The MPAI Patent Policy; https://mpai.community/about/the-mpai-patent-policy/.
MPAI; Framework Licence of the Multimodal Conversation Technical Specification (MPAI-MMC) V1; https://mpai.community/standards/mpai-mmc/framework-licence/mpai-mmc-v1-framework-licence/.
MPAI; Framework Licence of the Multimodal Conversation Technical Specification (MPAI-MMC) V2; https://mpai.community/standards/mpai-mmc/call-for-technologies/mpai-mmc-v2-call-for-technologies/.

4.2 Informative References

The references provided here are for information purpose.

Ekman, Paul (1999), “Basic Emotions”, in Dalgleish, T; Power, M (eds.), Handbook of Cognition and Emotion (PDF), Sussex, UK: John Wiley & Sons.
Emotion Markup Language (EmotionML) 1.0; https://www.w3.org/TR/2010/WD-emotionml-20100729/diffmarked.html.
Hobbs J.R., Gordon A.S. (2011) The Deep Lexical Semantics of Emotions. In: Ahmad K. (eds) Affective Computing and Sentiment Analysis. Text, Speech, and Language Technology, vol 45. Springer, Dordrecht, https://people.ict.usc.edu/~gordon/publications/EMOT08.PDF and https://www.researchgate.net/publication/227251103_The_Deep_Lexical_Semantics_of_Emotions.

5 Use Cases

5.1 Conversation with Personal Status (CPS)

5.1.1 Scope of Conversation with Personal Status

When humans have a conversation with other humans, they use speech and, in constrained cases, text. Their interlocutors perceive speech and/or text supplemented by visual information related to the speaker’s face and gesture of a conversing human. Text, speech, face, and gesture may convey information about the internal state of the speaker that MPAI calls Personal Status. Therefore, handling of Personal Status information in a human-machine conversation and, in the future, even machine-machine conversation, is a key feature of a machine trying to understand what the speakers’ utterances mean because Personal Status recognition can improve understanding of the speaker’s utterance and help a machine produce better replies.

Conversation with Personal Status (MMC-CPS) is a general Use Case of an entity – a real or digital human – conversing and question answering with a machine. The machine captures and understands Speech, extracts Personal Status from the Text, Speech, Face, and Gesture Factors, fuses the Factors into an estimated Personal Status of the entity to achieve a better understanding of the context in which the entity utters Speech.

5.1.2 Reference Architecture of Conversation with Personal Status

Figure 1 gives the Conversation with Personal Status Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

Figure 1 – Reference Model of Conversation with Personal Status

The operation of the Conversation with Personal Status Use Case develops as follows:

Input Selection is used to inform the machine whether the human employs Text or Speech in conversation with the machine.
Visual Scene Description extracts the Scene Geometry, the Physical Objects and the Face and Body Descriptors of humans in the Scene.
Audio Scene Description extracts the Scene Geometry, and the Speech Objects in the Scene.
Physical Object Identification assigns an Identifier to each Physical Object indicated by a human.
Speech Recognition recognises Speech utterances.
Language Understanding refines Text and extracts Meaning.
Personal Status Extraction extracts a human’s Personal Status.
Dialogue Processing produces the machine’s response and its Personal Status.
Personal Status Display produces a speaking Avatar expressing Personal Status.

5.1.3 I/O Data of Conversation with Personal Status

Table 2 gives the input and output data of the Conversation with Personal Status Use Case:

Table 2 – I/O Data of Conversation with Personal Status

Input	Comments
Input Text	Text typed by the human as additional information stream or as a replacement of the Speech.
Input Speech	Speech of the human having a conversation with the machine.
Input Video	Video of the Face of the human having a conversation with the machine.
Input Selection	Data determining the use of Speech vs Text.
Output	Comments
Machine Text	Text of the Speech produced by the machine.
Machine Speech	Synthetic Speech produced by the machine.
Machine Video	Avatar representing the machine.
Input Selection	Selection signalling use of Text or Speech.

5.1.4 Functions of AI Modules of Conversation with Personal Status

Table 3 provides the functions of the Conversation with Personal Status Use Case.

Table 3 – Functions of AI Modules of Conversation with Personal Status

AIM	Function
Visual Scene Description	Provides Visual Objects and their Spatial Attitudes.
Audio Scene Description	Provides Speech Objects and their Spatial Attitudes.
Speech Recognition	Recognises Speech
Language Understanding	Refines Text and extracts Meaning
Personal Status Extraction	Extracts Personal Status
Dialogue Processing	1. Processes Refined Text and Personal Status 2. Produces machine’s Text and Personal Status.
Personal Status Displays	1. Synthesises Machine Speech from Machine Text and Personal Status 2. Synthesises Machine Avatar

5.1.5 I/O Data of AI Modules of Conversation with Personal Status

Table 4 provides the I/O Data of the AI Modules of the Conversation with Personal Status Use Case.

Table 4 – I/O Data of AI Modules of Conversation with Personal Status

AIM	Receives	Produces
Visual Scene Description	Input Video	1. Face Descriptors 2. Body Descriptors 3. Visual Scene Geometry 4. Physical Objects
Audio Scene Description	Input Audio	1. Speech 2. Audio Scene Geometry
Spatial Object Identification	1. Body Descriptors 2. Visual Scene Geometry 3. Physical Objects	Physical Object ID
Speech Recognition	Input Speech	Recognised Text
Language Understanding	1. Physical Object ID 2. Input Text 3. Recognised Text 4. Input Selection	1. Meaning 2. Refined Text
Personal Status Extraction	1. Body Descriptors 2. Face Descriptors 3. Meaning 4. Speech	Input Personal Status
Dialogue Processing	1. Input Text 2. Refined Text 3. Input Personal Status 4. Input Selection	1. Machine Personal Status 2. Machine Text
Personal Status Displays	1. Machine Text 2. Machine Personal Status	1. Machine Avatar 2. Machine Speech 3. Machine Text

5.1.6 JSON Metadata of Conversation with Personal Status

Specified in Annex 6 – .

5.2 Conversation with Emotion (CWE)

5.2.1 Scope of Conversation with Emotion

In the Conversation with Emotion (MMC-CWE) Use Case, a machine responds to a human’s textual and/or vocal utterance in a manner consistent with the human’s utterance and emotional state, as detected from the human’s text, speech, or face. The machine responds using text, synthetic speech, and a face whose lip movements are synchronised with the synthetic speech and the synthetic machine emotion.

5.2.2 Reference Architecture of Conversation with Emotion

Figure 2 gives the Conversation with Emotion Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

Figure 2 – Reference Model of Conversation with Emotion

The operation of Conversation with Emotion develops as follows:

Input Selection is used to inform the machine whether the human employs Text or Speech in conversation with the machine.
Speech is recognised by Speech Recognition.
Visual Scene Description extracts Face Descriptors from the scene.
Language Understanding produces Meaning and Refined Text.
Personal Status Extraction extracts Emotion from Meaning, Input Speech, and Face Descriptors.
Dialogue Processing produces a response as Output Text and Emotion.
Speech Synthesis (Emotion) produces Output Speech from Text and Emotion.
Lips Animation animates the lips of a Face drawn from the Video of Faces KB in a way that is consistent with the Output Speech and the Output Emotion.

5.2.3 I/O Data of Conversation with Emotion

The input and output data of the Conversation with Emotion Use Case are:

Table 5 – I/O Data of Conversation with Emotion

Input	Comments
Input Selection	Data determining the use of Speech vs Text.
Input Text	Text typed by the human as additional information stream or as a replacement of the speech depending on the value of Input Selection.
Input Speech	Speech of the human having a conversation with the Machine.
Input Video	Video of the Face of the human having a conversation with the Machine.
Output	Comments
Machine Text	Text of the Speech produced by the Machine.
Machine Speech	Synthetic Speech with Emotion produced by the Machine.
Mane Video	Video of a Face whose lip movements are synchronised with the Output Speech and the Machine Personal Status.

5.2.4 Functions of AI Modules of Conversation with Emotion

Table 6 provides the functions of the Conversation with Emotion AIMs.

Table 6 – Functions of AI Modules of Conversation with Emotion

AIM	Function
Speech Recognition	Recognises Speech
Language Understanding	Refines Text and extracts Meaning
Personal Status Extraction	Extracts Personal Status from Meaning, Speech, and Face.
Dialogue Processing	1. Processes Refined Text and Personal Status 2. Produces Machine Text and Personal Status.
Personal Status Displays	1. Synthesises Machine Speech from Machine Text and Personal Status 2. Synthesises Machine Avatar

5.2.5 I/O Data of AI Modules of Conversation with Emotion

The AI Modules of Conversation with Emotion perform the Functions specified in Table 7.

Table 7 – AI Modules of Conversation with Emotion

AIM	Receives	Produces
Speech Recognition	Input Speech	Recognised Text
Language Understanding	Recognised Text	Meaning in Recognised Text.
Personal Status Extraction	1. Meaning 2. Speech 3. Face	Input Personal Status (Emotion only).
Dialogue Processing	1. Meaning. 2. Based on Input Selection 2.1. Refined Text 2.2. Input Text. 3. Input Personal Status.	1. Machine Personal Status 2. Machine Text
Speech Synthesis (Emotion)	1. Machine Text 2. Machine Personal Status	Machine Speech.
Lips Animation	1. Machine Personal Status 2. Machine Speech	Video with animated lips of from Video Faces KB.

5.2.6 JSON Metadata of Conversation with Emotion

Specified in Annex 7 – .

5.3 Multimodal Question Answering (MQA)

5.3.1 Scope of Multimodal Question Answering

In a Question Answering (QA) System, a machine provides answers to a user’s question presented in natural language. Multimodal Question Answering improves current QA systems that are only able to deal with text or speech inputs by offering the requesting human the ability to present both speech or text and images. For example, users might ask “Where can I buy this tool?” while showing the picture of the tool, even without showing their faces. In the Multimodal Question Answering (MMC-MQA) Use Case, a machine responds to a question expressed by a user in text or speech while showing an object. The machine’s response may use text and synthetic speech.

5.3.2 Reference Architecture of Multimodal Question Answering

Figure 3 gives the Multimodal Question Answering Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

Figure 3 – Reference Model of Multimodal Question Answering

The operation of Multimodal Question Answering develops in the following way:

Input Selection is used to inform the machine whether the human employs Text or Speech to query the machine.
Depending on the value of Input Selection, Language Understanding:
- Extracts the Meaning of the question from Recognised Text and refines Recognised Text.
- Extracts the Meaning of the question from Input Text.
Visual Scene Description extracts the Physical Object.
Object Identifier identifies the Physical Object.
Question Analysis determines the Intention of the question.
Question Answering uses Intention and Meaning to produce the answer as Machine Text.
Speech Synthesis (Text) produces the Output Speech from Machine Text.

5.3.3 I/O Data of Multimodal Question Answering

The input and output data of the Multimodal Question Answering Use Case are:

Table 8 – I/O Data of Multimodal Question Answering

Input	Comments
Input Selection	Data determining the use of Speech or Text.
Input Text	Text typed by the human as a replacement for Input Speech.
Input Speech	Speech of the human asking a question the Machine.
Input Video	Video of the human showing an object held in hand.
Output	Comments
Output Text	The Text generated by Machine in response to human inputs.
Output Speech	The Speech generated by Manchine in response to human inputs.

5.3.4 Functions of AI Modules of Multimodal Question Answering

Table 9 provides the functions of the Multimodal Question Answering Use Case.

Table 9 – Functions of AI Modules of Multimodal Question Answering

AIM	Function
Visual Scene Description	Extracts the Physical Object in the Visual Scene.
Object Identification	Identifies the Physical Object.
Speech Recognition	Recognises Speech.
Language Understanding	Extracts Meaning and refines Text from Recognised Text.
Question Analysis	Extracts Intention from Text.
Question Answering	Produces response of Machine to the query.
Speech Synthesis (Text)	Synthesises Speech from Text.

5.3.5 I/O Data of AI Modules of Multimodal Question Answering

The AI Modules of Multimodal Question Answering are given in Table 10.

Table 10 – AI Modules of Multimodal Question Answering

AIM	Receives	Produces
Visual Scene Description	Input Video	Physical Object
Object Identification	Physical Object	Physical Object Identifier
Speech Recognition	Input Speech	Recognised Text
Language Understanding	Input Text or Speech based on Input Selection	Refined Text Meaning
Question Analysis	Meaning	Intention
Question Answering	1. Input or Recognised Text (based on Input Selection) 2. Intention 3. Meaning	Machine Text
Speech Synthesis (Text)	Machine Text	Machine Speech

5.3.6 JSON Metadata of Multimodal Question Answering

Specified in Annex 8 – .

5.4 Conversation About a Scene (CAS)

5.4.1 Scope of Conversation About a Scene

This Use Case addresses the case of a human holding a conversation with a mMchine:

The Machine sees and hears an Environment containing a speaking human and some scattered objects.
The Machine recognises the human’s Speech and obtains the human’s Personal Status by capturing Speech, Face, and Gesture.
The human converses with the Machine indicating the object in the Environment s/he wishes to talk to or ask questions about it using Speech, Face, and Gesture.
The Machine understands which object the human is referring to and generates an avatar that:
- Utters Speech conveying a synthetic Personal Status that is relevant to the human’s Personal Status as shown by his/her Speech, Face, and Gesture, and
- Displays a face conveying a Personal Status that is relevant to the human’s Personal Status and to the response the Machine intends to make.
The Machine displays the Scene Presentation corresponding to how it perceives the Environment from a human-selected Point of View. The objects in the scene are labelled with the Machine’s understanding of their semantics so that the human can understand how the Machine sees the Environment.

5.4.2 Reference Architecture of Conversation About a Scene

Figure 4 gives the Conversation About a Scene Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

Figure 4 – Reference Model of Conversation About a Scene

The Machine operates according to the following workflow:

Visual Scene Description produces Body Descriptors, Visual Scene Geometry and Physical Objects from Input Video.
Speech Recognition produces Recognised Text from Input Speech.
Spatial Object Identification produces Physical Object ID from Physical Object and Body Descriptors.
Language Understanding produces Meaning and Refined Text from Recognised Text and Physical Object ID.
Personal Status Extraction produces Input Personal Status from Meaning, Input Speech, Face Descriptors, and Body Descriptors.
Dialogue Processing produces Machine Text and Machine Personal Status from Input Personal Status, Meaning, and Refined Text.
Personal Status Display produces Machine Text, Machine Speech, Machine Avatar from Machine Text, and Machine Personal Status.
Scene Presentation uses the Visual Scene Descriptors to produce the Rendered Scene as seen from the user-selected Point of View. The rendering is constantly updated as the machine improves its understanding of the scene and its objects.

5.4.3 I/O Data of Conversation About a Scene

Table 11 gives the input/output data of Conversation About a Scene.

Table 11 – I/O data of Conversation About a Scene

Input data	From	Comment
Input Video	Camera	Points to human and scene.
Input Speech	Microphone	Speech of human.
Point of View	Human	The point of view of the scene displayed by Scene Presentation.
Output data	To	Comments
Machine Speech	Human	Machine’s speech.
Machine Avatar	Human	Portion of Machine’s avatar (e.g., face).
Rendered Scene	Human	Reproduction of the scene perceived by Machine containing labelled objects as seen from the Point of View.

5.4.4 Functions of AI Modules of Conversation About a Scene

Table 9 provides the functions of the Conversation About a Scene Use Case.

Table 12 – Functions of AI Modules of Conversation About a Scene

AIM	Functions
Visual Scene Description	Provides Visual Objects and their Spatial Attitudes.
Spatial Object Identification	Provides ID of a Physical Object.
Speech Recognition	Recognises Speech.
Language Understanding	Refines Text and extracts Meaning.
Personal Status Extraction	Extracts Personal Status from Meaning, Speech, Body, and Face.
Dialogue Processing	1. Processes Refined Text and Personal Status. 2. Produces Machine’s Text and Personal Status.
Scene Presentation	Renders the Visual Scene as perceived by the Machine from the Point of View selected by human.
Personal Status Display	Provides Machine Speech and Machine Avatar from Machine Text and Machine Personal Status.

5.4.5 I/O Data of AI Modules of Conversation About a Scene

Table 13 gives the list of AIMs with their I/O Data.

Table 13 – AI Modules of Conversation About a Scene

AIM	Receives	Produces
Visual Scene Description	Input Video	1. Visual Scene Descriptors 2. Body Descriptors 3. Face Descriptors 4. Visual Scene Geometry 5. Physical Objects
Spatial Object Identification	1. Body Object 2. Physical Objects 3. Visual Scene Geometry	Physical Object ID
Speech Recognition	Input Speech	Recognised Text
Language Understanding	1. Recognised Text 2. Physical Object ID	1. Meaning 2. Refined Text
Personal Status Extraction	1. Body Object 2. Face Object 3. Input Speech 4. Meaning	Personal Status
Dialogue Processing	1. Personal Status 2. Meaning 3. Refined Text	Machine Personal Status
Scene Presentation	1. Visual Scene Descriptors 2. Point of View	Rendered Scene
Personal Status Display	1. Machine Text 2. Machine Personal Status	1. Machine Text 2. Machine Speech 3. Machine Avatar

5.4.6 JSON Metadata of Conversation About a Scene

Specified in Annex 9 – .

5.5 Virtual Secretary for Videoconference (VSV)

5.5.1 Scope of Virtual Secretary for Videoconference

In a virtual videoconference, i.e., a videoconference whose participants are avatars realistically impersonating the human participants, a Virtual Secretary is tasked with:

Listening to the Speech of each avatar.
Monitoring their Personal Status.
Drafting a Summary using the avatars’ Personal Status and Text obtained from the Speech Recognition AIM or directly via Text input in the meeting’s common language handled in two different ways:
- Transferred to an external application so that participants can edit the Summary.
- Displayed to avatars:
  - Avatars make Speech comments or Text comments (e.g., offline via chat).
  - The Virtual Secretary edits the Summary interpreting Text, and the avatars’ Personal Statuses.

Chapter 5 of Annex 1 – MPAI Basics provides additional information on the Avatar-Based Videoconference Use Case.

5.5.2 Reference Architecture of Virtual Secretary for Videoconference

Figure 5 specifies the architecture of the Virtual Secretary AIW.

Figure 5 – Reference Model of the Virtual Secretary for Videoconference Use Case

The Virtual Secretary processes one avatar at a time according to the following workflow:

Speech Recognition extracts Text from avatar Speech.
Avatar Descriptors Parsing provides Body and Face Descriptors.
Language Understanding:
- Receives Recognised Text.
- Produces:
  - Refined Text (of Recognised Text).

Personal Status Extraction:
- Receives Meaning, Speech, and Body and Face Descriptors.
- Produces the Personal Status of the avatar it is interacting with.
Summarisation:
- Receives:
  - Refined Text
  - Personal Status
  - Meaning
- Produces Summary using Personal Status and Text in the meeting’s common language.
- Receives Edited Summary from Dialogue Processing.
Dialogue Processing:
- Receives:
  - Refined Text.
  - Text from an avatar (concerning Summary, via chat).
  - Personal Status.
- Edits the Summary using avatars’ inputs.
- Sends Edited Summary back to Summarisation.
- Outputs VS Text concerning Summary and Personal Status of Virtual Secretary.
Personal Status Display:
- Receives Virtual Secretary’s Output Text and Personal Status.
- Produces the Virtual Secretary’s:
  - Synthesised Speech.
  - Face and Body Descriptors.

5.5.3 I/O Data of Virtual Secretary for Videoconference

Table 14 gives the input/output data of Virtual Secretary for Videoconference.

Table 14 – I/O data of Virtual Secretary

Input data	From	Comment
Text (xN)	Avatars	Remarks on the summary, etc.
Speech (xN)	Avatars	Utterances of avatars
Avatar Descriptors (xN)	Avatars	Gestures of avatars
Output data	To	Comments
Machine Speech	Avatars	VS Speech to avatars
Machine Face	Avatars	VS Face to avatars
Machine Avatar	Avatars	VS Avatar to avatars
Summary	Avatars	Summary of avatars’ interventions

5.5.4 Functions of AI Modules of Virtual Secretary for Videoconference

Table 15 gives the functions of Virtual Secretary for Videoconference AIMs.

Table 15 – Functions of Virtual Secretary for Videoconference AI Modules

AIM	Functions
Speech Recognition	Recognises Speech
Avatar Descriptors Parsing	Provides Face and Body Descriptors
Language Understanding	1. Refines Recognised Text 2. Extracts Meaning
Personal Status Extraction	Extracts Personal Status
Summarisation	Produces and refines Summary using Edited Summary
Dialogue Processing	Produces Text and Personal Status
Personal Status Display	Shows Virtual Secretary as speaking Avatar with Personal Status

5.5.5 I/O Data of AI Modules of Virtual Secretary for Videoconference

Table 16 gives the AI Modules of the Virtual Secretary depicted in Figure 5.

Table 16 – AI Modules of Virtual Secretary

AIM	Receives	Produces
Speech Recognition	Speech	Recognised Text
Avatar Descriptors Parsing	Avatar Descriptors	1. Face Descriptors 2. Body Descriptors
Language Understanding	Recognised Text	1. Refined Text 2. Meaning
Personal Status Extraction	1. Meaning 2. Speech 3. Face Descriptors 4. Body Descriptors	Personal Status
Summarisation	1. Meaning 2. Refined Text 3. Edited Summary	Summary
Dialogue Processing	1. Refined Text 2. Personal Status 3. Meaning 4. Summary	1. VS Personal Status 2. VS Text 3. Edited Summary
Personal Status Display	1. VS Text 2. VS Personal Status	1. PSD’s Avatar Model 2. VS Text 3. VS Speech 4. VS Avatar Descriptors

5.5.6 JSON Metadata of Virtual Secretary for Videoconference

Specified in Annex 11 – .

5.6 Human-Connected Autonomous Vehicle (CAV) Interaction (HCI)

5.6.1 Scope of Human-CAV Interaction

A Connected Autonomous Vehicle (CAV) is a system able to execute a command to move itself based on 1) capture of data sensed by a range of onboard sensors exploring the environment and 2) analysis, and interpretation of the data captured and transmitted by other sources in range, such as other CAVs, traffic lights and roadside units. Chapter 5 of Annex 1 – Connected Autonomous Vehicle describes the four Subsystems of a CAV among which Human-CAV interaction (HCI) has the function to recognise the human owner or renter, respond to humans’ commands and queries, converse with humans during the travel and activate the Autonomous Motion Subsystem in response to humans’ requests. Inter HCI Information, HCI-AMS Commands, and AMS-HCI Response are indicated in Figure 6 but not specified.

5.7 Reference Architecture of Human-CAV Interaction

Figure 6 represents the Human-CAV Interaction (HCI) Reference Model.

Figure 6 – Human-CAV Interaction Reference Model

The operation of HCI involves the following functions:

A group of humans approaches the CAV outside the CAV:
- The Audio Scene Description AIM creates the Audio Scene Description in the form of Audio (Speech) Objects corresponding to each speaking human in the Environment (close to the CAV).
- The Visual Scene Description creates the Visual Scene Descriptors in the form of Body and Face Descriptors corresponding to each human in the Environment (close to the CAV).
- The Speaker Recognition and Face Recognition AIMs authenticate the humans that the HCI is interacting with using Speech and Face Descriptors.
- The Speech Recognition AIM recognises the speech of each human.
- The Language Understanding AIM extracts Meaning and produces Refined Text.
- The Personal Status Extraction AIM extracts the Personal Status of the humans.
- The Dialogue Processing AIM validates the human Identities, produces the response and displays the HCI Personal Status, and issues commands to the Autonomous Motion Subsystem.
A group of humans sits in the seats inside the CAV:
- The Audio Scene Description AIM creates the Audio Scene Descriptions in the form of Audio (Speech) Objects corresponding to each speaking human in the cabin.
- The Visual Scene Description creates the Visual Scene Descriptors in the form of Body and Face Descriptors corresponding to each human in the cabin, and Physical Objects.
- The Speaker Recognition and Face Recognition AIMs identify the humans the HCI is interacting with using Speech and Face Descriptors.
- The Speech Recognition AIM recognises the speech of each human.
- The Language Understanding AIM extracts Meaning and produces Refined Text.
- The Personal Status Extraction AIM extracts the Personal Status of the humans.
- The Dialogue Processing AIM recognises the human Identities, produces the response, displays the HCI Personal Status, and issues commands to the Autonomous Motion Subsystem.
The HCI interacts with the humans in the cabin in several ways:
- By responding to commands/queries from one or more humans at the same time, e.g.:
  - Commands to go to a waypoint, park at a place, etc.
  - Commands with an effect in the cabin, e.g., turn off air conditioning, turn on the radio, call a person, open window or door, search for information etc.

Note: For completeness, Figure 6 includes the interaction of HCI with AMS (e.g., commands and responses regarding selection of Route by human) and with remote HCIs. However, this document does not address the format in which these interactions are performed.

By conversing with and responding to questions from one or more humans at the same time about travel-related issues (in-depth domain-specific conversation), e.g.:
- Humans request information, e.g., time to destination, route conditions, weather at destination, etc.
- CAV offers alternatives to humans, e.g., long but safe way, short but likely to have interruptions.
- Humans ask questions about objects in the cabin.
By following the conversation on travel matters held by humans in the cabin. Initial condition for this participation are if: 1) the passengers allow the HCI to do so, and 2) the processing is carried out inside the CAV.

Note that:

The Audio Scene Description provides all Speech Objects in the Audio Scene, removing all other audio sources.
The Speaker Recognition and Speech Recognition AIMs support multiple Speech Objects as input. Each Speech Object has an identifier to enable the Speaker Recognition and Speech Recognition AIMs to provide Recognised Texts labelled with Speaker IDs. If the Face Recognition AIM provides Face IDs corresponding to the Speaker IDs, the Dialogue Processing AIM can correctly associate the Speaker IDs (and the corresponding Recognised Texts) with the Face IDs.

5.7.1 I/O Data of Human-CAV Interaction

Table 17 gives the input/output data of Human-CAV Interaction.

Table 17 – I/O data of Human-CAV Interaction

Input data	From	Comment
Audio (Indoor)	Cabin Passengers	User’s social life Commands/interaction with CAV
Audio (Outdoor)	Users in Environment	User authentication User command User conversation
Input Text	Cabin Passengers	User’s social life Commands/interaction with CAV
Video (Outdoor)	Users in Environment	Commands/interaction with CAV
LiDAR (Indoor)	Cabin Passengers	User’s social life Commands/interaction with CAV
RADAR (Indoor)	Cabin Passengers	User’s social life Commands/interaction with CAV
Video (Indoor)	Cabin Passengers	User’s social life Commands/interaction with CAV
Inter HCI Info	Remote HCI
AMS-HCI Response	Motion Actuation Subsystem	AMS Response about execution of HCI-AMS Command
Output data	To	Comments
Output Speech	Cabin Passengers	CAV’s response to passengers
Output Avatar	Cabin Passengers	Portion of CAV’s Avatar (e.g., head & face).
Output Text	Cabin Passengers	CAV’s response to passengers
Inter HCI Info	Remote HCI
HCI-AMS Commands	Motion Actuation Subsystem	Command to AMS to actuate wheels, brakes, etc.

Note that this document does not specify Inter HCI Information, HCI-AMS Commands, and AMS-HCI Response.

5.7.2 Functions of AI Modules of Human-CAV Interaction

Table 18 gives the functions of all Human-CAV Interaction AIMs.

Table 18 – Functions of Human-CAV Interaction’s AI Modules

AIM	Function
Audio Scene Description	Produces the Audio Scene Descriptors using the Audio captured by the appropriate (indoor or outdoor) Microphone Array.
Visual Scene Description	Produces the Visual Scene Descriptors using the visual information captured by the appropriate (indoor or outdoor) visual sensors.
Speech Recognition	Converts speech into Text.
Physical Object Identification	Provides the ID of the class of objects of which the Physical Object is an Instance
Language Understanding	Improves the Text from Speech Recognition by using context information (e.g., Instance ID of object).
Speaker Recognition	Provides Speaker ID from Speech.
Personal Status Extraction	Provides the Personal Status of human.
Face Recognition	Provides Face ID from Face.
Dialogue Processing	Provides: 1. Text containing the response of the HCI to the human. 2. Personal Status of HCI congruous with the Text produced by the HCI.
Personal Status Display	Produces Speech, and Machine Face and Body.

5.7.3 I/O Data of AI Modules of Human-CAV Interaction

Table 19 gives the AI Modules of the Human-CAV Interaction depicted in Figure 3.

Table 19 – AI Modules of Human-CAV interaction

AIM	Receives	Produces
Audio Scene Description	Environment Audio (outdoor) Environment Audio (indoor)	Speech Objects
Visual Scene Description	Environment Video (outdoor) Environment Video (indoor)	Face Objects Physical Objects Body Descriptors Face Descriptors
Speech Recognition	Speech Object	Recognised Text
Physical Object Identification	Physical Object Body Descriptors	Object ID
Language Understanding	Recognised Text Personal Status Object ID	Meaning Personal Status Refined Text
Speaker Recognition	Speech Descriptors	Speaker ID
Personal Status Extraction	Speech Object Meaning Face Descriptors Body Descriptors	Personal Status
Face Recognition	Face Object	Face ID
Dialogue Processing	Speaker ID Meaning Refined Text Personal Status Face ID AMS-HCI Response	AMS-HCI Commands Output Text Output Personal Status
Personal Status Display	Machine Text Output Personal Status	Machine Avatar Machine Text Machine Speech

5.7.4 JSON Metadata of Human-CAV Interaction

Specified in Annex 10 – .

5.8 Unidirectional Speech Translation (UST)

5.8.1 Scope of Unidirectional Speech Translation

The goal of the Unidirectional Speech Translation (MMC-UST) Use Case is to translate speech segments expressed in a source language into a target language or to produce the textual version of the translated speech. If the desired output is speech, the user can specify whether their speech features (voice colour, emotional charge, etc.) should be preserved in the translated speech.

The flow of control is from Input Speech or Input Text to Translated Text, and then to Output Speech and Output Text. Depending on the value of Input Selection:

Input Text in Language A is translated into Translated Text in Language B and pronounced as Speech in Language B.
The Speech features (voice colour, emotional charge, etc.) in Language A are preserved in Language B.

5.8.2 Reference Architecture of Unidirectional Speech Translation

Figure 7 describes the input/output data, the AIMs and the data exchanged between AIMs.

Figure 7 – Reference Model of Unidirectional Speech Translation (UST)

5.8.3 I/O Data of Unidirectional Speech Translation

The input and output data of the Unidirectional Speech Translation Use Case are:

Table 20 – I/O Data of Unidirectional Speech Translation

Input	Comments
Input Selection	Determines whether: 1. The input will be in Text or Speech 2. The Input Speech features are preserved in the Output Speech.
Requested Languages	User-specified input Language (A) and output Language (B).
Input Speech	Speech produced in Language A by a human desiring translation into language B.
Input Text	Alternative textual source information to be translated into and pronounced in language B depending on the value of Input Selection.
Output	Comments
Translated Speech	Input Speech translated into language B preserving the Input Speech features in the Output Speech, depending on the value of Input Selection.
Translated Text	Text of Input Speech or Input Text translated into language B, depending on the value of Input Selection.

5.8.4 Functions of AI Modules of Unidirectional Speech Translation

Table 21 gives the functions of Unidirectional Speech Translation AIMs.

Table 21 – Functions of Unidirectional Speech Translation AI Modules

AIM	Functions
Speech Recognition	Recognises Speech
Translation	Translates Recognised Text
Speech Feature Extraction	Extracts Speech Features
Speech Synthesis (Features)	Synthesises Translated Text adding Speech Features

5.8.5 I/O Data of AI Modules of Unidirectional Speech Translation

The AI Modules of Unidirectional Speech Translation are given in Table 22.

Table 22 – AI Modules of Unidirectional Speech Translation

AIM	Receives	Produces
Speech Recognition	Input Speech Segment	Recognised Text
Translation	1. Input Text 2. Recognised Text (Based on Input Selection)	Translated Text
Speech Feature Extraction	Input Speech	Speaker-specific Speech Features (e.g., tones, intonation, intensity, pitch, emotion, speed).
Speech Synthesis (Features)	1. Translated Text 2. Speech Features (depending on Input Selection)	Produces Output Speech.

5.8.6 JSON Metadata of Unidirectional Speech Translation

Specified in Annex 12 – .

5.9 Bidirectional Speech Translation (BST)

5.9.1 Scope of Bidirectional Speech Translation

The goal of the Bidirectional Speech Translation (MMC-BST) Use Case is to support a conversation between two people, each speaking a different language. The machine translates each input speech segment into the selected language as speech or text. If the desired output is speech, users can specify whether their speech features (voice colour, emotional charge, etc.) should be preserved in the translated speech.

The flow of control (from Input Speech to Translated Text to Output Speech) is identical to that of the Unidirectional case. The difference is that, rather than one such flow, two flows are provided in two different channels – the first from language A to language B, and the second from language B to language A.

Depending on the value of Input Selection:

Input Text in Language A is translated into Translated Text in Language B and pronounced as Speech in Language B.
The Speech features (voice colour, emotional charge, etc.) in Language A are preserved in Language B.

The same applies for the Language-B-to-Language-A channel.

5.9.2 Reference Architecture of Bidirectional Speech Translation

Figure 8 depicts the AIMs and the data exchanged between AIMs.

Figure 8 – Reference Model of Bidirectional Speech Translation (BST)

5.9.3 I/O Data of Bidirectional Speech Translation

The input and output data of the Bidirectional Speech Translation Use Case are:

Table 23 – I/O Data of Bidirectional Speech Translation

Input	Comments
Input Selection	Determines whether the input will be Text or Speech.
Requested languages	User-specified input language and output languages
Input Speech1	Speech by human1 desiring spoken translation in the specified language.
Input Text1	Alternative Input Text to be translated to the specified language.
Input Speech2	Speech by human2 desiring spoken translation in the specified language.
Input Text2	Alternative Input Text to be translated to the specified language.
Output	Comments
Output Speech1	Translated Speech of Speaker 1.
Output Text1	Text of the translated Speech of Speaker 1.
Output Speech2	Translated Speech of Speaker 2.
Output Text2	Text of the translated Speech of Speaker 2.

5.9.4 Functions of AI Modules of Bidirectional Speech Translation

Table 24 gives the functions of Bidirectional Speech Translation AIMs.

Table 24 – Functions of Bidirectional Speech Translation AI Modules

AIM	Functions
Speech Recognition	Recognises Speech
Translation	Translates Recognised Text
Speech Feature Extraction	Extracts Speech Features
Speech Synthesis (Features)	Synthesises Translated Text adding Speech Features

5.9.5 I/O Data of AI Modules of Bidirectional Speech Translation

Table 25 gives the I/O Data of the AI Modules.

Table 25 – AI Modules of Bidirectional Speech Translation

AIM

Receives

Produces

Speech Recognition

1. Input Speech 1 Segment

2. Input Speech 2 Segment

1. Recognised Text 1

2. Recognised Text 2.

Translation

1. Input Text 1 or Recognised Text 1

2. Input Text 2 or Recognised Text 2

3. based on the value of Input Selection

1. Translated Text 1

2. Translated Text 2.

Speech Feature Extraction

1. Input Speech 1

2. Input Speech 2

1. Speech Features 1

2. Speech Features 2.

Speech Synthesis (Features)

1. Translated Text 1 and

2. Translated Text 2 and Speech Features

3. Speech Features 1 and 2 based on Input Selection

1. Translated Speech 1

3. Translated Speech 2

5.9.6 JSON Metadata of Bidirectional Speech Translation

Specified in Annex 13 – .

5.10 One-to-Many Speech Translation (MST)

5.10.1 Scope of One-to-Many Speech Translation

The goal of the One-to-Many Speech Translation (MMC-MST) Use Case is to enable one person speaking his or her language to broadcast to two or more audience members, each listening and responding in a different language, presented as speech or text. If the desired output is speech, users can specify whether their speech features (voice colour, emotional charge, etc.) should be preserved in the translated speech.

The flow of control (from Recognised Text to Translated Text to Output Speech) is identical to that of the Unidirectional case. However, rather than one such flow, multiple paired flows are provided – the first pair from language A to language B and B to A; the second from A to C and C to A; and so on.

Depending on the value of Input Selection (text or speech):

Input Text in Language A is translated into Translated Text in and pronounced as Speech of all Requested Languages.
The Speech features (voice colour, emotional charge, etc.) in Language A are preserved in all Requested Languages.

5.10.2 Reference Architecture of One-to-Many Speech Translation

Figure 9 depicts the AIMs and the data exchanged between AIMs.

Figure 9 – Reference Model of One-to-Many Speech Translation (MST)

5.10.3 I/O Data of One-to-Many Speech Translation

The input and output data of the One-to-Many Speech Translation Use Case are:

Table 26 – I/O Data of One-to-Many Speech Translation

Input	Comments
Input Selection	Determines whether the input will be in Text or Speech.
Desired Languages	User-specified input language and translated languages
Input Speech	Speech produced by human desiring translation and interpretation in a specified set of languages.
Input Text	Alternative textual source information.
Output	Comments
Translated Speech	Speech translated into the Requested Languages.
Translated Text	Text translated into the Requested Languages.

5.10.4 Functions of AI Modules of One-to-Many Speech Translation

Table 27 gives the functions of One-to-Many Speech Translation AIMs.

Table 27 – Functions of One-to-Many Speech Translation AI Modules

AIM	Functions
Speech Recognition	Recognises Speech
Translation	Translates Recognised Text
Speech Feature Extraction	Extracts Speech Features
Speech Synthesis (Features)	Synthesises Translated Text adding Speech Features

5.10.5 I/O Data of AI Modules of One-to-Many Speech Translation

Table 28 gives the I/O Data of the AI Modules.

Table 28 – AI Modules of One-to-Many Speech Translation

AIM	Receives	Produces
Speech Recognition	Input Speech Segment	Recognised Text
Speech Feature Extraction	Input Speech	Speaker-specific Speech Features.
Translation	Text input	Translated Texts in the Requested Languages.
Speech Synthesis (Features)	1. Translated Texts 2. Speech Features (based on Input Selection)	Speech Segments in the Desired Languages.

5.10.6 JSON Metadata of One-to-Many Speech Translation

Specified in Annex 14 – .

6 Composite AI Modules

AI Modules composed of multiple AI Modules are called Composite AIMs. They are used in several MPAI-MMC Use Cases. This chapter specifies the Personal Status Extraction (PSE) AIM using a format like the one adopted for Uses Cases. Other Technical Specifications specify other Composite AIMs, such as [3] that specifies the Personal Status Display Composite AIM used in this Technical Specification.

6.1 Personal Status Extraction (PSE)

Personal Status Extraction (PSE) is a composite AIM that extracts Cognitive State, Emotion, and Social Attitude called Factors conveyed by each of Text, Speech, Face, and Gesture, called Modalities, and provides an estimate of the Personal Status, intended as a combination of Factors. The Personal Status Composite AIM is used in MPAI-MMC and other Use Cases as a replacement for the combination of AIMs depicted in Figure 10. Personal Status need not convey information on all Factors and all Modalities.

6.1.1 Scope of Personal Status Extraction

Personal Status Extraction produces the estimate of the Personal Status of a human or an avatar by analysing each Modality in three steps:

Data Capture (e.g., characters and words, a digitised speech segment, the digital video containing the hand of a person, etc.).
Descriptor Extraction (e.g., pitch and intonation of the speech segment, thumb of the hand raised, the right eye winking, etc.).
Personal Status Interpretation (i.e., one of Emotion, Cognitive State, and Attitude).

An implementation may combine two or more of the AIMs implementing the steps.

6.1.2 Reference Architecture of Personal Status Extraction

Figure 10 depicts the Personal Status extraction process:

Descriptors are extracted from Text, Speech, Face Object, and Body Object. Depending on the value of Selection, Descriptors can be provided by an AI Module upstream.
Descriptors are interpreted and the specific indicators of the Personal Status in the Text, Speech, Face, and Gesture Modalities are derived.
Personal Status is obtained by combining the estimates of different Modalities of the Personal Status.

Input Selection inform PSE whether a Modality or its Descriptors are used.

Figure 10 – Reference Model of Personal Status Extraction

Note that:

A Modality can be input into the Personal Status Extraction Composite AIM as a Modality or as Descriptors. Both Modality Descriptors have the same syntax and semantics. Text Descriptors are equivalent to Meaning. Gesture Description extracts Gesture Descriptors from Body Object. In the future other Descriptors may be extracted from Body Object.
An Implementation can combine, e.g., the Gesture Description and PS-Gesture Interpretation AIMs into one AIM, and directly provide PS-Gesture from a Body Object without exposing PS-Gesture Descriptors.

6.1.3 I/O Data of Personal Status Extraction

Table 29 gives the input/output data of Personal Status Extraction.

Table 29 – I/O data of Personal Status Extraction

Input data	From	Comment
Input Selection	An external signal
Text	Keyboard or Speech Recognition	Text or recognised speech.
Text Descriptors	An upstream AIM
Speech	Microphone	Speech of human.
Speech Descriptors	An upstream AIM
Face Object	Visual Scene Description	The face of the human.
Face Descriptors	An upstream AIM
Body Object	Visual Scene Description	The upper part of the body.
Body Descriptors	An upstream AIM
Output data	To	Comments
Personal Status	A downstream AIM	For further processing

6.1.4 Functions of AI Modules of Personal Status Extraction

Table 30 gives functions of the AIMs.

Table 30 – AI Modules of Personal Status Extraction

AIM	Modules
Text Description	Extracts the Descriptors of Text.
Speech Description	Extracts the Descriptors of Speech.
Face Description	Extracts the Descriptors of Face.
Gesture Description	Extracts the Descriptors of Body.
PS-Text Interpretation	Interprets the Personal Status Descriptors of Text.
PS-Speech Interpretation	Interprets the Personal Status Descriptors of Speech.
PS-Face Interpretation	Interprets the Personal Status Descriptors of Face.
PS-Gesture Interpretation	Interprets the Personal Status Descriptors of Body.
Personal Status Combination	Produces the Personal Status.

6.1.5 I/O Data of AI Modules of Personal Status Extraction

Table 31 gives the list of the AIMs with their functions.

Table 31 – AI Modules of Personal Status Extraction

AIM	Receives	Produces
Text Description	Text	Text Descriptors
Speech Description	Speech	Speech Descriptors
Face Description	Face Object	Face Descriptors
Gesture Description	Body Object	Gesture Descriptors
PS-Text Interpretation	PS-Text Descriptors	PS-Text
PS-Speech Interpretation	PS-Speech Descriptors	PS-Speech
PS-Face Interpretation	PS-Face Descriptors	PS-Face
PS-Gesture Interpretation	PS-Gesture Descriptors	PS-Gesture
Personal Status Combination	PS-Text PS-Speech PS-Face PS-Gesture	Personal Status

6.1.6 JSON Metadata of Personal Status Extraction

Specified in Annex 15 – .

6.2 Personal Status Display (PSD)

6.2.1 Scope of Personal Status Display

A Personal Status Display (PSD) is a Composite AIM receiving Text and Personal Status and generating an avatar producing Text and uttering Speech with the intended Personal Status while the avatar’s Face and Gesture show the intended Personal Status. Instead of a ready-to-render avatar, the output can be provided as Compressed Avatar Descriptors. The Personal Status driving the avatar can be extracted from a human or can be synthetically generated by a machine as a result of its conversation with a human or another avatar. This Composite AIM is used in the Use Case figures of this document as a replacement for the combination of the AIMs depicted in Figure 11.

6.2.2 Reference Architecture of Personal Status Display

Figure 11 represents the AIMs required to implement Personal Status Display.

Figure 11 – Reference Model of Personal Status Display

The Personal Status Display operates as follows:

Selection determines the type of avatar output – Machine Avatar or Avatar Descriptors.
Text is passed as output and synthesised as Speech using the Personal Status provided by PS-Speech.
Machine Speech and PS-Face are used to produce the Face Descriptors.
PS-Gesture and Text are used for Body Descriptors using the Avatar Model.
Avatar Description produces a complete set of Avatar Descriptors.
Avatar Synthesis produces a ready-to-render Machine Avatar.

6.2.3 I/O Data of Personal Status Display

Table 32 gives the input/output data of Personal Status Display.

Table 32 – I/O data of Personal Status Display

Input data	From	Comment
Selection	Switch	PSD output type
Text	Keyboard, Speech Recognition, Machine
PS-Speech	Personal Status Extractor or Machine
Avatar Model	From AIM/AIW or embedded
PS-Face	Personal Status Extractor or Machine
PS-Gesture	Personal Status Extractor or Machine
Output data	To	Comments
Machine Text	Human or Avatar (i.e., an AIM)
Machine Speech	Human or Avatar (i.e., an AIM)
Compressed Descriptors	AIM/AIW downstream
Body Object	Presentation Device	Ready-to-render Avatar
Avatar Model	As in input

7 Data Formats

This Technical Specification specifies the Data Formats listed in Table 33. The reader is alerted that some data Formats are shared with the Context-based Audio Enhancement (MPAI-CAE) Standard [3]. At the current date, the specification of such data Formats is repeated verbatim in both Standards.

The first column gives the name of the data Format, the second the subsection where the data Format is specified and the third the Use Case(s) making use of it.

Table 33 – Data formats

Name of Data Format	Subsection	Use Case
Audio File	7.1	ABV
		BST
		CAS
		CWE
		HCI
		MST
		UST
		VSV
Audio Scene Descriptors	7.2	ABV
		HCI
Cognitive State	7.3	CAS
		HCI
		VSV
Emotion	7.4	ABV
		CWE
		HCI
		VSV
Face Descriptors	7.5	ABV
		CWE
		HCI
		VSV
Gesture Descriptors	7.6	ABV
		CWE
		HCI
		VSV
Instance ID	Error! Reference source not found.	HCI
Language Identifier	7.8	BST
		MST
		UST
Meaning	7.10	CAS
		CWE
		HCI
Personal Status	7.11	ABV
		CAS
		HCI
Physical Object Identifier (Instance Identifier)	Error! Reference source not found.	CAS
		MQA
Social Attitude	7.12	CAS
		HCI
Spatial Attribute	7.13	CAS
		HCI
Speech Descriptors	7.14	ABV
		CWE
		HCI
		VSV
Speech Features	7.15	UST
Text	7.16	BST
		CWE
		MQA
		MST
		UST
Text Descriptors	7.17	ABV
		CWE
		HCI
		VSV
Video	7.18	CWE
Video File	7.19	ARP
Video Of Faces KB Query Format	7.20	CWE
Visual Scene Descriptors	7.21	ABV
		CAS
		HCI

MPAI plans on creating a future specification that will contain all data Formats that are shared by more than one MPAI Standard.

7.1 Audio File

Audio data is packaged in a .wav file [10].

7.2 Audio Scene Descriptors

Audio Scene Descriptors are specified in MPAI-CAE V2 [3].

7.3 Cognitive State

Cognitive State is represented by the following Syntax and Semantics. Primary Cognitive State corresponds to General Adjectival and Secondary Cognitive State corresponds to Specific Adjectival in Table 34.

The Syntax and Semantics of Cognitive State are given by the following clauses.

7.3.1 Syntax

Cognitive State is represented by.

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“cogstateType”:{

“type”:”object”,

“properties”:{

“cogstateDegree”:{

“enum”: [“High”, “Medium”, “Low”]

“cogstateName”:{

“type”:”number”

“cogstateSetName”:{

“type”:”string”

}

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/cogstateType”

“secondary”:{

“$ref”:”#/definitions/cogstateType”

}

7.3.2 Semantics

Name	Definition
cogstateType	Specifies the Cognitive State that the input carries.
cogstateDegree	Specifies the Degree of Cognitive State as one of “Low,” “Medium,” and “High.”
cogstateName	Specifies the ID of a Cognitive State listed in Table 37.
cogstateSetName	Specifies the name of the Cognitive State set which contains the Cognitive State. Cognitive State set of Table 37 is used as a baseline, but other sets are possible.

Table 34 gives the standardised three-level Basic Cognitive State Label Set.

Table 34 – Basic Cognitive State Label Set

COGNITIVE CATEGORIES	GENERAL ADJECTIVAL	SPECIFIC ADJECTIVAL
AROUSAL	aroused/excited/energetic	cheerful playful lethargic sleepy
ATTENTION	attentive	expectant/anticipating thoughtful distracted/absent-minded vigilant hopeful/optimistic
BELIEF	credulous	sceptical
INTEREST	interested	fascinated curious bored
SURPRISE	surprised	astounded startled
UNDERSTANDING	comprehending	uncomprehending bewildered/puzzled

Table 35 provides the semantics for each label in the GENERAL ADJECTIVAL and SPECIFIC ADJECTIVAL columns above.

Table 35 – Basic Cognitive State Semantics Set

ID	Cognitive State	Meaning
1	aroused/excited/energetic	cognitive state of alertness and energy
2	astounded	high degree of surprised
3	attentive	cognitive state of paying attention
4	bewildered/puzzled	high degree of incomprehension
5	bored	not interested
6	cheerful	energetic combined with and communicating happiness
7	comprehending	cognitive state of successful application of mental models to a situation
8	credulous	cognitive state of conformance to mental models of a situation
9	curious	interest due to drive to know or understand
10	distracted/absent-minded	not attentive to present situation due to competing thoughts
11	expectant/anticipating	attentive to (expecting) future event or events
12	fascinated	high degree of interest
13	interested	cognitive state of attentiveness due to salience or appeal to emotions or drives
14	lethargic	not aroused
15	playful	energetic and communicating willingness to play
16	sceptical	not credulous
17	sleepy	not aroused due to need for sleep
18	surprised	cognitive state due to violation of expectation
19	startled	surprised by a sudden event or perception
20	surprised	cognitive state due to violation of expectation
21	thoughtful	attentive to thoughts
22	uncomprehending	not comprehending

7.4 Emotion

The Syntax and Semantics of Emotion are given by the following clauses. Emotions are expressed vocally through combinations of prosody (pitch, rhythm, and volume variations); separable speech effects (such as degrees of voice tension, breathiness, etc.); and vocal gestures (laughs, sobs, etc.).

Emotion is represented by the following Syntax and Semantics. Primary Emotion corresponds to General Adjectival and Secondary Emotion corresponds to Specific Adjectival in Table 36.

7.4.1 Syntax

Human Emotion is represented by.

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“emotionType”:{

“type”:”object”,

“properties”:{

“emotionDegree”:{

“enum”: [“High”, “Medium”, “Low”]

“emotionName”:{

“type”:”number”

“emotionSetName”:{

“type”:”string”

}

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/emotionType”

“secondary”:{

“$ref”:”#/definitions/emotionType”

}

7.4.2 Semantics

Name	Definition
emotionType	Specifies the Emotion that the input carries.
emotionDegree	Specifies the Degree of Emotion as one of “Low,” “Medium,” and “High.”
emotionName	Specifies the ID of an Emotion listed in Table 37.
emotionSetName	Specifies the name of the Emotion set which contains the Emotion. Emotion set of Table 37 is used as a baseline, but other sets are possible.

Table 36 gives the standardised three-level Basic Emotion Set partly based on Paul Eckman [19].

Table 36 – Basic Emotion Label Set

EMOTION CATEGORIES	GENERAL ADJECTIVAL	SPECIFIC ADJECTIVAL
ANGER	angry	furious irritated frustrated
CALMNESS	calm	peaceful/serene resigned
DISGUST	disgusted	repulsed
FEAR	fearful/scared	terrified anxious/uneasy
HAPPINESS	happy	joyful content delighted amused
HURT	hurt jealous	insulted/offended resentful/disgruntled bitter
PRIDE/SHAME	proud ashamed	guilty/remorseful/sorry embarrassed
RETROSPECTION	nostalgic	homesick
SADNESS	sad	lonely grief-stricken depressed/gloomy disappointed

Table 37 provides the semantics for each label in the GENERAL ADJECTIVAL and SPECIFIC ADJECTIVAL columns above.

Table 37 – Basic Emotion Semantics Set

ID	Emotion	Meaning
1	amused	positive emotion combined with interest (cognitive state)
2	angry	emotion due to perception of physical or emotional damage or threat
3	anxious/uneasy	low or medium degree of fear, often continuing rather than instant
4	ashamed	emotion due to awareness of violating social or moral norms
5	bitter	persistently angry due to disappointment or perception of hurt or injury
6	calm	relatively lacking emotion
7	content	medium or low degree of happiness, continuing rather than instant
8	delighted	high degree of happiness, often combined with surprise
9	depressed/ gloomy	high degree of sadness, continuing rather than instant, combined with lethargy (see AROUSAL)
10	disappointed	sadness due to failure of desired outcome
11	disgusted	emotion due to urge to avoid, often due to unpleasant perception or disapproval
12	embarrassed	shame due to consciousness of violation of social conventions
13	fearful/scared	emotion due to anticipation of physical or emotional pain or other undesired event or events
14	frustrated	angry due to failure of desired outcome
15	furious	high degree of angry
16	grief-stricken	sadness due to loss of an important social contact
17	happy	positive emotion, often continuing rather than instant
18	homesick	sad due to absence from home
19	hurt	emotion due to perception that others have caused social pain or embarrassment
20	insulted/offended	emotion due to perception that one has been improperly treated socially
21	irritated	low or medium degree of angry
22	jealous	emotion due to perception that others are more fortunate or successful
23	joyful	high degree of happiness, often due to a specific event
24	repulsed	high degree of disgusted
25	lonely	sad due to insufficient social contact
26	mortified	high degree of embarrassment
27	nostalgic	emotion associated with pleasant memories, usually of long before
28	peaceful/serene	calm combined with low degree of happiness
29	proud	emotion due to perception of positive social standing
30	resentful/disgruntled	emotion due to perception that one has been improperly treated
31	resigned	calm due to acceptance of failure of desired outcome, often combined with low degree of sadness
32	sad	negative emotion, often continuing rather than instant, often associated with a specific event
33	terrified	high degree of fear

7.5 Face Descriptors

Face Descriptors as defined in Personal Status Extraction are specified in MPAI-ARA V1 [3].

7.6 Gesture Descriptors

Gesture Descriptors as defined in Personal Status Extraction are specified in MPAI-ARA V1 [3].

7.7 Instance Identifier

Instance is an element of a set of entities – Physical Objects, users etc. – belonging to some levels in a hierarchical classification (taxonomy).

The syntax and semantics of Instance Identifier are .

7.7.1 Syntax

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“title”:”InstanceIdentifier”,

“type”:”object”,

“properties”:{

“InstanceLabel”:{

“type”:”string”

“LabelConfidenceLevel”:{

“type”:”number”,

“minimum”:0,

“maximum”:1

“Classification”:{

“type”:”array”,

“items”:{

“type”:”string”

}

“ClassificationConfidenceLevel”:{

“type”:”number”,

“minimum”:0,

“maximum”:1

}

“required”:[

“InstanceLabel”,

“LabelConfidenceLevel”,

“Classification”,

“ClassificationConfidenceLevel”

]

}

7.7.2 Semantics

Name	Definition
InstanceIdentifier	Provides the identifier of the Instance.
InstanceLabel	Describes the Instance identified by InstanceIdentifier.
LabelConfidenceLevel	Indicates the confidence level of the association between InstanceLabel and the Instance.
Classification	Describes the taxonomy inferred for the Instance.
ClassificationConfidenceLevel	Indicates the confidence level of the association between Classification and the Instance.

7.8 Intention

This subclause specifies data formats to describe intention, the outputs of Question analysis AIM. The “intention” consists of the following elements.

qtopic
qfocus
qLAT
qSAT

7.8.1 Syntax

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“Intention”:{

“type”:”object”,

“properties”:{“qtopic”:{“type”:”string”}, “qfocus”:{“type”:”string”},

“qLAT”:{“type”:”string”}, “qSAT”:{ “type”:”string” }, “qdomain”:{ “type”:”string”}

}

“type”:”object”,

“properties”:{“primary”:{“$ref”:”#/definitions/intention”}, “secondary”:{“$ref”:”#/definitions/intention” }

}

7.8.2 Semantics

Name	Definition
Intention	Provides abstracts of Intention of User Question using properties: qtopic, qfocus, qLAT, qSAT and qdomain
qtopic	Indicates the topic of the question. Question topic is the object or event that the question is about. Ex. of Qtopic is King Lear in “Who is the author of King Lear?”.
qfocus	Indicates the focus of the question, which is the part of the question that, if replaced by the answer, makes the question a stand-alone statement. Ex. What, where, who, what policy. Which river, etc. Example. Question: Who is the president of USA? (The word “Who” is the focus of the question and it will be replaced by “Biden” in the Answer.) Answer: Biden is the president of USA.
qLAT	Indicates the lexical answer type of the question.
qSAT	Indicates the semantic answer type of the question. QSAT corresponds to Named Entity type of the language analysis results.
qdomain	Indicates the domain of the question such as “science”, “weather”, “history”. Ex. Who is the third king of Yi dynasty in Korea? (qdomain: history)

The following example shows the question analysis result of the user’s question, “Who is the author of King Lear?” The question analysis result in the example shows that the domain of the question is “Literature,” the topic of the question is “King Lear”, and the focus of the question is “Who.”

{

“intention”:[

{

“qdomain”:”Literature”,

“qtopic”:”King Lear “,

“qfocus”:”who “,

“qLAT”:”author “,

“qSAT”:”person ”

}

]

}

The following example shows the result of the analysed question of “How do you make Kimchi?” The question analysis result in the example shows that the domain of the question is “Cooking”, the topic of the question is “Kimchi”, the focus of the question is “how”.

{

“intention”:[

{

“qdomain”:”Cooking”,

“qtopic”:”Kimchi”,

“qfocus”:”How “,

“qLAT”:”cooking method “,

“qSAT”:”method ”

}

]

}

7.9 Language identifier

Language identifiers are specified by [8].

7.10 Meaning

This subclause specifies data formats to describe meaning which is the result of natural language analysis. The “meaning” consists of the following elements.

POS_tagging
NE_tagging
Dependency_tagging
SRL_tagging

7.10.1 Syntax

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“meaning”:{

“type”:”object”,

“properties”:{

“POS_tagging”:{

“POS_tagging_set”:{

“type”:”string”

” POS_tagging_result”:{

“type”:”string”

}

“NE_tagging”:{

“NE_tagging_set”:{

“type”:”string”

” NE_tagging_result”:{

“type”:”string”

}

“dependency_tagging”:{

“dependency_tagging_set”:{

“type”:”string”

“dependency_tagging_result”:{

“type”:”string”

}

“SRL_tagging”:{

” SRL_tagging_set”:{

“type”:”string”

” SRL_tagging_result”:{

“type”:”string”

}

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/meaning”

“secondary”:{

“$ref”:”#/definitions/meaning”

}

7.10.2 Semantics

Name	Definition
Meaning	Provides an abstract of description of natural language analysis results.
POS_tagging	Indicates POS tagging results including information on the POS tagging set and tagged results of the User question. POS: Part of Speech such as noun, verb, etc.
NE_tagging	Indicates NE tagging results including information on the NE tagging set and tagged results of the User question. NE: Named Entity such as Person, Organisation, Fruit, etc.
dependency_tagging	Indicates dependency tagging results including information on the dependency tagging set and tagged results of the User question. Dependency indicates the structure of the sentence such as subject, object, head of the relation, etc.
SRL_tagging	Indicates SRL (Semantic Role Labelling) tagging results including information on the SRL tagging set and tagged results of the User question. SRL indicates the semantic structure of the sentence such as agent, location, patient role, etc.

7.11 Personal Status

7.11.1 Factors and Modalities

Personal Status is a data structure composed of three Personal Status Factors:

Emotion (such as “angry” or “sad”).
Cognitive State (such as “surprised” or “interested”).
Social Attitude (such as “polite” or “arrogant”).

All these Factors can be expressed via several Personal Status Modalities: Text, Speech, Face, and Gestures. (Other Modalities, such as body posture, may be handled in future MPAI Versions.)

Within a given Modality, the Factors can be analysed and interpreted via various Descriptors. For example, when expressed via Speech, the elements may be expressed through combinations of such features as prosody (pitch, rhythm, and volume variations); separable speech effects (such as degrees of voice tension, breathiness, etc.); and vocal gestures (laughs, sobs, etc.).

Each of the three Emotion, Cognitive State, and Social Attitude Factors is represented by a standard set of labels and associated semantics. For each of these Factors, two tables are provided:

A Label Set Table containing descriptive labels relevant to the element type in a three-level format:

These sets have been compiled in the interests of basic cooperation and coordination among AIM submitters and vendors complemented by a procedure whereby AIM submitters may propose extended or alternate sets for their purposes.

An Implementer wishing to extend or replace a Label Set Table for one of the three Factors is requested to do the following:

The submitted semantics should have a level of detail comparable to the semantics given in the current Label Semantics Table.

The appropriate MPAI Development Committee will examine the proposed extension or replacement. Only the adequacy of the proposed new tables in terms of clarity and completeness will be considered. In case the new tables are not clear or complete, a revision of the tables will be requested.

The accepted External Factor Set will be identified as proposed by the submitter and reviewed by the appropriate MPAI Committee and posted to the MPAI web site.

The versioning system is based on a name – MPAI for MPAI-generated versions or “organisation name” for the proposing organisation – with a suffix m.n where m indicates the version and n indicated the subversion.

7.11.2 Personal Status Data

Timestamp type can either be:
- Absolute time (A)
- Relative time, i.e., time from the start of operation (R)
Timestamp value is as in CAE V1.
- 18 values of Personal Status that include (see Table 38)
  - 6 cells for Emotion.
  - 6 cells for Cognitive State.
  - 6 cells for Social Attitude.

Table 38 – The table of (Factor, Modality) cells

		Modality
		Version	Fused value	Text	Speech	Face	Gesture
Factor	Emotion	V.Emotion
	Cognitive State	V.Cognitive
	Social Attitude	V.Attitude

The 18 values in the cells are represented as a vector of 18 values, 6 for each Factor:
- The first value is the Version of Emotion/Cognitive State/Social Attitude (VE/VC/VA) represented as two fields:
  - Field 1: 2 digits of the Version of the MMC standard (e.g., “12”, meaning version 1.2, is expressed as 2 bytes).
  - Field 2: The sequential number of the Factor dataset. Currently, there is one dataset given the number 1. New submissions will receive sequential numbers starting from 2, where the sequential number of the dataset is expressed with 1 byte).
- The second value is the current default fused value of the Modality.
- Followed by the 4 values of the Modality.
  - The value of Text
  - The value of Speech
  - The value of Face
  - The value of Gesture
- The list of possible values of a Modality are (values are in bytes):
  - Value 0: unable to compute for any reason, or error, or no discernable value.
  - Value 1 up to the largest number of Factor values in the relevant Label Semantics Table.

Therefore, a value of Personal Status is represented by the following table. Timestamp, Emotion, Cognitive State, Social Attitude and their Descriptors are present if the information is available.

Table 39 – The variables composing the Personal Status

Variable name	Code
Timestamp	Timestamp type
	Timestamp value
Emotion	Emotion version
	Fused Emotion value
	Text Emotion value
	Speech Emotion value
	Face Emotion value
	Gesture Emotion value
Cognitive State	Cognitive State version
	Fused Cognitive State value
	Text Cognitive State value
	Speech Cognitive State value
	Face Cognitive State value
	Gesture Cognitive State value
Social Attitude	Social Attitude version
	Fused Social Attitude value
	Text Social Attitude value
	Speech Social Attitude value
	Face Social Attitude value
	Gesture Social Attitude value

{

“$schema”: “http://json-schema.org/draft-07/schema#”,

“title”: “Personal Status”,

“type”: “object”,

“properties”: {

“Timestamp”: {

“type”: “object”,

“properties”: {

“Timestamp type”: {

“type”: “string”

“Timestamp value”: {

“type”: “string”,

“oneOf”: [

{ “format” : “date-time” },

{ “const” : “0” }

]

}

“required”: [“Timestamp value”],

“if”: {

“properties”: { “Timestamp value”: { “const”: “0” } }

“then”: {

“properties”: { “Timestamp type”: { “type”: “null” } }

“else”: {

“required”: [“Timestamp type”]

}

“emotion”: {

“type”: “object”,

“properties”: {

“Fused emotion value”: { “type”: “number”, “minimum”: 0 },

“Text emotion value”: { “type”: “number”, “minimum”: 0 },

“Speech emotion value”: { “type”: “number”, “minimum”: 0 },

“Face emotion value”: { “type”: “number”, “minimum”: 0 },

“Gesture emotion value”: { “type”: “number”, “minimum”: 0 },

“emotion version”: {

“type”: “string”,

“pattern”: “^[A-Za-z]+-\\d+\\.\\d+$”

}

“anyOf”: [

{ “required”: [“emotion version”, “Fused emotion value”] },

{ “required”: [“emotion version”, “Text emotion value”] },

{ “required”: [“emotion version”, “Speech emotion value”] },

{ “required”: [“emotion version”, “Face emotion value”] },

{ “required”: [“emotion version”, “Gesture emotion value”] }

]

“cogstate”: {

“type”: “object”,

“properties”: {

“Fused cogstate value”: { “type”: “number”, “minimum”: 0 },

“Text cogstate value”: { “type”: “number”, “minimum”: 0 },

“Speech cogstate value”: { “type”: “number”, “minimum”: 0 },

“Face cogstate value”: { “type”: “number”, “minimum”: 0 },

“Gesture cogstate value”: { “type”: “number”, “minimum”: 0 },

“cogstate version”: {

“type”: “string”,

“pattern”: “^[A-Za-z]+-\\d+\\.\\d+$”

}

“anyOf”: [

{ “required”: [“cogstate version”, “Fused cogstate value”] },

{ “required”: [“cogstate version”, “Text cogstate value”] },

{ “required”: [“cogstate version”, “Speech cogstate value”] },

{ “required”: [“cogstate version”, “Face cogstate value”] },

{ “required”: [“cogstate version”, “Gesture cogstate value”] }

]

“attitude”: {

“type”: “object”,

“properties”: {

“Fused attitude value”: { “type”: “number”, “minimum”: 0 },

“Text attitude value”: { “type”: “number”, “minimum”: 0 },

“Speech attitude value”: { “type”: “number”, “minimum”: 0 },

“Face attitude value”: { “type”: “number”, “minimum”: 0 },

“Gesture attitude value”: { “type”: “number”, “minimum”: 0 },

“attitude version”: {

“type”: “string”,

“pattern”: “^[A-Za-z]+-\\d+\\.\\d+$”

}

“anyOf”: [

{ “required”: [“attitude version”, “Fused attitude value”] },

{ “required”: [“attitude version”, “Text attitude value”] },

{ “required”: [“attitude version”, “Speech attitude value”] },

{ “required”: [“attitude version”, “Face attitude value”] },

{ “required”: [“attitude version”, “Gesture attitude value”] }

]

}

“required” : [“cogstate”],

“required” : [“attitude”],

“required” : [“emotion”]

}

7.12 Social Attitude

Social Attitude is represented by the following Syntax and Semantics. Primary Social Attitude corresponds to General Adjectival and Secondary Social Attitude corresponds to Specific Adjectival in Table 40.

7.12.1 Syntax

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“attitudeType”:{

“type”:”object”,

“properties”:{

“attitudeDegree”:{

“enum”: [“High”, “Medium”, “Low”]

“attitudeName”:{

“type”:”number”

“attitudeSetName”:{

“type”:”string”

}

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/attitudeType”

“secondary”:{

“$ref”:”#/definitions/attitudeType”

}

7.12.2 Semantics

Name	Definition
attitudeType	Specifies the Social Attitude that the input carries.
attitudeDegree	Specifies the Degree of Social Attitude as one of “Low,” “Medium,” and “High.”
attitudeName	Specifies the ID of a Social Attitude listed in Table 41.
attitudeSetName	Specifies the name of the Social Attitude set which contains the Social Attitude. Social Attitude set of Table 41 is used as a baseline, but other sets are possible.

Table 40 gives the standardised three-level Basic Social Attitude Set.

Table 40 – Basic Social Attitude Label Set

SOCIAL ATTITUDE

CATEGORIES

GENERAL

ADJECTIVAL

SPECIFIC

ADJECTIVAL

ACCEPTANCE

accepting

exclusive/cliquish

welcoming/inviting

friendly

unfriendly/hostile

AGREEMENT, DISAGREEMENT

like-minded

argumentative/disputatious

sarcastic

AGGRESSION

aggressive

peaceful

submissive

combative/belligerent

passive-aggressive

mocking

APPROVAL, DISAPPROVAL

admiring/approving

disapproving

indifferent

awed

contemptuous

ACTIVITY, PASSIVITY

assertive

passive

controlling

permissive/lenient

COOPERATION

cooperative/agreeable

uncooperative

flexible

subversive/undermining

uncommunicative

stubborn

disagreeable

RESPONSIVENESS

responsive/demonstrative

emotional/passionate

unresponsive/undemonstrative

unemotional/detached

enthusiastic

unenthusiastic

passionate

dispassionate

EMPATHY

empathetic/caring

kind

uncaring/callous

sympathetic

merciful

merciless/ruthless

self-absorbed

selfish/self-serving

selfless/altruistic

generous

EXPECTATION

optimistic

pessimistic

positive

sanguine

negative/defeatist

cynical

EXTROVERSION, INTROVERSION

outgoing/extroverted

uninhibited/unreserved

sociable

approachable

DEPENDENCE

dependent

independent

helpless

MOTIVATION

motivated

apathetic/indifferent

inspired

excited/stimulated

discouraged/dejected

dismissive

OPENNESS, TRUST

open

honest/sincere

reasonable

trusting

candid/frank

closed/distant

dishonest/deceitful

responsible/trustworthy/dependable

irresponsible

distrustful

PRAISING, CRITICISM

laudatory

critical

congratulatory

flattering

belittling

RESENTMENT, FORGIVENESS

forgiving

unforgiving/vindictive/spiteful/vengeful

understanding

petty

SELF-PROMOTION

boastful

modest/humble/unassuming

SELF-ESTEEM

conceited/vain

self-deprecating/self-effacing

smug

SOCIAL DOMINANCE, CONFIDENCE

arrogant

confident

submissive

overconfident

forward/presumptuous

brazen

SEXUALITY

seductive

lewd/bawdy/indecent

prudish/priggish

suggestive/risqué/naughty

SOCIAL RANK

polite/courteous/respectful

rude/disrespectful

commanding/domineering

pompous/pretentious

obedient

rebellious/defiant

condescending/patronizing/snobbish

pedantic

unaffected

servile/obsequious

Table 41 provides the semantics for each label in the GENERAL ADJECTIVAL and SPECIFIC ADJECTIVAL columns above.

Table 41 – Basic Social Attitude Semantics Set

ID	Social Attitude	Meaning
1	accepting	attitude communicating willingness to accept into relationship or group
2	admiring/approving	attitude due to perception that others’ actions or results are valuable
3	aggressive	tending to physically or metaphorically attack
4	apathetic/indifferent	showing lack of interest
5	approachable	sociable and not inspiring inhibition
6	argumentative	tending to argue or dispute
7	arrogant	emotion communicating social dominance
8	assertive	taking active role in social situations
9	awed	approval combined with incomprehension or fear
10	belittling	criticising by understating victim’s achievements, personal attributes, etc.
11	boastful	tending to praise or promote self
12	brazen	high degree of forwardness/presumption
13	candid/frank	open in linguistic communication
14	closed/distant	not open
15	commanding/domineering	tending to assert right to command
16	combative/belligerent	high degree of aggression, often physical
17	communicative	evincing willingness to communicate as needed
18	conceited/vain	evincing undesirable degree of self-esteem
19	condescending/patronizing/snobbish	disrespectfully asserting superior social status, experience, knowledge, or membership
20	confident	attitude due to belief in own ability
21	congratulatory	wishing well related to another’s success or good luck
22	contemptuous	high degree of disapproval and perceived superiority
23	controlling	undesirably assertive
24	cool	repressing outward reaction, often to indicate confidence or dominance, especially when confronting aggression, panic, etc.
25	cooperative/agreeable	communicating willingness to cooperate
26	critical	attitude expressing disapproval
27	cynical	habitually negative, reflecting disappointment or disillusionment
28	dependent	evincing inability to function without aid
29	discouraged/dejected	unmotivated because goals or rewards were not achieved
30	disagreeable	not agreeable
31	disapproving	not approving
32	dishonest/deceitful/insincere	not honest
33	dismissive	actively indicating lack of interest or motivation
34	distrustful	not trusting
35	emotional/passionate	high degree of responsiveness to emotions
36	empathetic/caring	interested in or vicariously feeling others’ emotions
37	enthusiastic	high degree of positive response, especially to specific occurrence
38	excited/stimulated	attitude indicating cognitive and emotional arousal
39	exclusive/cliquish	not welcoming into a social group
40	flattering	praising with intent to influence, often insincere
41	flexible	willing to adjust to changing circumstances or needs
42	forward/presumptuous	not observing norms related to intimacy or rank
43	forgiving	tending to forgive improper behaviour
44	friendly	welcoming or inviting social contact
45	generous	tending to give to others, materially or otherwise
46	guilty/remorseful/sorry	regret due to consciousness of hurting or damaging others
47	helpless	high degree of dependence
48	honest/sincere	tending to communicate without deception
49	independent	not dependent
50	indifferent	neither approving nor disapproving
51	inhibited/reserved/introverted/withdrawn	unable or unwilling to participate socially
52	inspired	motivated by some person, event, etc.
53	irresponsible	not responsible
54	kind	tending to act as motivated by empathy or sympathy
55	laudatory	praising
56	lewd/bawdy/indecent	evoking sexual associations in ways beyond social norms
57	like-minded	attitude expressing agreement
58	melodramatic	high or excessive degree of responsiveness or demonstrativeness
59	merciful	tending to avoid punishing others, often motivated by empathy or sympathy
60	merciless/ruthless	not merciful
61	mocking	communicating non-physical aggression, often by imitating a disapproved aspect of the victim
62	modest/humble/unassuming	not boastful
63	motivated	communicating goal-directed emotion and cognitive state
64	negative/defeatist	expressing pessimism, often habitually
65	obedient	evincing tendency to obey commands
66	open	tending to communicate without inhibition
67	optimistic	tending to expect positive events or results
68	outgoing/extroverted/uninhibited/unreserved	not inhibited
69	passive	not assertive
70	passive-aggressive	covertly and non-physically aggressive
71	peaceful	not aggressive
72	pedantic	excessively displaying knowledge or academic status
73	permissive	allowing activity that social norms might restrict
74	pessimistic	tending to expect negative events or results
75	petty	unforgiving concerning small matters
76	polite/courteous/respectful	tending to respect social norms
77	pompous/pretentious	excessively displaying social rank, often above actual status
78	positive	expressing optimism, often habitually
79	prudish/priggish	expressing disapproval of even minor social transgressions, especially related to sex
80	reasonable	evincing willingness to resolve issues through reasoning
81	rebellious/defiant	evincing unwillingness to obey
82	responsible/trustworthy/dependable	evincing characteristics or behaviour that encourage trust
83	responsive/demonstrative	tending to outwardly react to emotions and cognitive states, often as prompted by others
84	rude/disrespectful	not polite or respectful
85	sanguine	low degree of optimism, often expressed calmly
86	sarcastic	communicating disagreement by pretending agreement in an obviously insincere manner
87	seductive	communicating interest in sexual or related contact
88	self-absorbed	not empathetic due to excessive interest in self
89	self-deprecating/self-effacing	tending to criticize, or fail to praise or promote, self
90	selfish/self-serving	not generous due to excessive interest in own benefit
91	selfless/altruistic	tending to act for others’ benefit, sometimes exclusively
92	servile/obsequious	excessively and demonstrably obedient
93	shy	low degree of social inhibition
94	smug	evincing undesirable degree of self-esteem related to perceived triumph
95	stubborn	unwilling to change one’s mind or behaviour
96	sociable	comfortable in social situations
97	submissive	tending to submit to social dominance
98	subversive/undermining	communicating intention to work against a victim’s goals
99	suggestive/risqué/naughty	evoking sexual associations within social norms
100	supportive	communicating willingness to support as needed
101	sympathetic	empathetic related to others’ hurt or suffering
102	trusting	tending to trust others
103	unaffected	not pompous
104	uncaring/callous	not empathetic or caring
105	uncommunicative	not communicative
106	uncooperative	not cooperative
107	understanding	forgiving due to ability to understand motivations
108	unemotional/dispassionate/detached	not emotional, even when emotion is expected
109	unenthusiastic	not enthusiastic
110	unfriendly/hostile	not friendly
111	unresponsive/undemonstrative	not responsive or demonstrative
112	welcoming/inviting	high degree of acceptance with emotional warmth

7.13 Spatial Attitude

Spatial Attitude is specified in MPAI-OSD V1 [5].

7.14 Speech Descriptors

Speech Descriptors act as Speech Features defined in Personal Status Extraction.

7.15 Speech Features

Speech Features are digitally represented as follows.

7.15.1 Syntax

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“SpeechFeatures”:{

“type”:”object”,

“properties”:{

“pitch”:{

“type”:”real”

“tone”:{

“type”:”ToneType”

“intonation”:[

{

“type_p”:”pitch”,

“type_s”:”speed”,

“type_i”:”intensity”

}

“intensity”:{

“type”:”real”

“speed”:{

“type”:”real”,

“emotion”:{

“type”:”EmotionType”

“NNSpeechFeatures”:{

“type”:”vector of floating point”

}

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/SpeechFeatureType”

“secondary”:{

“$ref”:”#/definitions/SpeechFeatureType”

}

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“ToneType”:{

“type”:”object”,

“properties”:{

“toneName”:{

“type”:”string”

“toneSetName”:{

“type”:”string”

}

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/ToneType”

“secondary”:{

“$ref”:”#/definitions/ToneType”

}

7.15.2 Semantics

Name	Definition
SpeechFeatures	Indicates characteristic elements extracted from the input speech, specifically pitch, tone, intonation, intensity, speed, emotion, and NNspeechFeatures.
NNSpeechFeatures	Indicates specifically neural-network-based characteristic elements extracted from the input speech by Neural Network
pitch	Indicates the fundamental frequency of Speech expressed as a real number indicating frequency as Hz (Hertz).
tone	Tone is a variation in the pitch of the voice while speaking expressed as human readable words as in Table 42.
ToneType	Indicates the Tone that the input speech carries.
intonation	A variation of the pitch, intensity and speed within a time period measured in seconds.
intensity	Energy of Speech expressed as a real number indicating dBs (decibel).
speed	Indicates the Speech Rate as a real number indicating specified linguistic units (e.g., Phonemes, Syllables, or Words) per second.
emotion	Indicates the Emotion that the input speech carries.
EmotionType	Indicates the Emotion that the input speech carries.
toneName	Specifies the name of a Tone.
toneSetName	Name of the Tone set which contains the Tone. Tone set is used as a baseline, but other sets are possible.

Note: The semantics of “tone” defines a basic set of elements characterising tone. Elements can be added to the basic set or new sets defined using the registration procedure defined for Emotion Sets (0).

Table 42 – Basic Tones

TONE CATEGORIES	ADJECTIVAL	Semantics
FORMALITY	formal informal	serious, official, polite everyday, relaxed, casual
ASSERTIVENESS	assertive factual hesitant	certain about content neutral about content uncertain about content
REGISTER (per situation or use case)	conversational directive	appropriate to an informal speaking related to commands or requests for action

7.16 Text

The Format of Input Text, Output Text and Recognised Text is provided by ISO/IEC 10646; Information technology – Universal Coded Character Set [9].

7.17 Text Descriptors

Meaning act as Text Descriptors defined in Personal Status Extraction.

7.18 Video

Video satisfies the following specifications:

Pixel shape: square
Bit depth: 8 or 10 bits/pixel
Aspect ratio: 4/3 or 16/9
640 < # of horizontal pixels < 1920
480 < # of vertical pixels < 1080
Frame frequency 50-120 Hz
Scanning: progressive
Colorimetry: ITU-R BT709 or BT2020
Colour format: RGB or YUV
Compression:
2. If compressed, compression according to one of the following standards: MPEG-4 AVC [10], MPEG-H HEVC [13], MPEG-5 EVC [14].

7.19 Video File

The Format of a Video MP4 File Format [12].

7.20 Video of Faces KB Query Format

Data Specification: All faces in the Video of Faces KB shall be aligned.

Input: The Video of Faces KB is queried with an Emotion.

Output: The response is a Video File of a human face.

7.21 Visual Scene Descriptors

Visual Scene Descriptors are specified in MPAI-OSD [5].

MPAI Basics

1 General

In recent years, Artificial Intelligence (AI) and related technologies have been introduced in a broad range of applications affecting the life of millions of people and are expected to do so much more in the future. As digital media standards have positively influenced industry and billions of people, so AI-based data coding standards are expected to have a similar positive impact. In addition, some AI technologies may carry inherent risks, e.g., in terms of bias toward some classes of users making the need for standardisation more important and urgent than ever.

The above considerations have prompted the establishment of the international, unaffiliated, not-for-profit Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) organisation with the mission to develop AI-enabled data coding standards to enable the development of AI-based products, applications, and services.

As a rule, MPAI standards include four documents: Technical Specification, Reference Software Specifications, Conformance Testing Specifications, and Performance Assessment Specifications.

The last – and new in standardisation – type of Specification includes standard operating procedures that enable users of MPAI Implementations to make informed decision about their applicability based on the notion of Performance, defined as a set of attributes characterising a reliable and trustworthy implementation.

2 Governance of the MPAI Ecosystem

The technical foundations of the MPAI Ecosystem are currently provided by the following documents developed and maintained by MPAI:

Technical Specification.
Reference Software Specification.
Conformance Testing.
Performance Assessment.
Technical Report

An MPAI Standard is a collection of a variable number of the 5 document types.

Figure 12 depicts the MPAI ecosystem operation for conforming MPAI implementations.

Figure 12 – The MPAI ecosystem operation

Technical Specification: Governance of the MPAI Ecosystem Table 43 identifies the following roles in the MPAI Ecosystem:

Table 43 – Roles in the MPAI Ecosystem

MPAI	Publishes Standards. Establishes the not-for-profit MPAI Store. Appoints Performance Assessors.
Implementers	Submit Implementations to Performance Assessors.
Performance Assessors	Inform Implementation submitters and the MPAI Store if Implementation Performance is acceptable.
Implementers	Submit Implementations to the MPAI Store.
MPAI Store	Assign unique ImplementerIDs (IID) to Implementers in its capacity as ImplementerID Registration Authority (IIDRA)[1]. Verifies security and Tests Implementation Conformance.
Users	Download Implementations and report their experience to MPAI.

3 AI Framework

In general, MPAI Application Standards are defined as aggregations – called AI Workflows (AIW) – of processing elements – called AI Modules (AIM) – executed in an AI Framework (AIF). MPAI defines Interoperability as the ability to replace an AIW or an AIM Implementation with a functionally equivalent Implementation.

Figure 13 depicts the MPAI-AIF Reference Model under which Implementations of MPAI Application Standards and user-defined MPAI-AIF Conforming applications operate [2].

Figure 13 – The AI Framework (AIF) Reference Model

MPAI Application Standards normatively specify the Syntax and Semantics of the input and output data and the Function of the AIW and the AIMs, and the Connections between and among the AIMs of an AIW.

An AIW is defined by its Function and input/output Data and by its AIM topology. Likewise, an AIM is defined by its Function and input/output Data. MPAI standard are silent on the technology used to implement the AIM which may be based on AI or data processing, and implemented in software, hardware or hybrid software and hardware technologies.

MPAI also defines 3 Interoperability Levels of an AIF that executes an AIW. Table 44 gives the characteristics of an AIW and its AIMs of a given Level:

Table 44 – MPAI Interoperability Levels

Level	AIW	AIMs
1	An implementation of a use case	Implementations able to call the MPAI-AIF APIs.
2	An Implementation of an MPAI Use Case	Implementations of the MPAI Use Case
3	An Implementation of an MPAI Use Case certified by a Performance Assessor	Implementations of the MPAI Use Case certified by Performance Assessors

4 Audio-Visual Scene Description

The ability to describe (i.e., digitally represent) an audio-visual scene is a key requirement of several MPAI Technical Specifications and Use Cases. MPAI has developed Technical Specification: Context-based Audio Enhancement (MPAI-CAE) [4] that includes Audio Scene Descriptors and uses a subset of Graphics Language Transmission Format (glTF) [7] to describe a visual scene.

4.1 Audio Scene Descriptors

Audio Scene Description is a Composite AI Module (AIM) specified by Technical Specification: Context-based Audio Enhancement (MPAI-CAE) [4]. The position of an Audio Object is defined by Azimuth, Elevation, Distance.

The Composite AIM and its composing AIMs are depicted in Figure 19.

Figure 14 – The Audio Scene Description Composite AIM

4.2 Visual Scene Descriptors

MPAI uses a subset of Graphics Language Transmission Format (glTF) [7] to describe a visual scene.

5 Avatar-Based Videoconference

Technical Report: Avatar-Based Videoconference (MPAI-ARA) specifies AIWs and AIMs of a Use Case where geographically distributed humans hold a videoconference represented by their avatars. Figure 15 depicts the components of the system supporting the conference of a group of humans participating through avatars having their visual appearance and uttering the participants’ real voice.

Figure 15 – Avatar-Based Videoconference end-to-end diagram

Figure 16 contains the reference architectures of the four AW Workflows constituting the Avatar-Based Videoconference: Client (Transmission side), Server, Virtual Secretary, and Client (Receiving side).

Figure 16 – The AIWs of Avatar-Based Videoconference

6 Connected Autonomous Vehicles

MPAI defines a Connected Autonomous Vehicle (CAV), as a physical system that:

Converses with humans by understanding their utterances, e.g., a request to be taken to a destination.
Acquires information with a variety of sensors on the physical environment where it is located or traverses like the one depicted in Figure 17.
Plans a Route enabling the CAV to reach the requested destination.
Autonomously reaches the destination by:
- Moving in the physical environment.
- Building Digital Representations of the Environment.
- Exchanging elements of such Representations with other CAVs and CAV-aware entities.
- Making decisions about how to execute the Route.
- Acting on the CAV motion actuation to implement the decisions.


Figure 17 – An environment of CAV operation

MPAI believes in the capability of standards to accelerate the creation of a global competitive CAV market and has published Technical Specification:f Connected Autonomous Vehicle (MPAI-CAV) – Architecture that includes (see Figure 18):

A CAV Reference Model broken down into four Subsystems.
The Functions of each Subsystem.
The Data exchanged between Subsystems.
A breakdown of each Subsystem in Components of which the following is specified:
- The Functions of the Components.
- The Data exchanged between Components.
- The Topology of Components and their Connections.
Subsequently, Functional Requirements of the Data exchanged.
Eventually, standard technologies for the Data exchanged.

Figure 19 – The MPAI-CAV Subsystems with their Components

Subsystems are implemented as AI Workflows and Components as AI Modules according to Technical Specification: AI Framework (MPAI-AIF) [2].

MPAI-wide terms and definitions

The Terms used in this standard whose first letter is capital and are not already included in Table 1 are defined in Table 45.

Table 45 – MPAI-wide Terms

Term	Definition
Access	Static or slowly changing data that are required by an application such as domain knowledge data, data models, etc.
AI Framework (AIF)	The environment where AIWs are executed.
AI AIMName (AIM)	A data processing element receiving AIM-specific Inputs and producing AIM-specific Outputs according to according to its Function. An AIM may be an aggregation of AIMs.
AI Workflow (AIW)	A structured aggregation of AIMs implementing a Use Case receiving AIW-specific inputs and producing AIW-specific outputs according to the AIW Function.
Application Standard	An MPAI Standard designed to enable a particular application domain.
Channel	A connection between an output port of an AIM and an input port of an AIM. The term “connection” is also used as synonymous.
Communication	The infrastructure that implements message passing between AIMs
Composite AIM	An AIM aggregating more than one AIM.
Component	One of the 7 AIF elements: Access, Communication, Controller, Internal Storage, Global Storage, Store, and User Agent
Conformance	The attribute of an Implementation of being a correct technical Implementation of a Technical Specification.
Conformance Tester	An entity Testing the Conformance of an Implementation.
Conformance Testing	The normative document specifying the Means to Test the Conformance of an Implementation.
Conformance Testing Means	Procedures, tools, data sets and/or data set characteristics to Test the Conformance of an Implementation.
Connection	A channel connecting an output port of an AIM and an input port of an AIM.
Controller	A Component that manages and controls the AIMs in the AIF, so that they execute in the correct order and at the time when they are needed
Data Format	The standard digital representation of data.
Data Semantics	The meaning of data.
Ecosystem	The ensemble of actors making it possible for a User to execute an application composed of an AIF, one or more AIWs, each with one or more AIMs potentially sourced from independent implementers.
Explainability	The ability to trace the output of an Implementation back to the inputs that have produced it.
Fairness	The attribute of an Implementation whose extent of applicability can be assessed by making the training set and/or network open to testing for bias and unanticipated results.
Function	The operations effected by an AIW or an AIM on input data.
Global Storage	A Component to store data shared by AIMs.
Internal Storage	A Component to store data of the individual AIMs.
Identifier	A name that uniquely identifies an Implementation.
Implementation	1. An embodiment of the MPAI-AIF Technical Specification, or 2. An AIW or AIM of a particular Level (1-2-3) conforming with a Use Case of an MPAI Application Standard.
Implementer	A legal entity implementing MPAI Technical Specifications.
ImplementerID (IID)	A unique name assigned by the ImplementerID Registration Authority to an Implementer.
ImplementerID Registration Authority (IIDRA)	The entity appointed by MPAI to assign ImplementerID’s to Implementers.
Interoperability	The ability to functionally replace an AIM with another AIW having the same Interoperability Level
Interoperability Level	The attribute of an AIW and its AIMs to be executable in an AIF Implementation and to: 1. Be proprietary (Level 1) 2. Pass the Conformance Testing (Level 2) of an Application Standard 3. Pass the Performance Testing (Level 3) of an Application Standard.
Knowledge Base	Structured and/or unstructured information made accessible to AIMs via MPAI-specified interfaces
Message	A sequence of Records transported by Communication through Channels.
Normativity	The set of attributes of a technology or a set of technologies specified by the applicable parts of an MPAI standard.
Performance	The attribute of an Implementation of being Reliable, Robust, Fair and Replicable.
Performance Assessment	The normative document specifying the Means to Assess the Grade of Performance of an Implementation.
Performance Assessment Means	Procedures, tools, data sets and/or data set characteristics to Assess the Performance of an Implementation.
Performance Assessor	An entity Assessing the Performance of an Implementation.
Profile	A particular subset of the technologies used in MPAI-AIF or an AIW of an Application Standard and, where applicable, the classes, other subsets, options and parameters relevant to that subset.
Record	A data structure with a specified structure
Reference Model	The AIMs and theirs Connections in an AIW.
Reference Software	A technically correct software implementation of a Technical Specification containing source code, or source and compiled code.
Reliability	The attribute of an Implementation that performs as specified by the Application Standard, profile, and version the Implementation refers to, e.g., within the application scope, stated limitations, and for the period of time specified by the Implementer.
Replicability	The attribute of an Implementation whose Performance, as Assessed by a Performance Assessor, can be replicated, within an agreed level, by another Performance Assessor.
Robustness	The attribute of an Implementation that copes with data outside of the stated application scope with an estimated degree of confidence.
Scope	The domain of applicability of an MPAI Application Standard
Service Provider	An entrepreneur who offers an Implementation as a service (e.g., a recommendation service) to Users.
Standard	The ensemble of Technical Specification, Reference Software, Conformance Testing and Performance Assessment of an MPAI application Standard.
Technical Specification	(Framework) the normative specification of the AIF. (Application) the normative specification of the set of AIWs belonging to an application domain along with the AIMs required to Implement the AIWs that includes: 1. The formats of the Input/Output data of the AIWs implementing the AIWs. 2. The Connections of the AIMs of the AIW. 3. The formats of the Input/Output data of the AIMs belonging to the AIW.
Testing Laboratory	A laboratory accredited to Assess the Grade of Performance of Implementations.
Time Base	The protocol specifying how Components can access timing information
Topology	The set of AIM Connections of an AIW.
Use Case	A particular instance of the Application domain target of an Application Standard.
User	A user of an Implementation.
User Agent	The Component interfacing the user with an AIF through the Controller
Version	A revision or extension of a Standard or of one of its elements.

Notices and Disclaimers Concerning MPAI Standards (Informative)

The notices and legal disclaimers given below shall be borne in mind when downloading and using approved MPAI Standards.

In the following, “Standard” means the collection of four MPAI-approved and published documents: “Technical Specification”, “Reference Software” and “Conformance Testing” and, where applicable, “Performance Testing”.

Life cycle of MPAI Standards

MPAI Standards are developed in accordance with the MPAI Statutes. An MPAI Standard may only be developed when a Framework Licence has been adopted. MPAI Standards are developed by especially established MPAI Development Committees who operate on the basis of consensus, as specified in Annex 1 of the MPAI Statutes. While the MPAI General Assembly and the Board of Directors administer the process of the said Annex 1, MPAI does not independently evaluate, test, or verify the accuracy of any of the information or the suitability of any of the technology choices made in its Standards.

MPAI Standards may be modified at any time by corrigenda or new editions. A new edition, however, may not necessarily replace an existing MPAI standard. Visit the web page to determine the status of any given published MPAI Standard.

Comments on MPAI Standards are welcome from any interested parties, whether MPAI members or not. Comments shall mandatorily include the name and the version of the MPAI Standard and, if applicable, the specific page or line the comment applies to. Comments should be sent to the MPAI Secretariat. Comments will be reviewed by the appropriate committee for their technical relevance. However, MPAI does not provide interpretation, consulting information, or advice on MPAI Standards. Interested parties are invited to join MPAI so that they can attend the relevant Development Committees.

Coverage and Applicability of MPAI Standards

MPAI makes no warranties or representations of any kind concerning its Standards, and expressly disclaims all warranties, expressed or implied, concerning any of its Standards, including but not limited to the warranties of merchantability, fitness for a particular purpose, non-infringement etc. MPAI Standards are supplied “AS IS”.

The existence of an MPAI Standard does not imply that there are no other ways to produce and distribute products and services in the scope of the Standard. Technical progress may render the technologies included in the MPAI Standard obsolete by the time the Standard is used, especially in a field as dynamic as AI. Therefore, those looking for standards in the Data Compression by Artificial Intelligence area should carefully assess the suitability of MPAI Standards for their needs.

IN NO EVENT SHALL MPAI BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO: THE NEED TO PROCURE SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE PUBLICATION, USE OF, OR RELIANCE UPON ANY STANDARD, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE AND REGARDLESS OF WHETHER SUCH DAMAGE WAS FORESEEABLE.

MPAI alerts users that practicing its Standards may infringe patents and other rights of third parties. Submitters of technologies to this standard have agreed to licence their Intellectual Property according to their respective Framework Licences.

Users of MPAI Standards should consider all applicable laws and regulations when using an MPAI Standard. The validity of Conformance Testing is strictly technical and refers to the correct implementation of the MPAI Standard. Moreover, positive Performance Assessment of an implementation applies exclusively in the context of the MPAI Governance and does not imply compliance with any regulatory requirements in the context of any jurisdiction. Therefore, it is the responsibility of the MPAI Standard implementer to observe or refer to the applicable regulatory requirements. By publishing an MPAI Standard, MPAI does not intend to promote actions that are not in compliance with applicable laws, and the Standard shall not be construed as doing so. In particular, users should evaluate MPAI Standards from the viewpoint of data privacy and data ownership in the context of their jurisdictions.

Implementers and users of MPAI Standards documents are responsible for determining and complying with all appropriate safety, security, environmental and health and all applicable laws and regulations.

MPAI draft and approved standards, whether they are in the form of documents or as web pages or otherwise, are copyrighted by MPAI under Swiss and international copyright laws. MPAI Standards are made available and may be used for a wide variety of public and private uses, e.g., implementation, use and reference, in laws and regulations and standardisation. By making these documents available for these and other uses, however, MPAI does not waive any rights in copyright to its Standards. For inquiries regarding the copyright of MPAI standards, please contact the MPAI Secretariat.

The Reference Software of an MPAI Standard is released with the MPAI Modified Berkeley Software Distribution licence. However, implementers should be aware that the Reference Software of an MPAI Standard may reference some third-party software that may have a different licence.

Patent declarations (Informative)

The MPAI Multimodal Conversation (MPAI-MMC) Technical Specification has been developed according to the process outlined in the MPAI Statutes [15] and the MPAI Patent Policy [16].

The following entities have agreed to licence their standard essential patents reading on the MPAI Multimodal Conversation (MPAI-MMC) Technical Specification according to the MPAI-MMC Framework Licence [17]:

Table 46 – Companies having submitted a patent declaration (MPAI-MMC V1)

Entity	Name	Email address
ETRI	Songwon Lee	lsw84@etri.re.k
KLleon	Jisu Kang	jisu.kang@klleon.io
Speech Morphing, Inc.	Fathy Yassa	fathy@speechmorphing.com

Patents declarations concern Version 1. Declarations for Version 2 will be published after requests for declarations will be made.

Personal Status (Informative)

The study of “personal status” – of emotion, cognitive states, attitudes, and other status factors that a person can express at a given time – is not new: many aspects have long been studied. Now, however, technological, and scientific advances promise accelerating understanding. MPAI’s aim is to establish standards in various current and future use cases involving Personal Status – for instance, to enable computational systems to recognize users’ emotions and react to them most helpfully. Thus, the need arises to at least roughly characterize and survey Emotions, Cognitive States, and Attitudes.

To begin meeting this need, this document proposes definitions, listings, and semantic characterizations of these three factors. These proposals are indeed rough and subject to disagreement or revision on many levels. Accordingly, they can in fact be revised for particular use cases and as the relevant studies move ahead. Revision procedures are specified in the Conclusion below.

This Annex offers definitions and examples of each status factor, with brief discussion. Listings of labels and accompanying semantics per factor are given in Section 4.2.

Emotions are states of physiological arousal accompanied by changes in facial expressions, gestures, posture, or subjective feelings. Examples include joy, sadness, disgust, fear, and anger. Innate elements of emotions – there may be learned components as well – are controlled by the subcortical regions of the brain, including the amygdala, ventral striatum, and hypothalamus.

Sensations like pain, pleasure, taste, vision, hearing, and so on are likewise largely innate, but we’ll try to distinguish them from Emotions as such. Unlike Emotions, sensations will not be defined or listed here.

Cognitive states are the results of information processing: a cognitive system accepts input patterns – in humans, initially perceptual patterns, whether new or stored – and produces output patterns, which may include actions that can affect the world outside the system. To perform this processing, the system must recognize the input patterns, perhaps influenced by priming (“expectations”), and then associate them with other patterns, often in a sequence of steps or flow, until the output pattern is reached. The recognition, associations, and sequencing giving rise to Cognitive States may sometimes be innate; but in humans, they’re predominantly learned.

This high-level definition of cognition and Cognitive States could describe not only human or other biological information processing, but artificial processing as well – such as that carried out by self-driving vehicles, which must recognize other vehicles, signs and signals, etc., based on patterns conveyed by sensors, and, through processing, derive appropriate action patterns. Clearly, then, the definition is meant to exclude emotion, since the vehicles have none, and in fact probably lack sensations (“qualia”) of any sort, much less consciousness. In humans, however, the separation between emotion and cognition is much harder to make cleanly, since much information processing is at least partly driven by drives which are associated with emotions. Even so, it’s helpful to maintain the separation for analytical purposes; so this Annex will treat Cognitive States as those information processing states which even a system lacking emotions might be able to enter – the processing states that Star Trek’s “purely logical” Mr. Spock might be found in.

However, while observing the distinction between Emotions and Cognitive States as an analytical aid, we certainly recognize (1) borderline cases (like Curiosity, which does involve a drive to obtain new information, but might still be modelled by a system which pursued that goal in numerical terms without emotion, as Mr. Spock might do) and (2) hybrid or overlapping states in which both cognitive processing and emotion play parts (like Positive or Negative Surprise, in which a human is both surprised – as even Mr. Spock might be – but also emotionally pleased or displeased by the unexpected event or discovery).

Since we’re defining and listing Emotions and Cognitive States for the limited purposes of near-term human-machine interaction, we’ll avoid a wide range of human emotional and cognitive concerns. Again, we’re bypassing discussion of sensation or consciousness. Likewise, we’ll avoid concern with the emotional factors in human decision-making (related to issues of bias and free will); with abnormal psychology (related to psychosis, obsessive-compulsive disorder, amnesia, etc.); or with many more psychological areas.

So, for example, while we will currently be interested (clearly a Cognitive State, though also viewable as borderline, hybrid, or both) in the following states, among others:

Interest: determination that certain percepts are relevant to goals
Curiosity: bias toward seeking or attending to new percepts or information
Confusion: disorderly information processing
Certainty: conclusion that percepts or processing results are reliable (e.g., as basis for action)
Attention: bias to process some percepts and not others; bias to direct processing through a certain sequence and not others

… we will for now avoid discussion of states like these:

Amnesia: loss of long-term memory
Psychosis: a cognitive disorder in which mental percepts are sometimes confused with objectively real ones
Priming: cognitive bias to recognize or process percepts in a certain way
Consciousness: reportable awareness, augmented by self-concept, self-history, awareness of being aware, etc.
Subconscious processing: information processing without awareness or consciousness

A person’s attitudes are ways of relating to exterior elements – most often, to other humans, but also to situations, facts, etc. They’re ways of feeling or thinking about those elements, and/or ways of behaving toward them, prompted by those Emotions and Cognitive States.

For MPAI’s purposes, Attitudes are of interest for analysis of relations within use cases between people, and/or between people and computational systems. How can a machine communicate a helpful Attitude – the hybrid combination of Emotion and Cognitive State that constitutes a desire to be useful? How can a machine recognize a resentful Attitude – perhaps arising from a user’s anger (Emotion) at her belief (Cognitive State) that she has been treated unfairly in a transaction?

The prompting or engendering of Attitudes by relevant Emotions and Cognitive States can be depicted in various ways, as in the Figures 1 and 2 below; but, whatever the graphic description, for the purposes of MPAI’s standardization efforts, the focus will remain on the relational aspect of Attitudes, and especially on social relations.

Given that Emotions and Cognitive States themselves are difficult to describe precisely, we can’t expect definitive listings or semantic characterizations of the Attitudes that arise from them. Even so, we hope that those in Section 4.2 can prove useful in facilitating coordination among modules.

Figure 20 – Components of Attitude

Figure 21 – Process of the Behaviour from the Emotion and Attitudes

AIW and AIM Metadata of MMC-CPS

1 Metadata for MPAI-CPS AIW

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V2/AIW-AIM-metadata.schema.json”,

“title”:”CPS AIF v2 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”MMC-CPS”,

“Version”:”2″

}

“APIProfile”:”Main”,

“Description”:” This AIF is used to enable a human to converse with a machine using Personal Status”,

“Types”:[

{

“Name”:”InputSelection_t”,

“Type”:”{Text_t | Speech_t}”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Video_t”,

“Type”:” uint24[]”

{

“Name”:”3DGraphics_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText3″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputAudio”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSelection1″,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSelection2″,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

{

“Name”:”VisualSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:” VisualSceneDescription”,

“Version”:”2″

}

{

“Name”:”AudioSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:” AudioSceneDescription”,

“Version”:”2″

}

{

“Name”:”SpatialObjectIdentification”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:” SpatialObjectIdentification”,

“Version”:”2″

}

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”SpeechRecogniton”,

“Version”:”2″

}

{

“Name”:”LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

}

{

“Name”:”PersonalStatusExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

}

{

“Name”:”DialogueProcessing”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”DialogProcessing”,

“Version”:”2″

}

{

“Name”:”PersonalStatusDisplay”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

}

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”InputVideo”

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”InputVideo”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputAudio”

“Input”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”InputAudio”

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors”

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”BodyDescriptors”

}

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”VisualSceneGeometry”

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”VisualSceneGeometry”

}

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”PhysicalObject”

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”PhysicalObject”

}

{

“Output”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”PhysicalObjectID”

“Input”:{

“AIMName”:”DialogProcessing”,

“PortName”:”PhysicalObjectID”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText3″

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”InputText3″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech2″

}

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognisedText”

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSelection1″

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”InputSelection1″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText2″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputText2″

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors”

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”BodyDescriptors”

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceDescriptors”

“Input”:{

“AIMName”:”PersonalStatusDescription”,

“PortName”:”FaceDescriptors”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning”

}

{

“Output”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”InputSpeech1″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSpeech1″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText1″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputText1″

}

{

“Output”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputPersonalStatus”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputPersonalStatus”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Meaning”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText”

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachinePersonalStatus”

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachinePersonalStatus”

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText”

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineText”

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineAvatar”

“Input”:{

“AIMName”:””,

“PortName”:” MachineAvatar ”

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineSpeech”

“Input”:{

“AIMName”:””,

“PortName”:”MachineSpeech”

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineText”

“Input”:{

“AIMName”:””,

“PortName”:”MachineText”

}

“Implementations”:[

{

“BinaryName”:”mmccps.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”MPAIStore”,

“Destination”:””

}

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2 AIM metadata for CPS

2.1 Visual Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”VisualSceneDescription”,

“Version”:”2″

“Description”:”This AIM implements the visual scene description function for MMC-CPS.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint24[]”

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”{uint8[]}”

{

“Name”:”FaceDescriptors_t”,

“Type”:”{uint8[]}”

{

“Name”:”BodyDescriptors_t”,

“Type”:”{uint8[]}”

{

“Name”:”PhysicalObject_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObject”,

“Direction”:”OutputInput”,

“RecordType”:” PhysicalObjects_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2 Audio Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”AudioSceneDescription”,

“Version”:”2″

“Description”:”This AIM implements the visual scene description function for MMC-CPS.”,

“Types”:[

{

“Name”:”Audio_t”,

“Type”:”uint16[]”

{

“Name”: “Array_Audio_t”,

“Type”: “Audio_t[]”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

}

“Ports”:[

{

“Name”:”InputAudio”,

“Direction”:”InputOutput”,

“RecordType”:”Array_Audio_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Speech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.3 SpatialObjectIdentification

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”SpatialObjectIdentification”,

“Version”:”1″

“Description”:”This AIM identifies the Physical Object indicated by the finger of a human.”,

“Types”:[

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint16[]”

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”{uint8[]}”

{

“Name”:”PhysicalObject_t”,

“Type”:”{uint8[]}”

{

“Name”:”PhysicalObjectID_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

}

“Ports”:[

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VisualSceneGeometry”,

“Direction”:”InputOutput”,

“RecordType”:”VisualSceneGeometry_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObjects”,

“Direction”:”InputOutput”,

“RecordType”:”PhysicalObject_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObjectID”,

“Direction”:”OutputInput”,

“RecordType”:”Instance_t[]”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.4 SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”VideoAnalysis”,

“Version”:”2″

“Description”:”This AIM implements the speech recognition function for MMC-CPS”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”Speech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognisedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.5 Language Understanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

“Description”:”This AIM implements language understanding function for MMC-CPS.”,

“Types”:[

{

“Name”:”PhysicalObjectID_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

{

“Name”:”Text_t”,

“Type”:”uint8[]”

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

}

“Ports”:[

{

“Name”:”PhysicalObjectID”,

“Direction”:”InputOutput”,

“RecordType”:”Instance_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText3″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSelection1″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning2″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.6 PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

“Description”:”This AIM extracts the combined Personal Status from Text, Speech, Face, and Gesture.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”Speech_t”,

“Type”:”{uint16[]}”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning”,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.7 DialogueProcessing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”DialogueProcessing”,

“Version”:”2″

“Description”:”This AIM produces the Machine’s Text and Personal Status from the human’s Text and Personal Status.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”Text_t”,

“{uint8[] | uint16[]}”

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

}

“Ports”:[

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputPersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning”,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:” RefinedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachinePersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.8 PersonalStatusDisplay

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

“Description”:”This AIM renders a speaking avatar from Machine Text and Machine Personal Status.”,

“Types”:[

{

“Name”:”PersonalStatus_t”,

“Type”:”{uint8[]}”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”3DGraphics_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”MachinePersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

AIW and AIM Metadata of MMC-CWE

1 AIW metadata for CWE

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V1/AIW-AIM-metadata.schema.json”,

“title”:”CWE AIF v1 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”MMC-CWE”,

“Version”:”2″

}

“APIProfile”:”Basic”,

“Description”:” This AIF is used to call the AIW of CWE”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Video_t”,

“Type”:” uint24[]”

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

}

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSelection1″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSelection2″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineVideo”,

“Direction”:”OutputInput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”SpeechRecogniton”,

“Version”:”1″

}

{

“Name”:”VisualSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:VisualSceneDescription,

“Version”:”1″

}

{

“Name”:”LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”LanguageUnderstanding”,

“Version”:”1″

}

{

“Name”:”PersonalStatusExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”1″

}

{

“Name”:”DialogProcessing”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”DialogProcessing”,

“Version”:”1″

}

{

“Name”:”SpeechSynthesisEmotion”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”SpeechSynthesisEmotion”,

“Version”:”1″

}

{

“Name”:”LipsAnimation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”LipsAnimation”,

“Version”:”1″

}

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech1″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputVideo”

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”InputVideo”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”RecognisedText”

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText1″

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”InputText1″

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

“Input”:{

“AIMName”:”Personal Status Extraction”,

“PortName”:”Meaning”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText”

“Input”:{

“AIMName”:”Personal Status Extraction”,

“PortName”:”RefinedText”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSpeech2″

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceDescriptors”

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”FaceDescriptors ”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSelection1″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSelection1″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText2″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputText2″

}

{

“Output”:{

“AIMName”:” LanguageUnderstanding”,

“PortName”:”Meaning”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Meaning”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:” RefinedText”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText”

}

{

“Output”:{

“AIMName”:”InputPersonalStatus”,

“PortName”:”PersonalStatusExtraction”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputPersonalStatus”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSelection2″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputSelection2″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText3″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputText3″

}

{

“Output”:{

“AIMName”:”DialouegProcessing”,

“PortName”:”MachineText1″

“Input”:{

“AIMName”:””,

“PortName”:”MachineText1″

}

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText2″

“Input”:{

“AIMName”:”SpeechSynthesisEmotion”,

“PortName”:”MachineText2″

}

{

“Output”:{

“AIMName”:”SpeechSynthesisEmotion”,

“PortName”:”MachineSpeech1″

“Input”:{

“AIMName”:””,

“PortName”:”MachineSpeech1″

}

“Output”:{

“AIMName”:”SpeechSynthesisEmotion”,

“PortName”:”MachineSpeech2″

“Input”:{

“AIMName”:”LipsAnimation”,

“PortName”:”MachineSpeech2″

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachinePersonalStatus”

“Input”:{

“AIMName”:”LipsAnimation”,

“PortName”:”MachinePersonalStatus”

}

{

“Output”:{

“AIMName”:”LipsAnimation”,

“PortName”:”MachineFace”

“Input”:{

“AIMName”:””,

“PortName”:”MachineFace”

}

“Implementations”:[

{

“BinaryName”:”mmccwe.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”AIMStorage”,

“Destination”:””

}

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2 AIM metadata

2.1 SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”SpeechRecognition”,

“Version”:”2″

“Description”:”This AIM implements speech recognition function for MMC-CWE.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognisedText

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2 Visual Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”VisualSceneDescription”,

“Version”:”2″

“Description”:”This AIM describes the visual scene in MMC-CWE as Face Descriptors.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint32[]”

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.3 Language Understanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

“Description”:”This AIM implements language understanding function for MMC-CWE.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

}

“Ports”:[

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning2″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.4 PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

“Description”:”This AIM extracts and combined Personal Status from Text, Speech, and Face.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”Speech_t”,

“Type”:”{uint16[]}”

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”RefinedText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning2″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSelection1″,

“Direction”:”OutputInput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.5 Dialogue Processing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”DialogueProcessing”,

“Version”:”1″

“Description”:”This AIM implements Dialog Processing for MMC-CWE.”,

“Types”:[

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”PersonalStatus_t”,

“Type”:”Text_t | Speech_t”

}

“Ports”:[

{

“Name”:”Meaning1″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RefinedText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputPersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSelection2″,

“Direction”:”OutputInput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachinePersonalStatus1″,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachinePersonalStatus2″,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.6 SpeechSynthesisEmotion

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”SpeechSynthesisEmotion”,

“Version”:”2″

“Description”:”This AIM implements speech synthesis with emotion function for MMC-CWE.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

}

“Ports”:[

{

“Name”:”MachineText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachinePersonalStatus1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineSpeech12,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.7 Lips Animation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”LipAnimation”,

“Version”:”2″

“Description”:”This AIM implements lips animation function for MMC-CWE.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

{

“Name”:”Video_t”,

“Type”:”uint24[]”

}

“Ports”:[

{

“Name”:”MachineSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachinePersonalStatus2″,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceKBVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineFace”,

“Direction”:”OutputInput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

AIW and AIM Metadata of MMC-MQA

1 AIW metadata for MQA

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V1/AIW-AIM-metadata.schema.json”,

“title”:”MQA AIF v1 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”MMC-MQA”,

“Version”:”1″

}

“APIProfile”:”Basic”,

“Description”:” This AIF is used to execute the AIW of MQA”,

“Types”:[

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Video_t”,

“Type”:” uint24[]”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

{

“Name”:”PhysicalObjectIdentifier_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”Intention_t”,

“Type”:”{string<256 qtopic; string<256 qfocus; string<256 qLAT; string<256 qSAT; string<256 qdomain}”

}

“Ports”:[

{

“Name”:”InputSelection1″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSelection2″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

{

“Name”:”VisualSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”VisualSceneDescription”,

“Version”:”1″

}

{

“Name”:”PhysicalObjectIdentification”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”PhysicalObjectIdentification”,

“Version”:”2″

}

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”SpeechRecogniton”,

“Version”:”2″

}

{

“Name”:”LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

}

{

“Name”:”QuestionAnalysis”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”QuestionAnalysis”,

“Version”:”2″

}

{

“Name”:”QuestionAnswering”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”QuestionAnswering”,

“Version”:”1″

}

{

“Name”:”SpeechSynthesisText”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”SpeechSynthesisText”,

“Version”:”1″

}

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”InputVideo”

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”InputVideo”

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”PhysicalObject”

“Input”:{

“AIMName”:”PhysicalObjectIdentification”,

“PortName”:”PhysicalObject”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech”

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText2″

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”InputText2″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSelection2″

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”InputSelection2″

}

{

“Output”:{

“AIMName”:”PhysicalObjectIdentification”,

“PortName”:”PhysicalObjectIdentifier”

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”PhysicalObjectIdentifier”

}

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognisedText”

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

“Input”:{

“AIMName”:”QuestionAnalysis”,

“PortName”:”Meaning”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSelection”

“Input”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”InputSelection”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText1″

“Input”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”InputText1″

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText”

“Input”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”RefinedText”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

“Input”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”Meaning”

}

{

“Output”:{

“AIMName”:”QuestionAnalysis”,

“PortName”:”Intention”

“Input”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”Intention”

}

{

“Output”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”MachineText1″

“Input”:{

“AIMName”:”SpeechSynthesisText”,

“PortName”:”MachineText1″

}

{

“Output”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”MachineText2″

“Input”:{

“AIMName”:””,

“PortName”:”MachineText2″

}

“Implementations”:[

{

“BinaryName”:”mmcmqa.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”AIMStorage”,

“Destination”:””

}

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2 AIM metadata

2.1 VisualSceneDescription

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMX-MQA”,

“AIM”:”VisualSceneDescription”,

“Version”:”2″

“Description”:”This AIM describes the visual scene for MMC-MQA providing one physical object.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint32[]”

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObject”,

“Direction”:”OutputInput”,

“RecordType”:”PhysicalObject_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2 PhysicalObjectIdentification

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMX-MQA”,

“AIM”:”PhysicalObjectIdentification”,

“Version”:”2″

“Description”:”This AIM identified a physical object.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint32[]”

{

“Name”:”PhysicalObjectIdentifier_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

}

“Ports”:[

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObjectIdentifier”,

“Direction”:”OutputInput”,

“RecordType”:”PhysicalObject_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.3 SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”SpeechRecognition”,

“Version”:”2″

“Description”:”This AIM implements speech recognition function for MMC-MQA that converts a speech object to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognisedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.4 Language Understanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

“Description”:”This AIM implements language understanding function for MMC-MQA.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

{

“Name”:”ObjectIdentifier_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

“Ports”:[

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSelection2″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObjectIdentifier”,

“Direction”:”InputOutput”,

“RecordType”:” ObjectIdentifier_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”RefinedText_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning2″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.5 Question Analysis

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”QuestionAnalysis”,

“Version”:”2″

“Description”:”This AIM implements the question analysis function for MMC-MQA.”,

“Types”:[

{

“Name”:”meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”intention_t”,

“Type”:”{string<256 qtopic; string<256 qfocus; string<256 qLAT; string<256 qSAT; string<256 qdomain}”

}

“Ports”:[

{

“Name”:”Meaning_2″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Intention”,

“Direction”:”OutputInput”,

“RecordType”:”Intention_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.6 Question Answering

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”QuestionAnswering”,

“Version”:”2″

“Description”:”This AIM implements question answering function for MMC-MQA.”,

“Types”:[

{

“Name”:”InputSelection_t”,

“Type”:”{Text_t | Speech_t}”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”intention_t”,

“Type”:”{string<256 qtopic; string<256 qfocus; string<256 qLAT; string<256 qSAT; string<256 qdomain}”

}

“Ports”:[

{

“Name”:”InputSelection1″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RefinedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:”false e”

{

“Name”:”Meaning_1″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Intention”,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText2″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.7 SpeechSynthesisText

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”SpeechSynthesis”,

“Version”:”2″

“Description”:”This AIM implements speech synthesis function for MMC-MQA.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

}

“Ports”:[

{

“Name”:”MachineText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:”fals”

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:”fals”

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

AIW and AIM Metadata of MMC-CAS

1. AIW metadata for MMC-CAS

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V2/AIW-AIM-metadata.schema.json”,

“title”:”CAS AIF V2 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”MMC-CAS”,

“Version”:”2″

}

“APIProfile”:”Basic”,

“Description”:” This AIF is used to execute enable a human to converse with a machine about objects in an environment”,

“Types”:[

{

“Name”:”PointOfView_t”,

“Type”:”{float32[3] Position; float32[3] Orientation}”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Video_t”,

“Type”:”uint24[]”

{

“Name”:”3DGraphics_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”PointOfView”,

“Direction”:”InputOutput”,

“RecordType”:”PointOfView_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RenderedScene”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

“SubAIMs”:[

{

“Name”:”VisualSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:” VisualSceneDescription”,

“Version”:”1″

}

{

“Name”:”SpatialObjectIdentification”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:” SpatialObjectIdentification”,

“Version”:”2″

}

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS,

“AIM”:”SpeechRecogniton”,

“Version”:”2″

}

{

“Name”:” LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

}

{

“Name”:”PersonalStatus”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”PersonalStatus”,

“Version”:”2″

}

{

“Name”:”DialogueProcessing”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”DialogueProcessing”,

“Version”:”2″

}

{

“Name”:”ScenePresentation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”ScenePresentation”,

“Version”:”2″

}

{

“Name”:”PersonalStatusDisplay”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

}

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”InputVideo”

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”InputVideo”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech2″

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors2″

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”BodyDescriptors2″

}

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”VisualSceneGeometry”

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”VisualSceneGeometry”

}

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”PhysicalObject”

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”PhysicalObject”

}

{

“Output”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”ObjectID”

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”ObjectID”

}

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognisedText”

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSpeech1″

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors1″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”BodyDescriptors1″

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceDescriptors”

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”FaceDescriptors”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning1″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning1″

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning2″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning2″

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText”

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”RefinedText”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

“Input”:{

“AIMName”:”PersonalStatus”,

“PortName”:”InputSpeech1″

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescruiptors1″

“Input”:{

“AIMName”:”PersonalStatus”,

“PortName”:”BodyDescruiptors1″

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceDescriptors”

“Input”:{

“AIMName”:”PersonalStatus”,

“PortName”:”FaceDescriptors”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning1″

“Input”:{

“AIMName”:”PersonalStatus”,

“PortName”:”Meaning1″

}

{

“Output”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputPersonalStatus”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputPersonalStatus”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning2″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Meaning2″

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”PointOfView”

“Input”:{

“AIMName”:”ScenePresentation”,

“PortName”:”PointOfView”

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText”

“Input”:{

“AIMName”:””,

“PortName”:”MachineText”

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachinePersonalStatus”

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachinePersonalStatus”

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText”

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineText”

}

{

“Output”:{

“AIMName”:”ScenePresentation”,

“PortName”:”RenderedScene”

“Input”:{

“AIMName”:””,

“PortName”:”RenderedScene”

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineAvatar”

“Input”:{

“AIMName”:””,

“PortName”:”MachineAvatar”

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineSpeech”

“Input”:{

“AIMName”:””,

“PortName”:”MachineSpeech”

}

“Implementations”:[

{

“BinaryName”:”mmccas.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”MPAIStore”,

“Destination”:””

}

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2. AIM metadata for MMC-CAS

2.1 Visual Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”VisualSceneDescription”,

“Version”:”2″

“Description”:”This AIM implements the visual scene description function for MMC-CAS.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint24[]”

{

“Name”:”VisualSceneDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”uint8[]”

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VisualSceneDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”VisualSceneDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyDescriptors1″,

“Direction”:”OutputInput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyDescriptors2″,

“Direction”:”OutputInput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VisualSceneGeometry”,

“Direction”:”OutputInput”,

“RecordType”:”SceneGeometry_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObject”,

“Direction”:”OutputInput”,

“RecordType”:” PhysicalObjects_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2 SpatialObjectIdentification

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”SpatialObjectIdentification”,

“Version”:”2″

“Description”:”This AIM identifies the Physical Object indicated by a human’s finger.”,

“Types”:[

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”{uint8[]}”

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

{

“Name”:”PhysicalObjectID_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

}

“Ports”:[

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VisualSceneGeometry”,

“Direction”:”InputOutput”,

“RecordType”:”VisualSceneGeometry_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObjects”,

“Direction”:”InputOutput”,

“RecordType”:”PhysicalObject_t[]”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObjectID”,

“Direction”:”OutputInput”,

“RecordType”:”PhysicalObjectID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.3 SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”SpeechRecognition”,

“Version”:”2″

“Description”:”This AIM implements the speech recognition function for MMC-CAS: it converts the user’s speech to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognisedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.4 LanguageUnderstanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”LanguageUnderstanding”,

“Version”:”1″

“Description”:”This AIM extracts Meaning from Recognised Text supplemented by the ID of the Physical Object and improves Recognised Text supplemented by the ID of the Physical Object.”,

“Types”:[

{

“Name”:”PhysicalObject_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

}

“Ports”:[

{

“Name”:”PhysicalObjectID”,

“Direction”:”InputOutput”,

“RecordType”:”PhysicalObjectID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning2″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.5 PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

“Description”:”This AIM extracts the combined Personal Status from Text, Speech, Face, and Gesture.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”{uint16[]}”

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyDescriptors1″,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning1″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.6 DialogueProcessing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”DialogueProcessing”,

“Version”:”1″

“Description”:”This AIM produces the Machine’s Text and Personal Status from the human’s Text and Personal Status.”,

“Types”:[

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”Text_t”,

“{uint8[] | uint16[]}”

“Ports”:[

{

“Name”:”PersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning2″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Text(LanguageUnderstanding)”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachinePersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.7 ScenePresentation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”VisualScenePresentation”,

“Version”:”2″

“Description”:”This AIM renders the Visual Scene Description produced by the Visual Scene Description.”,

“Types”:[

{

“Name”:”PointOfView_t”,

“Type”:”{float32[6]}”

{

“Name”:”VisualSceneDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”3DGraphics_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”PointOfView”,

“Direction”:”InputOutput”,

“RecordType”:”PointOfView_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VisualSceneDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:” VisualSceneDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RenderedScene”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.8 PersonalStatusDisplay

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

“Description”:”This AIM renders a speaking avatar from text and Personal Status.”,

“Types”:[

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”3DGraphics_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”MachinePersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

AIW and AIM Metadata of CAV-HCI

1. AIW metadata for HCI

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V2/AIW-AIM-metadata.schema.json”,

“title”:”HCI AIF V2 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-CAV”,

“AIW”:”CAV-HCI”,

“AIM”:”CAV-HCI”,

“Version”:”1″

}

“APIProfile”:”Secure”,

“Description”:” This AIF enables a human to converse with a CAV”,

“Types”:[

{

“Name”: “Audio_t”,

“Type”: “uint16[]”

{

“Name”:”ArrayAudio_t”,

“Type”:”Audio_t[]”

{

“Name”:”VideoOutdoor_t”,

“Type”:”uint32[]”

{

“Name”:”LiDAR_t”,

“Type”:”uint24[]”

{

“Name”:”RADAR_t”,

“Type”:”uint24[]”

{

“Name”:”VideoIndoor_t”,

“Type”:”uint32[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

{

“Name”:”3DGraphics_t”,

“Type”:”{uint8[]}”

{

“Name”: “Speech_t”,

“Type”: “uint16[]”

“Ports”:[

{

“Name”:”AudioIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”Audio_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”AudioOutdoor”,

“Direction”:”InputOutput”,

“RecordType”:”ArrayAudio_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VideoOutdoor”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”LiDARIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”LiDAR_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”LiDAROutdoor”,

“Direction”:”InputOutput”,

“RecordType”:”LiDAR_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VideoIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}, {

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

{

“Name”:”AudioSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:” AudioSceneDescription”,

“Version”:”2″

}

{

“Name”:”VisualSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:” VisualSceneDescription”,

“Version”:”2″

}

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI,

“AIM”:”SpeechRecogniton”,

“Version”:”2″

}

{

“Name”:”SpatialObjectIdentification”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:” SpatialObjectIdentification”,

“Version”:”2″

}

{

“Name”:” LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

}

{

“Name”:”SpeechRecognition”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”SpeechRecognition”,

“Version”:”2″

}

{

“Name”:”SpeakerRecognition”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”SpeakerRecognition”,

“Version”:”2″

}

{

“Name”:”PersonalStatus”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”PersonalStatus”,

“Version”:”2″

}

{

“Name”:”FaceRecognition”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”FaceRecognition”,

“Version”:”2″

}

{

“Name”:”DialogueProcessing”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”DialogueProcessing”,

“Version”:”2″

}

{

“Name”:”PersonalStatusDisplay”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

}

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”AudioIndoor”

“Input”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”AudioIndoor”

}

{

“Output”:{

“AIMName”:”EnvironmentSensingSubsystem”,

“PortName”:”AudioOutdoor”

“Input”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”AudioOutdoor”

}

{

“Output”:{

“AIMName”:”EnvironmentSensingSubsystem”,

“PortName”:”VideoOutdoor”

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”VideoOutdoor”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”LiDARIndoor”

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”LiDARIndoor”

{

“Output”:{

“AIMName”:””,

“PortName”:”RADARIndoor”

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”RADARIndoor”

{

“Output”:{

“AIMName”:””,

“PortName”:”VideoIndoor”

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”VideoIndoor”

}

{

“Output”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”InputSpeech2″

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech2″

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors1″

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”BodyDescriptors1″

}

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”SceneGeometry”

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”SceneGeometry”

}

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”PhysicalObjectID”

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”PhysicalObjectID”

}

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognisedText”

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

{

“Output”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”PhysicalObjectID”

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”PhysicalObjectID”

}

{

“Output”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”SpeechObject”

“Input”:{

“AIMName”:”SpeakerRecognition”,

“PortName”:”SpeechObject”

}

{

“Output”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”InputSpeech”

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSpeech”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning1″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning1″

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors2″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”BodyDescriptors2″

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceDescriptors”

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”FaceDescriptors”

}

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceObject”

“Input”:{

“AIMName”:”FaceRecognition”,

“PortName”:”FaceObject”

}

{

“Output”:{

“AIMName”:”SpeakerRecognition”,

“PortName”:”SpeakerID”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”SpeakerID”

}

{

“Output”:{

“AIMName”:”LanguageProcessing”,

“PortName”:”Meaning2″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Meaning2″

}

{

“Output”:{

“AIMName”:”LanguageProcessing”,

“PortName”:”RefinedText”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText”

}

{

“Output”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”PersonalStatus”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”PersonalStatus”

}

{

“Output”:{

“AIMName”:”FaceRecognition”,

“PortName”:”FaceID”

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”FaceID”

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText”

“Input”:{

“AIMName”:””,

“PortName”:”MachineText”

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachinePersonalStatus”

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachinePersonalStatus”

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText”

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineText”

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineAvatar”

“Input”:{

“AIMName”:””,

“PortName”:”MachineAvatar”

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineSpeech”

“Input”:{

“AIMName”:””,

“PortName”:”MachineSpeech”

}

“Implementations”:[

{

“BinaryName”:”cas.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”MPAIStore”,

“Destination”:””

}

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2. Metadata for HCI AIMs

2.1 Audio Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”CAV”,

“AIW”:”HCI”,

“AIM”:”AudioSceneDescription”,

“Version”:”2″

“Description”:”This AIM implements the audio scene description function for CAV-HCI.”,

“Types”:[

{

“Name”: “Audio_t”,

“Type”: “uint16[]”

{

“Name”: “ArrayAudio_t”,

“Type”: “Audio_t[]”

“Name”:”Speech_t”,

“Type”:”uint16[]”

}

“Ports”:[

{

“Name”:”AudioIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”ArrayAudio_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”AudioOutdoor”,

“Direction”:”InputOutput”,

“RecordType”:”ArrayAudio_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechObject”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech2″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2 }Visual Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”VisualSceneDescription”,

“Version”:”2″

},HCI “Description”:”This AIM implements the visual scene description function for MMC-CAS.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint32[]”

{

“Name”:”LiDAR_t”,

“Type”:”uint24[]”

{

“Name”:”RADAR_t”,

“Type”:”uint24[]”

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”uint8[]”

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

}

{

“Name”:”FaceObject_t”,

“Type”:”uint32[]”

“Ports”:[

{

“Name”:”VideoOutdoor”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”LiDARIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”LiDAR_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RADARIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”RADAR_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VideoIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VisualSceneGeometry”,

“Direction”:”OutputInput”,

“RecordType”:”VisualSceneGeometry_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObject”,

“Direction”:”OutputInput”,

“RecordType”:” PhysicalObjects_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

{

“Name”:”FaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceObject”,

“Direction”:”OutputInput”,

“RecordType”:”FaceObjects_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

2.3 SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”CAV”,

“AIW”:”HCI”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

“Description”:”This AIM implements the speech recognition function for MMC-CAS: it converts the user’s speech to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognisedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

2.4 SpatialObjectIdentification

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”CAS”,

“AIM”:”SpatialObjectIdentification”,

“Version”:”1″

“Description”:”This AIM identifies the Physical Object indicated by a human’s finger.”,

“Types”:[

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint16[]”

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”{uint8[]}”

{

“Name”:”PhysicalObject_t”,

“Type”:”{uint8[]}”

{

“Name”:”PhysicalObjectID_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

}

“Ports”:[

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SceneGeometry”,

“Direction”:”InputOutput”,

“RecordType”:”VisualSceneGeometry_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObjects”,

“Direction”:”InputOutput”,

“RecordType”:”PhysicalObject_t[]”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObjectID”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

2.5 LanguageUnderstanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

“Description”:”This AIM extracts Meaning from Recognised Text supplemented by the ID of the Physical Object and improves Recognised Text supplemented by the ID of the Physical Object.”,

“Types”:[

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

}

“Ports”:[

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PhysicalObjectID”,

“Direction”:”InputOutput”,

“RecordType”:”PhysicalObjectID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning”,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.6 SpeakerRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”SpeakerRecognition”,

“Version”:”2″

“Description”:”This AIM implements the speaker recognition function for CAV-HCI: it identifies a speaker based on their speech.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”SpeakerID_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”SpeechObject”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeakerID”,

“Direction”:”OutputInput”,

“RecordType”:”SpeakerID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

2.7 PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

“Description”:”This AIM extracts the combined Personal Status from Text, Speech, Face, and Gesture.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”{uint16[]}”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FacwDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

2.8 FaceRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”FaceRecognition”,

“Version”:”2″

“Description”:”This AIM implements the human recognition function for CAV-HCI: it identifies a human based on their face.”,

“Types”:[

{

“Name”:”Face_t”,

“Type”:”uint32[]”

{

“Name”:”FaceID_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”FaceObject”,

“Direction”:”InputOutput”,

“RecordType”:”Face_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceID”,

“Direction”:”OutputInput”,

“RecordType”:”FaceID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

2.9 DialogueProcessing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”DialogueProcessing”,

“Version”:”1″

“Description”:”This AIM produces the Machine’s Text and Personal Status from the human’s Text and Personal Status.”,

“Types”:[

{

“Name”:”Text_t”,

“{uint8[] | uint16[]}”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”SpeakerID”,

“Direction”:”InputOutput”,

“RecordType”:”SpeakerID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning2″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceID”,

“Direction”:”InputOutput”,

“RecordType”:”FaceID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachinePersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.9 PersonalStatusDisplay

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

“Description”:”This AIM renders a speaking avatar from text and Personal Status.”,

“Types”:[

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”3DGraphics_t”,

“Type”:”uint8[]”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

“Ports”:[

{

“Name”:”MachinePersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

{

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

AIW and AIM Metadata of ARA-VSV

1 Metadata for VSV AIW

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V2/AIW-AIM-metadata.schema.json”,

“title”:”VSV AIF V2 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”MMC-VSV “,

“Version”:”2″

}

“APIProfile”:”Secure”,

“Description”:” This AIF is used to produce the visual and vocal appearance of the Virtual Secretary and the Summary of the Avatar-Based Videoconference”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”AvatarDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”Summary_t”,

“Type”:”uint8[]”

{

“Name”:”AvatarModel_t”,

“Type”:”uint8[]”

{

“Name”:”AvatarDescriptors_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”AvatarDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”AvatarDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}, {

“Name”:”Summary”,

“Direction”:”OutputInput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”AvatarModel”,

“Direction”:”OutputInput”,

“RecordType”:”AvatarModel_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VSText”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VSSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VSAvatarDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”AvatarDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

{

“Name”:”SpeechRecognition”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

}

{

“Name”:”AvatarDescriptorsParsing”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC “,

“AIW”:”MMC-VSV”,

“AIM”:”AvatarDescriptorsParsing”,

“Version”:”2″

}

{

“Name”:” LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

}

{

“Name”:”PersonalStatusExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC “,

“AIW”:”MMC-VSV”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

}

{

“Name”:”Summarisation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”Summarisation”,

“Version”:”2″

}

{

“Name”:”PersonalStatusDisplay”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

}

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech1″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputAvatarDescriptors”

“Input”:{

“AIMName”:”AvatarDescriptorsParsing”,

“PortName”:”InputAvatarDescriptors”

}

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognisedText”

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning2″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning2″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSpeech2″

}

{

“Output”:{

“AIMName”:”AvatarDescriptorsParsing”,

“PortName”:”BodyDescriptors”

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”BodyDescriptors”

}

{

“Output”:{

“AIMName”:”AvatarDescriptorParsing”,

“PortName”:”FaceDescriptors”

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”FaceDescriptors”

}

{ “Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning2″

“Input”:{

“AIMName”:”Summarisation”,

“PortName”:”Meaning2″

}

{ “Output”:{

“AIMName”LanguageUnderstanding”,

“PortName”:”RefinedText2″

“Input”:{

“AIMName”:”Summarisation”,

“PortName”:”RefinedText2″

}

{ “Output”:{

“AIMName”PersonalStatusExtraction”,

“PortName”:”InputPersonalStatus2″

“Input”:{

“AIMName”:”Summarisation”,

“PortName”:”InputPersonalStatus2″

}

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText1″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText1″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText1″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputText1″

}

{

“Output”:{

“AIMName”:”LanguageProcessing”,

“PortName”:”RefinedText1″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText1″

}

{

“Output”:{

“AIMName”:”LanguagePeocessing”,

“PortName”:”Meaning1″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Meaning1″

}

{

“Output”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputPersonalStatus1″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputPersonalStatus1″

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”EditedSummary”

“Input”:{

“AIMName”:”Summarisation”,

“PortName”:”EditedSummary”

}

{

“Output”:{

“AIMName”:”Summarisation”,

“PortName”:”Summary1″

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Summary1″

}

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Summary2″

“Input”:{

“AIMName”:””,

“PortName”:”Summary2″

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”VSAvatarModel”

“Input”:{

“AIMName”:””,

“PortName”:”VSAvatarModel”

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”VSText”

“Input”:{

“AIMName”:””,

“PortName”:”VSText”

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”VSSpeech”

“Input”:{

“AIMName”:””,

“PortName”:”VSSpeech”

}

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”VSAvatarDescriptors”

“Input”:{

“AIMName”:””,

“PortName”:”VSAvatarDescriptors”

}

“Implementations”:[

{

“BinaryName”:”vsv.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”MPAIStore”,

“Destination”:””

}

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2. AIM metadata for ARA-VSV

2.1 SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

“Description”:”This AIM implements the speech recognition function for ARA-VSV: it converts the user’s speech to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognisedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2 AvatarDescriptorParsing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”AvatarDescriptorParsing”,

“Version”:”2″

“Description”:”This AIM implements the speech recognition function for ARA-VSV: it converts the user’s speech to text.”,

“Types”:[

{

“Name”:”AvatarDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”BodyDescriptors_t”,

“Type”:”{uint8[]}”

}

{

“Name”:”FaceDescriptors_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”InputAvatarDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”AvatarDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-ara/”

}

]

}

2.3 LanguageUnderstanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”LanguageUnderstanding”,

“Version”:”1″

“Description”:”This AIM extracts Meaning from Recognised Text supplemented by the ID of the Physical Object and improves Recognised Text.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

}

“Ports”:[

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputsedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RefinedText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.4 PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

“Description”:”This AIM extracts the combined Personal Status from Text, Speech, Face, and Gesture.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”{uint16[]}”

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”Meaning2″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputPersonalStatus1″,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning”,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RefinedText2″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputPersonalStatus2″,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.5 Summarisation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”ARA”,

“AIW”:”VSV”,

“AIM”:”Summarisation”,

“Version”:”2″

“Description”:”This AIM produces the Summary of the Videoconference.”,

“Types”:[

{

“Name”:”Meaning_t”,

“{uint8[]}”

{

“Name”:”Text_t”,

“{uint8[] | uint16[]}”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint16[]”

{

“Name”:”Summary_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”Meaning”,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TextLanguageUnderstanding”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”EditedSummary”,

“Direction”:”InputOutput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Summary”,

“Direction”:”OutputInput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.6 DialogueProcessing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”CAS”,

“AIM”:”DialogueProcessing”,

“Version”:”1″

“Description”:”This AIM produces the Machine’s Text and Personal Status from the human’s Text and Personal Status.”,

“Types”:[

{

“Name”:”Text_t”,

“{uint8[] | uint16[]}”

{

“Name”:”Meaning_t”,

“{uint8[]}”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint16[]”

{

“Name”:”Summary_t”,

“{uint8[]}”

“Ports”:[

{

“Name”:”Text”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TextLanguageUnderstanding”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Meaning”,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”EditedSummary”,

“Direction”:”OutputInput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Summary”,

“Direction”:”InputOutput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”Summary”,

“Direction”:”OutputInput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VSPersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VSText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.7 PersonalStatusDisplay

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”ARA”,

“AIW”:”VSV”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

“Description”:”This AIM outputs the Avatar Model and renders a speaking avatar from text and Personal Status.”,

“Types”:[

{

“Name”:”AvatarModel_t”,

“Type”:”{uint8[]}”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”3DGraphics_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”VSPersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VSText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”AvatarModel”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VSText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”VSSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

{

“Name”:”AvatarDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:” AvatarDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

AIW and AIM Metadata of MMC-UST

1 AIW metadata for UST

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V1/AIW-AIM-metadata.schema.json”,

“title”:”UST AIF v1 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-UST”,

“AIM”:”MMC-UST”,

“Version”:”1″

}

“APIProfile”:”Main”,

“Description”:” This AIF is used to call the AIW of UST”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”InputSelection_t”,

“Type”:”Speech_t | Text_t”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Language_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RequestedLanguage”,

“Direction”:”InputOutput”,

“RecordType”:”uint8[5] Language_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-UST”,

“AIM”:”SpeechRecogniton”,

“Version”:”1″

}

{

“Name”:”Translation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-UST”,

“AIM”:”Translation”,

“Version”:”1″

}

{

“Name”:”SpeechFeatureExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-UST”,

“AIM”:”SpeechFeatureExtraction”,

“Version”:”1″

}

{

“Name”:”SpeechSynthesis”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-UST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

}

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”RequestedLanguage”

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RequestedLanguage”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText”

“Input”:{

“AIMName”:”Translation”,

“PortName”:”InputText ”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech1″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

“Input”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”InputSpeech2″

}

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedSpeech”

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedSpeech”

}

{

“Output”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”SpeechFeatures”

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”SpeechFeatures”

}

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognizedText”

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RecognizedText”

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText”

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedText”

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText”

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedText”

}

“Implementations”:[

{

“BinaryName”:”ust.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”AIMStorage”,

“Destination”:””

}

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2 AIM metadata

2.1 SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”UST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

“Description”:”This AIM implements speech recognition function for MMC-UST that converts speech of user utterance to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognizedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2 Translation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”UST”,

“AIM”:”Translation”,

“Version”:”1″

“Description”:”This AIM implements translation function for MMC-UST.”,

“Types”:[

{

“Name”:”InputSelection_t”,

“Type”:”Speech_t | Text_t”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Language_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RequestedLanguage”,

“Direction”:”InputOutput”,

“RecordType”:”uint8[5] Language_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”OutputText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.3 Speech Feature Extraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”UST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

“Description”:”This AIM implements speech recognition function for MMC-UST.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

}

“Ports”:[

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechFeatures”,

“Direction”:”OutputInput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.4 Speech Synthesis

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”UST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

“Description”:”This AIM implements speech synthesis function for MMC-UST.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”TranslatedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechFeatures”,

“Direction”:”InputOutput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”OutputSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

AIW and AIM Metadata of MMC-BST

1 AIW metadata for BST

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V1/AIW-AIM-metadata.schema.json”,

“title”:”BST AIF v1 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-BST”,

“AIM”:”MMC-BST”,

“Version”:”1″

}

“APIProfile”:”Main”,

“Description”:” This AIF is used to call the AIW of BST”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”InputSelection_t”,

“Type”:”Speech_t | Text_t”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Language_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RequestedLanguage”,

“Direction”:”InputOutput”,

“RecordType”:”uint8[5] Language_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech3″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech4″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedText2″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedSpeech2″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-BST”,

“AIM”:”SpeechRecogniton”,

“Version”:”1″

}

{

“Name”:”Translation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-BST”,

“AIM”:”Translation”,

“Version”:”1″

}

{

“Name”:”SpeechFeatureExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-BST”,

“AIM”:”SpeechFeatureExtraction”,

“Version”:”1″

}

{

“Name”:”SpeechSynthesis”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-BST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

}

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”RequestedLanguage ”

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RequestedLanguage”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText1″

“Input”:{

“AIMName”:”Translation”,

“PortName”:”InputText1 ”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText2″

“Input”:{

“AIMName”:”Translation”,

“PortName”:”InputText2″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech1″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech2″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech3″

“Input”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”InputSpeech3″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech4″

“Input”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”InputSpeech4″

}

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedSpeech1″

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedSpeech1″

}

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedSpeech2″

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedSpeech2″

}

{

“Output”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”SpeechFeatures1″

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”SpeechFeatures1″

}

{

“Output”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”SpeechFeatures2″

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”SpeechFeatures2″

}

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognizedText1″

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RecognizedText1″

}

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognizedText2″

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RecognizedText2″

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText1″

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedText1″

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText2″

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedText2″

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText3″

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedText3″

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText4″

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedText4″

}

“Implementations”:[

{

“BinaryName”:”bst.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”AIMStorage”,

“Destination”:””

}

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2 AIM metadata

2.1 SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”BST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

“Description”:”This AIM implements speech recognition function for MMC-BST that converts speech of user utterance to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognizedText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognizedText2″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2 Translation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”BST”,

“AIM”:”Translation”,

“Version”:”1″

“Description”:”This AIM implements translation function for MMC-BST.”,

“Types”:[

{

“Name”:”InputSelection_t”,

“Type”:”Speech_t | Text_t”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Language_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RequestedLanguages”,

“Direction”:”InputOutput”,

“RecordType”:”uint8[5] Language_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedText2″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedText3″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedText4″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.3 Speech Feature Extraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”BST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

“Description”:”This AIM implements speech recognition function for MMC-BST.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

}

“Ports”:[

{

“Name”:”InputSpeech3″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech4″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechFeatures1″,

“Direction”:”OutputInput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechFeatures2″,

“Direction”:”OutputInput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.4 Speech Synthesis

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”BST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

“Description”:”This AIM implements speech synthesis function for MMC-BST.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”TranslatedText3″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedText4″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechFeatures1″,

“Direction”:”InputOutput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechFeatures2″,

“Direction”:”InputOutput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedSpeech2″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

AIW and AIM Metadata of MMC-MST

1. AIW metadata for MST

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V1/AIW-AIM-metadata.schema.json”,

“title”:”MST AIF v1 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MST”,

“AIM”:”MMC-MST”,

“Version”:”1″

}

“APIProfile”:”Main”,

“Description”:” This AIF is used to call the AIW of MST”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”InputSelection_t”,

“Type”:”Speech_t | Text_t”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”Language_t”,

“Type”:”{uint8[]}”

}

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RequestedLanguage”,

“Direction”:”InputOutput”,

“RecordType”:”uint8[5] Language_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”OutputText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”OutputText2″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”OutputTextN”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InterpretedSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InterpretedSpeech2″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InterpretedSpeechN”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

}

{

“Name”:”Translation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MST”,

“AIM”:”Translation”,

“Version”:”1″

}

{

“Name”:”SpeechFeatureExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MST”,

“AIM”:”SpeechFeatureExtraction”,

“Version”:”1″

}

{

“Name”:”SpeechSynthesis”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

}

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”RequestedLanguage”

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RequestedLanguage”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText”

“Input”:{

“AIMName”:”Translation”,

“PortName”:”InputText ”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech1″

}

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

“Input”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”InputSpeech2″

}

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”InterpretedSpeech1″

“Input”:{

“AIMName”:””,

“PortName”:”InterpretedSpeech1″

}

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”InterpretedSpeech2″

“Input”:{

“AIMName”:””,

“PortName”:”InterpretedSpeech2″

}

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”InterpretedSpeechN”

“Input”:{

“AIMName”:””,

“PortName”:”InterpretedSpeechN”

}

{

“Output”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”SpeechFeatures”

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”SpeechFeatures”

}

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognizedText”

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RecognizedText”

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText1″

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedText1″

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText2″

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedText2″

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedTextN”

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedTextN”

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”OutputText1″

“Input”:{

“AIMName”:””,

“PortName”:”OutputText1″

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”OutputText2″

“Input”:{

“AIMName”:””,

“PortName”:”OutputText2″

}

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”OutputTextN”

“Input”:{

“AIMName”:””,

“PortName”:”OutputTextN”

}

“Implementations”:[

{

“BinaryName”:”mst.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”AIMStorage”,

“Destination”:””

}

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2. AIM metadata

2.1 SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”MST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

“Description”:”This AIM implements the speech recognition function for MMC-MST: it converts the user’s speech to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognizedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2 Translation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”MST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

“Description”:”This AIM implements the translation function for MMC-MST: it converts source language text to target language text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”RecognizedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.3 Speech Feature Extraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”MST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

“Description”:”This AIM implements the speech feature extraction function for MMC-MST: it extracts specified features from the user’s source language speech so that these can be used during speech synthesis of the target text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

}

“Ports”:[

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechFeatures”,

“Direction”:”OutputInput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.4 Speech Synthesis

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”MST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

“Description”:”This AIM implements the speech synthesis function for MMC-MST: it receives target language text and optionally speech features extracted from the source language speech and produces target language speech.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

“Ports”:[

{

“Name”:”TranslatedText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TranslatedTextN”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”OutputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”OutputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”OutputTextN”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechFeatures”,

“Direction”:”InputOutput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InterpretedSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InterpretedSpeech2″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”InterpretedSpeechN”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

Metadata of MMC-PSE Composite AIM

1. PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

“Description”:”This AIM implements Personal Status Extraction function.”,

“Types”:[

{

“Name”:”InputSelection_t”,

“Type”:”uint8[]”

{

“Name”:”Text_t”,

“Type”:”uint8[] | uint16[]”

{

“Name”:”PSTextDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”PSSpeechDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”FaceObject_t”,

“Type”:”uint24[]”

{

“Name”:”PSFaceDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”BodyObject_t”,

“Type”:”uint[]”

{

“Name”:”PSGestureDescriptors_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TextObject”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TextDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”TextDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechObject”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”SpeechDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceObject”,

“Direction”:”InputOutput”,

“RecordType”:”FaceObject_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyObject”,

“Direction”:”InputOutput”,

“RecordType”:”BodyObject_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SubAIMs”:[

{

“Name”:”PSTextDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSTextDescription”,

“Version”:”2″

}

{

“Name”:”PSSpeechDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSSpeechDescription”,

“Version”:”2″

}

{

“Name”:”PSFaceDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSFaceDescription”,

“Version”:”2″

}

{

“Name”:”PSGestureDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSGestureDescription”,

“Version”:”2″

}

{

“Name”:”PSTextInterpretatiom”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSTextInterpretatiom”,

“Version”:”2″

}

{

“Name”:”PSSpeechInterpretatiom”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSSpeechInterpretatiom”,

“Version”:”2″

}

{

“Name”:”PSFaceInterpretatiom”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSFaceInterpretatiom”,

“Version”:”2″

}

{

“Name”:”PSGestureInterpretatiom”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSGestureInterpretatiom”,

“Version”:”2″

}

{

“Name”:”PersonalStatusCombination”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSGestureInterpretatiom”,

“Version”:”2″

}

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”Selection”

“Input”:{

“AIMName”:””,

“PortName”:”Selection”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”TextObject”

“Input”:{

“AIMName”:”TextDescription”,

“PortName”:”TextObject”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”SpeechObject”

“Input”:{

“AIMName”:”SpeechDescription”,

“PortName”:”SpeechObject”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”FaceObject”

“Input”:{

“AIMName”:”FaceDescription”,

“PortName”:”FaceObject”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”GestureObject”

“Input”:{

“AIMName”:”GestureDescription”,

“PortName”:”GestureObject”

}

{

“Output”:{

“AIMName”:”PSTextDescription”,

“PortName”:”PSTextDescriptors”

“Input”:{

“AIMName”:”PSTextInterpretation”,

“PortName”:”PSTextDescriptors”

}

{

“Output”:{

“AIMName”:””,

“PortName”:” TextDescriptors”

“Input”:{

“AIMName”:”PSTextInterpretation”,

“PortName”:”TextDescriptors”

}

{

“Output”:{

“AIMName”:”PSSpeechDescription”,

“PortName”:”PSSpeechDescriptors”

“Input”:{

“AIMName”:”PSSpeechInterpretation”,

“PortName”:”PSSpeechDescriptors”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”SpeechDescriptors”

“Input”:{

“AIMName”:”PSSpeechInterpretation”,

“PortName”:”SpeechDescriptors”

}

{

“Output”:{

“AIMName”:”PSFaceDescription”,

“PortName”:”PSFaceDescriptors”

“Input”:{

“AIMName”:”PSFaceInterpretation”,

“PortName”:”PSFaceDescriptors”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”FaceDescriptors”

“Input”:{

“AIMName”:”PSFaceInterpretation”,

“PortName”:”FaceDescriptors”

}

{

“Output”:{

“AIMName”:”PSGestureDescription”,

“PortName”:”PSGestureDescriptors”

“Input”:{

“AIMName”:”PSSpeechInterpretation”,

“PortName”:”PSGestureDescriptors”

}

{

“Output”:{

“AIMName”:””,

“PortName”:”BodyDescriptors”

“Input”:{

“AIMName”:”PSGestureInterpretation”,

“PortName”:”BodyDescriptors”

}

{

“Output”:{

“AIMName”:”PSTextInterpretation”,

“PortName”:”PS-Text”

“Input”:{

“AIMName”:”PersonalStatusCombination”,

“PortName”:”PS-Text”

}

{

“Output”:{

“AIMName”:”PSTSpeechInterpretation”,

“PortName”:”PS-Speech”

“Input”:{

“AIMName”:”PersonalStatusCombination”,

“PortName”:”PS-Speech ”

}

{

“Output”:{

“AIMName”:”PSFaceInterpretation”,

“PortName”:”PS-Face”

“Input”:{

“AIMName”:”PersonalStatusCombination”,

“PortName”:”PS-Face”

}

{

“Output”:{

“AIMName”:”PSGestureInterpretation”,

“PortName”:”PS-Gesture”

“Input”:{

“AIMName”:”PersonalStatusCombination”,

“PortName”:”PS-Gesture”

}

{

“Output”:{

“AIMName”:”PersonalStatusCombination”,

“PortName”:”PersonalStatus”

“Input”:{

“AIMName”:””,

“PortName”:”PersonalStatus”

}

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

1.1 PSTextDescription

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSTextDescription”,

“Version”:”2″

“Description”:”This AIM implements the text description for Personal Status.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

{

“Name”:”PSTextDescriptors_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”TextObject”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSTextDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:” PSTextDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

1.2 PSSpeechDescription

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSSpeechDescription”,

“Version”:”2″

“Description”:”This AIM implements the Speech description for Personal Status.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

{

“Name”:”PSSpeechDescriptors_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”SpeechObject”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSSpeechDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”SpeechDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

1.3 PSFaceDescription

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSFaceDescription”,

“Version”:”2″

“Description”:”This AIM implements the Face description for Personal Status.”,

“Types”:[

{

“Name”:”Face_t”,

“Type”:”uint32[]”

{

“Name”:”PSFaceDescriptors_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”FaceObject”,

“Direction”:”InputOutput”,

“RecordType”:”Face_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSFaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

1.4 PSBodyDescription

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSGestureDescription”,

“Version”:”2″

“Description”:”This AIM implements the Body description for Personal Status.”,

“Types”:[

{

“Name”:”Body_t”,

“Type”:”uint8[]”

{

“Name”:”PSBodyDescriptors_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”BodyObject”,

“Direction”:”InputOutput”,

“RecordType”:”Body_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSBodyDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”GestureDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

1.5 PSTextInterpretation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSTextInterpretation”,

“Version”:”2″

“Description”:”This AIM implements the Text Interpretation function for Personal Status.”,

“Types”:[

{

“Name”:”PSTextDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”TextDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”PSText_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”PSTextDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”PSTextDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”TextDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”TextDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSText”,

“Direction”:”OutputInput”,

“RecordType”:” PSText_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

1.6 PSSpeechInterpretation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSSpeechInterpretation”,

“Version”:”2″

“Description”:”This AIM implements the Speech Interpretation function for Personal Status.”,

“Types”:[

{

“Name”:”PSSpeechDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”SpeechDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”PSSpeech_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”PSSpeechDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”PSSpeechDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”SpeechDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”SpeechDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSSpeech”,

“Direction”:”OutputInput”,

“RecordType”:” PSSpeech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

1.7 PSFaceInterpretation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSFaceInterpretation”,

“Version”:”2″

“Description”:”This AIM implements the Face Interpretation function for Personal Status.”,

“Types”:[

{

“Name”:”PSFaceDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”PSFace_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”PSFaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”PSFaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSFace”,

“Direction”:”OutputInput”,

“RecordType”:” PSFace_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

1.8 PSBodyInterpretation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSGestureInterpretation”,

“Version”:”2″

“Description”:”This AIM implements the Face Interpretation function for Personal Status.”,

“Types”:[

{

“Name”:”PSGestureDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

{

“Name”:”PSGesture_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”PSFaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”PSGestureDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSFace”,

“Direction”:”OutputInput”,

“RecordType”:” PSGesture_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

1.9 PersonalStatusCombination

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PersonalStatusCombination”,

“Version”:”2″

“Description”:”This AIM implements the Personal Status Combination function.”,

“Types”:[

{

“Name”:”PSText_t”,

“Type”:”uint8[]”

{

“Name”:”PSSpeech_t”,

“Type”:”uint8[]”

{

“Name”:”PSFace_t”,

“Type”:”uint8[]”

{

“Name”:”PSGesture_t”,

“Type”:”uint8[]”

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

“Ports”:[

{

“Name”:”PSText”,

“Direction”:”InputOutput”,

“RecordType”:”PSText_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”PSSpeech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSFace”,

“Direction”:”InputOutput”,

“RecordType”:”PSFace_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PSGesture”,

“Direction”:”InputOutput”,

“RecordType”:”PSGesture_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

{

“Name”:”PersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

“SUbAIMs”:[

“Topology”:[

“Implementations”:[

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

Communication Among AIM Implementors (Informative)

To the extent possible, AIM input and output data are specified so that the inner implementation of an AIM need not be known or considered by cooperating AIMs. In other words, so far as possible, cooperating AIMs are designed to interact as black boxes. However, AIMs based upon the neural network technology currently prevalent in AI systems will generally require closer cooperation – in effect, greater transparency. An AIM receiving neural input in the form of features (vectors) will require some assistance in processing them. The downstream AIM will need either

The neural network model used to train the upstream AIM, or
A precise specification of the syntax and semantics of the features,

so that the downstream AIM can handle the features received from the upstream AIM.

A core design principle of MPAI is modularity: AI Modules or AIMs and their interfaces must be defined such that each AIM can be built by an independent implementor, without damage to the function of a use case as a whole.

However, MPAI also recognizes that that AIMs and their implementors may sometimes profit from communication and interchange of data and/or components. Such exchanges can be especially appropriate for AIMs featuring neural network components or comparable elements for machine learning – an increasingly common and important situation in the design of cooperative artificial intelligence modules.

The Unidirectional Speech Translation workflow provides a good example. It is designed to enable addition to the Translated Speech (that is, to the target language or output speech) of Speech Features extracted from the input, or source language, speech. This addition can enable the spoken translation to express the original emotion, or to employ the original speaker’s voice quality to give the impression that he or she is pronouncing the translation. For these purposes, a Speech Feature Extraction AIM can extract relevant speech features from the input speech and pass them to the Speech Synthesis (Features) AIM. However, while the two AIMs can indeed be independently implemented, the downstream (receiving) AIM, in this case Speech Synthesiser (Features), will need to process the received speech features appropriately. If Speech Feature Extraction employs neural network technology and passes the resulting features as vectors, then Speech Synthesis (Features) will need cooperation from Speech Feature Extraction. The downstream AIM will need either (1) the neural network model used to train the upstream AIM, or (2) a precise specification of the syntax and semantics of the features, so that the downstream AIM can handle the features received from the upstream AIM.

Comparable considerations obtain for the Conversation with Emotion (CWE) use case. And, more generally, they will obtain for any AIMs that exchange neural information. In explicitly providing for such communication among artificial machine learning models and components, MPAI is not only recognising practical requirements for cooperation among such modules, but also acknowledging an analogy with communication among biological neural subsystems.

[1] At the time of publication of this Technical Report, the MPAI Store was assigned as the IIDRA.

Technical Specification – Multimodal Conversation (MPAI-MMC) WD for CC

1 Introduction (Informative)

2 Scope of Standard

3 Terms and Definitions

4 References

4.1 Normative References

4.2 Informative References

5 Use Cases

5.1 Conversation with Personal Status (CPS)

5.1.1 Scope of Conversation with Personal Status

5.1.2 Reference Architecture of Conversation with Personal Status

5.1.3 I/O Data of Conversation with Personal Status

5.1.4 Functions of AI Modules of Conversation with Personal Status

5.1.5 I/O Data of AI Modules of Conversation with Personal Status

5.1.6 JSON Metadata of Conversation with Personal Status

5.2 Conversation with Emotion (CWE)

5.2.1 Scope of Conversation with Emotion

5.2.2 Reference Architecture of Conversation with Emotion

5.2.3 I/O Data of Conversation with Emotion

5.2.4 Functions of AI Modules of Conversation with Emotion

5.2.5 I/O Data of AI Modules of Conversation with Emotion

5.2.6 JSON Metadata of Conversation with Emotion

5.3 Multimodal Question Answering (MQA)

5.3.1 Scope of Multimodal Question Answering

5.3.2 Reference Architecture of Multimodal Question Answering

5.3.3 I/O Data of Multimodal Question Answering

5.3.4 Functions of AI Modules of Multimodal Question Answering

5.3.5 I/O Data of AI Modules of Multimodal Question Answering

5.3.6 JSON Metadata of Multimodal Question Answering

5.4 Conversation About a Scene (CAS)

5.4.1 Scope of Conversation About a Scene

5.4.2 Reference Architecture of Conversation About a Scene

5.4.3 I/O Data of Conversation About a Scene

5.4.4 Functions of AI Modules of Conversation About a Scene

5.4.5 I/O Data of AI Modules of Conversation About a Scene

5.4.6 JSON Metadata of Conversation About a Scene

5.5 Virtual Secretary for Videoconference (VSV)

5.5.1 Scope of Virtual Secretary for Videoconference

5.5.2 Reference Architecture of Virtual Secretary for Videoconference

5.5.3 I/O Data of Virtual Secretary for Videoconference

5.5.4 Functions of AI Modules of Virtual Secretary for Videoconference

5.5.5 I/O Data of AI Modules of Virtual Secretary for Videoconference

5.5.6 JSON Metadata of Virtual Secretary for Videoconference

5.6 Human-Connected Autonomous Vehicle (CAV) Interaction (HCI)

5.6.1 Scope of Human-CAV Interaction

5.7 Reference Architecture of Human-CAV Interaction

5.7.1 I/O Data of Human-CAV Interaction

5.7.2 Functions of AI Modules of Human-CAV Interaction

5.7.3 I/O Data of AI Modules of Human-CAV Interaction

5.7.4 JSON Metadata of Human-CAV Interaction

5.8 Unidirectional Speech Translation (UST)

5.8.1 Scope of Unidirectional Speech Translation

5.8.2 Reference Architecture of Unidirectional Speech Translation

5.8.3 I/O Data of Unidirectional Speech Translation

5.8.4 Functions of AI Modules of Unidirectional Speech Translation

5.8.5 I/O Data of AI Modules of Unidirectional Speech Translation

5.8.6 JSON Metadata of Unidirectional Speech Translation

5.9 Bidirectional Speech Translation (BST)

5.9.1 Scope of Bidirectional Speech Translation

5.9.2 Reference Architecture of Bidirectional Speech Translation

5.9.3 I/O Data of Bidirectional Speech Translation

5.9.4 Functions of AI Modules of Bidirectional Speech Translation

5.9.5 I/O Data of AI Modules of Bidirectional Speech Translation

5.9.6 JSON Metadata of Bidirectional Speech Translation

5.10 One-to-Many Speech Translation (MST)

5.10.1 Scope of One-to-Many Speech Translation

5.10.2 Reference Architecture of One-to-Many Speech Translation

5.10.3 I/O Data of One-to-Many Speech Translation

5.10.4 Functions of AI Modules of One-to-Many Speech Translation

5.10.5 I/O Data of AI Modules of One-to-Many Speech Translation

5.10.6 JSON Metadata of One-to-Many Speech Translation

6 Composite AI Modules

6.1 Personal Status Extraction (PSE)

6.1.1 Scope of Personal Status Extraction

6.1.2 Reference Architecture of Personal Status Extraction

6.1.3 I/O Data of Personal Status Extraction

6.1.4 Functions of AI Modules of Personal Status Extraction

6.1.5 I/O Data of AI Modules of Personal Status Extraction

6.1.6 JSON Metadata of Personal Status Extraction

6.2 Personal Status Display (PSD)