This document is a working draft of Version 2 of Technical Specification: Multimodal Conversation (MPAI-MMC) published with a request for Community Comments. Comments should be sent to the MPAI Secretariat by 2023/09/25T23:59 UTC to enable MPAI to consider comments for potential inclusion in the final text of the Technical Specification planned to be approved for publication by the 36th General Assembly (2023/09/29). The deadline for submitting a response is September 25 at 23:59 UTC.

 

 

WARNING

 

Use of the technologies described in this Technical Specification may infringe patents, copyrights or intellectual property rights of MPAI Members or non-members.

 

MPAI and its Members accept no responsibility whatsoever for damages or liability, direct or consequential, which may result from the use of this Technical Specification.

 

Readers are invited to review Annex 3 – Notices and Disclaimers.

 

 

 

 

1        Introduction (Informative) 7

2        Scope of Standard. 8

3        Terms and Definitions. 10

4        References. 12

4.1         Normative References. 12

4.2         Informative References. 12

5        Use Case Architectures. 13

5.1         Conversation with Personal Status (CPS) 13

5.1.1     Scope of Conversation with Personal Status. 13

5.1.2     Reference Architecture of Conversation with Personal Status. 13

5.1.3     I/O Data of Conversation with Personal Status. 14

5.1.4     Functions of AI Modules of Conversation with Personal Status. 14

5.1.5     I/O Data of AI Modules of Conversation with Personal Status. 15

5.1.6     JSON Metadata of Conversation with Personal Status. 15

5.2         Conversation with Emotion (CWE) 15

5.2.1     Scope of Conversation with Emotion. 15

5.2.2     Reference Architecture of Conversation with Emotion. 15

5.2.3     I/O Data of Conversation with Emotion. 16

5.2.4     Functions of AI Modules of Conversation with Emotion. 17

5.2.5     I/O Data of AI Modules of Conversation with Emotion. 17

5.2.6     JSON Metadata of Conversation with Emotion. 17

5.3         Multimodal Question Answering (MQA) 17

5.3.1     Scope of Multimodal Question Answering. 17

5.3.2     Reference Architecture of Multimodal Question Answering. 18

5.3.3     I/O Data of Multimodal Question Answering. 18

5.3.4     Functions of AI Modules of Multimodal Question Answering. 19

5.3.5     I/O Data of AI Modules of Multimodal Question Answering. 19

5.3.6     JSON Metadata of Multimodal Question Answering. 19

5.4         Conversation About a Scene (CAS) 19

5.4.1     Scope of Conversation About a Scene. 19

5.4.2     Reference Architecture of Conversation About a Scene. 20

5.4.3     I/O Data of Conversation About a Scene. 21

5.4.4     Functions of AI Modules of Conversation About a Scene. 21

5.4.5     I/O Data of AI Modules of Conversation About a Scene. 21

5.4.6     JSON Metadata of Conversation About a Scene. 22

5.5         Virtual Secretary for Videoconference (VSV) 22

5.5.1     Scope of Virtual Secretary for Videoconference. 22

5.5.2     Reference Architecture of Virtual Secretary for Videoconference. 22

5.5.3     I/O Data of Virtual Secretary for Videoconference. 24

5.5.4     Functions of AI Modules of Virtual Secretary for Videoconference. 24

5.5.5     I/O Data of AI Modules of Virtual Secretary for Videoconference. 24

5.5.6     JSON Metadata of Virtual Secretary for Videoconference. 25

5.6         Human-Connected Autonomous Vehicle (CAV) Interaction (HCI) 25

5.6.1     Scope of Human-CAV Interaction. 25

5.7         Reference Architecture of Human-CAV Interaction. 25

5.7.1     I/O Data of Human-CAV Interaction. 27

5.7.2     Functions of AI Modules of Human-CAV Interaction. 28

5.7.3     I/O Data of AI Modules of Human-CAV Interaction. 28

5.7.4     JSON Metadata of Human-CAV Interaction. 29

5.8         Unidirectional Speech Translation (UST) 29

5.8.1     Scope of Unidirectional Speech Translation. 29

5.8.2     Reference Architecture of Unidirectional Speech Translation. 29

5.8.3     I/O Data of Unidirectional Speech Translation. 30

5.8.4     Functions of AI Modules of Unidirectional Speech Translation. 30

5.8.5     I/O Data of AI Modules of Unidirectional Speech Translation. 31

5.8.6     JSON Metadata of Unidirectional Speech Translation. 31

5.9         Bidirectional Speech Translation (BST) 31

5.9.1     Scope of Bidirectional Speech Translation. 31

5.9.2     Reference Architecture of Bidirectional Speech Translation. 31

5.9.3     I/O Data of Bidirectional Speech Translation. 32

5.9.4     Functions of AI Modules of Bidirectional Speech Translation. 32

5.9.5     I/O Data of AI Modules of Bidirectional Speech Translation. 33

5.9.6     JSON Metadata of Bidirectional Speech Translation. 33

5.10       One-to-Many Speech Translation (MST) 33

5.10.1   Scope of One-to-Many Speech Translation. 33

5.10.2   Reference Architecture of One-to-Many Speech Translation. 33

5.10.3   I/O Data of One-to-Many Speech Translation. 34

5.10.4   Functions of AI Modules of One-to-Many Speech Translation. 34

5.10.5   I/O Data of AI Modules of One-to-Many Speech Translation. 34

5.10.6   JSON Metadata of One-to-Many Speech Translation. 35

6        Composite AI Modules. 35

6.1         Personal Status Extraction (PSE) 35

6.1.1     Scope of Personal Status Extraction. 35

6.1.2     Reference Architecture of Personal Status Extraction. 35

6.1.3     I/O Data of Personal Status Extraction. 36

6.1.4     Functions of AI Modules of Personal Status Extraction. 36

6.1.5     I/O Data of AI Modules of Personal Status Extraction. 37

6.1.6     JSON Metadata of Personal Status Extraction. 37

6.2         Personal Status Display (PSD) 37

6.2.1     Scope of Personal Status Display. 37

6.2.2     Reference Architecture of Personal Status Display. 37

6.2.3     I/O Data of Personal Status Display. 38

7        Data Formats. 38

7.1         Audio File. 40

7.2         Audio Scene Descriptors. 40

7.3         Cognitive State. 40

7.3.1     Syntax. 40

7.3.2     Semantics. 41

7.4         Emotion. 42

7.4.1     Syntax. 42

7.4.2     Semantics. 43

7.5         Face Descriptors. 44

7.6         Gesture Descriptors. 45

7.7         InstanceID.. Error! Bookmark not defined.

7.8         Intention. 46

7.8.1     Syntax. 46

7.8.2     Semantics. 46

7.9         Language identifier 47

7.10       Meaning. 47

7.10.1   Syntax. 47

7.10.2   Semantics. 48

7.11       Personal Status. 48

7.11.1   Factors and Modalities. 48

7.11.2   Personal Status Data. 49

7.12       Instance Identifier Error! Bookmark not defined.

7.13       Instance Identifier 45

7.13.1   Syntax. 45

7.13.2   Semantics. 45

7.14       Social Attitude. 52

7.14.1   Syntax. 52

7.14.2   Semantics. 52

7.15       Spatial Attitude. 57

7.16       Speech Descriptors. 57

7.17       Speech Features. 57

7.17.1   Syntax. 57

7.17.2   Semantics. 58

7.18       Text 59

7.19       Text Descriptors. 60

7.20       Video. 60

7.21       Video File. 60

7.22       Video of Faces KB Query Format 60

7.23       Visual Scene Descriptors. 60

Annex 1 – MPAI Basics. 61

1        General 61

2        Governance of the MPAI Ecosystem.. 61

3        AI Framework. 62

4        Audio-Visual Scene Description. 63

4.1         Audio Scene Descriptors. 63

4.2         Visual Scene Descriptors. 63

5        Avatar-Based Videoconference. 64

6        Connected Autonomous Vehicles. 64

Annex 2 – MPAI-wide terms and definitions. 67

Annex 3 – Notices and Disclaimers Concerning MPAI Standards (Informative) 70

Annex 4 – Patent declarations (Informative) 72

Annex 5 – Personal Status (Informative) 73

Annex 6 – AIW and AIM Metadata of MMC-CPS. 76

1        Metadata for MPAI-CPS AIW… 76

2        AIM metadata for CPS. 83

2.1         Visual Scene Description. 83

2.2         Audio Scene Description. 84

2.3         SpatialObjectIdentification. 85

2.4         SpeechRecognition. 86

2.5         Language Understanding. 87

2.6         PersonalStatusExtraction. 88

2.7         DialogueProcessing. 90

2.8         PersonalStatusDisplay. 91

Annex 7 – AIW and AIM Metadata of MMC-CWE.. 93

1        AIW metadata for CWE.. 93

2        AIM metadata. 99

2.1         SpeechRecognition. 99

2.2         Visual Scene Description. 100

2.3         Language Understanding. 101

2.4         PersonalStatusExtraction. 102

2.5         Dialogue Processing. 103

2.6         SpeechSynthesisEmotion. 105

2.7         Lips Animation. 106

Annex 8 – AIW and AIM Metadata of MMC-MQA.. 108

1        AIW metadata for MQA.. 108

2        AIM metadata. 113

2.1         VisualSceneDescription. 113

2.2         PhysicalObjectIdentification. 114

2.3         SpeechRecognition. 115

2.4         Language Understanding. 116

2.5         Question Analysis. 117

2.6         Question Answering. 118

2.7         SpeechSynthesisText 119

Annex 9 – AIW and AIM Metadata of MMC-CAS. 121

  1. AIW metadata for MMC-CAS. 121
  2. AIM metadata for MMC-CAS. 128

2.1    Visual Scene Description. 128

2.2    SpatialObjectIdentification. 129

2.3    SpeechRecognition. 131

2.4    LanguageUnderstanding. 131

2.5    PersonalStatusExtraction. 133

2.6    DialogueProcessing. 134

2.7    ScenePresentation. 135

2.8    PersonalStatusDisplay. 136

Annex 10 – AIW and AIM Metadata of CAV-HCI. 138

  1. AIW metadata for HCI. 138
  2. Metadata for HCI AIMs. 146

2.1    Audio Scene Description. 146

2.2    }Visual Scene Description. 147

2.3    SpeechRecognition. 149

2.4    SpatialObjectIdentification. 150

2.5    LanguageUnderstanding. 151

2.6    SpeakerRecognition. 152

2.7    PersonalStatusExtraction. 153

2.8    FaceRecognition. 154

2.9    DialogueProcessing. 155

2.9    PersonalStatusDisplay. 156

Annex 11 – AIW and AIM Metadata of ARA-VSV.. 158

1        Metadata for VSV AIW… 158

  1. AIM metadata for ARA-VSV.. 164

2.1    SpeechRecognition. 164

2.2    AvatarDescriptorParsing. 165

2.3    LanguageUnderstanding. 166

2.4    PersonalStatusExtraction. 167

2.5    Summarisation. 169

2.6    DialogueProcessing. 170

2.7    PersonalStatusDisplay. 172

Annex 12 – AIW and AIM Metadata of MMC-UST. 174

1        AIW metadata for UST. 174

2        AIM metadata. 178

2.1         SpeechRecognition. 178

2.2         Translation. 178

2.3         Speech Feature Extraction. 180

2.4         Speech Synthesis. 180

Annex 13 – AIW and AIM Metadata of MMC-BST. 182

1        AIW metadata for BST. 182

2        AIM metadata. 187

2.1         SpeechRecognition. 187

2.2         Translation. 188

2.3         Speech Feature Extraction. 190

2.4         Speech Synthesis. 191

Annex 14 – AIW and AIM Metadata of MMC-MST. 193

  1. AIW metadata for MST. 193
  2. AIM metadata. 198

2.1         SpeechRecognition. 198

2.2         Translation. 199

2.3         Speech Feature Extraction. 200

2.4         Speech Synthesis. 200

Annex 15 – Metadata of MMC-PSE Composite AIM… 203

  1. PersonalStatusExtraction. 203

1.1    PSTextDescription. 209

1.2    PSSpeechDescription. 209

1.3    PSFaceDescription. 210

1.4    PSBodyDescription. 211

1.5    PSTextInterpretation. 212

1.6    PSSpeechInterpretation. 213

1.7    PSFaceInterpretation. 214

1.8    PSBodyInterpretation. 215

1.9    PersonalStatusCombination. 216

Annex 16 – Communication Among AIM Implementors (Informative) 218

 

1          Introduction (Informative)

From the moment a human built the first machine, there was a need to “communicate” with it. As more complex machines were built, the need for more sophisticated communication methods arose. Today, as personal devices become more pervasive and the use of information and other online services becomes ubiquitous, human-machine communication often becomes more direct and even “personal”. In the past, humans communicated with more primitive machines by touch, but now the possibility of using speech and visual means enhances this trend.

 

The ability of Artificial Intelligence to learn from interactions with humans gives machines the ability to improve their “conversational” capabilities by better understanding the meaning of what humans type or say and by providing more pertinent responses. If properly trained, machines can also learn to understand additional or hidden meanings of a sentence by analysing a human’s text, speech, or gestures. Machines can also be made to develop and rely on “internal statuses” comparable to those driving the attitudes of conversing humans. Thus, they can provide responses – in text, speech, and gestures – that are more human-like and richer in content.

 

The mission of the international, unaffiliated, non-profit Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) Standards Developing Organisation is to develop AI-enabled data coding standards. MPAI believes that its standards should enable humans to select machines whose internal operation they understand to some degree, rather than machines that are just “black boxes” resulting from unknown training with unknown data. Thus, an implemented MPAI standard breaks up monolithic AI applications, yielding a set of interacting components with identified data whose semantics is known, as far as possible.

 

This opportunity for individual humans also offers a positive impact on industry, as component developers can compete in providing components with standard interfaces that have improved performance compared to other implementations. This “Lego-type” approach to application development is made possible by the MPAI AI Framework standard [2], where “applications” (called AI Workflows – AIWs) are composed of AI Modules (called AIMs) executed in AI Frameworks (called AIFs). AIMs are defined by their functions and data, but not by their internal architecture, which may be based on AI or data processing technologies, and implemented in software, hardware, or hybrid technology.  Annex 1 – MPAI Basic provides additional details on the MPAI standards ecosystem and MPAI standards relevant to this Technical Specification.

 

Technical Specification: Multimodal Conversation (MPAI-MMC) V2 provides the technologies supporting the implementation of a subset of the possibilities envisaged by this introduction. It is organised in Use Cases, such as Conversation with Emotion, Multimodal Question Answering, and Unidirectional Speech Translation, corresponding to AI Workflows. Each Use Case provides the functions, and the input/output data of the AIW and the AIM topology. Each AIM of the Use Case is specified in terms of functions and input/output data. A single chapter also collects all data formats referenced in the specification.

 

In this Introduction and in the following, Terms beginning with a capital letter are defined in Table 1 if they are specific to this Standard and in Table 45 if they are common to all MPAI Standards. The chapters and the Annexes are Normative unless they are labelled as Informative.

 

2          Scope of Standard

Multimodal Conversation (MPAI-MMC) specifies:

  1. The technologies required to analyse the text and/or the speech and other non-verbal components exchanged in human-machine and machine-machine conversation with the goal to emulate human-human conversation in completeness and intensity.
  2. Use Cases that apply the technologies, both from MPAI-MMC and other MPAI standards:
    • Conversation with Personal Status” (CPS), enabling conversation and question answering with a machine able to extract the inner state of the entity it is conversing with and showing itself as a speaking digital human able to express a Personal Status. By adding or removing minor components to this general Use Case, five Use Cases are spawned:
    • Conversation with Emotion” (CWE), supporting audio-visual conversation with a machine impersonated by a synthetic voice and an animated face.
    • Multimodal Question Answering” (MQA), supporting request for information about a dis­played object.
    • Conversation About a Scene” (CAS) where a human converses with a machine pointing at the objects scattered in a room and displaying Personal Status in their speech, face, and gestures while the machine responds displaying its Personal Status in speech, face, and gesture.
    • “Human-Connected Autonomous Vehicle Interaction” (HCI) where humans converse with a machine displaying Personal Status after having been properly identified by the machine with their speech and face in outdoor and indoor conditions while the machine responds displaying its Personal Status in speech, face, and gesture.
    • “Virtual Secretary for Videoconference” (VSV) where an avatar not representing a human in a virtual conference makes and displays a summary of what other avatars say, receives and interprets comments using the avatars’ utterances and Personal Statuses, and displays the edited summary.
    • Three Uses Cases supporting conversational translation applications. In each Use Case, users can specify whether speech or text is used as input and, if it is speech, whether their speech features are preserved in the interpreted speech:
    • Unidirectional Speech Translation” (UST).
    • Bidirectional Speech Translation” (BST).
    • One-to-Many Speech Translation” (MST).
  3. One Composite AIMs that applies the technologies, both from MPAI-MMC and other MPAI standards: Personal Status Extraction analyses the Personal Status conveyed by Text, Speech, Face, and Gesture – of a real or digital human – and provides an estimate of the Personal Status.

 

Note that:

  1. Each Use Case normatively defines:
    • The Functions of the AIW implementing it and of the AIMs.
    • The Connections between and among the AIMs
    • The Semantics and the Formats of the input and output data of the AIW and the AIMs.
  2. Each Composite AIM normatively defines:
    • The Functions of the Composite AIM implementing it and of the AIMs.
    • The Connections between and among the AIMs
    • The Semantics and the Formats of the input and output data of the AIW and the AIMs.

 

The word normatively implies that an Implementation claiming Conformance to:

  1. An AIW, shall:
    1. Perform the AIW function specified in the appropriate Section of Chapter 5.
    2. All AIMs, their topology and connections should conform with the AIW Architecture specified in the appropriate Section of Chapter 5.
    3. The AIW and AIM input and output data should have the formats specified in the appropriate Sections of Chapter 0.
  2. An AIM, shall:
    1. Perform the AIM function specified by the appropriate Section of Chapter 5.
    2. Receive and produce the data specified in the appropriate Section of Chapter 5.
    3. Receive as input and produce as output data having the format specified in Section Chapter 0.
    4. A data Format, the data shall have the format specified in Chapter 0.

 

Users of this Technical Specification should note that:

  1. This Technical Specification defines Interoperability Levels but does not mandate any.
  2. Implementers decide the Interoperability Level their Implementation satisfies.
  3. Implementers can use the Reference Software of this Technical Specification to develop their Implementations.
  4. The Conformance Testing specification can be used to test the conformity of an Implemen­tation to this Standard.
  5. Performance Assessors can assess the level of Performance of an Implementation based on the Performance Assessment specification of this Standard.
  6. Implementers and Users should consider Annex 2 – Notices and Disclaimers.

The current Version of MPAI-MMC has been developed by the MPAI Multimodal Conversation Development Committee (MM-DC). MPAI expects to produce future MPAI-MMC Versions extending the scope of the Use Cases and/or add new Use Cases within the Multimodal Conversation scope.

 

3          Terms and Definitions

The terms used in this standard whose first letter is capital have the meaning defined in Table 1.

 

Table 1Table of terms and definitions

 

Term Definition
Audio Digital representation of an analogue audio signal sampled at a frequency between 8-192 kHz with a number of bits/sample between 8 and 32, and non-linear and linear quantisation.
Audio Object Coded representation of Audio information with its metadata. An Audio Object can be a combination of Audio Objects.
Audio Scene The Audio Objects of an Environment with Object location metadata.
Audio-Visual Object Coded representation of Audio-Visual information with its metadata. An Audio-Visual Object can be a combination of Audio-Visual Objects.
Audio-Visual Scene (AV Scene) The Audio-Visual Objects of an Environment with Object location metadata.
Avatar An animated 3D object representing a real or fictitious person in a Virtual Space.
Avatar Model An inanimate avatar exposing interfaces to enable animation for animation.
Cognitive State An element of the internal status reflecting the way a human or avatar understands the Environment, such as “Confused”, “Dubious”, “Convinced”.
Colour (of speech) The timber of an identifiable voice independent of a current Personal Status and language.
Connected Autonomous Vehicle A vehicle able to autonomously reach an assigned geographical position by:

1.      Understanding human utterances.

2.      Planning a route.

3.      Sensing and interpreting the Environment.

4.      Exchanging information with other CAV.

5.      Acting on the CAV’s motion actuation subsystem.

Descriptor Coded representation of text, audio, speech, or visual feature.
Emotion The coded representation of the internal state resulting from the interaction of a human or avatar with the Environment or subsets of it, such as “Angry”, “Sad”, “Determined”.
Environment A Virtual Space containing a Scene.
Environment Model The static audio and visual components of the Environment, e.g., walls, table, and chairs.
Face The portion of a 2D or 3D digital representation corresponding to the face of a human.
Factor One of Emotion, Cognitive State and Attitude.
Grade The intensity of a Factor.
Identifier The label uniquely associated with a human or an avatar or an object.
Instance An element of a set of entities – Physical Objects, users etc. – belonging to some levels in a hierarchical classification (taxonomy).
Intention The result of analysis of the goal of an input question.
Manifestation The manner of showing the Personal Status, or a subset of it, in any one of Speech, Face, and Physical Gesture.
Meaning Information extracted from Text such as syntactic and semantic information, Personal Status, and other information, such as an Object Identifier.
Modality One of Text, Speech, Face, or Gesture.
Object Descriptor An individual attribute of the coded representation of an object in a Scene, including its Spatial Attitude.
Orientation The set of the 3 roll, pitch, yaw angles indicating the rotation around the principal axis (x) of an Object, its y axis having an angle of 90˚ counterclockwise (right-to-left) with the x axis and its z axis pointing up toward the viewer.
Personal Status The ensemble of information internal to a person, including Emotion, Cognitive State, and Attitude.
Physical Gesture A movement of the body or part of it, such as the head, arm, hand, and finger, often a complement to a vocal utterance.
Pitch The fundamental frequency of Speech. Pitch is the attribute that makes it possible to judge sounds as “higher” and “lower.”
Point of View The Spatial Attitude of a human or avatar looking at an Environment.
Position The 3 coordinates (x,y,z) of a representative point of an object in the Real and Virtual Space.
Refined Text The Text resulting from the analysis of the Text produced by Speech Recognition made by Language Understanding.
Scene A structured composition of Objects.
Scene Presentation The format used by an audio-visual renderer to render the Audio-Visual Scene internal to the machine from a selected Point of View.
Social Attitude An element of the internal status related to the way a human or avatar intends to position vis-à-vis the Environment or subsets of it, e.g., “Respectful”, “Confrontational”, “Soothing”.
Spatial Attitude Position and Orientation and their velocities and accelerations of a Human and Physical Object in a Real or Virtual Environment.
Spatial Attribute Position and Orientation and their velocities and accelerations of a Human and Physical Object in a Real or Virtual Environment.
Speech Digital representation of analogue speech sampled at a frequency between 8 kHz and 96 kHz with a number of bits/sample of 8, 16 and 24, and non-linear and linear quantisation.
Speech Features Aspects of a speech segment that enable its description and reproduction, e.g., degree of vocal tension, Pitch, etc., and that can be automatically recognised and extracted for speech synthesis or other related purposes.
Speech Rate The number of Speech Units per second.
Speech Unit Phoneme, syllable, or word as a segment of Speech.
Text A sequence of characters drawn from a finite alphabet.
Visual Object Coded representation of Visual information with its metadata. A Video Object can be a combination of Video Objects.
Vocal Gesture Utterance, such as cough, laugh, hesitation, etc. Lexical elements are excluded.

4          References

4.1        Normative References

This standard normatively references the following documents, both from MPAI and other stan­dards organisations. MPAI standards are publicly available at https://mpai.community/standards/resources/.

  1. Technical Specification; MPAI Ecosystem Governance (MPAI-GME) V1.1; https://mpai.community/standards/mpai-gme/.
  2. Technical Specification; AI Framework (MPAI-AIF) 1; https://mpai.community/standards/mpai-aif/.
  3. Technical Specification: Avatar Representation and Animation (MPAI-ARA) V1; https://mpai.community/standards/mpai-ara/.
  4. Technical Specification: Context-based Audio Enhancement (MPAI-CAE) V2; https://mpai.community/standards/mpai-cae/.
  5. Technical Specification: Connected Autonomous Vehicle (MPAI-CAV) V2; https://mpai.community/standards/mpai-cav/.
  6. Technical Specification: Visual Object and Scene Description (MPAI-OSD) V2; https://mpai.community/standards/mpai-osd/.
  7. Khronos; Graphics Language Transmission Format (glTF); October 2021; https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html
  8. ISO 639; Codes for the Representation of Names of Languages – Part 1: Alpha-2 Code.
  9. ISO/IEC 10646; Information technology – Universal Coded Character Set.
  10. ITU-R; Long-form file format for the international exchange of audio programme materials with metadata; BS.2088-1 (10/2019) https://www.loc.gov/preservation/digital/formats/fdd/fdd000001.shtml.
  11. ISO/IEC 14496-10; Information technology – Coding of audio-visual objects – Part 10: Advanced Video Coding.
  12. ISO/IEC 14496-12; Information technology – Coding of audio-visual objects – Part 12: ISO base media file format.
  13. ISO/IEC 23008-2; Information technology – High efficiency coding and media delivery in heterogeneous environments – Part 2: High Efficiency Video Coding.
  14. ISO/IEC 23094-1; Information technology – General video coding – Part 1: Essential Video Coding.
  15. MPAI; The MPAI Statutes; https://mpai.community/statutes/.
  16. MPAI; The MPAI Patent Policy; https://mpai.community/about/the-mpai-patent-policy/.
  17. MPAI; Framework Licence of the Multimodal Conversation Technical Specification (MPAI-MMC) V1; https://mpai.community/standards/mpai-mmc/framework-licence/mpai-mmc-v1-framework-licence/.
  18. MPAI; Framework Licence of the Multimodal Conversation Technical Specification (MPAI-MMC) V2; https://mpai.community/standards/mpai-mmc/call-for-technologies/mpai-mmc-v2-call-for-technologies/.

4.2        Informative References

The references provided here are for information purpose.

  1. Ekman, Paul (1999), “Basic Emotions”, in Dalgleish, T; Power, M (eds.), Handbook of Cognition and Emotion (PDF), Sussex, UK: John Wiley & Sons.
  2. Emotion Markup Language (EmotionML) 1.0; https://www.w3.org/TR/2010/WD-emotionml-20100729/diffmarked.html.
  3. Hobbs J.R., Gordon A.S. (2011) The Deep Lexical Semantics of Emotions. In: Ahmad K. (eds) Affective Computing and Sentiment Analysis. Text, Speech, and Language Technology, vol 45. Springer, Dordrecht, https://people.ict.usc.edu/~gordon/publications/EMOT08.PDF and https://www.researchgate.net/publication/227251103_The_Deep_Lexical_Semantics_of_Emotions.

 

5          Use Cases

5.1        Conversation with Personal Status (CPS)

5.1.1        Scope of Conversation with Personal Status

When humans have a conversation with other humans, they use speech and, in constrained cases, text. Their interlocutors perceive speech and/or text supplemented by visual information related to the speaker’s face and gesture of a conversing human. Text, speech, face, and gesture may convey information about the internal state of the speaker that MPAI calls Personal Status. Therefore, handling of Personal Status information in a human-machine conversation and, in the future, even machine-machine conversation, is a key feature of a machine trying to understand what the speakers’ utterances mean because Personal Status recognition can improve understanding of the speaker’s utterance and help a machine produce better replies.

Conversation with Personal Status (MMC-CPS) is a general Use Case of an entity – a real or digital human – conversing and question answering with a machine. The machine captures and understands Speech, extracts Personal Status from the Text, Speech, Face, and Gesture Factors, fuses the Factors into an estimated Personal Status of the entity to achieve a better understanding of the context in which the entity utters Speech.

5.1.2        Reference Architecture of Conversation with Personal Status

Figure 1 gives the Conversation with Personal Status Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

 

Figure 1 – Reference Model of Conversation with Personal Status

 

The operation of the Conversation with Personal Status Use Case develops as follows:

  1. Input Selection is used to inform the machine whether the human employs Text or Speech in conversation with the machine.
  2. Visual Scene Description extracts the Scene Geometry, the Physical Objects and the Face and Body Descriptors of humans in the Scene.
  3. Audio Scene Description extracts the Scene Geometry, and the Speech Objects in the Scene.
  4. Physical Object Identification assigns an Identifier to each Physical Object indicated by a human.
  5. Speech Recognition recognises Speech utterances.
  6. Language Understanding refines Text and extracts Meaning.
  7. Personal Status Extraction extracts a human’s Personal Status.
  8. Dialogue Processing produces the machine’s response and its Personal Status.
  9. Personal Status Display produces a speaking Avatar expressing Personal Status.

5.1.3        I/O Data of Conversation with Personal Status

Table 2 gives the input and output data of the Conversation with Personal Status Use Case:

 

Table 2 – I/O Data of Conversation with Personal Status

Input Comments
Input Text Text typed by the human as additional information stream or as a replacement of the Speech.
Input Speech Speech of the human having a conversation with the machine.
Input Video Video of the Face of the human having a conversation with the machine.
Input Selection Data determining the use of Speech vs Text.
Output Comments
Machine Text Text of the Speech produced by the machine.
Machine Speech Synthetic Speech produced by the machine.
Machine Video Avatar representing the machine.
Input Selection Selection signalling use of Text or Speech.

5.1.4        Functions of AI Modules of Conversation with Personal Status

Table 3 provides the functions of the Conversation with Personal Status Use Case.

 

Table 3 – Functions of AI Modules of Conversation with Personal Status

AIM Function
Visual Scene Description Provides Visual Objects and their Spatial Attitudes.
Audio Scene Description Provides Speech Objects and their Spatial Attitudes.
Speech Recognition Recognises Speech
Language Understanding Refines Text and extracts Meaning
Personal Status Extraction Extracts Personal Status
Dialogue Processing 1.      Processes Refined Text and Personal Status

2.      Produces machine’s Text and Personal Status.

Personal Status Displays 1.      Synthesises Machine Speech from Machine Text and Personal Status

2.      Synthesises Machine Avatar

5.1.5        I/O Data of AI Modules of Conversation with Personal Status

Table 4 provides the I/O Data of the AI Modules of the Conversation with Personal Status Use Case.

 

Table 4 – I/O Data of AI Modules of Conversation with Personal Status

AIM Receives Produces
Visual Scene Description Input Video 1.      Face Descriptors

2.      Body Descriptors

3.      Visual Scene Geometry

4.      Physical Objects

Audio Scene Description Input Audio 1.      Speech

2.      Audio Scene Geometry

Spatial Object Identification 1.      Body Descriptors

2.      Visual Scene Geometry

3.      Physical Objects

Physical Object ID
Speech Recognition Input Speech Recognised Text
Language Understanding 1.      Physical Object ID

2.      Input Text

3.      Recognised Text

4.      Input Selection

1.      Meaning

2.      Refined Text

Personal Status Extraction 1.      Body Descriptors

2.      Face Descriptors

3.      Meaning

4.      Speech

Input Personal Status
Dialogue Processing 1.      Input Text

2.      Refined Text

3.      Input Personal Status

4.      Input Selection

1.      Machine Personal Status

2.      Machine Text

Personal Status Displays 1.      Machine Text

2.      Machine Personal Status

1.      Machine Avatar

2.      Machine Speech

3.      Machine Text

5.1.6        JSON Metadata of Conversation with Personal Status

Specified in Annex 6 – .

5.2        Conversation with Emotion (CWE)

5.2.1        Scope of Conversation with Emotion

In the Conversation with Emotion (MMC-CWE) Use Case, a machine responds to a human’s textual and/or vocal utterance in a manner consistent with the human’s utterance and emotional state, as detected from the human’s text, speech, or face. The machine responds using text, synthetic speech, and a face whose lip movements are synchronised with the synthetic speech and the synthetic machine emotion.

5.2.2        Reference Architecture of Conversation with Emotion

Figure 2 gives the Conversation with Emotion Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

 

Figure 2 – Reference Model of Conversation with Emotion

 

The operation of Conversation with Emotion develops as follows:

  1. Input Selection is used to inform the machine whether the human employs Text or Speech in conversation with the machine.
  2. Speech is recognised by Speech Recognition.
  3. Visual Scene Description extracts Face Descriptors from the scene.
  4. Language Understanding produces Meaning and Refined Text.
  5. Personal Status Extraction extracts Emotion from Meaning, Input Speech, and Face Descriptors.
  6. Dialogue Processing produces a response as Output Text and Emotion.
  7. Speech Synthesis (Emotion) produces Output Speech from Text and Emotion.
  8. Lips Animation animates the lips of a Face drawn from the Video of Faces KB in a way that is consistent with the Output Speech and the Output Emotion.

5.2.3        I/O Data of Conversation with Emotion

The input and output data of the Conversation with Emotion Use Case are:

 

Table 5 – I/O Data of Conversation with Emotion 

 

Input Comments
Input Selection Data determining the use of Speech vs Text.
Input Text Text typed by the human as additional information stream or as a replacement of the speech depending on the value of Input Selection.
Input Speech Speech of the human having a conversation with the Machine.
Input Video Video of the Face of the human having a conversation with the Machine.
Output Comments
Machine Text Text of the Speech produced by the Machine.
Machine Speech Synthetic Speech with Emotion produced by the Machine.
Mane Video Video of a Face whose lip movements are synchronised with the Output Speech and the Machine Personal Status.

5.2.4        Functions of AI Modules of Conversation with Emotion

Table 6 provides the functions of the Conversation with Emotion AIMs.

 

Table 6 – Functions of AI Modules of Conversation with Emotion

AIM Function
Speech Recognition Recognises Speech
Language Understanding Refines Text and extracts Meaning
Personal Status Extraction Extracts Personal Status from Meaning, Speech, and Face.
Dialogue Processing 1.      Processes Refined Text and Personal Status

2.      Produces Machine Text and Personal Status.

Personal Status Displays 1.      Synthesises Machine Speech from Machine Text and Personal Status

2.      Synthesises Machine Avatar

5.2.5        I/O Data of AI Modules of Conversation with Emotion

The AI Modules of Conversation with Emotion perform the Functions specified in Table 7.

 

Table 7 – AI Modules of Conversation with Emotion

AIM Receives Produces
Speech Recognition Input Speech Recognised Text
Language Understanding Recognised Text Meaning in Recognised Text.
Personal Status Extraction 1.        Meaning

2.        Speech

3.        Face

Input Personal Status (Emotion only).
Dialogue Processing 1.        Meaning.

2.        Based on Input Selection

2.1.       Refined Text

2.2.       Input Text.

3.        Input Personal Status.

1.      Machine Personal Status

2.      Machine Text

Speech Synthesis (Emotion) 1.      Machine Text

2.      Machine Personal Status

Machine Speech.
Lips Animation 1.      Machine Personal Status

2.      Machine Speech

Video with animated lips of from Video Faces KB.

5.2.6        JSON Metadata of Conversation with Emotion

Specified in Annex 7 – .

5.3        Multimodal Question Answering (MQA)

5.3.1        Scope of Multimodal Question Answering

In a Question Answering (QA) System, a machine provides answers to a user’s question presented in natural language. Multimodal Question Answering improves current QA systems that are only able to deal with text or speech inputs by offering the requesting human the ability to present both speech or text and images. For example, users might ask “Where can I buy this tool?” while showing the picture of the tool, even without showing their faces. In the Multimodal Question Answering (MMC-MQA) Use Case, a machine responds to a question expressed by a user in text or speech while showing an object. The machine’s response may use text and synthetic speech.

5.3.2        Reference Architecture of Multimodal Question Answering

Figure 3 gives the Multimodal Question Answering Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

 

Figure 3 Reference Model of Multimodal Question Answering 

 

The operation of Multimodal Question Answering develops in the following way:

  1. Input Selection is used to inform the machine whether the human employs Text or Speech to query the machine.
  2. Depending on the value of Input Selection, Language Understanding:
    • Extracts the Meaning of the question from Recognised Text and refines Recognised Text.
    • Extracts the Meaning of the question from Input Text.
  3. Visual Scene Description extracts the Physical Object.
  4. Object Identifier identifies the Physical Object.
  5. Question Analysis determines the Intention of the question.
  6. Question Answering uses Intention and Meaning to produce the answer as Machine Text.
  7. Speech Synthesis (Text) produces the Output Speech from Machine Text.

5.3.3        I/O Data of Multimodal Question Answering

The input and output data of the Multimodal Question Answering Use Case are:

 

Table 8 – I/O Data of Multimodal Question Answering

 

Input Comments
Input Selection Data determining the use of Speech or Text.
Input Text Text typed by the human as a replacement for Input Speech.
Input Speech Speech of the human asking a question the Machine.
Input Video Video of the human showing an object held in hand.
Output Comments
Output Text The Text generated by Machine in response to human inputs.
Output Speech The Speech generated by Manchine in response to human inputs.

5.3.4        Functions of AI Modules of Multimodal Question Answering

Table 9 provides the functions of the Multimodal Question Answering Use Case.

 

Table 9 – Functions of AI Modules of Multimodal Question Answering

AIM Function
Visual Scene Description Extracts the Physical Object in the Visual Scene.
Object Identification Identifies the Physical Object.
Speech Recognition Recognises Speech.
Language Understan­ding Extracts Meaning and refines Text from Recognised Text.
Question Analysis Extracts Intention from Text.
Question Answering Produces response of Machine to the query.
Speech Synthesis (Text) Synthesises Speech from Text.

5.3.5        I/O Data of AI Modules of Multimodal Question Answering

The AI Modules of Multimodal Question Answering are given in Table 10.

 

Table 10 – AI Modules of Multimodal Question Answering

AIM Receives Produces
Visual Scene Description Input Video Physical Object
Object Identification Physical Object Physical Object Identifier
Speech Recognition Input Speech Recognised Text
Language Understan­ding Input Text or Speech based on Input Selection Refined Text

Meaning

Question Analysis Meaning Intention
Question Answering 1.      Input or Recognised Text (based on Input Selection)

2.      Intention

3.      Meaning

Machine Text
Speech Synthesis (Text) Machine Text Machine Speech

5.3.6        JSON Metadata of Multimodal Question Answering

Specified in Annex 8 – .

5.4        Conversation About a Scene (CAS)

5.4.1        Scope of Conversation About a Scene

This Use Case addresses the case of a human holding a conversation with a mMchine:

  1. The Machine sees and hears an Environment containing a speaking human and some scattered objects.
  2. The Machine recognises the human’s Speech and obtains the human’s Personal Status by capturing Speech, Face, and Gesture.
  3. The human converses with the Machine indicating the object in the Environment s/he wishes to talk to or ask questions about it using Speech, Face, and Gesture.
  4. The Machine understands which object the human is referring to and generates an avatar that:
    • Utters Speech conveying a synthetic Personal Status that is relevant to the human’s Personal Status as shown by his/her Speech, Face, and Gesture, and
    • Displays a face conveying a Personal Status that is relevant to the human’s Personal Status and to the response the Machine intends to make.
  5. The Machine displays the Scene Presentation corresponding to how it perceives the Environment from a human-selected Point of View. The objects in the scene are labelled with the Machine’s understanding of their semantics so that the human can understand how the Machine sees the Environment.

5.4.2        Reference Architecture of Conversation About a Scene

Figure 4 gives the Conversation About a Scene Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

 

Figure 4 – Reference Model of Conversation About a Scene

The Machine operates according to the following workflow:

  1. Visual Scene Description produces Body Descriptors, Visual Scene Geometry and Physical Objects from Input Video.
  2. Speech Recognition produces Recognised Text from Input Speech.
  3. Spatial Object Identification produces Physical Object ID from Physical Object and Body Descriptors.
  4. Language Understanding produces Meaning and Refined Text from Recognised Text and Physical Object ID.
  5. Personal Status Extraction produces Input Personal Status from Meaning, Input Speech, Face Descriptors, and Body Descriptors.
  6. Dialogue Processing produces Machine Text and Machine Personal Status from Input Personal Status, Meaning, and Refined Text.
  7. Personal Status Display produces Machine Text, Machine Speech, Machine Avatar from Machine Text, and Machine Personal Status.
  8. Scene Presentation uses the Visual Scene Descriptors to produce the Rendered Scene as seen from the user-selected Point of View. The rendering is constantly updated as the machine improves its understanding of the scene and its objects.

5.4.3        I/O Data of Conversation About a Scene

Table 11 gives the input/output data of Conversation About a Scene.

 

Table 11 – I/O data of Conversation About a Scene

 

Input data From Comment
Input Video Camera Points to human and scene.
Input Speech Microphone Speech of human.
Point of View Human The point of view of the scene displayed by Scene Presentation.
Output data To Comments
Machine Speech Human Machine’s speech.
Machine Avatar Human Portion of Machine’s avatar (e.g., face).
Rendered Scene Human Reproduction of the scene perceived by Machine containing labelled objects as seen from the Point of View.

5.4.4        Functions of AI Modules of Conversation About a Scene

Table 9 provides the functions of the Conversation About a Scene Use Case.

 

Table 12 – Functions of AI Modules of Conversation About a Scene

AIM Functions
Visual Scene Description Provides Visual Objects and their Spatial Attitudes.
Spatial Object Identification Provides ID of a Physical Object.
Speech Recognition Recognises Speech.
Language Understanding Refines Text and extracts Meaning.
Personal Status Extraction Extracts Personal Status from Meaning, Speech, Body, and Face.
Dialogue Processing 1.      Processes Refined Text and Personal Status.

2.      Produces Machine’s Text and Personal Status.

Scene Presentation Renders the Visual Scene as perceived by the Machine from the Point of View selected by human.
Personal Status Display Provides Machine Speech and Machine Avatar from Machine Text and Machine Personal Status.

5.4.5        I/O Data of AI Modules of Conversation About a Scene

Table 13 gives the list of AIMs with their I/O Data.

 

Table 13 – AI Modules of Conversation About a Scene

 

AIM Receives Produces
Visual Scene Description Input Video 1.      Visual Scene Descriptors

2.      Body Descriptors

3.      Face Descriptors

4.      Visual Scene Geometry

5.      Physical Objects

Spatial Object Identification 1.      Body Object

2.      Physical Objects

3.      Visual Scene Geometry

Physical Object ID
Speech Recognition Input Speech Recognised Text
Language Understanding 1.      Recognised Text

2.      Physical Object ID

1.      Meaning

2.      Refined Text

Personal Status Extraction 1.      Body Object

2.      Face Object

3.      Input Speech

4.      Meaning

 Personal Status
Dialogue Processing 1.      Personal Status

2.      Meaning

3.      Refined Text

Machine Personal Status
Scene Presentation 1.      Visual Scene Descriptors

2.      Point of View

Rendered Scene
Personal Status Display 1.      Machine Text

2.      Machine Personal Status

1.      Machine Text

2.      Machine Speech

3.      Machine Avatar

5.4.6        JSON Metadata of Conversation About a Scene

Specified in Annex 9 – .

5.5        Virtual Secretary for Videoconference (VSV)

5.5.1        Scope of Virtual Secretary for Videoconference

In a virtual videoconference, i.e., a videoconference whose participants are avatars realistically impersonating the human participants, a Virtual Secretary is tasked with:

  1. Listening to the Speech of each avatar.
  2. Monitoring their Personal Status.
  3. Drafting a Summary using the avatars’ Personal Status and Text obtained from the Speech Recognition AIM or directly via Text input in the meeting’s common language handled in two different ways:
    • Transferred to an external application so that participants can edit the Summary.
    • Displayed to avatars:
      • Avatars make Speech comments or Text comments (e.g., offline via chat).
      • The Virtual Secretary edits the Summary interpreting Text, and the avatars’ Personal Statuses.

Chapter 5 of Annex 1 – MPAI Basics provides additional information on the Avatar-Based Videoconference Use Case.

5.5.2        Reference Architecture of Virtual Secretary for Videoconference

Figure 5 specifies the architecture of the Virtual Secretary AIW.

 

Figure 5 – Reference Model of the Virtual Secretary for Videoconference Use Case

The Virtual Secretary processes one avatar at a time according to the following workflow:

  1. Speech Recognition extracts Text from avatar Speech.
  2. Avatar Descriptors Parsing provides Body and Face Descriptors.
  3. Language Understanding:
    • Receives Recognised Text.
    • Produces:
      • Refined Text (of Recognised Text).
  1. Personal Status Extraction:
    • Receives Meaning, Speech, and Body and Face Descriptors.
    • Produces the Personal Status of the avatar it is interacting with.
  2. Summarisation:
    • Receives:
      • Refined Text
      • Personal Status
      • Meaning
    • Produces Summary using Personal Status and Text in the meeting’s common language.
    • Receives Edited Summary from Dialogue Processing.
  3. Dialogue Processing:
    • Receives:
      • Refined Text.
      • Text from an avatar (concerning Summary, via chat).
      • Personal Status.
    • Edits the Summary using avatars’ inputs.
    • Sends Edited Summary back to Summarisation.
    • Outputs VS Text concerning Summary and Personal Status of Virtual Secretary.
  4. Personal Status Display:
    • Receives Virtual Secretary’s Output Text and Personal Status.
    • Produces the Virtual Secretary’s:
      • Synthesised Speech.
      • Face and Body Descriptors.

5.5.3        I/O Data of Virtual Secretary for Videoconference

Table 14 gives the input/output data of Virtual Secretary for Videoconference.

 

Table 14 – I/O data of Virtual Secretary

 

Input data From Comment
Text (xN) Avatars Remarks on the summary, etc.
Speech (xN) Avatars Utterances of avatars
Avatar Descriptors (xN) Avatars Gestures of avatars
Output data To Comments
Machine Speech Avatars VS Speech to avatars
Machine Face Avatars VS Face to avatars
Machine Avatar Avatars VS Avatar to avatars
Summary Avatars Summary of avatars’ interventions

5.5.4        Functions of AI Modules of Virtual Secretary for Videoconference

Table 15 gives the functions of Virtual Secretary for Videoconference AIMs.

 

Table 15 – Functions of Virtual Secretary for Videoconference AI Modules

 

AIM Functions
Speech Recognition Recognises Speech
Avatar Descriptors Parsing Provides Face and Body Descriptors
Language Understanding 1.      Refines Recognised Text

2.      Extracts Meaning

Personal Status Extraction Extracts Personal Status
Summarisation Produces and refines Summary using Edited Summary
Dialogue Processing Produces Text and Personal Status
Personal Status Display Shows Virtual Secretary as speaking Avatar with Personal Status

5.5.5        I/O Data of AI Modules of Virtual Secretary for Videoconference

Table 16 gives the AI Modules of the Virtual Secretary depicted in Figure 5.

 

Table 16 – AI Modules of Virtual Secretary

 

AIM Receives Produces
Speech Recognition Speech Recognised Text
Avatar Descriptors Parsing Avatar Descriptors 1.      Face Descriptors

2.      Body Descriptors

Language Understanding Recognised Text 1.      Refined Text

2.      Meaning

Personal Status Extraction 1.      Meaning

2.      Speech

3.      Face Descriptors

4.      Body Descriptors

Personal Status
Summarisation 1.      Meaning

2.      Refined Text

3.      Edited Summary

Summary
Dialogue Processing 1.      Refined Text

2.      Personal Status

3.      Meaning

4.      Summary

1.      VS Personal Status

2.      VS Text

3.      Edited Summary

Personal Status Display 1.      VS Text

2.      VS Personal Status

1.      PSD’s Avatar Model

2.      VS Text

3.      VS Speech

4.      VS Avatar Descriptors

5.5.6        JSON Metadata of Virtual Secretary for Videoconference

Specified in Annex 11 – .

5.6        Human-Connected Autonomous Vehicle (CAV) Interaction (HCI)

5.6.1        Scope of Human-CAV Interaction

A Connected Autonomous Vehicle (CAV) is a system able to execute a command to move itself based on 1) capture of data sensed by a range of onboard sensors exploring the environment and 2) analysis, and interpretation of the data captured and transmitted by other sources in range, such as other CAVs, traffic lights and roadside units. Chapter 5 of Annex 1 –  Connected Autonomous Vehicle describes the four Subsystems of a CAV among which Human-CAV interaction (HCI) has the function to recognise the human owner or renter, respond to humans’ commands and queries, converse with humans during the travel and activate the Autonomous Motion Subsystem in response to humans’ requests. Inter HCI Information, HCI-AMS Commands, and AMS-HCI Response are indicated in Figure 6 but not specified.

5.7        Reference Architecture of Human-CAV Interaction

Figure 6 represents the Human-CAV Interaction (HCI) Reference Model.

 

Figure 6 – Human-CAV Interaction Reference Model

The operation of HCI involves the following functions:

  1. A group of humans approaches the CAV outside the CAV:
    • The Audio Scene Description AIM creates the Audio Scene Description in the form of Audio (Speech) Objects corresponding to each speaking human in the Environment (close to the CAV).
    • The Visual Scene Description creates the Visual Scene Descriptors in the form of Body and Face Descriptors corresponding to each human in the Environment (close to the CAV).
    • The Speaker Recognition and Face Recognition AIMs authenticate the humans that the HCI is interacting with using Speech and Face Descriptors.
    • The Speech Recognition AIM recognises the speech of each human.
    • The Language Understanding AIM extracts Meaning and produces Refined Text.
    • The Personal Status Extraction AIM extracts the Personal Status of the humans.
    • The Dialogue Processing AIM validates the human Identities, produces the response and displays the HCI Personal Status, and issues commands to the Autonomous Motion Subsystem.
  2. A group of humans sits in the seats inside the CAV:
    • The Audio Scene Description AIM creates the Audio Scene Descriptions in the form of Audio (Speech) Objects corresponding to each speaking human in the cabin.
    • The Visual Scene Description creates the Visual Scene Descriptors in the form of Body and Face Descriptors corresponding to each human in the cabin, and Physical Objects.
    • The Speaker Recognition and Face Recognition AIMs identify the humans the HCI is interacting with using Speech and Face Descriptors.
    • The Speech Recognition AIM recognises the speech of each human.
    • The Language Understanding AIM extracts Meaning and produces Refined Text.
    • The Personal Status Extraction AIM extracts the Personal Status of the humans.
    • The Dialogue Processing AIM recognises the human Identities, produces the response, displays the HCI Personal Status, and issues commands to the Autonomous Motion Subsystem.
  3. The HCI interacts with the humans in the cabin in several ways:
    • By responding to commands/queries from one or more humans at the same time, e.g.:
      • Commands to go to a waypoint, park at a place, etc.
      • Commands with an effect in the cabin, e.g., turn off air conditioning, turn on the radio, call a person, open window or door, search for information etc.

Note: For completeness, Figure 6 includes the interaction of HCI with AMS (e.g., commands and responses regarding selection of Route by human) and with remote HCIs. However, this document does not address the format in which these interactions are performed.

  • By conversing with and responding to questions from one or more humans at the same time about travel-related issues (in-depth domain-specific conversation), e.g.:
    • Humans request information, e.g., time to destination, route conditions, weather at destination, etc.
    • CAV offers alternatives to humans, e.g., long but safe way, short but likely to have interruptions.
    • Humans ask questions about objects in the cabin.
  • By following the conversation on travel matters held by humans in the cabin. Initial condition for this participation are if: 1) the passengers allow the HCI to do so, and 2) the processing is carried out inside the CAV.

 

Note that:

  1. The Audio Scene Description provides all Speech Objects in the Audio Scene, removing all other audio sources.
  2. The Speaker Recognition and Speech Recognition AIMs support multiple Speech Objects as input. Each Speech Object has an identifier to enable the Speaker Recognition and Speech Recognition AIMs to provide Recognised Texts labelled with Speaker IDs. If the Face Recognition AIM provides Face IDs corresponding to the Speaker IDs, the Dialogue Processing AIM can correctly associate the Speaker IDs (and the corresponding Recognised Texts) with the Face IDs.

5.7.1        I/O Data of Human-CAV Interaction

Table 17 gives the input/output data of Human-CAV Interaction.

 

Table 17 – I/O data of Human-CAV Interaction

 

Input data From Comment
Audio (Indoor) Cabin Passengers User’s social life

Commands/interaction with CAV

Audio (Outdoor) Users in Environment User authentication

User command

User conversation

Input Text Cabin Passengers User’s social life

Commands/interaction with CAV

Video (Outdoor) Users in Environment Commands/interaction with CAV
LiDAR (Indoor) Cabin Passengers User’s social life

Commands/interaction with CAV

RADAR (Indoor) Cabin Passengers User’s social life

Commands/interaction with CAV

Video (Indoor) Cabin Passengers User’s social life

Commands/interaction with CAV

Inter HCI Info Remote HCI  
AMS-HCI Response Motion Actuation Subsystem AMS Response about execution of HCI-AMS Command
Output data To Comments
Output Speech Cabin Passengers CAV’s response to passengers
Output Avatar Cabin Passengers Portion of CAV’s Avatar (e.g., head & face).
Output Text Cabin Passengers CAV’s response to passengers
Inter HCI Info Remote HCI  
HCI-AMS Commands Motion Actuation Subsystem Command to AMS to actuate wheels, brakes, etc.

 

Note that this document does not specify Inter HCI Information, HCI-AMS Commands, and AMS-HCI Response.

5.7.2        Functions of AI Modules of Human-CAV Interaction

Table 18 gives the functions of all Human-CAV Interaction AIMs.

 

Table 18 – Functions of Human-CAV Interaction’s AI Modules

 

AIM Function
Audio Scene Description Produces the Audio Scene Descriptors using the Audio captured by the appropriate (indoor or outdoor) Microphone Array.
Visual Scene Description Produces the Visual Scene Descriptors using the visual information captured by the appropriate (indoor or outdoor) visual sensors.
Speech Recognition Converts speech into Text.
Physical Object Identification Provides the ID of the class of objects of which the Physical Object is an Instance
Language Understanding Improves the Text from Speech Recognition by using context information (e.g., Instance ID of object).
Speaker Recognition Provides Speaker ID from Speech.
Personal Status Extraction Provides the Personal Status of human.
Face Recognition Provides Face ID from Face.
Dialogue Processing Provides:

1.      Text containing the response of the HCI to the human.

2.      Personal Status of HCI congruous with the Text produced by the HCI.

Personal Status Display Produces Speech, and Machine Face and Body.

5.7.3        I/O Data of AI Modules of Human-CAV Interaction

Table 19 gives the AI Modules of the Human-CAV Interaction depicted in Figure 3.

 

Table 19 – AI Modules of Human-CAV interaction

AIM Receives Produces
Audio Scene Description Environment Audio (outdoor)

Environment Audio (indoor)

Speech Objects
Visual Scene Description Environment Video (outdoor)

Environment Video (indoor)

Face Objects

Physical Objects

Body Descriptors

Face Descriptors

Speech Recognition Speech Object Recognised Text
Physical Object Identification Physical Object

Body Descriptors

Object ID
Language Understanding Recognised Text

Personal Status

Object ID

Meaning

Personal Status

Refined Text

Speaker Recognition Speech Descriptors Speaker ID
Personal Status Extraction Speech Object

Meaning

Face Descriptors

Body Descriptors

Personal Status
Face Recognition Face Object Face ID
Dialogue Processing Speaker ID

Meaning

Refined Text

Personal Status

Face ID

AMS-HCI Response

AMS-HCI Commands

Output Text

Output Personal Status

 

Personal Status Display Machine Text

Output Personal Status

Machine Avatar

Machine Text

Machine Speech

5.7.4        JSON Metadata of Human-CAV Interaction

Specified in Annex 10 – .

5.8        Unidirectional Speech Translation (UST)

5.8.1        Scope of Unidirectional Speech Translation

The goal of the Unidirectional Speech Translation (MMC-UST) Use Case is to translate speech segments expressed in a source language into a target language or to produce the textual version of the translated speech. If the desired output is speech, the user can specify whether their speech features (voice colour, emotional charge, etc.) should be preserved in the translated speech.

 

The flow of control is from Input Speech or Input Text to Translated Text, and then to Output Speech and Output Text. Depending on the value of Input Selection:

  1. Input Text in Language A is translated into Translated Text in Language B and pronounced as Speech in Language B.
  2. The Speech features (voice colour, emotional charge, etc.) in Language A are preserved in Language B.

5.8.2        Reference Architecture of Unidirectional Speech Translation

Figure 7 describes the input/output data, the AIMs and the data exchanged between AIMs.

 

Figure 7 Reference Model of Unidirectional Speech Translation (UST)

5.8.3        I/O Data of Unidirectional Speech Translation

The input and output data of the Unidirectional Speech Translation Use Case are:

 

Table 20 – I/O Data of Unidirectional Speech Translation

 

Input Comments
Input Selection Determines whether:

1.      The input will be in Text or Speech

2.      The Input Speech features are preserved in the Output Speech.

Requested Languages User-specified input Language (A) and output Language (B).
Input Speech Speech produced in Language A by a human desiring translation into language B.
Input Text Alternative textual source information to be translated into and pron­ounced in language B depending on the value of Input Selection.
Output Comments
Translated Speech Input Speech translated into language B preserving the Input Speech features in the Output Speech, depending on the value of Input Selec­tion.
Translated Text Text of Input Speech or Input Text translated into language B, depending on the value of Input Selection.

5.8.4        Functions of AI Modules of Unidirectional Speech Translation

Table 21 gives the functions of Unidirectional Speech Translation AIMs.

 

Table 21 – Functions of Unidirectional Speech Translation AI Modules

AIM Functions
Speech Recognition Recognises Speech
Translation Translates Recognised Text
Speech Feature Extraction Extracts Speech Features
Speech Synthesis (Features) Synthesises Translated Text adding Speech Features

5.8.5        I/O Data of AI Modules of Unidirectional Speech Translation

The AI Modules of Unidirectional Speech Translation are given in Table 22.

 

Table 22 – AI Modules of Unidirectional Speech Translation

 

AIM Receives Produces
Speech Recognition Input Speech Segment Recognised Text
Translation 1.      Input Text

2.      Recognised Text

(Based on Input Selection)

Translated Text
Speech Feature Extraction Input Speech Speaker-specific Speech Features (e.g., tones, intonation, intensity, pitch, emotion, speed).
Speech Synthesis (Features) 1.      Translated Text

2.      Speech Feat­ures (depending on Input Selection)

Produces Output Speech.

5.8.6        JSON Metadata of Unidirectional Speech Translation

Specified in Annex 12 – .

5.9        Bidirectional Speech Translation (BST)

5.9.1        Scope of Bidirectional Speech Translation

The goal of the Bidirectional Speech Translation (MMC-BST) Use Case is to support a conversation between two people, each speaking a different language. The machine translates each input speech segment into the selected language as speech or text. If the desired output is speech, users can specify whether their speech features (voice colour, emotional charge, etc.) should be preserved in the translated speech.

The flow of control (from Input Speech to Trans­lated Text to Output Speech) is identical to that of the Unidirectional case. The difference is that, rather than one such flow, two flows are provided in two different channels – the first from lan­guage A to language B, and the second from language B to language A.

 

Depending on the value of Input Selection:

  1. Input Text in Language A is translated into Translated Text in Language B and pronounced as Speech in Language B.
  2. The Speech features (voice colour, emotional charge, etc.) in Language A are preserved in Language B.

 

The same applies for the Language-B-to-Language-A channel.

5.9.2        Reference Architecture of Bidirectional Speech Translation

Figure 8 depicts the AIMs and the data exchanged between AIMs.

 

Figure 8 Reference Model of Bidirectional Speech Translation (BST)

5.9.3        I/O Data of Bidirectional Speech Translation

The input and output data of the Bidirectional Speech Translation Use Case are:

 

Table 23 – I/O Data of Bidirectional Speech Translation

 

Input Comments
Input Selection Determines whether the input will be Text or Speech.
Requested languages User-specified input language and output languages
Input Speech1 Speech by human1 desiring spoken translation in the specified language.
Input Text1 Alternative Input Text to be translated to the specified language.
Input Speech2 Speech by human2 desiring spoken translation in the specified language.
Input Text2 Alternative Input Text to be translated to the specified language.
Output Comments
Output Speech1 Translated Speech of Speaker 1.
Output Text1 Text of the translated Speech of Speaker 1.
Output Speech2 Translated Speech of Speaker 2.
Output Text2 Text of the translated Speech of Speaker 2.

5.9.4        Functions of AI Modules of Bidirectional Speech Translation

Table 24 gives the functions of Bidirectional Speech Translation AIMs.

 

Table 24 – Functions of Bidirectional Speech Translation AI Modules

AIM Functions
Speech Recognition Recognises Speech
Translation Translates Recognised Text
Speech Feature Extraction Extracts Speech Features
Speech Synthesis

(Features)

Synthesises Translated Text adding Speech Features

 

5.9.5        I/O Data of AI Modules of Bidirectional Speech Translation

Table 25 gives the I/O Data of the AI Modules.

 

Table 25 – AI Modules of Bidirectional Speech Translation

 

AIM Receives Produces
Speech Recognition 1.      Input Speech 1 Segment

2.      Input Speech 2 Segment

1.      Recog­nised Text 1

2.      Recog­nised Text 2.

Translation 1.      Input Text 1 or Recognised Text 1

2.      Input Text 2 or Recognised Text 2

3.      based on the value of Input Selection

1.      Translated Text 1

2.      Translated Text 2.

Speech Feature Extraction 1.      Input Speech 1

2.      Input Speech 2

1.      Speech Features 1

2.      Speech Features 2.

Speech Synthesis (Features) 1.      Translated Text 1 and

2.      Translated Text 2 and Speech Features

3.      Speech Features 1 and 2 based on Input Selection

1.      Translated Speech 1

3.      Translated Speech 2

5.9.6        JSON Metadata of Bidirectional Speech Translation

Specified in Annex 13 – .

5.10    One-to-Many Speech Translation (MST)

5.10.1    Scope of One-to-Many Speech Translation

The goal of the One-to-Many Speech Translation (MMC-MST) Use Case is to enable one person speaking his or her language to broadcast to two or more audience members, each listening and responding in a different language, presented as speech or text. If the desired output is speech, users can specify whether their speech features (voice colour, emotional charge, etc.) should be preserved in the translated speech.

 

The flow of control (from Recognised Text to Translated Text to Output Speech) is identical to that of the Unidirectional case. However, rather than one such flow, multiple paired flows are provided – the first pair from language A to language B and B to A; the second from A to C and C to A; and so on.

Depending on the value of Input Selection (text or speech):

  1. Input Text in Language A is translated into Translated Text in and pronounced as Speech of all Requested Languages.
  2. The Speech features (voice colour, emotional charge, etc.) in Language A are preserved in all Requested Languages.

5.10.2    Reference Architecture of One-to-Many Speech Translation

Figure 9 depicts the AIMs and the data exchanged between AIMs.

 

Figure 9Reference Model of One-to-Many Speech Translation (MST)

5.10.3    I/O Data of One-to-Many Speech Translation

The input and output data of the One-to-Many Speech Translation Use Case are:

 

Table 26 – I/O Data of One-to-Many Speech Translation

 

Input Comments
Input Selection Determines whether the input will be in Text or Speech.
Desired Languages User-specified input language and translated languages
Input Speech Speech produced by human desiring translation and interpretation in a specified set of languages.
Input Text Alternative textual source information.
Output Comments
Translated Speech Speech translated into the Requested Languages.
Translated Text Text translated into the Requested Languages.

5.10.4    Functions of AI Modules of One-to-Many Speech Translation

Table 27 gives the functions of One-to-Many Speech Translation AIMs.

 

Table 27 – Functions of One-to-Many Speech Translation AI Modules

AIM Functions
Speech Recognition Recognises Speech
Translation Translates Recognised Text
Speech Feature Extraction Extracts Speech Features
Speech Synthesis (Features) Synthesises Translated Text adding Speech Features

5.10.5    I/O Data of AI Modules of One-to-Many Speech Translation

Table 28 gives the I/O Data of the AI Modules.

 

Table 28 – AI Modules of One-to-Many Speech Translation

 

AIM Receives Produces
Speech Recognition Input Speech Segment Recognised Text
Speech Feature Extraction Input Speech Speaker-specific Speech Features.
Translation Text input Translated Texts in the Requested Languages.
Speech Synthesis (Features) 1.      Translated Texts

2.      Speech Features (based on Input Selection)

Speech Segments in the Desired Languages.

5.10.6    JSON Metadata of One-to-Many Speech Translation

Specified in Annex 14 – .

 

6          Composite AI Modules

AI Modules composed of multiple AI Modules are called Composite AIMs. They are used in several MPAI-MMC Use Cases. This chapter specifies the Personal Status Extraction (PSE) AIM using a format like the one adopted for Uses Cases. Other Technical Specifications specify other Composite AIMs, such as [3] that specifies the Personal Status Display Composite AIM used in this Technical Specification.

6.1        Personal Status Extraction (PSE)

Personal Status Extraction (PSE) is a composite AIM that extracts Cognitive State, Emotion, and Social Attitude called Factors conveyed by each of Text, Speech, Face, and Gesture, called Modalities, and provides an estimate of the Personal Status, intended as a combination of Factors. The Personal Status Composite AIM is used in MPAI-MMC and other Use Cases as a replacement for the combination of AIMs depicted in Figure 10. Personal Status need not convey information on all Factors and all Modalities.

6.1.1        Scope of Personal Status Extraction

Personal Status Extraction produces the estimate of the Personal Status of a human or an avatar by analysing each Modality in three steps:

  1. Data Capture (e.g., characters and words, a digitised speech segment, the digital video containing the hand of a person, etc.).
  2. Descriptor Extraction (e.g., pitch and intonation of the speech segment, thumb of the hand raised, the right eye winking, etc.).
  3. Personal Status Interpretation (i.e., one of Emotion, Cognitive State, and Attitude).

 

An implementation may combine two or more of the AIMs implementing the steps.

6.1.2        Reference Architecture of Personal Status Extraction

Figure 10 depicts the Personal Status extraction process:

  1. Descriptors are extracted from Text, Speech, Face Object, and Body Object. Depending on the value of Selection, Descriptors can be provided by an AI Module upstream.
  2. Descriptors are interpreted and the specific indicators of the Personal Status in the Text, Speech, Face, and Gesture Modalities are derived.
  3. Personal Status is obtained by combining the estimates of different Modalities of the Personal Status.

 

Input Selection inform PSE whether a Modality or its Descriptors are used.

 

Figure 10 – Reference Model of Personal Status Extraction

Note that:

  1. A Modality can be input into the Personal Status Extraction Composite AIM as a Modality or as Descriptors. Both Modality Descriptors have the same syntax and semantics. Text Descriptors are equivalent to Meaning. Gesture Description extracts Gesture Descriptors from Body Object. In the future other Descriptors may be extracted from Body Object.
  2. An Implementation can combine, e.g., the Gesture Description and PS-Gesture Interpretation AIMs into one AIM, and directly provide PS-Gesture from a Body Object without exposing PS-Gesture Descriptors.

6.1.3        I/O Data of Personal Status Extraction

Table 29 gives the input/output data of Personal Status Extraction.

 

Table 29 – I/O data of Personal Status Extraction

 

Input data From Comment
Input Selection An external signal  
Text Keyboard or Speech Recognition Text or recognised speech.
Text Descriptors An upstream AIM  
Speech Microphone Speech of human.
Speech Descriptors An upstream AIM  
Face Object Visual Scene Description The face of the human.
Face Descriptors An upstream AIM  
Body Object Visual Scene Description The upper part of the body.
Body Descriptors An upstream AIM  
Output data To Comments
Personal Status A downstream AIM For further processing

6.1.4        Functions of AI Modules of Personal Status Extraction

Table 30 gives functions of the AIMs.

 

Table 30 – AI Modules of Personal Status Extraction

 

AIM Modules
Text Description Extracts the Descriptors of Text.
Speech Description Extracts the Descriptors of Speech.
Face Description Extracts the Descriptors of Face.
Gesture Description Extracts the Descriptors of Body.
PS-Text Interpretation Interprets the Personal Status Descriptors of Text.
PS-Speech Interpretation Interprets the Personal Status Descriptors of Speech.
PS-Face Interpretation Interprets the Personal Status Descriptors of Face.
PS-Gesture Interpretation Interprets the Personal Status Descriptors of Body.
Personal Status Combination Produces the Personal Status.

6.1.5        I/O Data of AI Modules of Personal Status Extraction

Table 31 gives the list of the AIMs with their functions.

 

Table 31 – AI Modules of Personal Status Extraction

 

AIM Receives Produces
Text Description Text Text Descriptors
Speech Description Speech Speech Descriptors
Face Description Face Object Face Descriptors
Gesture Description Body Object Gesture Descriptors
PS-Text Interpretation PS-Text Descriptors PS-Text
PS-Speech Interpretation PS-Speech Descriptors PS-Speech
PS-Face Interpretation PS-Face Descriptors PS-Face
PS-Gesture Interpretation PS-Gesture Descriptors PS-Gesture
Personal Status Combination PS-Text

PS-Speech

PS-Face

PS-Gesture

Personal Status

6.1.6        JSON Metadata of Personal Status Extraction

Specified in Annex 15 – .

6.2        Personal Status Display (PSD)

6.2.1        Scope of Personal Status Display

A Personal Status Display (PSD) is a Composite AIM receiving Text and Personal Status and generating an avatar producing Text and uttering Speech with the intended Personal Status while the avatar’s Face and Gesture show the intended Personal Status. Instead of a ready-to-render avatar, the output can be provided as Compressed Avatar Descriptors. The Personal Status driving the avatar can be extracted from a human or can be synthetically generated by a machine as a result of its conversation with a human or another avatar. This Composite AIM is used in the Use Case figures of this document as a replacement for the combination of the AIMs depicted in Figure 11.

6.2.2        Reference Architecture of Personal Status Display

Figure 11 represents the AIMs required to implement Personal Status Display.

 

Figure 11 – Reference Model of Personal Status Display

The Personal Status Display operates as follows:

  1. Selection determines the type of avatar output – Machine Avatar or Avatar Descriptors.
  2. Text is passed as output and synthesised as Speech using the Personal Status provided by PS-Speech.
  3. Machine Speech and PS-Face are used to produce the Face Descriptors.
  4. PS-Gesture and Text are used for Body Descriptors using the Avatar Model.
  5. Avatar Description produces a complete set of Avatar Descriptors.
  6. Avatar Synthesis produces a ready-to-render Machine Avatar.

6.2.3        I/O Data of Personal Status Display

Table 32 gives the input/output data of Personal Status Display.

 

Table 32 – I/O data of Personal Status Display

 

Input data From Comment
Selection Switch PSD output type
Text Keyboard, Speech Recognition, Machine  
PS-Speech Personal Status Extractor or Machine  
Avatar Model From AIM/AIW or embedded  
PS-Face Personal Status Extractor or Machine  
PS-Gesture Personal Status Extractor or Machine  
Output data To Comments
Machine Text Human or Avatar (i.e., an AIM)  
Machine Speech Human or Avatar (i.e., an AIM)  
Compressed Descriptors AIM/AIW downstream  
Body Object Presentation Device Ready-to-render Avatar
Avatar Model As in input  

7          Data Formats

This Technical Specification specifies the Data Formats listed in Table 33. The reader is alerted that some data Formats are shared with the Context-based Audio Enhancement (MPAI-CAE) Standard [3]. At the current date, the specification of such data Formats is repeated verbatim in both Standards.

 

The first column gives the name of the data Format, the second the subsection where the data Format is specified and the third the Use Case(s) making use of it.

 

Table 33 – Data formats

 

Name of Data Format Subsection Use Case
Audio File 7.1 ABV
BST
CAS
CWE
HCI
MST
UST
VSV
Audio Scene Descriptors 7.2 ABV
HCI
Cognitive State 7.3 CAS
HCI
VSV
Emotion 7.4 ABV
CWE
HCI
VSV
Face Descriptors 7.5 ABV
CWE
HCI
VSV
Gesture Descriptors 7.6 ABV
CWE
HCI
VSV
Instance ID Error! Reference source not found. HCI
Language Identifier 7.8 BST
MST
UST
Meaning 7.10 CAS
CWE
HCI
Personal Status 7.11 ABV
CAS
HCI
Physical Object Identifier (Instance Identifier) Error! Reference source not found. CAS
MQA
Social Attitude 7.12 CAS
HCI
Spatial Attribute 7.13 CAS
HCI
Speech Descriptors 7.14 ABV
CWE
HCI
VSV
Speech Features 7.15 UST
Text 7.16 BST
CWE
MQA
MST
UST
Text Descriptors 7.17 ABV
CWE
HCI
VSV
Video 7.18 CWE
Video File 7.19 ARP
Video Of Faces KB Query Format 7.20 CWE
Visual Scene Descriptors 7.21 ABV
CAS
HCI

MPAI plans on creating a future specification that will contain all data Formats that are shared by more than one MPAI Standard.

7.1        Audio File

Audio data is packaged in a .wav file [10].

7.2        Audio Scene Descriptors

Audio Scene Descriptors are specified in MPAI-CAE V2 [3].

7.3        Cognitive State

Cognitive State is represented by the following Syntax and Semantics. Primary Cognitive State corresponds to General Adjectival and Secondary Cognitive State corresponds to Specific Adjectival in Table 34.

 

The Syntax and Semantics of Cognitive State are given by the following clauses.

7.3.1        Syntax

Cognitive State is represented by.

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“cogstateType”:{

“type”:”object”,

“properties”:{

“cogstateDegree”:{

“enum”: [“High”, “Medium”, “Low”]

},

“cogstateName”:{

“type”:”number”

},

“cogstateSetName”:{

“type”:”string”

}

}

},

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/cogstateType”

},

“secondary”:{

“$ref”:”#/definitions/cogstateType”

}

}

}

7.3.2        Semantics

Name Definition
cogstateType Specifies the Cognitive State that the input carries.
cogstateDegree Specifies the Degree of Cognitive State as one of “Low,” “Medium,” and “High.”
cogstateName Specifies the ID of a Cognitive State listed in Table 37.
cogstateSetName Specifies the name of the Cognitive State set which contains the Cognitive State. Cognitive State set of Table 37 is used as a baseline, but other sets are possible.

 

Table 34 gives the standardised three-level Basic Cognitive State Label Set.

 

Table 34 – Basic Cognitive State Label Set

COGNITIVE CATEGORIES GENERAL ADJECTIVAL SPECIFIC ADJECTIVAL
AROUSAL aroused/excited/energetic cheerful

playful

lethargic

sleepy

ATTENTION attentive expectant/anticipating

thoughtful

distracted/absent-minded

vigilant

hopeful/optimistic

BELIEF credulous sceptical
INTEREST interested fascinated

curious

bored

SURPRISE surprised astounded

startled

UNDERSTANDING comprehending uncomprehending

bewildered/puzzled

 

Table 35 provides the semantics for each label in the GENERAL ADJECTIVAL and SPECIFIC ADJECTIVAL columns above.

 

Table 35 – Basic Cognitive State Semantics Set

ID Cognitive State Meaning
1 aroused/excited/energetic cognitive state of alertness and energy
2 astounded high degree of surprised
3 attentive cognitive state of paying attention
4 bewildered/puzzled high degree of incomprehension
5 bored not interested
6 cheerful energetic combined with and communicating happiness
7 comprehending cognitive state of successful application of mental models to a situation
8 credulous cognitive state of conformance to mental models of a situation
9 curious interest due to drive to know or understand
10 distracted/absent-minded not attentive to present situation due to competing thoughts
11 expectant/anticipating attentive to (expecting) future event or events
12 fascinated high degree of interest
13 interested cognitive state of attentiveness due to salience or appeal to emotions or drives
14 lethargic not aroused
15 playful energetic and communicating willingness to play
16 sceptical not credulous
17 sleepy not aroused due to need for sleep
18 surprised cognitive state due to violation of expectation
19 startled surprised by a sudden event or perception
20 surprised cognitive state due to violation of expectation
21 thoughtful attentive to thoughts
22 uncomprehending not comprehending

7.4        Emotion

The Syntax and Semantics of Emotion are given by the following clauses. Emotions are expressed vocally through combinations of prosody (pitch, rhythm, and volume variations); separable speech effects (such as degrees of voice tension, breathiness, etc.); and vocal gestures (laughs, sobs, etc.).

Emotion is represented by the following Syntax and Semantics. Primary Emotion corresponds to General Adjectival and Secondary Emotion corresponds to Specific Adjectival in Table 36.

7.4.1        Syntax

Human Emotion is represented by.

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“emotionType”:{

“type”:”object”,

“properties”:{

“emotionDegree”:{

“enum”: [“High”, “Medium”, “Low”]

},

“emotionName”:{

“type”:”number”

},

“emotionSetName”:{

“type”:”string”

}

}

},

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/emotionType”

},

“secondary”:{

“$ref”:”#/definitions/emotionType”

}

}

}

7.4.2        Semantics

Name Definition
emotionType Specifies the Emotion that the input carries.
emotionDegree Specifies the Degree of Emotion as one of “Low,” “Medium,” and “High.”
emotionName Specifies the ID of an Emotion listed in Table 37.
emotionSetName Specifies the name of the Emotion set which contains the Emotion. Emotion set of Table 37 is used as a baseline, but other sets are possible.

 

Table 36 gives the standardised three-level Basic Emotion Set partly based on Paul Eckman [19].

 

Table 36 – Basic Emotion Label Set

EMOTION 

CATEGORIES

GENERAL ADJECTIVAL SPECIFIC ADJECTIVAL
ANGER angry furious

irritated

frustrated

CALMNESS calm peaceful/serene

resigned

DISGUST disgusted repulsed
FEAR fearful/scared terrified

anxious/uneasy

HAPPINESS happy joyful

content

delighted

amused

HURT hurt

jealous

insulted/offended

resentful/disgruntled

bitter

PRIDE/SHAME proud

ashamed

guilty/remorseful/sorry

embarrassed

RETROSPECTION nostalgic homesick
SADNESS sad lonely

grief-stricken

depressed/gloomy

disappointed

 

Table 37 provides the semantics for each label in the GENERAL ADJECTIVAL and SPECIFIC ADJECTIVAL columns above.

 

Table 37  – Basic Emotion Semantics Set

ID Emotion Meaning
1 amused positive emotion combined with interest (cognitive state)
2 angry emotion due to perception of physical or emotional damage or threat
3 anxious/uneasy low or medium degree of fear, often continuing rather than instant
4 ashamed emotion due to awareness of violating social or moral norms
5 bitter persistently angry due to disappointment or perception of hurt or injury
6 calm relatively lacking emotion
7 content medium or low degree of happiness, continuing rather than instant
8 delighted high degree of happiness, often combined with surprise
9 depressed/

gloomy

high degree of sadness, continuing rather than instant, combined with lethargy (see AROUSAL)
10 disappointed sadness due to failure of desired outcome
11 disgusted emotion due to urge to avoid, often due to unpleasant perception or disapproval
12 embarrassed shame due to consciousness of violation of social conventions
13 fearful/scared emotion due to anticipation of physical or emotional pain or other undesired event or events
14 frustrated angry due to failure of desired outcome
15 furious high degree of angry
16 grief-stricken sadness due to loss of an important social contact
17 happy positive emotion, often continuing rather than instant
18 homesick sad due to absence from home
19 hurt emotion due to perception that others have caused social pain or embarrassment
20 insulted/offended emotion due to perception that one has been improperly treated socially
21 irritated low or medium degree of angry
22 jealous emotion due to perception that others are more fortunate or successful
23 joyful high degree of happiness, often due to a specific event
24 repulsed high degree of disgusted
25 lonely sad due to insufficient social contact
26 mortified high degree of embarrassment
27 nostalgic emotion associated with pleasant memories, usually of long before
28 peaceful/serene calm combined with low degree of happiness
29 proud emotion due to perception of positive social standing
30 resentful/disgruntled emotion due to perception that one has been improperly treated
31 resigned calm due to acceptance of failure of desired outcome, often combined with low degree of sadness
32 sad negative emotion, often continuing rather than instant, often associated with a specific event
33 terrified high degree of fear

7.5        Face Descriptors

Face Descriptors as defined in Personal Status Extraction are specified in MPAI-ARA V1 [3].

7.6        Gesture Descriptors

Gesture Descriptors as defined in Personal Status Extraction are specified in MPAI-ARA V1 [3].

7.7        Instance Identifier

Instance is an element of a set of entities – Physical Objects, users etc. – belonging to some levels in a hierarchical classification (taxonomy).

The syntax and semantics of Instance Identifier are .

7.7.1        Syntax

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“title”:”InstanceIdentifier”,

“type”:”object”,

“properties”:{

“InstanceLabel”:{

“type”:”string”

},

“LabelConfidenceLevel”:{

“type”:”number”,

“minimum”:0,

“maximum”:1

},

“Classification”:{

“type”:”array”,

“items”:{

“type”:”string”

}

},

“ClassificationConfidenceLevel”:{

“type”:”number”,

“minimum”:0,

“maximum”:1

}

},

“required”:[

“InstanceLabel”,

“LabelConfidenceLevel”,

“Classification”,

“ClassificationConfidenceLevel”

]

}

7.7.2        Semantics

Name Definition
InstanceIdentifier Provides the identifier of the Instance.
InstanceLabel Describes the Instance identified by InstanceIdentifier.
LabelConfidenceLevel Indicates the confidence level of the association between InstanceLabel and the Instance.
Classification Describes the taxonomy inferred for the Instance.
ClassificationConfidenceLevel Indicates the confidence level of the association between Classification and the Instance.

 

7.8        Intention

This subclause specifies data formats to describe intention, the outputs of Question analysis AIM. The “intention” consists of the following elements.

  • qtopic
  • qfocus
  • qLAT
  • qSAT

7.8.1        Syntax

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“Intention”:{

“type”:”object”,

“properties”:{“qtopic”:{“type”:”string”}, “qfocus”:{“type”:”string”},

“qLAT”:{“type”:”string”}, “qSAT”:{ “type”:”string” },  “qdomain”:{ “type”:”string”}

}

}

},

“type”:”object”,

“properties”:{“primary”:{“$ref”:”#/definitions/intention”}, “secondary”:{“$ref”:”#/definitions/intention” }

}

}

7.8.2        Semantics

Name Definition
Intention Provides abstracts of Intention of User Question using properties: qtopic, qfocus, qLAT, qSAT and qdomain
qtopic Indicates the topic of the question. Question topic is the object or event that the question is about.

Ex. of Qtopic is King Lear in “Who is the author of King Lear?”.

qfocus Indicates the focus of the question, which is the part of the question that, if replaced by the answer, makes the question a stand-alone statement. Ex. What, where, who, what policy. Which river, etc.

Example.

Question: Who is the president of USA? (The word “Who” is the focus of the question and it will be replaced by “Biden” in the Answer.)

Answer: Biden is the president of USA.

qLAT Indicates the lexical answer type of the question.
qSAT Indicates the semantic answer type of the question. QSAT corresponds to Named Entity type of the language analysis results.
qdomain Indicates the domain of the question such as “science”, “weather”, “history”.

Ex. Who is the third king of Yi dynasty in Korea? (qdomain: history)

 

The following example shows the question analysis result of the user’s question, “Who is the author of King Lear?” The question analysis result in the example shows that the domain of the question is “Literature,” the topic of the question is “King Lear”, and the focus of the question is “Who.”

 

{

“intention”:[

{

“qdomain”:”Literature”,

“qtopic”:”King Lear “,

“qfocus”:”who “,

“qLAT”:”author “,

“qSAT”:”person ”

}

]

}

 

The following example shows the result of the analysed question of “How do you make Kimchi?” The question analysis result in the example shows that the domain of the question is “Cooking”, the topic of the question is “Kimchi”, the focus of the question is “how”.

 

{

“intention”:[

{

“qdomain”:”Cooking”,

“qtopic”:”Kimchi”,

“qfocus”:”How “,

“qLAT”:”cooking method “,

“qSAT”:”method ”

}

]

}

7.9        Language identifier

Language identifiers are specified by [8].

7.10    Meaning

This subclause specifies data formats to describe meaning which is the result of natural language analysis. The “meaning” consists of the following elements.

  • POS_tagging
  • NE_tagging
  • Dependency_tagging
  • SRL_tagging

7.10.1    Syntax

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“meaning”:{

“type”:”object”,

“properties”:{

“POS_tagging”:{

“POS_tagging_set”:{

“type”:”string”

},

” POS_tagging_result”:{

“type”:”string”

}

},

“NE_tagging”:{

“NE_tagging_set”:{

“type”:”string”

},

” NE_tagging_result”:{

“type”:”string”

}

},

“dependency_tagging”:{

“dependency_tagging_set”:{

“type”:”string”

},

“dependency_tagging_result”:{

“type”:”string”

}

},

“SRL_tagging”:{

” SRL_tagging_set”:{

“type”:”string”

},

” SRL_tagging_result”:{

“type”:”string”

}

}

}

},

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/meaning”

},

“secondary”:{

“$ref”:”#/definitions/meaning”

}

}

}

}

7.10.2    Semantics

Name Definition
Meaning Provides an abstract of description of natural language analysis results.
POS_tagging Indicates POS tagging results including information on the POS tagging set and tagged results of the User question. POS: Part of Speech such as noun, verb, etc.
NE_tagging Indicates NE tagging results including information on the NE tagging set and tagged results of the User question. NE: Named Entity such as Person, Organisation, Fruit, etc.
dependency_tagging Indicates dependency tagging results including information on the dependency tagging set and tagged results of the User question. Dependency indicates the structure of the sentence such as subject, object, head of the relation, etc.
SRL_tagging Indicates SRL (Semantic Role Labelling) tagging results including information on the SRL tagging set and tagged results of the User question. SRL indicates the semantic structure of the sentence such as agent, location, patient role, etc.

7.11    Personal Status

7.11.1    Factors and Modalities

Personal Status is a data structure composed of three Personal Status Factors:

  1. Emotion (such as “angry” or “sad”).
  2. Cognitive State (such as “surprised” or “interested”).
  3. Social Attitude (such as “polite” or “arrogant”).

All these Factors can be expressed via several Personal Status Modalities: Text, Speech, Face, and Gestures. (Other Modalities, such as body posture, may be handled in future MPAI Versions.)

Within a given Modality, the Factors can be analysed and interpreted via various Descriptors. For example, when expressed via Speech, the elements may be expressed through combinations of such features as prosody (pitch, rhythm, and volume variations); separable speech effects (such as degrees of voice tension, breathiness, etc.); and vocal gestures (laughs, sobs, etc.).

Each of the three Emotion, Cognitive State, and Social Attitude Factors is represented by a standard set of labels and associated semantics. For each of these Factors, two tables are provided:

  • A Label Set Table containing descriptive labels relevant to the element type in a three-level format:

These sets have been compiled in the interests of basic cooperation and coordination among AIM submitters and vendors complemented by a procedure whereby AIM submitters may propose extended or alternate sets for their purposes.

An Implementer wishing to extend or replace a Label Set Table for one of the three Factors is requested to do the following:

The submitted semantics should have a level of detail comparable to the semantics given in the current Label Semantics Table.

The appropriate MPAI Development Committee will examine the proposed extension or replac­ement. Only the adequacy of the proposed new tables in terms of clarity and completeness will be considered. In case the new tables are not clear or complete, a revision of the tables will be requested.

The accepted External Factor Set will be identified as proposed by the submitter and reviewed by the appropriate MPAI Committee and posted to the MPAI web site.

The versioning system is based on a name – MPAI for MPAI-generated versions or “organisation name” for the proposing organisation – with a suffix m.n where m indicates the version and n indicated the subversion.

7.11.2    Personal Status Data

  1. Timestamp type can either be:
    • Absolute time (A)
    • Relative time, i.e., time from the start of operation (R)
  2. Timestamp value is as in CAE V1.
    • 18 values of Personal Status that include (see Table 38)
      • 6 cells for Emotion.
      • 6 cells for Cognitive State.
      • 6 cells for Social Attitude.

 

Table 38 – The table of (Factor, Modality) cells

    Modality
    Version Fused value Text Speech Face Gesture
Factor Emotion V.Emotion          
Cognitive State V.Cognitive          
Social Attitude V.Attitude          

 

  1. The 18 values in the cells are represented as a vector of 18 values, 6 for each Factor:
    • The first value is the Version of Emotion/Cognitive State/Social Attitude (VE/VC/VA) represented as two fields:
      • Field 1: 2 digits of the Version of the MMC standard (e.g., “12”, meaning version 1.2, is expressed as 2 bytes).
      • Field 2: The sequential number of the Factor dataset. Currently, there is one dataset given the number 1. New submissions will receive sequential numbers starting from 2, where the sequential number of the dataset is expressed with 1 byte).
    • The second value is the current default fused value of the Modality.
    • Followed by the 4 values of the Modality.
      • The value of Text
      • The value of Speech
      • The value of Face
      • The value of Gesture
    • The list of possible values of a Modality are (values are in bytes):
      • Value 0: unable to compute for any reason, or error, or no discernable value.
      • Value 1 up to the largest number of Factor values in the relevant Label Semantics Table.

Therefore, a value of Personal Status is represented by the following table. Timestamp, Emotion, Cognitive State, Social Attitude and their Descriptors are present if the information is available.

 

Table 39 – The variables composing the Personal Status

Variable name Code
Timestamp Timestamp type
  Timestamp value
Emotion Emotion version
  Fused Emotion value
  Text Emotion value
  Speech Emotion value
  Face Emotion value
  Gesture Emotion value
Cognitive State Cognitive State version
  Fused Cognitive State value
  Text Cognitive State value
  Speech Cognitive State value
  Face Cognitive State value
  Gesture Cognitive State value
Social Attitude Social Attitude version
  Fused Social Attitude value
  Text Social Attitude value
  Speech Social Attitude value
  Face Social Attitude value
  Gesture Social Attitude value

{

“$schema”: “http://json-schema.org/draft-07/schema#”,

“title”: “Personal Status”,

“type”: “object”,

“properties”: {

“Timestamp”: {

“type”: “object”,

“properties”: {

“Timestamp type”: {

“type”: “string”

},

“Timestamp value”: {

“type”: “string”,

“oneOf”: [

{ “format” : “date-time” },

{ “const” : “0” }

]

}

},

“required”: [“Timestamp value”],

“if”: {

“properties”: { “Timestamp value”: { “const”: “0” } }

},

“then”: {

“properties”: { “Timestamp type”: { “type”: “null” } }

},

“else”: {

“required”: [“Timestamp type”]

}

},

“emotion”: {

“type”: “object”,

“properties”: {

“Fused emotion value”: { “type”: “number”, “minimum”: 0 },

“Text emotion value”: { “type”: “number”, “minimum”: 0 },

“Speech emotion value”: { “type”: “number”, “minimum”: 0 },

“Face emotion value”: { “type”: “number”, “minimum”: 0 },

“Gesture emotion value”: { “type”: “number”, “minimum”: 0 },

“emotion version”: {

“type”: “string”,

“pattern”: “^[A-Za-z]+-\\d+\\.\\d+$”

}

},

“anyOf”: [

{ “required”: [“emotion version”, “Fused emotion value”] },

{ “required”: [“emotion version”, “Text emotion value”] },

{ “required”: [“emotion version”, “Speech emotion value”] },

{ “required”: [“emotion version”, “Face emotion value”] },

{ “required”: [“emotion version”, “Gesture emotion value”] }

]

},

“cogstate”: {

“type”: “object”,

“properties”: {

“Fused cogstate value”: { “type”: “number”, “minimum”: 0 },

“Text cogstate value”: { “type”: “number”, “minimum”: 0 },

“Speech cogstate value”: { “type”: “number”, “minimum”: 0 },

“Face cogstate value”: { “type”: “number”, “minimum”: 0 },

“Gesture cogstate value”: { “type”: “number”, “minimum”: 0 },

“cogstate version”: {

“type”: “string”,

“pattern”: “^[A-Za-z]+-\\d+\\.\\d+$”

}

},

“anyOf”: [

{ “required”: [“cogstate version”, “Fused cogstate value”] },

{ “required”: [“cogstate version”, “Text cogstate value”] },

{ “required”: [“cogstate version”, “Speech cogstate value”] },

{ “required”: [“cogstate version”, “Face cogstate value”] },

{ “required”: [“cogstate version”, “Gesture cogstate value”] }

]

},

“attitude”: {

“type”: “object”,

“properties”: {

“Fused attitude value”: { “type”: “number”, “minimum”: 0 },

“Text attitude value”: { “type”: “number”, “minimum”: 0 },

“Speech attitude value”: { “type”: “number”, “minimum”: 0 },

“Face attitude value”: { “type”: “number”, “minimum”: 0 },

“Gesture attitude value”: { “type”: “number”, “minimum”: 0 },

“attitude version”: {

“type”: “string”,

“pattern”: “^[A-Za-z]+-\\d+\\.\\d+$”

}

},

“anyOf”: [

{ “required”: [“attitude version”, “Fused attitude value”] },

{ “required”: [“attitude version”, “Text attitude value”] },

{ “required”: [“attitude version”, “Speech attitude value”] },

{ “required”: [“attitude version”, “Face attitude value”] },

{ “required”: [“attitude version”, “Gesture attitude value”] }

]

}

},

“required” : [“cogstate”],

“required” : [“attitude”],

“required” : [“emotion”]

}

7.12    Social Attitude

Social Attitude is represented by the following Syntax and Semantics. Primary Social Attitude corresponds to General Adjectival and Secondary Social Attitude corresponds to Specific Adjectival in Table 40.

7.12.1    Syntax

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“attitudeType”:{

“type”:”object”,

“properties”:{

“attitudeDegree”:{

“enum”: [“High”, “Medium”, “Low”]

},

“attitudeName”:{

“type”:”number”

},

“attitudeSetName”:{

“type”:”string”

}

}

},

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/attitudeType”

},

“secondary”:{

“$ref”:”#/definitions/attitudeType”

}

}

}

7.12.2    Semantics

Name Definition
attitudeType Specifies the Social Attitude that the input carries.
attitudeDegree Specifies the Degree of Social Attitude as one of “Low,” “Medium,” and “High.”
attitudeName Specifies the ID of a Social Attitude listed in  Table 41.
attitudeSetName Specifies the name of the Social Attitude set which contains the Social Attitude. Social Attitude set of Table 41 is used as a baseline, but other sets are possible.

 

Table 40 gives the standardised three-level Basic Social Attitude Set.

 

Table 40 – Basic Social Attitude Label Set

SOCIAL ATTITUDE

 CATEGORIES

GENERAL

ADJECTIVAL

SPECIFIC

ADJECTIVAL

ACCEPTANCE accepting

exclusive/cliquish

welcoming/inviting

friendly

unfriendly/hostile

AGREEMENT, DISAGREEMENT like-minded

argumentative/disputatious

sarcastic
AGGRESSION aggressive

peaceful

submissive

combative/belligerent

passive-aggressive

mocking

APPROVAL, DISAPPROVAL admiring/approving

disapproving

indifferent

awed

contemptuous

ACTIVITY, PASSIVITY assertive

passive

controlling

permissive/lenient

COOPERATION cooperative/agreeable

uncooperative

flexible

subversive/undermining

uncommunicative

stubborn

disagreeable

RESPONSIVENESS responsive/demonstrative

emotional/passionate

unresponsive/undemonstrative

unemotional/detached

enthusiastic

unenthusiastic

passionate

dispassionate

 

EMPATHY empathetic/caring

kind

uncaring/callous

sympathetic

merciful

merciless/ruthless

self-absorbed

selfish/self-serving

selfless/altruistic

generous

EXPECTATION optimistic

pessimistic

positive

sanguine

negative/defeatist

cynical

EXTROVERSION, INTROVERSION outgoing/extroverted

uninhibited/unreserved

sociable

approachable

DEPENDENCE dependent

independent

helpless
MOTIVATION motivated

apathetic/indifferent

 

inspired

excited/stimulated

discouraged/dejected

dismissive

OPENNESS, TRUST open

honest/sincere

reasonable

trusting

 

candid/frank

closed/distant

dishonest/deceitful

responsible/trustworthy/dependable

irresponsible

distrustful

 

PRAISING, CRITICISM laudatory

critical

congratulatory

flattering

belittling

RESENTMENT, FORGIVENESS forgiving

unforgiving/vindictive/spiteful/vengeful

understanding

petty

SELF-PROMOTION boastful

modest/humble/unassuming

 
SELF-ESTEEM conceited/vain

self-deprecating/self-effacing

smug
SOCIAL DOMINANCE, CONFIDENCE arrogant

confident

submissive

overconfident

forward/presumptuous

brazen

SEXUALITY seductive

lewd/bawdy/indecent

prudish/priggish

suggestive/risqué/naughty

 

SOCIAL RANK polite/courteous/respectful

rude/disrespectful

commanding/domineering

pompous/pretentious

obedient

rebellious/defiant

condescending/patronizing/snobbish

pedantic

unaffected

servile/obsequious

 

Table 41 provides the semantics for each label in the GENERAL ADJECTIVAL and SPECIFIC ADJECTIVAL columns above.

 

Table 41 – Basic Social Attitude Semantics Set

ID Social Attitude Meaning
1 accepting attitude communicating willingness to accept into relationship or group
2 admiring/approving attitude due to perception that others’ actions or results are valuable
3 aggressive tending to physically or metaphorically attack
4 apathetic/indifferent showing lack of interest
5 approachable sociable and not inspiring inhibition
6 argumentative tending to argue or dispute
7 arrogant emotion communicating social dominance
8 assertive taking active role in social situations
9 awed approval combined with incomprehension or fear
10 belittling criticising by understating victim’s achievements, personal attributes, etc.
11 boastful tending to praise or promote self
12 brazen high degree of forwardness/presumption
13 candid/frank open in linguistic communication
14 closed/distant not open
15 commanding/domineering tending to assert right to command
16 combative/belligerent high degree of aggression, often physical
17 communicative evincing willingness to communicate as needed
18 conceited/vain evincing undesirable degree of self-esteem
19 condescending/patronizing/snobbish disrespectfully asserting superior social status, experience, knowledge, or membership
20 confident attitude due to belief in own ability
21 congratulatory wishing well related to another’s success or good luck
22 contemptuous high degree of disapproval and perceived superiority
23 controlling undesirably assertive
24 cool repressing outward reaction, often to indicate confidence or dominance, especially when confronting aggression, panic, etc.
25 cooperative/agreeable communicating willingness to cooperate
26 critical attitude expressing disapproval
27 cynical habitually negative, reflecting disappointment or disillusionment
28 dependent evincing inability to function without aid
29 discouraged/dejected unmotivated because goals or rewards were not achieved
30 disagreeable not agreeable
31 disapproving not approving
32 dishonest/deceitful/insincere not honest
33 dismissive actively indicating lack of interest or motivation
34 distrustful not trusting
35 emotional/passionate high degree of responsiveness to emotions
36 empathetic/caring interested in or vicariously feeling others’ emotions
37 enthusiastic high degree of positive response, especially to specific occurrence
38 excited/stimulated attitude indicating cognitive and emotional arousal
39 exclusive/cliquish not welcoming into a social group
40 flattering praising with intent to influence, often insincere
41 flexible willing to adjust to changing circumstances or needs
42 forward/presumptuous not observing norms related to intimacy or rank
43 forgiving tending to forgive improper behaviour
44 friendly welcoming or inviting social contact
45 generous tending to give to others, materially or otherwise
46 guilty/remorseful/sorry regret due to consciousness of hurting or damaging others
47 helpless high degree of dependence
48 honest/sincere tending to communicate without deception
49 independent not dependent
50 indifferent neither approving nor disapproving
51 inhibited/reserved/introverted/withdrawn unable or unwilling to participate socially
52 inspired motivated by some person, event, etc.
53 irresponsible not responsible
54 kind tending to act as motivated by empathy or sympathy
55 laudatory praising
56 lewd/bawdy/indecent evoking sexual associations in ways beyond social norms
57 like-minded attitude expressing agreement
58 melodramatic high or excessive degree of responsiveness or demonstrativeness
59 merciful tending to avoid punishing others, often motivated by empathy or sympathy
60 merciless/ruthless not merciful
61 mocking communicating non-physical aggression, often by imitating a disapproved aspect of the victim
62 modest/humble/unassuming not boastful
63 motivated communicating goal-directed emotion and cognitive state
64 negative/defeatist expressing pessimism, often habitually
65 obedient evincing tendency to obey commands
66 open tending to communicate without inhibition
67 optimistic tending to expect positive events or results
68 outgoing/extroverted/uninhibited/unreserved not inhibited
69 passive not assertive
70 passive-aggressive covertly and non-physically aggressive
71 peaceful not aggressive
72 pedantic excessively displaying knowledge or academic status
73 permissive allowing activity that social norms might restrict
74 pessimistic tending to expect negative events or results
75 petty unforgiving concerning small matters
76 polite/courteous/respectful tending to respect social norms
77 pompous/pretentious excessively displaying social rank, often above actual status
78 positive expressing optimism, often habitually
79 prudish/priggish expressing disapproval of even minor social transgressions, especially related to sex
80 reasonable evincing willingness to resolve issues through reasoning
81 rebellious/defiant evincing unwillingness to obey
82 responsible/trustworthy/dependable evincing characteristics or behaviour that encourage trust
83 responsive/demonstrative tending to outwardly react to emotions and cognitive states, often as prompted by others
84 rude/disrespectful not polite or respectful
85 sanguine low degree of optimism, often expressed calmly
86 sarcastic communicating disagreement by pretending agreement in an obviously insincere manner
87 seductive communicating interest in sexual or related contact
88 self-absorbed not empathetic due to excessive interest in self
89 self-deprecating/self-effacing tending to criticize, or fail to praise or promote, self
90 selfish/self-serving not generous due to excessive interest in own benefit
91 selfless/altruistic tending to act for others’ benefit, sometimes exclusively
92 servile/obsequious excessively and demonstrably obedient
93 shy low degree of social inhibition
94 smug evincing undesirable degree of self-esteem related to perceived triumph
95 stubborn unwilling to change one’s mind or behaviour
96 sociable comfortable in social situations
97 submissive tending to submit to social dominance
98 subversive/undermining communicating intention to work against a victim’s goals
99 suggestive/risqué/naughty evoking sexual associations within social norms
100 supportive communicating willingness to support as needed
101 sympathetic empathetic related to others’ hurt or suffering
102 trusting tending to trust others
103 unaffected not pompous
104 uncaring/callous not empathetic or caring
105 uncommunicative not communicative
106 uncooperative not cooperative
107 understanding forgiving due to ability to understand motivations
108 unemotional/dispassionate/detached not emotional, even when emotion is expected
109 unenthusiastic not enthusiastic
110 unfriendly/hostile not friendly
111 unresponsive/undemonstrative not responsive or demonstrative
112 welcoming/inviting high degree of acceptance with emotional warmth

7.13    Spatial Attitude

Spatial Attitude is specified in MPAI-OSD V1 [5].

7.14    Speech Descriptors

Speech Descriptors act as Speech Features defined in Personal Status Extraction.

7.15    Speech Features

Speech Features are digitally represented as follows.

7.15.1    Syntax

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“SpeechFeatures”:{

“type”:”object”,

“properties”:{

“pitch”:{

“type”:”real”

},

“tone”:{

“type”:”ToneType”

},

“intonation”:[

{

“type_p”:”pitch”,

“type_s”:”speed”,

“type_i”:”intensity”

}

],

“intensity”:{

“type”:”real”

},

“speed”:{

“type”:”real”,

},

“emotion”:{

“type”:”EmotionType”

},

“NNSpeechFeatures”:{

“type”:”vector of floating point”

}

}

}

},

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/SpeechFeatureType”

},

“secondary”:{

“$ref”:”#/definitions/SpeechFeatureType”

}

}

}

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“ToneType”:{

“type”:”object”,

“properties”:{

“toneName”:{

“type”:”string”

},

“toneSetName”:{

“type”:”string”

}

}

},

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/ToneType”

},

“secondary”:{

“$ref”:”#/definitions/ToneType”

}

}

}

}

 

7.15.2    Semantics

Name Definition
SpeechFeatures Indicates characteristic elements extracted from the input speech, specifically pitch, tone, intonation, intensity, speed, emotion, and NNspeechFeatures.
NNSpeechFeatures Indicates specifically neural-network-based characteristic elements extracted from the input speech by Neural Network
pitch Indicates the fundamental frequency of Speech expressed as a real number indicating frequency as Hz (Hertz).
tone Tone is a variation in the pitch of the voice while speaking expressed as human readable words as in Table 42.
ToneType Indicates the Tone that the input speech carries.
intonation A variation of the pitch, intensity and speed within a time period measured in seconds.
intensity Energy of Speech expressed as a real number indicating dBs (decibel).
speed Indicates the Speech Rate as a real number indicating specified linguistic units (e.g., Phonemes, Syllables, or Words) per second.
emotion Indicates the Emotion that the input speech carries.
EmotionType Indicates the Emotion that the input speech carries.
toneName Specifies the name of a Tone.
toneSetName Name of the Tone set which contains the Tone. Tone set is used as a baseline, but other sets are possible.

Note: The semantics of “tone” defines a basic set of elements characterising tone. Elements can be added to the basic set or new sets defined using the registration procedure defined for Emotion Sets (0).

 

Table 42 – Basic Tones

TONE CATEGORIES ADJECTIVAL Semantics
FORMALITY formal

informal

serious, official, polite

everyday, relaxed, casual

ASSERTIVENESS assertive

factual

hesitant

certain about content

neutral about content

uncertain about content

REGISTER (per situation or use case) conversational

directive

appropriate to an informal speaking

related to commands or requests for action

7.16    Text

The Format of Input Text, Output Text and Recognised Text is provided by ISO/IEC 10646; Information technology – Universal Coded Character Set [9].

7.17    Text Descriptors

Meaning act as Text Descriptors defined in Personal Status Extraction.

7.18    Video

Video satisfies the following specifications:

  1. Pixel shape: square
  2. Bit depth: 8 or 10 bits/pixel
  3. Aspect ratio: 4/3 or 16/9
  4. 640 < # of horizontal pixels < 1920
  5. 480 < # of vertical pixels < 1080
  6. Frame frequency 50-120 Hz
  7. Scanning: progressive
  8. Colorimetry: ITU-R BT709 or BT2020
  9. Colour format: RGB or YUV
  10. Compression:
    1. If compressed, compression according to one of the following standards: MPEG-4 AVC [10], MPEG-H HEVC [13], MPEG-5 EVC [14].

7.19    Video File

The Format of a Video MP4 File Format [12].

7.20    Video of Faces KB Query Format

Data Specification: All faces in the Video of Faces KB shall be aligned.

Input: The Video of Faces KB is queried with an Emotion.

Output: The response is a Video File of a human face.

7.21    Visual Scene Descriptors

Visual Scene Descriptors are specified in MPAI-OSD [5].

 

  • MPAI Basics

1          General

In recent years, Artificial Intelligence (AI) and related technologies have been introduced in a broad range of applications affecting the life of millions of people and are expected to do so much more in the future. As digital media standards have positively influenced industry and billions of people, so AI-based data coding standards are expected to have a similar positive impact. In addition, some AI technologies may carry inherent risks, e.g., in terms of bias toward some classes of users making the need for standardisation more important and urgent than ever.

 

The above considerations have prompted the establishment of the international, unaffiliated, not-for-profit Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) organisation with the mission to develop AI-enabled data coding standards to enable the development of AI-based products, applications, and services.

 

As a rule, MPAI standards include four documents: Technical Specification, Reference Software Specifications, Conformance Testing Specifications, and Performance Assessment Specifications.

The last – and new in standardisation – type of Specification includes standard operating procedures that enable users of MPAI Implementations to make informed decision about their applicability based on the notion of Performance, defined as a set of attributes characterising a reliable and trustworthy implementation.

 

2          Governance of the MPAI Ecosystem

The technical foundations of the MPAI Ecosystem are currently provided by the following documents developed and maintained by MPAI:

  1. Technical Specification.
  2. Reference Software Specification.
  3. Conformance Testing.
  4. Performance Assessment.
  5. Technical Report

An MPAI Standard is a collection of a variable number of the 5 document types.

 

Figure 12 depicts the MPAI ecosystem operation for conforming MPAI implementations.

 

Figure 12 – The MPAI ecosystem operation

Technical Specification: Governance of the MPAI Ecosystem Table 43 identifies the following roles in the MPAI Ecosystem:

 

Table 43 – Roles in the MPAI Ecosystem

MPAI Publishes Standards.

Establishes the not-for-profit MPAI Store.

Appoints Performance Assessors.

Implementers Submit Implementations to Performance Assessors.
Performance Assessors Inform Implementation submitters and the MPAI Store if Implementation Performance is acceptable.
Implementers Submit Implementations to the MPAI Store.
MPAI Store Assign unique ImplementerIDs (IID) to Implementers in its capacity as ImplementerID Registration Authority (IIDRA)[1].

Verifies security and Tests Implementation Confor­mance.

Users Download Implementations and report their experience to MPAI.

 

3          AI Framework

In general, MPAI Application Standards are defined as aggregations – called AI Workflows (AIW) – of processing elements – called AI Modules (AIM) – executed in an AI Framework (AIF). MPAI defines Interoperability as the ability to replace an AIW or an AIM Implementation with a functionally equivalent Implementation.

 

Figure 13 depicts the MPAI-AIF Reference Model under which Implementations of MPAI Application Standards and user-defined MPAI-AIF Conforming applications operate [2].

 

Figure 13 – The AI Framework (AIF) Reference Model

MPAI Application Standards normatively specify the Syntax and Semantics of the input and output data and the Function of the AIW and the AIMs, and the Connections between and among the AIMs of an AIW.

 

An AIW is defined by its Function and input/output Data and by its AIM topology. Likewise, an AIM is defined by its Function and input/output Data. MPAI standard are silent on the technology used to implement the AIM which may be based on AI or data processing, and implemented in software, hardware or hybrid software and hardware technologies.

 

MPAI also defines 3 Interoperability Levels of an AIF that executes an AIW. Table 44 gives the characteristics of an AIW and its AIMs of a given Level:

 

Table 44 – MPAI Interoperability Levels

Level AIW AIMs
1 An implementation of a use case Implementations able to call the MPAI-AIF APIs.
2 An Implementation of an MPAI Use Case Implementations of the MPAI Use Case
3 An Implementation of an MPAI Use Case certified by a Performance Assessor Implementations of the MPAI Use Case certified by Performance Assessors

 

4          Audio-Visual Scene Description

The ability to describe (i.e., digitally represent) an audio-visual scene is a key requirement of several MPAI Technical Specifications and Use Cases. MPAI has developed Technical Specification: Context-based Audio Enhancement (MPAI-CAE) [4] that includes Audio Scene Descriptors and uses a subset of Graphics Language Transmission Format (glTF) [7] to describe a visual scene.

4.1        Audio Scene Descriptors

Audio Scene Description is a Composite AI Module (AIM) specified by Technical Specification: Context-based Audio Enhancement (MPAI-CAE) [4]. The position of an Audio Object is defined by Azimuth, Elevation, Distance.

 

The Composite AIM and its composing AIMs are depicted in Figure 19.

 

Figure 14 – The Audio Scene Description Composite AIM

4.2        Visual Scene Descriptors

MPAI uses a subset of Graphics Language Transmission Format (glTF) [7] to describe a visual scene.

5          Avatar-Based Videoconference

Technical Report: Avatar-Based Videoconference (MPAI-ARA) specifies AIWs and AIMs of a Use Case where geographically distributed humans hold a videoconference represented by their avatars. Figure 15 depicts the components of the system supporting the conference of a group of humans participating through avatars having their visual appearance and uttering the participants’ real voice.

 

Figure 15 – Avatar-Based Videoconference end-to-end diagram

Figure 16 contains the reference architectures of the four AW Workflows constituting the Avatar-Based Videoconference: Client (Transmission side), Server, Virtual Secretary, and Client (Receiving side).

 

Figure 16 – The AIWs of Avatar-Based Videoconference

6          Connected Autonomous Vehicles

MPAI defines a Connected Autonomous Vehicle (CAV), as a physical system that:

  1. Converses with humans by understanding their utterances, e.g., a request to be taken to a destination.
  2. Acquires information with a variety of sensors on the physical environment where it is located or traverses like the one depicted in Figure 17.
  3. Plans a Route enabling the CAV to reach the requested destination.
  4. Autonomously reaches the destination by:
    • Moving in the physical environment.
    • Building Digital Representations of the Environment.
    • Exchanging elements of such Representations with other CAVs and CAV-aware entities.
    • Making decisions about how to execute the Route.
    • Acting on the CAV motion actuation to implement the decisions.

 

Figure 17 – An environment of CAV operation

 

MPAI believes in the capability of standards to accelerate the creation of a global competitive CAV market and has published Technical Specification:f Connected Autonomous Vehicle (MPAI-CAV) – Architecture that includes (see Figure 18):

  1. A CAV Reference Model broken down into four Subsystems.
  2. The Functions of each Subsystem.
  3. The Data exchanged between Subsystems.
  4. A breakdown of each Subsystem in Components of which the following is specified:
    • The Functions of the Components.
    • The Data exchanged between Components.
    • The Topology of Components and their Connections.
  5. Subsequently, Functional Requirements of the Data exchanged.
  6. Eventually, standard technologies for the Data exchanged.

 

Figure 19 – The MPAI-CAV Subsystems with their Components

Subsystems are implemented as AI Workflows and Components as AI Modules according to Technical Specification: AI Framework (MPAI-AIF) [2].

 

 

 

 

  • MPAI-wide terms and definitions

The Terms used in this standard whose first letter is capital and are not already included in Table 1  are defined in Table 45.

 

 

Table 45 – MPAI-wide Terms

Term Definition
Access Static or slowly changing data that are required by an application such as domain knowledge data, data models, etc.
AI Framework (AIF) The environment where AIWs are executed.
AI AIMName (AIM) A data processing element receiving AIM-specific Inputs and producing AIM-specific Outputs according to according to its Function. An AIM may be an aggregation of AIMs.
AI Workflow (AIW) A structured aggregation of AIMs implementing a Use Case receiving AIW-specific inputs and producing AIW-specific outputs according to the AIW Function.
Application Standard An MPAI Standard designed to enable a particular application domain.
Channel A connection between an output port of an AIM and an input port of an AIM. The term “connection” is also used as synonymous.
Communication The infrastructure that implements message passing between AIMs
Composite AIM An AIM aggregating more than one AIM.
Component One of the 7 AIF elements: Access, Communication, Controller, Internal Storage, Global Storage, Store, and User Agent
Conformance The attribute of an Implementation of being a correct technical Implem­entation of a Technical Specification.
Conformance Tester An entity Testing the Conformance of an Implem­entation.
Conformance Testing The normative document specifying the Means to Test the Conformance of an Implem­entation.
Conformance Testing Means Procedures, tools, data sets and/or data set characteristics to Test the Conformance of an Implem­en­tation.
Connection A channel connecting an output port of an AIM and an input port of an AIM.
Controller A Component that manages and controls the AIMs in the AIF, so that they execute in the correct order and at the time when they are needed
Data Format The standard digital representation of data.
Data Semantics The meaning of data.
Ecosystem The ensemble of actors making it possible for a User to execute an application composed of an AIF, one or more AIWs, each with one or more AIMs potentially sourced from independent implementers.
Explainability The ability to trace the output of an Implementation back to the inputs that have produced it.
Fairness The attribute of an Implementation whose extent of applicability can be assessed by making the training set and/or network open to testing for bias and unanticipated results.
Function The operations effected by an AIW or an AIM on input data.
Global Storage A Component to store data shared by AIMs.
Internal Storage A Component to store data of the individual AIMs.
Identifier A name that uniquely identifies an Implementation.
Implementation 1.      An embodiment of the MPAI-AIF Technical Specification, or

2.      An AIW or AIM of a particular Level (1-2-3) conforming with a Use Case of an MPAI Applic­ation Standard.

Implementer A legal entity implementing MPAI Technical Specifications.
ImplementerID (IID) A unique name assigned by the ImplementerID Registration Authority to an Implementer.
ImplementerID Registration Authority (IIDRA) The entity appointed by MPAI to assign ImplementerID’s to Implementers.
Interoperability The ability to functionally replace an AIM with another AIW having the same Interoperability Level
Interoperability Level The attribute of an AIW and its AIMs to be executable in an AIF Implem­en­tati­on and to:

1.      Be proprietary (Level 1)

2.      Pass the Conformance Tes­ting (Level 2) of an Applic­ation Standard

3.      Pass the Perform­ance Testing (Level 3) of an Applic­ation Standard.

Knowledge Base Structured and/or unstructured information made accessible to AIMs via MPAI-specified interfaces
Message A sequence of Records transported by Communication through Channels.
Normativity The set of attributes of a technology or a set of technologies specified by the applicable parts of an MPAI standard.
Performance The attribute of an Implementation of being Reliable, Robust, Fair and Replicable.
Performance Assessment The normative document specifying the Means to Assess the Grade of Performance of an Implementation.
Performance Assessment Means Procedures, tools, data sets and/or data set characteristics to Assess the Performance of an Implem­en­tation.
Performance Assessor An entity Assessing the Performance of an Implementation.
Profile A particular subset of the technologies used in MPAI-AIF or an AIW of an Application Standard and, where applicable, the classes, other subsets, options and parameters relevant to that subset.
Record A data structure with a specified structure
Reference Model The AIMs and theirs Connections in an AIW.
Reference Software A technically correct software implementation of a Technical Specific­ation containing source code, or source and compiled code.
Reliability The attribute of an Implementation that performs as specified by the Application Standard, profile, and version the Implementation refers to, e.g., within the application scope, stated limitations, and for the period of time specified by the Implementer.
Replicability The attribute of an Implementation whose Performance, as Assessed by a Performance Assessor, can be replicated, within an agreed level, by another Performance Assessor.
Robustness The attribute of an Implementation that copes with data outside of the stated application scope with an estimated degree of confidence.
Scope The domain of applicability of an MPAI Application Standard
Service Provider An entrepreneur who offers an Implementation as a service (e.g., a recommendation service) to Users.
Standard The ensemble of Technical Specification, Reference Software, Confor­man­ce Testing and Performance Assessment of an MPAI application Standard.
Technical Specification (Framework) the normative specification of the AIF.

(Application) the normative specification of the set of AIWs belon­ging to an application domain along with the AIMs required to Im­plem­ent the AIWs that includes:

1.      The formats of the Input/Output data of the AIWs implementing the AIWs.

2.      The Connections of the AIMs of the AIW.

3.      The formats of the Input/Output data of the AIMs belonging to the AIW.

Testing Laboratory A laboratory accredited to Assess the Grade of  Performance of Implementations.
Time Base The protocol specifying how Components can access timing information
Topology The set of AIM Connections of an AIW.
Use Case A particular instance of the Application domain target of an Application Standard.
User A user of an Implementation.
User Agent The Component interfacing the user with an AIF through the Controller
Version A revision or extension of a Standard or of one of its elements.

 

 

 

 

  • Notices and Disclaimers Concerning MPAI Standards (Informative)

 

The notices and legal disclaimers given below shall be borne in mind when downloading and using approved MPAI Standards.

 

In the following, “Standard” means the collection of four MPAI-approved and published docum­ents: “Technical Specification”, “Reference Software” and “Conformance Testing” and, where applicable, “Performance Testing”.

 

Life cycle of MPAI Standards

MPAI Standards are developed in accordance with the MPAI Statutes. An MPAI Standard may only be developed when a Framework Licence has been adopted. MPAI Standards are developed by especially established MPAI Development Committees who operate on the basis of consensus, as specified in Annex 1 of the MPAI Statutes. While the MPAI General Assembly and the Board of Directors administer the process of the said Annex 1, MPAI does not independently evaluate, test, or verify the accuracy of any of the information or the suitability of any of the technology choices made in its Standards.

 

MPAI Standards may be modified at any time by corrigenda or new editions. A new edition, however, may not necessarily replace an existing MPAI standard. Visit the web page to determine the status of any given published MPAI Standard.

 

Comments on MPAI Standards are welcome from any interested parties, whether MPAI members or not. Comments shall mandatorily include the name and the version of the MPAI Standard and, if applicable, the specific page or line the comment applies to. Comments should be sent to the MPAI Secretariat. Comments will be reviewed by the appropriate committee for their technical relevance. However, MPAI does not provide interpretation, consulting information, or advice on MPAI Standards. Interested parties are invited to join MPAI so that they can attend the relevant Development Committees.

 

Coverage and Applicability of MPAI Standards

MPAI makes no warranties or representations of any kind concerning its Standards, and expressly disclaims all warranties, expressed or implied, concerning any of its Standards, including but not limited to the warranties of merchantability, fitness for a particular purpose, non-infringement etc. MPAI Standards are supplied “AS IS”.

 

The existence of an MPAI Standard does not imply that there are no other ways to produce and distribute products and services in the scope of the Standard. Technical progress may render the technologies included in the MPAI Standard obsolete by the time the Standard is used, especially in a field as dynamic as AI. Therefore, those looking for standards in the Data Compression by Artificial Intelligence area should carefully assess the suitability of MPAI Standards for their needs.

 

IN NO EVENT SHALL MPAI BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO: THE NEED TO PROCURE SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE PUBLICATION, USE OF, OR RELIANCE UPON ANY STANDARD, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE AND REGARDLESS OF WHETHER SUCH DAMAGE WAS FORESEEABLE.

 

MPAI alerts users that practicing its Standards may infringe patents and other rights of third parties. Submitters of technologies to this standard have agreed to licence their Intellectual Property according to their respective Framework Licences.

 

Users of MPAI Standards should consider all applicable laws and regulations when using an MPAI Standard. The validity of Conformance Testing is strictly technical and refers to the correct implementation of the MPAI Standard. Moreover, positive Performance Assessment of an implementation applies exclusively in the context of the MPAI Governance and does not imply compliance with any regulatory requirements in the context of any jurisdiction. Therefore, it is the responsibility of the MPAI Standard implementer to observe or refer to the applicable regulatory requirements. By publishing an MPAI Standard, MPAI does not intend to promote actions that are not in compliance with applicable laws, and the Standard shall not be construed as doing so. In particular, users should evaluate MPAI Standards from the viewpoint of data privacy and data ownership in the context of their jurisdictions.

 

Implementers and users of MPAI Standards documents are responsible for determining and complying with all appropriate safety, security, environmental and health and all applicable laws and regulations.

 

Copyright

MPAI draft and approved standards, whether they are in the form of documents or as web pages or otherwise, are copyrighted by MPAI under Swiss and international copyright laws. MPAI Standards are made available and may be used for a wide variety of public and private uses, e.g., implementation, use and reference, in laws and regulations and standardisation. By making these documents available for these and other uses, however, MPAI does not waive any rights in copyright to its Standards. For inquiries regarding the copyright of MPAI standards, please contact the MPAI Secretariat.

 

The Reference Software of an MPAI Standard is released with the MPAI Modified Berkeley Software Distribution licence. However, implementers should be aware that the Reference Software of an MPAI Standard may reference some third-party software that may have a different licence.

 

 

 

 

  • Patent declarations (Informative)

 

The MPAI Multimodal Conversation (MPAI-MMC) Technical Specification has been developed according to the process outlined in the MPAI Statutes [15] and the MPAI Patent Policy [16].

The following entities have agreed to licence their standard essential patents reading on the MPAI Multimodal Conversation (MPAI-MMC) Technical Specification according to the MPAI-MMC Framework Licence [17]:

 

Table 46 – Companies having submitted a patent declaration (MPAI-MMC V1)

Entity Name Email address
ETRI Songwon Lee lsw84@etri.re.k
KLleon Jisu Kang jisu.kang@klleon.io
Speech Morphing, Inc. Fathy Yassa fathy@speechmorphing.com

 

Patents declarations concern Version 1. Declarations for Version 2 will be published after requests for declarations will be made.

  • Personal Status (Informative)

The study of “personal status” – of emotion, cognitive states, attitudes, and other status factors that a person can express at a given time – is not new: many aspects have long been studied. Now, however, technological, and scientific advances promise accelerating understanding. MPAI’s aim is to establish standards in various current and future use cases involving Personal Status – for instance, to enable computational systems to recognize users’ emotions and react to them most helpfully. Thus, the need arises to at least roughly characterize and survey Emotions, Cognitive States, and Attitudes.

 

To begin meeting this need, this document proposes definitions, listings, and semantic characterizations of these three factors. These proposals are indeed rough and subject to disagreement or revision on many levels. Accordingly, they can in fact be revised for particular use cases and as the relevant studies move ahead. Revision procedures are specified in the Conclusion below.

 

This Annex offers definitions and examples of each status factor, with brief discussion. Listings of labels and accompanying semantics per factor are given in Section 4.2.

 

Emotions are states of physiological arousal accompanied by changes in facial expressions, gestures, posture, or subjective feelings. Examples include joy, sadness, disgust, fear, and anger. Innate elements of emotions – there may be learned components as well – are controlled by the subcortical regions of the brain, including the amygdala, ventral striatum, and hypothalamus.

 

Sensations like pain, pleasure, taste, vision, hearing, and so on are likewise largely innate, but we’ll try to distinguish them from Emotions as such. Unlike Emotions, sensations will not be defined or listed here.

 

Cognitive states are the results of information processing: a cognitive system accepts input patterns – in humans, initially perceptual patterns, whether new or stored – and produces output patterns, which may include actions that can affect the world outside the system. To perform this processing, the system must recognize the input patterns, perhaps influenced by priming (“expectations”), and then associate them with other patterns, often in a sequence of steps or flow, until the output pattern is reached. The recognition, associations, and sequencing giving rise to Cognitive States may sometimes be innate; but in humans, they’re predominantly learned.

 

This high-level definition of cognition and Cognitive States could describe not only human or other biological information processing, but artificial processing as well – such as that carried out by self-driving vehicles, which must recognize other vehicles, signs and signals, etc., based on patterns conveyed by sensors, and, through processing, derive appropriate action patterns. Clearly, then, the definition is meant to exclude emotion, since the vehicles have none, and in fact probably lack sensations (“qualia”) of any sort, much less consciousness. In humans, however, the separation between emotion and cognition is much harder to make cleanly, since much information processing is at least partly driven by drives which are associated with emotions. Even so, it’s helpful to maintain the separation for analytical purposes; so this Annex will treat Cognitive States as those information processing states which even a system lacking emotions might be able to enter – the processing states that Star Trek’s “purely logical” Mr. Spock might be found in.

However, while observing the distinction between Emotions and Cognitive States as an analytical aid, we certainly recognize (1) borderline cases (like Curiosity, which does involve a drive to obtain new information, but might still be modelled by a system which pursued that goal in numerical terms without emotion, as Mr. Spock might do) and (2) hybrid or overlapping states in which both cognitive processing and emotion play parts (like Positive or Negative Surprise, in which a human is both surprised – as even Mr. Spock might be – but also emotionally pleased or displeased by the unexpected event or discovery).

 

Since we’re defining and listing Emotions and Cognitive States for the limited purposes of near-term human-machine interaction, we’ll avoid a wide range of human emotional and cognitive concerns. Again, we’re bypassing discussion of sensation or consciousness. Likewise, we’ll avoid concern with the emotional factors in human decision-making (related to issues of bias and free will); with abnormal psychology (related to psychosis, obsessive-compulsive disorder, amnesia, etc.); or with many more psychological areas.

 

So, for example, while we will currently be interested (clearly a Cognitive State, though also viewable as borderline, hybrid, or both) in the following states, among others:

 

  1. Interest: determination that certain percepts are relevant to goals
  2. Curiosity: bias toward seeking or attending to new percepts or information
  3. Confusion: disorderly information processing
  4. Certainty: conclusion that percepts or processing results are reliable (e.g., as basis for action)
  5. Attention: bias to process some percepts and not others; bias to direct processing through a certain sequence and not others

 

… we will for now avoid discussion of states like these:

  1. Amnesia: loss of long-term memory
  2. Psychosis: a cognitive disorder in which mental percepts are sometimes confused with objectively real ones
  3. Priming: cognitive bias to recognize or process percepts in a certain way
  4. Consciousness: reportable awareness, augmented by self-concept, self-history, awareness of being aware, etc.
  5. Subconscious processing: information processing without awareness or consciousness

 

A person’s attitudes are ways of relating to exterior elements – most often, to other humans, but also to situations, facts, etc. They’re ways of feeling or thinking about those elements, and/or ways of behaving toward them, prompted by those Emotions and Cognitive States.

 

For MPAI’s purposes, Attitudes are of interest for analysis of relations within use cases between people, and/or between people and computational systems. How can a machine communicate a helpful Attitude – the hybrid combination of Emotion and Cognitive State that constitutes a desire to be useful? How can a machine recognize a resentful Attitude – perhaps arising from a user’s anger (Emotion) at her belief (Cognitive State) that she has been treated unfairly in a transaction?

 

The prompting or engendering of Attitudes by relevant Emotions and Cognitive States can be depicted in various ways, as in the Figures 1 and 2 below; but, whatever the graphic description, for the purposes of MPAI’s standardization efforts, the focus will remain on the relational aspect of Attitudes, and especially on social relations.

 

Given that Emotions and Cognitive States themselves are difficult to describe precisely, we can’t expect definitive listings or semantic characterizations of the Attitudes that arise from them. Even so, we hope that those in Section 4.2 can prove useful in facilitating coordination among modules.

 

 

Figure 20 – Components of Attitude

Figure 21 – Process of the Behaviour from the Emotion and Attitudes

 

 

  • AIW and AIM Metadata of MMC-CPS

1          Metadata for MPAI-CPS AIW

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V2/AIW-AIM-metadata.schema.json”,

“title”:”CPS AIF v2 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”MMC-CPS”,

“Version”:”2″

}

},

“APIProfile”:”Main”,

“Description”:” This AIF is used to enable a human to converse with a machine using Personal Status”,

“Types”:[

{

“Name”:”InputSelection_t”,

“Type”:”{Text_t | Speech_t}”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Video_t”,

“Type”:” uint24[]”

},

{

“Name”:”3DGraphics_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText3″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputAudio”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSelection1″,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSelection2″,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

{

“Name”:”VisualSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:” VisualSceneDescription”,

“Version”:”2″

}

}

},

{

“Name”:”AudioSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:” AudioSceneDescription”,

“Version”:”2″

}

}

},

{

“Name”:”SpatialObjectIdentification”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:” SpatialObjectIdentification”,

“Version”:”2″

}

}

},

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”SpeechRecogniton”,

“Version”:”2″

}

}

},

{

“Name”:”LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

}

}

},

{

“Name”:”PersonalStatusExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

}

}

},

{

“Name”:”DialogueProcessing”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”DialogProcessing”,

“Version”:”2″

}

}

},

{

“Name”:”PersonalStatusDisplay”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

}

}

}

],

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”InputVideo”

},

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”InputVideo”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputAudio”

},

“Input”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”InputAudio”

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors”

},

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”BodyDescriptors”

}

},

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”VisualSceneGeometry”

},

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”VisualSceneGeometry”

}

},

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”PhysicalObject”

},

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”PhysicalObject”

}

},

{

“Output”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”PhysicalObjectID”

},

“Input”:{

“AIMName”:”DialogProcessing”,

“PortName”:”PhysicalObjectID”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText3″

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”InputText3″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

},

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech2″

}

},

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognisedText”

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSelection1″

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”InputSelection1″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText2″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputText2″

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors”

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”BodyDescriptors”

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceDescriptors”

},

“Input”:{

“AIMName”:”PersonalStatusDescription”,

“PortName”:”FaceDescriptors”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning”

}

},

{

“Output”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”InputSpeech1″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSpeech1″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText1″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputText1″

}

},

{

“Output”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputPersonalStatus”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputPersonalStatus”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Meaning”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText”

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachinePersonalStatus”

},

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachinePersonalStatus”

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText”

},

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineText”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineAvatar”

},

“Input”:{

“AIMName”:””,

“PortName”:” MachineAvatar ”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineSpeech”

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineSpeech”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineText”

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineText”

}

}

],

“Implementations”:[

{

“BinaryName”:”mmccps.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”MPAIStore”,

“Destination”:””

}

],

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

},

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

},

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

},

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

},

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

},

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

},

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2          AIM metadata for CPS

2.1        Visual Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”VisualSceneDescription”,

“Version”:”2″

},

“Description”:”This AIM implements the visual scene description function for MMC-CPS.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint24[]”

},

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”{uint8[]}”

},

{

“Name”:”FaceDescriptors_t”,

“Type”:”{uint8[]}”

},

 

{

“Name”:”BodyDescriptors_t”,

“Type”:”{uint8[]}”

},

{

“Name”:”PhysicalObject_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObject”,

“Direction”:”OutputInput”,

“RecordType”:” PhysicalObjects_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.2        Audio Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”AudioSceneDescription”,

“Version”:”2″

},

“Description”:”This AIM implements the visual scene description function for MMC-CPS.”,

“Types”:[

{

“Name”:”Audio_t”,

“Type”:”uint16[]”

},

{

“Name”: “Array_Audio_t”,

“Type”: “Audio_t[]”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

}

],

“Ports”:[

{

“Name”:”InputAudio”,

“Direction”:”InputOutput”,

“RecordType”:”Array_Audio_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Speech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.3        SpatialObjectIdentification

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”SpatialObjectIdentification”,

“Version”:”1″

},

“Description”:”This AIM identifies the Physical Object indicated by the finger of a human.”,

“Types”:[

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint16[]”

},

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”{uint8[]}”

},

{

“Name”:”PhysicalObject_t”,

“Type”:”{uint8[]}”

},

{

“Name”:”PhysicalObjectID_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

}

],

“Ports”:[

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VisualSceneGeometry”,

“Direction”:”InputOutput”,

“RecordType”:”VisualSceneGeometry_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObjects”,

“Direction”:”InputOutput”,

“RecordType”:”PhysicalObject_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObjectID”,

“Direction”:”OutputInput”,

“RecordType”:”Instance_t[]”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.4        SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”VideoAnalysis”,

“Version”:”2″

},

“Description”:”This AIM implements the speech recognition function for MMC-CPS”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”Speech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognisedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.5        Language Understanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

},

“Description”:”This AIM implements language understanding function for MMC-CPS.”,

“Types”:[

{

“Name”:”PhysicalObjectID_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

},

{

“Name”:”Text_t”,

“Type”:”uint8[]”

},

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

}

],

“Ports”:[

{

“Name”:”PhysicalObjectID”,

“Direction”:”InputOutput”,

“RecordType”:”Instance_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText3″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSelection1″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning2″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

2.6        PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

},

“Description”:”This AIM extracts the combined Personal Status from Text, Speech, Face, and Gesture.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”Speech_t”,

“Type”:”{uint16[]}”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning”,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.7        DialogueProcessing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”DialogueProcessing”,

“Version”:”2″

},

“Description”:”This AIM produces the Machine’s Text and Personal Status from the human’s Text and Personal Status.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”Text_t”,

“{uint8[] | uint16[]}”

},

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

}

],

“Ports”:[

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputPersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning”,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:” RefinedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachinePersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.8        PersonalStatusDisplay

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CPS”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

},

“Description”:”This AIM renders a speaking avatar from Machine Text and Machine Personal Status.”,

“Types”:[

{

“Name”:”PersonalStatus_t”,

“Type”:”{uint8[]}”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”3DGraphics_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”MachinePersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

 

 

 

 

  • AIW and AIM Metadata of MMC-CWE

1          AIW metadata for CWE

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V1/AIW-AIM-metadata.schema.json”,

“title”:”CWE AIF v1 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”MMC-CWE”,

“Version”:”2″

}

},

“APIProfile”:”Basic”,

“Description”:” This AIF is used to call the AIW of CWE”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Video_t”,

“Type”:” uint24[]”

},

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

}

],

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSelection1″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSelection2″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineVideo”,

“Direction”:”OutputInput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”SpeechRecogniton”,

“Version”:”1″

}

}

},

{

“Name”:”VisualSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:VisualSceneDescription,

“Version”:”1″

}

}

},

{

“Name”:”LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”LanguageUnderstanding”,

“Version”:”1″

}

}

},

{

“Name”:”PersonalStatusExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”1″

}

}

},

{

“Name”:”DialogProcessing”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”DialogProcessing”,

“Version”:”1″

}

}

},

{

“Name”:”SpeechSynthesisEmotion”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”SpeechSynthesisEmotion”,

“Version”:”1″

}

}

},

{

“Name”:”LipsAnimation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”LipsAnimation”,

“Version”:”1″

}

}

}

],

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

},

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech1″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputVideo”

},

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”InputVideo”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”RecognisedText”

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText1″

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”InputText1″

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

},

“Input”:{

“AIMName”:”Personal Status Extraction”,

“PortName”:”Meaning”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText”

},

“Input”:{

“AIMName”:”Personal Status Extraction”,

“PortName”:”RefinedText”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSpeech2″

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceDescriptors”

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”FaceDescriptors ”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSelection1″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSelection1″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText2″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputText2″

}

},

{

“Output”:{

“AIMName”:” LanguageUnderstanding”,

“PortName”:”Meaning”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Meaning”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:” RefinedText”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText”

}

},

{

“Output”:{

“AIMName”:”InputPersonalStatus”,

“PortName”:”PersonalStatusExtraction”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputPersonalStatus”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSelection2″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputSelection2″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText3″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputText3″

}

},

{

“Output”:{

“AIMName”:”DialouegProcessing”,

“PortName”:”MachineText1″

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineText1″

}

},

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText2″

},

“Input”:{

“AIMName”:”SpeechSynthesisEmotion”,

“PortName”:”MachineText2″

}

},

{

“Output”:{

“AIMName”:”SpeechSynthesisEmotion”,

“PortName”:”MachineSpeech1″

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineSpeech1″

}

},

“Output”:{

“AIMName”:”SpeechSynthesisEmotion”,

“PortName”:”MachineSpeech2″

},

“Input”:{

“AIMName”:”LipsAnimation”,

“PortName”:”MachineSpeech2″

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachinePersonalStatus”

},

“Input”:{

“AIMName”:”LipsAnimation”,

“PortName”:”MachinePersonalStatus”

}

},

{

“Output”:{

“AIMName”:”LipsAnimation”,

“PortName”:”MachineFace”

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineFace”

}

}

],

“Implementations”:[

{

“BinaryName”:”mmccwe.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”AIMStorage”,

“Destination”:””

}

],

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

},

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

},

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

},

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

},

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

},

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

},

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

 

2          AIM metadata

2.1        SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”SpeechRecognition”,

“Version”:”2″

},

“Description”:”This AIM implements speech recognition function for MMC-CWE.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognisedText

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.2        Visual Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”VisualSceneDescription”,

“Version”:”2″

},

“Description”:”This AIM describes the visual scene in MMC-CWE as Face Descriptors.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint32[]”

},

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.3        Language Understanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

},

“Description”:”This AIM implements language understanding function for MMC-CWE.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

}

],

“Ports”:[

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning2″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

2.4        PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

},

“Description”:”This AIM extracts and combined Personal Status from Text, Speech, and Face.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”Speech_t”,

“Type”:”{uint16[]}”

},

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”RefinedText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning2″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSelection1″,

“Direction”:”OutputInput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.5        Dialogue Processing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”DialogueProcessing”,

“Version”:”1″

},

“Description”:”This AIM implements Dialog Processing for MMC-CWE.”,

“Types”:[

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”Text_t | Speech_t”

}

 

],

“Ports”:[

{

“Name”:”Meaning1″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RefinedText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputPersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSelection2″,

“Direction”:”OutputInput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachinePersonalStatus1″,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachinePersonalStatus2″,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.6        SpeechSynthesisEmotion

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”SpeechSynthesisEmotion”,

“Version”:”2″

},

“Description”:”This AIM implements speech synthesis with emotion function for MMC-CWE.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

}

],

“Ports”:[

{

“Name”:”MachineText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachinePersonalStatus1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineSpeech12,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

 

2.7        Lips Animation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”LipAnimation”,

“Version”:”2″

},

“Description”:”This AIM implements lips animation function for MMC-CWE.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

},

{

“Name”:”Video_t”,

“Type”:”uint24[]”

}

],

“Ports”:[

{

“Name”:”MachineSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachinePersonalStatus2″,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceKBVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineFace”,

“Direction”:”OutputInput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

 

 

 

  • AIW and AIM Metadata of MMC-MQA

1          AIW metadata for MQA

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V1/AIW-AIM-metadata.schema.json”,

“title”:”MQA AIF v1 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”MMC-MQA”,

“Version”:”1″

}

},

“APIProfile”:”Basic”,

“Description”:” This AIF is used to execute the AIW of MQA”,

“Types”:[

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Video_t”,

“Type”:” uint24[]”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

},

{

“Name”:”PhysicalObjectIdentifier_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”Intention_t”,

“Type”:”{string<256 qtopic; string<256 qfocus; string<256 qLAT; string<256 qSAT; string<256 qdomain}”

}

],

“Ports”:[

{

“Name”:”InputSelection1″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSelection2″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

{

“Name”:”VisualSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”VisualSceneDescription”,

“Version”:”1″

}

}

},

{

“Name”:”PhysicalObjectIdentification”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”PhysicalObjectIdentification”,

“Version”:”2″

}

}

},

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”SpeechRecogniton”,

“Version”:”2″

}

}

},

{

“Name”:”LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

}

}

},

{

“Name”:”QuestionAnalysis”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”QuestionAnalysis”,

“Version”:”2″

}

}

},

{

“Name”:”QuestionAnswering”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”QuestionAnswering”,

“Version”:”1″

}

}

},

{

“Name”:”SpeechSynthesisText”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”SpeechSynthesisText”,

“Version”:”1″

}

}

}

],

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”InputVideo”

},

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”InputVideo”

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”PhysicalObject”

},

“Input”:{

“AIMName”:”PhysicalObjectIdentification”,

“PortName”:”PhysicalObject”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech”

},

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText2″

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”InputText2″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSelection2″

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”InputSelection2″

}

},

{

“Output”:{

“AIMName”:”PhysicalObjectIdentification”,

“PortName”:”PhysicalObjectIdentifier”

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”PhysicalObjectIdentifier”

}

},

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognisedText”

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

},

“Input”:{

“AIMName”:”QuestionAnalysis”,

“PortName”:”Meaning”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSelection”

},

“Input”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”InputSelection”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText1″

},

“Input”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”InputText1″

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText”

},

“Input”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”RefinedText”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning”

},

“Input”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”Meaning”

}

},

{

“Output”:{

“AIMName”:”QuestionAnalysis”,

“PortName”:”Intention”

},

“Input”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”Intention”

}

},

{

“Output”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”MachineText1″

},

“Input”:{

“AIMName”:”SpeechSynthesisText”,

“PortName”:”MachineText1″

}

},

{

“Output”:{

“AIMName”:”QuestionAnswering”,

“PortName”:”MachineText2″

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineText2″

}

}

],

“Implementations”:[

{

“BinaryName”:”mmcmqa.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”AIMStorage”,

“Destination”:””

}

],

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

},

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

},

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

},

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

},

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

},

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

},

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2          AIM metadata

2.1        VisualSceneDescription

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMX-MQA”,

“AIM”:”VisualSceneDescription”,

“Version”:”2″

},

“Description”:”This AIM describes the visual scene for MMC-MQA providing one physical object.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint32[]”

},

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObject”,

“Direction”:”OutputInput”,

“RecordType”:”PhysicalObject_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2        PhysicalObjectIdentification

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMX-MQA”,

“AIM”:”PhysicalObjectIdentification”,

“Version”:”2″

},

“Description”:”This AIM identified a physical object.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint32[]”

},

{

“Name”:”PhysicalObjectIdentifier_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

}

],

“Ports”:[

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObjectIdentifier”,

“Direction”:”OutputInput”,

“RecordType”:”PhysicalObject_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.3        SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CWE”,

“AIM”:”SpeechRecognition”,

“Version”:”2″

},

“Description”:”This AIM implements speech recognition function for MMC-MQA that converts a speech object to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognisedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.4        Language Understanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

},

“Description”:”This AIM implements language understanding function for MMC-MQA.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Selection_t”,

“Type”:”{Text_t | Speech_t}”

},

{

“Name”:”ObjectIdentifier_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

],

“Ports”:[

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSelection2″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObjectIdentifier”,

“Direction”:”InputOutput”,

“RecordType”:” ObjectIdentifier_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”RefinedText_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning2″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

2.5        Question Analysis

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”QuestionAnalysis”,

“Version”:”2″

},

“Description”:”This AIM implements the question analysis function for MMC-MQA.”,

“Types”:[

{

“Name”:”meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”intention_t”,

“Type”:”{string<256 qtopic; string<256 qfocus; string<256 qLAT; string<256 qSAT; string<256 qdomain}”

}

],

“Ports”:[

{

“Name”:”Meaning_2″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Intention”,

“Direction”:”OutputInput”,

“RecordType”:”Intention_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

2.6        Question Answering

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”QuestionAnswering”,

“Version”:”2″

},

“Description”:”This AIM implements question answering function for MMC-MQA.”,

“Types”:[

{

“Name”:”InputSelection_t”,

“Type”:”{Text_t | Speech_t}”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”intention_t”,

“Type”:”{string<256 qtopic; string<256 qfocus; string<256 qLAT; string<256 qSAT; string<256 qdomain}”

}

],

“Ports”:[

{

“Name”:”InputSelection1″,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RefinedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:”false e”

},

{

“Name”:”Meaning_1″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Intention”,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText2″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

2.7        SpeechSynthesisText

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-MQA”,

“AIM”:”SpeechSynthesis”,

“Version”:”2″

},

“Description”:”This AIM implements speech synthesis function for MMC-MQA.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

}

],

“Ports”:[

{

“Name”:”MachineText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:”fals”

},

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:”fals”

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

 

 

 

 

 

 

  • AIW and AIM Metadata of MMC-CAS

1.        AIW metadata for MMC-CAS

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V2/AIW-AIM-metadata.schema.json”,

“title”:”CAS AIF V2 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”MMC-CAS”,

“Version”:”2″

}

},

“APIProfile”:”Basic”,

“Description”:” This AIF is used to execute enable a human to converse with a machine about objects in an environment”,

“Types”:[

{

“Name”:”PointOfView_t”,

“Type”:”{float32[3] Position; float32[3] Orientation}”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Video_t”,

“Type”:”uint24[]”

},

{

“Name”:”3DGraphics_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”PointOfView”,

“Direction”:”InputOutput”,

“RecordType”:”PointOfView_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RenderedScene”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

],

“SubAIMs”:[

{

“Name”:”VisualSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:” VisualSceneDescription”,

“Version”:”1″

}

}

},

{

“Name”:”SpatialObjectIdentification”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:” SpatialObjectIdentification”,

“Version”:”2″

}

}

},

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS,

“AIM”:”SpeechRecogniton”,

“Version”:”2″

}

}

},

{

“Name”:” LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

}

}

},

{

“Name”:”PersonalStatus”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”PersonalStatus”,

“Version”:”2″

}

}

},

{

“Name”:”DialogueProcessing”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”DialogueProcessing”,

“Version”:”2″

}

}

},

{

“Name”:”ScenePresentation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”ScenePresentation”,

“Version”:”2″

}

}

},

{

“Name”:”PersonalStatusDisplay”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

}

}

}

],

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”InputVideo”

},

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”InputVideo”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

},

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech2″

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors2″

},

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”BodyDescriptors2″

}

},

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”VisualSceneGeometry”

},

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”VisualSceneGeometry”

}

},

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”PhysicalObject”

},

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”PhysicalObject”

}

},

{

“Output”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”ObjectID”

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”ObjectID”

}

},

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognisedText”

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSpeech1″

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors1″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”BodyDescriptors1″

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceDescriptors”

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”FaceDescriptors”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning1″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning1″

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning2″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning2″

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText”

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”RefinedText”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

},

“Input”:{

“AIMName”:”PersonalStatus”,

“PortName”:”InputSpeech1″

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescruiptors1″

},

“Input”:{

“AIMName”:”PersonalStatus”,

“PortName”:”BodyDescruiptors1″

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceDescriptors”

},

“Input”:{

“AIMName”:”PersonalStatus”,

“PortName”:”FaceDescriptors”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning1″

},

“Input”:{

“AIMName”:”PersonalStatus”,

“PortName”:”Meaning1″

}

},

{

“Output”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputPersonalStatus”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputPersonalStatus”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning2″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Meaning2″

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”PointOfView”

},

“Input”:{

“AIMName”:”ScenePresentation”,

“PortName”:”PointOfView”

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText”

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineText”

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachinePersonalStatus”

},

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachinePersonalStatus”

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText”

},

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineText”

}

},

{

“Output”:{

“AIMName”:”ScenePresentation”,

“PortName”:”RenderedScene”

},

“Input”:{

“AIMName”:””,

“PortName”:”RenderedScene”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineAvatar”

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineAvatar”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineSpeech”

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineSpeech”

}

},

],

“Implementations”:[

{

“BinaryName”:”mmccas.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”MPAIStore”,

“Destination”:””

}

],

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

},

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

},

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

},

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

},

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

},

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

},

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.        AIM metadata for MMC-CAS

2.1        Visual Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”VisualSceneDescription”,

“Version”:”2″

},

“Description”:”This AIM implements the visual scene description function for MMC-CAS.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint24[]”

},

{

“Name”:”VisualSceneDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”uint8[]”

},

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”InputVideo”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VisualSceneDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”VisualSceneDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyDescriptors1″,

“Direction”:”OutputInput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyDescriptors2″,

“Direction”:”OutputInput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VisualSceneGeometry”,

“Direction”:”OutputInput”,

“RecordType”:”SceneGeometry_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObject”,

“Direction”:”OutputInput”,

“RecordType”:” PhysicalObjects_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.2        SpatialObjectIdentification

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”SpatialObjectIdentification”,

“Version”:”2″

},

“Description”:”This AIM identifies the Physical Object indicated by a human’s finger.”,

“Types”:[

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”{uint8[]}”

},

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

},

{

“Name”:”PhysicalObjectID_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

}

],

“Ports”:[

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VisualSceneGeometry”,

“Direction”:”InputOutput”,

“RecordType”:”VisualSceneGeometry_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObjects”,

“Direction”:”InputOutput”,

“RecordType”:”PhysicalObject_t[]”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObjectID”,

“Direction”:”OutputInput”,

“RecordType”:”PhysicalObjectID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.3        SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”SpeechRecognition”,

“Version”:”2″

},

“Description”:”This AIM implements the speech recognition function for MMC-CAS: it converts the user’s speech to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognisedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.4        LanguageUnderstanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”LanguageUnderstanding”,

“Version”:”1″

},

“Description”:”This AIM extracts Meaning from Recognised Text supplemented by the ID of the Physical Object and improves Recognised Text supplemented by the ID of the Physical Object.”,

“Types”:[

{

“Name”:”PhysicalObject_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

}

],

“Ports”:[

{

“Name”:”PhysicalObjectID”,

“Direction”:”InputOutput”,

“RecordType”:”PhysicalObjectID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning2″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.5        PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

},

“Description”:”This AIM extracts the combined Personal Status from Text, Speech, Face, and Gesture.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”{uint16[]}”

},

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyDescriptors1″,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning1″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.6        DialogueProcessing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”DialogueProcessing”,

“Version”:”1″

},

“Description”:”This AIM produces the Machine’s Text and Personal Status from the human’s Text and Personal Status.”,

“Types”:[

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

],

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”Text_t”,

“{uint8[] | uint16[]}”

},

“Ports”:[

{

“Name”:”PersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning2″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Text(LanguageUnderstanding)”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachinePersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.7        ScenePresentation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”VisualScenePresentation”,

“Version”:”2″

},

“Description”:”This AIM renders the Visual Scene Description produced by the Visual Scene Description.”,

“Types”:[

{

“Name”:”PointOfView_t”,

“Type”:”{float32[6]}”

},

{

“Name”:”VisualSceneDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”3DGraphics_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”PointOfView”,

“Direction”:”InputOutput”,

“RecordType”:”PointOfView_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VisualSceneDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:” VisualSceneDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RenderedScene”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.8        PersonalStatusDisplay

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-CAS”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

},

“Description”:”This AIM renders a speaking avatar from text and Personal Status.”,

“Types”:[

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”3DGraphics_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”MachinePersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

 

  • AIW and AIM Metadata of CAV-HCI

1.        AIW metadata for HCI

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V2/AIW-AIM-metadata.schema.json”,

“title”:”HCI AIF V2 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-CAV”,

“AIW”:”CAV-HCI”,

“AIM”:”CAV-HCI”,

“Version”:”1″

}

},

“APIProfile”:”Secure”,

“Description”:” This AIF enables a human to converse with a CAV”,

“Types”:[

{

“Name”: “Audio_t”,

“Type”: “uint16[]”

},

{

“Name”:”ArrayAudio_t”,

“Type”:”Audio_t[]”

},

{

“Name”:”VideoOutdoor_t”,

“Type”:”uint32[]”

},

{

“Name”:”LiDAR_t”,

“Type”:”uint24[]”

},

{

“Name”:”RADAR_t”,

“Type”:”uint24[]”

},

{

“Name”:”VideoIndoor_t”,

“Type”:”uint32[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

{

“Name”:”3DGraphics_t”,

“Type”:”{uint8[]}”

},

{

“Name”: “Speech_t”,

“Type”: “uint16[]”

},

],

“Ports”:[

{

“Name”:”AudioIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”Audio_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”AudioOutdoor”,

“Direction”:”InputOutput”,

“RecordType”:”ArrayAudio_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VideoOutdoor”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”LiDARIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”LiDAR_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”LiDAROutdoor”,

“Direction”:”InputOutput”,

“RecordType”:”LiDAR_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VideoIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},      {

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

{

“Name”:”AudioSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:” AudioSceneDescription”,

“Version”:”2″

}

}

},

{

“Name”:”VisualSceneDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:” VisualSceneDescription”,

“Version”:”2″

}

}

},

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI,

“AIM”:”SpeechRecogniton”,

“Version”:”2″

}

}

},

{

“Name”:”SpatialObjectIdentification”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:” SpatialObjectIdentification”,

“Version”:”2″

}

}

},

{

“Name”:” LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

}

}

},

{

“Name”:”SpeechRecognition”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”SpeechRecognition”,

“Version”:”2″

}

}

},

{

“Name”:”SpeakerRecognition”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”SpeakerRecognition”,

“Version”:”2″

}

}

},

{

“Name”:”PersonalStatus”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”PersonalStatus”,

“Version”:”2″

}

}

},

{

“Name”:”FaceRecognition”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”FaceRecognition”,

“Version”:”2″

}

}

},

{

“Name”:”DialogueProcessing”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”DialogueProcessing”,

“Version”:”2″

}

}

},

{

“Name”:”PersonalStatusDisplay”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

}

}

}

],

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”AudioIndoor”

},

“Input”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”AudioIndoor”

}

},

{

“Output”:{

“AIMName”:”EnvironmentSensingSubsystem”,

“PortName”:”AudioOutdoor”

},

“Input”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”AudioOutdoor”

}

},

{

“Output”:{

“AIMName”:”EnvironmentSensingSubsystem”,

“PortName”:”VideoOutdoor”

},

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”VideoOutdoor”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”LiDARIndoor”

},

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”LiDARIndoor”

},

},

{

“Output”:{

“AIMName”:””,

“PortName”:”RADARIndoor”

},

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”RADARIndoor”

},

{

“Output”:{

“AIMName”:””,

“PortName”:”VideoIndoor”

},

“Input”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”VideoIndoor”

}

},

{

“Output”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”InputSpeech2″

},

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech2″

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors1″

},

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”BodyDescriptors1″

}

},

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”SceneGeometry”

},

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”SceneGeometry”

}

},

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”PhysicalObjectID”

},

“Input”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”PhysicalObjectID”

}

},

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognisedText”

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

},

{

“Output”:{

“AIMName”:”SpatialObjectIdentification”,

“PortName”:”PhysicalObjectID”

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”PhysicalObjectID”

}

},

{

“Output”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”SpeechObject”

},

“Input”:{

“AIMName”:”SpeakerRecognition”,

“PortName”:”SpeechObject”

}

},

{

“Output”:{

“AIMName”:”AudioSceneDescription”,

“PortName”:”InputSpeech”

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSpeech”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning1″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning1″

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”BodyDescriptors2″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”BodyDescriptors2″

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceDescriptors”

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”FaceDescriptors”

}

},

{

“Output”:{

“AIMName”:”VisualSceneDescription”,

“PortName”:”FaceObject”

},

“Input”:{

“AIMName”:”FaceRecognition”,

“PortName”:”FaceObject”

}

},

{

“Output”:{

“AIMName”:”SpeakerRecognition”,

“PortName”:”SpeakerID”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”SpeakerID”

}

},

{

“Output”:{

“AIMName”:”LanguageProcessing”,

“PortName”:”Meaning2″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Meaning2″

}

},

{

“Output”:{

“AIMName”:”LanguageProcessing”,

“PortName”:”RefinedText”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”PersonalStatus”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”PersonalStatus”

}

},

{

“Output”:{

“AIMName”:”FaceRecognition”,

“PortName”:”FaceID”

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”FaceID”

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText”

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineText”

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachinePersonalStatus”

},

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachinePersonalStatus”

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”MachineText”

},

“Input”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineText”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineAvatar”

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineAvatar”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”MachineSpeech”

},

“Input”:{

“AIMName”:””,

“PortName”:”MachineSpeech”

}

},

],

“Implementations”:[

{

“BinaryName”:”cas.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”MPAIStore”,

“Destination”:””

}

],

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

},

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

},

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

},

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

},

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

},

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

},

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.        Metadata for HCI AIMs

2.1        Audio Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”CAV”,

“AIW”:”HCI”,

“AIM”:”AudioSceneDescription”,

“Version”:”2″

},

“Description”:”This AIM implements the audio scene description function for CAV-HCI.”,

“Types”:[

{

“Name”: “Audio_t”,

“Type”: “uint16[]”

},

{

“Name”: “ArrayAudio_t”,

“Type”: “Audio_t[]”

},

“Name”:”Speech_t”,

“Type”:”uint16[]”

}

],

“Ports”:[

{

“Name”:”AudioIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”ArrayAudio_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”AudioOutdoor”,

“Direction”:”InputOutput”,

“RecordType”:”ArrayAudio_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechObject”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech2″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.2        }Visual Scene Description

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”VisualSceneDescription”,

“Version”:”2″

},HCI      “Description”:”This AIM implements the visual scene description function for MMC-CAS.”,

“Types”:[

{

“Name”:”Video_t”,

“Type”:”uint32[]”

},

{

“Name”:”LiDAR_t”,

“Type”:”uint24[]”

},

{

“Name”:”RADAR_t”,

“Type”:”uint24[]”

},

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”uint8[]”

},

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

},

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

}

{

“Name”:”FaceObject_t”,

“Type”:”uint32[]”

},

],

“Ports”:[

{

“Name”:”VideoOutdoor”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”LiDARIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”LiDAR_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RADARIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”RADAR_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VideoIndoor”,

“Direction”:”InputOutput”,

“RecordType”:”Video_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VisualSceneGeometry”,

“Direction”:”OutputInput”,

“RecordType”:”VisualSceneGeometry_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObject”,

“Direction”:”OutputInput”,

“RecordType”:” PhysicalObjects_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

{

“Name”:”FaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceObject”,

“Direction”:”OutputInput”,

“RecordType”:”FaceObjects_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

}

2.3        SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”CAV”,

“AIW”:”HCI”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

},

“Description”:”This AIM implements the speech recognition function for MMC-CAS: it converts the user’s speech to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognisedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

}

2.4        SpatialObjectIdentification

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”CAS”,

“AIM”:”SpatialObjectIdentification”,

“Version”:”1″

},

“Description”:”This AIM identifies the Physical Object indicated by a human’s finger.”,

“Types”:[

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint16[]”

},

{

“Name”:”VisualSceneGeometry_t”,

“Type”:”{uint8[]}”

},

{

“Name”:”PhysicalObject_t”,

“Type”:”{uint8[]}”

},

{

“Name”:”PhysicalObjectID_t”,

“Type”:”{string objectImageLabel; float32 confidenceLevel}”

}

],

“Ports”:[

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SceneGeometry”,

“Direction”:”InputOutput”,

“RecordType”:”VisualSceneGeometry_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObjects”,

“Direction”:”InputOutput”,

“RecordType”:”PhysicalObject_t[]”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObjectID”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

}

2.5        LanguageUnderstanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

},

“Description”:”This AIM extracts Meaning from Recognised Text supplemented by the ID of the Physical Object and improves Recognised Text supplemented by the ID of the Physical Object.”,

“Types”:[

{

“Name”:”PhysicalObject_t”,

“Type”:”uint8[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

}

],

“Ports”:[

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PhysicalObjectID”,

“Direction”:”InputOutput”,

“RecordType”:”PhysicalObjectID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning”,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.6        SpeakerRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”SpeakerRecognition”,

“Version”:”2″

},

“Description”:”This AIM implements the speaker recognition function for CAV-HCI: it identifies a speaker based on their speech.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”SpeakerID_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”SpeechObject”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeakerID”,

“Direction”:”OutputInput”,

“RecordType”:”SpeakerID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

}

2.7        PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

},

“Description”:”This AIM extracts the combined Personal Status from Text, Speech, Face, and Gesture.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”{uint16[]}”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FacwDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

}

2.8        FaceRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”FaceRecognition”,

“Version”:”2″

},

“Description”:”This AIM implements the human recognition function for CAV-HCI: it identifies a human based on their face.”,

“Types”:[

{

“Name”:”Face_t”,

“Type”:”uint32[]”

},

{

“Name”:”FaceID_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”FaceObject”,

“Direction”:”InputOutput”,

“RecordType”:”Face_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceID”,

“Direction”:”OutputInput”,

“RecordType”:”FaceID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

}

2.9        DialogueProcessing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”DialogueProcessing”,

“Version”:”1″

},

“Description”:”This AIM produces the Machine’s Text and Personal Status from the human’s Text and Personal Status.”,

“Types”:[

{

“Name”:”Text_t”,

“{uint8[] | uint16[]}”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”SpeakerID”,

“Direction”:”InputOutput”,

“RecordType”:”SpeakerID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning2″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RefinedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceID”,

“Direction”:”InputOutput”,

“RecordType”:”FaceID_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachinePersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.9        PersonalStatusDisplay

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-HCI”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

},

“Description”:”This AIM renders a speaking avatar from text and Personal Status.”,

“Types”:[

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”3DGraphics_t”,

“Type”:”uint8[]”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

],

“Ports”:[

{

“Name”:”MachinePersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”MachineSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

{

“Name”:”MachineAvatar”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-cav/”

}

]

}

}

 

 

  • AIW and AIM Metadata of ARA-VSV

1          Metadata for VSV AIW

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V2/AIW-AIM-metadata.schema.json”,

“title”:”VSV AIF V2 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”MMC-VSV “,

“Version”:”2″

}

},

“APIProfile”:”Secure”,

“Description”:” This AIF is used to produce the visual and vocal appearance of the Virtual Secretary and the Summary of the Avatar-Based Videoconference”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”AvatarDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”Summary_t”,

“Type”:”uint8[]”

},

{

“Name”:”AvatarModel_t”,

“Type”:”uint8[]”

},

{

“Name”:”AvatarDescriptors_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”AvatarDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”AvatarDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},      {

“Name”:”Summary”,

“Direction”:”OutputInput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”AvatarModel”,

“Direction”:”OutputInput”,

“RecordType”:”AvatarModel_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VSText”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VSSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VSAvatarDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”AvatarDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

{

“Name”:”SpeechRecognition”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

}

}

},

{

“Name”:”AvatarDescriptorsParsing”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC “,

“AIW”:”MMC-VSV”,

“AIM”:”AvatarDescriptorsParsing”,

“Version”:”2″

}

}

},

{

“Name”:” LanguageUnderstanding”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”LanguageUnderstanding”,

“Version”:”2″

}

}

},

{

“Name”:”PersonalStatusExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC “,

“AIW”:”MMC-VSV”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

}

}

},

{

“Name”:”Summarisation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”Summarisation”,

“Version”:”2″

}

}

},

{

“Name”:”PersonalStatusDisplay”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

}

}

}

],

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

},

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech1″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputAvatarDescriptors”

},

“Input”:{

“AIMName”:”AvatarDescriptorsParsing”,

“PortName”:”InputAvatarDescriptors”

}

},

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognisedText”

},

“Input”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RecognisedText”

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning2″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”Meaning2″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputSpeech2″

}

},

{

“Output”:{

“AIMName”:”AvatarDescriptorsParsing”,

“PortName”:”BodyDescriptors”

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”BodyDescriptors”

}

},

{

“Output”:{

“AIMName”:”AvatarDescriptorParsing”,

“PortName”:”FaceDescriptors”

},

“Input”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”FaceDescriptors”

}

},

{  “Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”Meaning2″

},

“Input”:{

“AIMName”:”Summarisation”,

“PortName”:”Meaning2″

}

},

{  “Output”:{

“AIMName”LanguageUnderstanding”,

“PortName”:”RefinedText2″

},

“Input”:{

“AIMName”:”Summarisation”,

“PortName”:”RefinedText2″

}

},

{  “Output”:{

“AIMName”PersonalStatusExtraction”,

“PortName”:”InputPersonalStatus2″

},

“Input”:{

“AIMName”:”Summarisation”,

“PortName”:”InputPersonalStatus2″

}

},

{

“Output”:{

“AIMName”:”LanguageUnderstanding”,

“PortName”:”RefinedText1″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText1″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText1″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputText1″

}

},

{

“Output”:{

“AIMName”:”LanguageProcessing”,

“PortName”:”RefinedText1″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”RefinedText1″

}

},

{

“Output”:{

“AIMName”:”LanguagePeocessing”,

“PortName”:”Meaning1″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Meaning1″

}

},

{

“Output”:{

“AIMName”:”PersonalStatusExtraction”,

“PortName”:”InputPersonalStatus1″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”InputPersonalStatus1″

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”EditedSummary”

},

“Input”:{

“AIMName”:”Summarisation”,

“PortName”:”EditedSummary”

}

},

{

“Output”:{

“AIMName”:”Summarisation”,

“PortName”:”Summary1″

},

“Input”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Summary1″

}

},

{

“Output”:{

“AIMName”:”DialogueProcessing”,

“PortName”:”Summary2″

},

“Input”:{

“AIMName”:””,

“PortName”:”Summary2″

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”VSAvatarModel”

},

“Input”:{

“AIMName”:””,

“PortName”:”VSAvatarModel”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”VSText”

},

“Input”:{

“AIMName”:””,

“PortName”:”VSText”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”VSSpeech”

},

“Input”:{

“AIMName”:””,

“PortName”:”VSSpeech”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusDisplay”,

“PortName”:”VSAvatarDescriptors”

},

“Input”:{

“AIMName”:””,

“PortName”:”VSAvatarDescriptors”

}

}

],

“Implementations”:[

{

“BinaryName”:”vsv.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”MPAIStore”,

“Destination”:””

}

],

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

},

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

},

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

},

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

},

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

},

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

},

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.        AIM metadata for ARA-VSV

2.1        SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

},

“Description”:”This AIM implements the speech recognition function for ARA-VSV: it converts the user’s speech to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognisedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.2        AvatarDescriptorParsing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”AvatarDescriptorParsing”,

“Version”:”2″

},

“Description”:”This AIM implements the speech recognition function for ARA-VSV: it converts the user’s speech to text.”,

“Types”:[

{

“Name”:”AvatarDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”BodyDescriptors_t”,

“Type”:”{uint8[]}”

}

{

“Name”:”FaceDescriptors_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”InputAvatarDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”AvatarDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-ara/”

}

]

}

}

2.3        LanguageUnderstanding

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”LanguageUnderstanding”,

“Version”:”1″

},

“Description”:”This AIM extracts Meaning from Recognised Text supplemented by the ID of the Physical Object and improves Recognised Text.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

}

],

“Ports”:[

{

“Name”:”RecognisedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputsedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RefinedText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning1″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.4        PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:”MMC-VSV”,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

},

“Description”:”This AIM extracts the combined Personal Status from Text, Speech, Face, and Gesture.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”{uint16[]}”

},

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”Tagging_t”,

“Type”:”{string<256 set; string<256 result}”

},

{

“Name”:”Meaning_t”,

“Type”:”{Tagging_t POS_tagging; Tagging_t NE_tagging; Tagging_t dependency_tagging; Tagging_t SRL_tagging}”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”Meaning2″,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputPersonalStatus1″,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning”,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RefinedText2″,

“Direction”:”OutputInput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputPersonalStatus2″,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.5        Summarisation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”ARA”,

“AIW”:”VSV”,

“AIM”:”Summarisation”,

“Version”:”2″

},

“Description”:”This AIM produces the Summary of the Videoconference.”,

“Types”:[

{

“Name”:”Meaning_t”,

“{uint8[]}”

},

{

“Name”:”Text_t”,

“{uint8[] | uint16[]}”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint16[]”

},

{

“Name”:”Summary_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”Meaning”,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TextLanguageUnderstanding”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”EditedSummary”,

“Direction”:”InputOutput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Summary”,

“Direction”:”OutputInput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.6        DialogueProcessing

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”CAS”,

“AIM”:”DialogueProcessing”,

“Version”:”1″

},

“Description”:”This AIM produces the Machine’s Text and Personal Status from the human’s Text and Personal Status.”,

“Types”:[

{

“Name”:”Text_t”,

“{uint8[] | uint16[]}”

},

{

“Name”:”Meaning_t”,

“{uint8[]}”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint16[]”

},

{

“Name”:”Summary_t”,

“{uint8[]}”

},

],

“Ports”:[

{

“Name”:”Text”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TextLanguageUnderstanding”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Meaning”,

“Direction”:”InputOutput”,

“RecordType”:”Meaning_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”EditedSummary”,

“Direction”:”OutputInput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Summary”,

“Direction”:”InputOutput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”Summary”,

“Direction”:”OutputInput”,

“RecordType”:”Summary_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VSPersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VSText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.7        PersonalStatusDisplay

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”ARA”,

“AIW”:”VSV”,

“AIM”:”PersonalStatusDisplay”,

“Version”:”2″

},

“Description”:”This AIM outputs the Avatar Model and renders a speaking avatar from text and Personal Status.”,

“Types”:[

{

“Name”:”AvatarModel_t”,

“Type”:”{uint8[]}”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”3DGraphics_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”VSPersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VSText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”AvatarModel”,

“Direction”:”OutputInput”,

“RecordType”:”3DGraphics_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VSText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”VSSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

{

“Name”:”AvatarDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:” AvatarDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

  • AIW and AIM Metadata of MMC-UST

1          AIW metadata for UST

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V1/AIW-AIM-metadata.schema.json”,

“title”:”UST AIF v1 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-UST”,

“AIM”:”MMC-UST”,

“Version”:”1″

}

},

“APIProfile”:”Main”,

“Description”:” This AIF is used to call the AIW of UST”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”InputSelection_t”,

“Type”:”Speech_t | Text_t”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Language_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RequestedLanguage”,

“Direction”:”InputOutput”,

“RecordType”:”uint8[5] Language_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-UST”,

“AIM”:”SpeechRecogniton”,

“Version”:”1″

}

}

},

{

“Name”:”Translation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-UST”,

“AIM”:”Translation”,

“Version”:”1″

}

}

},

{

“Name”:”SpeechFeatureExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-UST”,

“AIM”:”SpeechFeatureExtraction”,

“Version”:”1″

}

}

},

{

“Name”:”SpeechSynthesis”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-UST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

}

}

}

],

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”RequestedLanguage”

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RequestedLanguage”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText”

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”InputText ”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

},

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech1″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

},

“Input”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”InputSpeech2″

}

},

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedSpeech”

},

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedSpeech”

}

},

{

“Output”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”SpeechFeatures”

},

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”SpeechFeatures”

}

},

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognizedText”

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RecognizedText”

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText”

},

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedText”

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText”

},

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedText”

}

}

],

“Implementations”:[

{

“BinaryName”:”ust.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”AIMStorage”,

“Destination”:””

}

],

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

},

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

},

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

},

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

},

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

},

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

},

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2          AIM metadata

2.1        SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”UST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

},

“Description”:”This AIM implements speech recognition function for MMC-UST that converts speech of user utterance to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognizedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.2        Translation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”UST”,

“AIM”:”Translation”,

“Version”:”1″

},

“Description”:”This AIM implements translation function for MMC-UST.”,

“Types”:[

{

“Name”:”InputSelection_t”,

“Type”:”Speech_t | Text_t”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Language_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RequestedLanguage”,

“Direction”:”InputOutput”,

“RecordType”:”uint8[5] Language_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”OutputText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.3        Speech Feature Extraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”UST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

},

“Description”:”This AIM implements speech recognition function for MMC-UST.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

}

],

“Ports”:[

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechFeatures”,

“Direction”:”OutputInput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

2.4        Speech Synthesis

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”UST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

},

“Description”:”This AIM implements speech synthesis function for MMC-UST.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”TranslatedText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechFeatures”,

“Direction”:”InputOutput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”OutputSpeech”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

 

 

 

  • AIW and AIM Metadata of MMC-BST

1          AIW metadata for BST

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V1/AIW-AIM-metadata.schema.json”,

“title”:”BST AIF v1 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-BST”,

“AIM”:”MMC-BST”,

“Version”:”1″

}

},

“APIProfile”:”Main”,

“Description”:” This AIF is used to call the AIW of BST”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”InputSelection_t”,

“Type”:”Speech_t | Text_t”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Language_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RequestedLanguage”,

“Direction”:”InputOutput”,

“RecordType”:”uint8[5] Language_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech3″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech4″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedText2″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedSpeech2″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-BST”,

“AIM”:”SpeechRecogniton”,

“Version”:”1″

}

}

},

{

“Name”:”Translation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-BST”,

“AIM”:”Translation”,

“Version”:”1″

}

}

},

{

“Name”:”SpeechFeatureExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-BST”,

“AIM”:”SpeechFeatureExtraction”,

“Version”:”1″

}

}

},

{

“Name”:”SpeechSynthesis”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-BST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

}

}

}

],

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”RequestedLanguage ”

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RequestedLanguage”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText1″

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”InputText1 ”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText2″

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”InputText2″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

},

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech1″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

},

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech2″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech3″

},

“Input”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”InputSpeech3″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech4″

},

“Input”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”InputSpeech4″

}

},

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedSpeech1″

},

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedSpeech1″

}

},

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedSpeech2″

},

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedSpeech2″

}

},

{

“Output”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”SpeechFeatures1″

},

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”SpeechFeatures1″

}

},

{

“Output”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”SpeechFeatures2″

},

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”SpeechFeatures2″

}

},

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognizedText1″

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RecognizedText1″

}

},

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognizedText2″

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RecognizedText2″

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText1″

},

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedText1″

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText2″

},

“Input”:{

“AIMName”:””,

“PortName”:”TranslatedText2″

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText3″

},

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedText3″

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText4″

},

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedText4″

}

}

],

“Implementations”:[

{

“BinaryName”:”bst.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”AIMStorage”,

“Destination”:””

}

],

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

},

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

},

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

},

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

},

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

},

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

},

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

 

2          AIM metadata

2.1        SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”BST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

},

“Description”:”This AIM implements speech recognition function for MMC-BST that converts speech of user utterance to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognizedText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognizedText2″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.2        Translation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”BST”,

“AIM”:”Translation”,

“Version”:”1″

},

“Description”:”This AIM implements translation function for MMC-BST.”,

“Types”:[

{

“Name”:”InputSelection_t”,

“Type”:”Speech_t | Text_t”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Language_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RequestedLanguages”,

“Direction”:”InputOutput”,

“RecordType”:”uint8[5] Language_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedText2″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedText3″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedText4″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.3        Speech Feature Extraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”BST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

},

“Description”:”This AIM implements speech recognition function for MMC-BST.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

}

],

“Ports”:[

{

“Name”:”InputSpeech3″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech4″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechFeatures1″,

“Direction”:”OutputInput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechFeatures2″,

“Direction”:”OutputInput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

2.4        Speech Synthesis

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”BST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

},

“Description”:”This AIM implements speech synthesis function for MMC-BST.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”TranslatedText3″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedText4″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechFeatures1″,

“Direction”:”InputOutput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechFeatures2″,

“Direction”:”InputOutput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedSpeech2″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

 

 

 

  • AIW and AIM Metadata of MMC-MST

1.        AIW metadata for MST

{

“$schema”:”https://json-schema.org/draft/2020-12/schema”,

“$id”:”https://mpai.community/standards/resources/MPAI-AIF/V1/AIW-AIM-metadata.schema.json”,

“title”:”MST AIF v1 AIW/AIM metadata”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MST”,

“AIM”:”MMC-MST”,

“Version”:”1″

}

},

“APIProfile”:”Main”,

“Description”:” This AIF is used to call the AIW of MST”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”InputSelection_t”,

“Type”:”Speech_t | Text_t”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”Language_t”,

“Type”:”{uint8[]}”

}

],

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”InputSelection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RequestedLanguage”,

“Direction”:”InputOutput”,

“RecordType”:”uint8[5] Language_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputText”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”OutputText1″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”OutputText2″,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”OutputTextN”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InterpretedSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InterpretedSpeech2″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InterpretedSpeechN”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

{

“Name”:”SpeechRecogniton”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

}

}

},

{

“Name”:”Translation”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MST”,

“AIM”:”Translation”,

“Version”:”1″

}

}

},

{

“Name”:”SpeechFeatureExtraction”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MST”,

“AIM”:”SpeechFeatureExtraction”,

“Version”:”1″

}

}

},

{

“Name”:”SpeechSynthesis”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:”MMC-MST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

}

}

}

],

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”RequestedLanguage”

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RequestedLanguage”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputText”

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”InputText ”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech1″

},

“Input”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”InputSpeech1″

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”InputSpeech2″

},

“Input”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”InputSpeech2″

}

},

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”InterpretedSpeech1″

},

“Input”:{

“AIMName”:””,

“PortName”:”InterpretedSpeech1″

}

},

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”InterpretedSpeech2″

},

“Input”:{

“AIMName”:””,

“PortName”:”InterpretedSpeech2″

}

},

{

“Output”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”InterpretedSpeechN”

},

“Input”:{

“AIMName”:””,

“PortName”:”InterpretedSpeechN”

}

},

{

“Output”:{

“AIMName”:”SpeechFeatureExtraction”,

“PortName”:”SpeechFeatures”

},

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”SpeechFeatures”

}

},

{

“Output”:{

“AIMName”:”SpeechRecognition”,

“PortName”:”RecognizedText”

},

“Input”:{

“AIMName”:”Translation”,

“PortName”:”RecognizedText”

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText1″

},

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedText1″

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedText2″

},

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedText2″

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”TranslatedTextN”

},

“Input”:{

“AIMName”:”SpeechSynthesis”,

“PortName”:”TranslatedTextN”

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”OutputText1″

},

“Input”:{

“AIMName”:””,

“PortName”:”OutputText1″

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”OutputText2″

},

“Input”:{

“AIMName”:””,

“PortName”:”OutputText2″

}

},

{

“Output”:{

“AIMName”:”Translation”,

“PortName”:”OutputTextN”

},

“Input”:{

“AIMName”:””,

“PortName”:”OutputTextN”

}

}

],

“Implementations”:[

{

“BinaryName”:”mst.exe”,

“Architecture”:”x64″,

“OperatingSystem”:”Windows”,

“Version”:”v0.1″,

“Source”:”AIMStorage”,

“Destination”:””

}

],

“ResourcePolicies”:[

{

“Name”:”Memory”,

“Minimum”:”50000″,

“Maximum”:”100000″,

“Request”:”75000″

},

{

“Name”:”CPUNumber”,

“Minimum”:”1″,

“Maximum”:”2″,

“Request”:”1″

},

{

“Name”:”CPU:Class”,

“Minimum”:”Low”,

“Maximum”:”High”,

“Request”:”Medium”

},

{

“Name”:”GPU:CUDA:FrameBuffer”,

“Minimum”:”11GB_GDDR5X”,

“Maximum”:”8GB_GDDR6X”,

“Request”:”11GB_GDDR6″

},

{

“Name”:”GPU:CUDA:MemorySpeed”,

“Minimum”:”1.60GHz”,

“Maximum”:”1.77GHz”,

“Request”:”1.71GHz”

},

{

“Name”:”GPU:CUDA:Class”,

“Minimum”:”SM61″,

“Maximum”:”SM86″,

“Request”:”SM75″

},

{

“Name”:”GPU:Number”,

“Minimum”:”1″,

“Maximum”:”1″,

“Request”:”1″

}

],

“Documentation”:[

{

“Type”:”tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

2.        AIM metadata

2.1        SpeechRecognition

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”MST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

},

“Description”:”This AIM implements the speech recognition function for MMC-MST: it converts the user’s speech to text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognizedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.2        Translation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”MST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

},

“Description”:”This AIM implements the translation function for MMC-MST: it converts source language text to target language text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”InputSpeech1″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”RecognizedText”,

“Direction”:”OutputInput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

2.3        Speech Feature Extraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”MST”,

“AIM”:”SpeechRecognition”,

“Version”:”1″

},

“Description”:”This AIM implements the speech feature extraction function for MMC-MST: it extracts specified features from the user’s source language speech so that these can be used during speech synthesis of the target text.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

}

],

“Ports”:[

{

“Name”:”InputSpeech2″,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechFeatures”,

“Direction”:”OutputInput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

2.4        Speech Synthesis

 

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MMC”,

“AIW”:”MST”,

“AIM”:”SpeechSynthesis”,

“Version”:”1″

},

“Description”:”This AIM implements the speech synthesis function for MMC-MST: it receives target language text and optionally speech features extracted from the source language speech and produces target language speech.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:” SpeechFeatures_t”,

“Type”:”{byte pitch; string<256 tone; string<256 intonation; string<256 intensity; string<256 speed; Emotion_t emotion; float32[] NNspeechFeatures}”

},

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

}

],

“Ports”:[

{

“Name”:”TranslatedText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TranslatedTextN”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”OutputText1″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”OutputText2″,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”OutputTextN”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechFeatures”,

“Direction”:”InputOutput”,

“RecordType”:”SpeechFeatures_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InterpretedSpeech1″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InterpretedSpeech2″,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”InterpretedSpeechN”,

“Direction”:”OutputInput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

 

  • Metadata of MMC-PSE Composite AIM

1.        PersonalStatusExtraction

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PersonalStatusExtraction”,

“Version”:”2″

},

“Description”:”This AIM implements Personal Status Extraction function.”,

“Types”:[

{

“Name”:”InputSelection_t”,

“Type”:”uint8[]”

},

{

“Name”:”Text_t”,

“Type”:”uint8[] | uint16[]”

},

{

“Name”:”PSTextDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”PSSpeechDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”FaceObject_t”,

“Type”:”uint24[]”

},

{

“Name”:”PSFaceDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”BodyObject_t”,

“Type”:”uint[]”

},

{

“Name”:”PSGestureDescriptors_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”InputSelection”,

“Direction”:”InputOutput”,

“RecordType”:”Selection_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TextObject”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TextDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”TextDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechObject”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”SpeechDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceObject”,

“Direction”:”InputOutput”,

“RecordType”:”FaceObject_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyObject”,

“Direction”:”InputOutput”,

“RecordType”:”BodyObject_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”BodyDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PersonalStatus”,

“Direction”:”OutputInput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SubAIMs”:[

{

“Name”:”PSTextDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSTextDescription”,

“Version”:”2″

}

}

},

{

“Name”:”PSSpeechDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSSpeechDescription”,

“Version”:”2″

}

}

},

{

“Name”:”PSFaceDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSFaceDescription”,

“Version”:”2″

}

}

},

{

“Name”:”PSGestureDescription”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSGestureDescription”,

“Version”:”2″

}

}

},

{

“Name”:”PSTextInterpretatiom”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSTextInterpretatiom”,

“Version”:”2″

}

}

},

{

“Name”:”PSSpeechInterpretatiom”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSSpeechInterpretatiom”,

“Version”:”2″

}

}

},

{

“Name”:”PSFaceInterpretatiom”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSFaceInterpretatiom”,

“Version”:”2″

}

}

},

{

“Name”:”PSGestureInterpretatiom”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSGestureInterpretatiom”,

“Version”:”2″

}

}

},

{

“Name”:”PersonalStatusCombination”,

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Standard”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSGestureInterpretatiom”,

“Version”:”2″

}

}

}

],

“Topology”:[

{

“Output”:{

“AIMName”:””,

“PortName”:”Selection”

},

“Input”:{

“AIMName”:””,

“PortName”:”Selection”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”TextObject”

},

“Input”:{

“AIMName”:”TextDescription”,

“PortName”:”TextObject”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”SpeechObject”

},

“Input”:{

“AIMName”:”SpeechDescription”,

“PortName”:”SpeechObject”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”FaceObject”

},

“Input”:{

“AIMName”:”FaceDescription”,

“PortName”:”FaceObject”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”GestureObject”

},

“Input”:{

“AIMName”:”GestureDescription”,

“PortName”:”GestureObject”

}

},

{

“Output”:{

“AIMName”:”PSTextDescription”,

“PortName”:”PSTextDescriptors”

},

“Input”:{

“AIMName”:”PSTextInterpretation”,

“PortName”:”PSTextDescriptors”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:” TextDescriptors”

},

“Input”:{

“AIMName”:”PSTextInterpretation”,

“PortName”:”TextDescriptors”

}

},

{

“Output”:{

“AIMName”:”PSSpeechDescription”,

“PortName”:”PSSpeechDescriptors”

},

“Input”:{

“AIMName”:”PSSpeechInterpretation”,

“PortName”:”PSSpeechDescriptors”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”SpeechDescriptors”

},

“Input”:{

“AIMName”:”PSSpeechInterpretation”,

“PortName”:”SpeechDescriptors”

}

},

{

“Output”:{

“AIMName”:”PSFaceDescription”,

“PortName”:”PSFaceDescriptors”

},

“Input”:{

“AIMName”:”PSFaceInterpretation”,

“PortName”:”PSFaceDescriptors”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”FaceDescriptors”

},

“Input”:{

“AIMName”:”PSFaceInterpretation”,

“PortName”:”FaceDescriptors”

}

},

{

“Output”:{

“AIMName”:”PSGestureDescription”,

“PortName”:”PSGestureDescriptors”

},

“Input”:{

“AIMName”:”PSSpeechInterpretation”,

“PortName”:”PSGestureDescriptors”

}

},

{

“Output”:{

“AIMName”:””,

“PortName”:”BodyDescriptors”

},

“Input”:{

“AIMName”:”PSGestureInterpretation”,

“PortName”:”BodyDescriptors”

}

},

{

“Output”:{

“AIMName”:”PSTextInterpretation”,

“PortName”:”PS-Text”

},

“Input”:{

“AIMName”:”PersonalStatusCombination”,

“PortName”:”PS-Text”

}

},

{

“Output”:{

“AIMName”:”PSTSpeechInterpretation”,

“PortName”:”PS-Speech”

},

“Input”:{

“AIMName”:”PersonalStatusCombination”,

“PortName”:”PS-Speech ”

}

},

{

“Output”:{

“AIMName”:”PSFaceInterpretation”,

“PortName”:”PS-Face”

},

“Input”:{

“AIMName”:”PersonalStatusCombination”,

“PortName”:”PS-Face”

}

},

{

“Output”:{

“AIMName”:”PSGestureInterpretation”,

“PortName”:”PS-Gesture”

},

“Input”:{

“AIMName”:”PersonalStatusCombination”,

“PortName”:”PS-Gesture”

}

},

{

“Output”:{

“AIMName”:”PersonalStatusCombination”,

“PortName”:”PersonalStatus”

},

“Input”:{

“AIMName”:””,

“PortName”:”PersonalStatus”

}

}

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

1.1        PSTextDescription

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSTextDescription”,

“Version”:”2″

},

“Description”:”This AIM implements the text description for Personal Status.”,

“Types”:[

{

“Name”:”Text_t”,

“Type”:”{uint8[] | uint16[]}”

},

{

“Name”:”PSTextDescriptors_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”TextObject”,

“Direction”:”InputOutput”,

“RecordType”:”Text_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSTextDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:” PSTextDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

1.2        PSSpeechDescription

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSSpeechDescription”,

“Version”:”2″

},

“Description”:”This AIM implements the Speech description for Personal Status.”,

“Types”:[

{

“Name”:”Speech_t”,

“Type”:”uint16[]”

},

{

“Name”:”PSSpeechDescriptors_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”SpeechObject”,

“Direction”:”InputOutput”,

“RecordType”:”Speech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSSpeechDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”SpeechDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

1.3        PSFaceDescription

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSFaceDescription”,

“Version”:”2″

},

“Description”:”This AIM implements the Face description for Personal Status.”,

“Types”:[

{

“Name”:”Face_t”,

“Type”:”uint32[]”

},

{

“Name”:”PSFaceDescriptors_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”FaceObject”,

“Direction”:”InputOutput”,

“RecordType”:”Face_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSFaceDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

1.4        PSBodyDescription

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSGestureDescription”,

“Version”:”2″

},

“Description”:”This AIM implements the Body description for Personal Status.”,

“Types”:[

{

“Name”:”Body_t”,

“Type”:”uint8[]”

},

{

“Name”:”PSBodyDescriptors_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”BodyObject”,

“Direction”:”InputOutput”,

“RecordType”:”Body_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSBodyDescriptors”,

“Direction”:”OutputInput”,

“RecordType”:”GestureDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

1.5        PSTextInterpretation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSTextInterpretation”,

“Version”:”2″

},

“Description”:”This AIM implements the Text Interpretation function for Personal Status.”,

“Types”:[

{

“Name”:”PSTextDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”TextDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”PSText_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”PSTextDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”PSTextDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”TextDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”TextDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSText”,

“Direction”:”OutputInput”,

“RecordType”:” PSText_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

1.6        PSSpeechInterpretation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSSpeechInterpretation”,

“Version”:”2″

},

“Description”:”This AIM implements the Speech Interpretation function for Personal Status.”,

“Types”:[

{

“Name”:”PSSpeechDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”SpeechDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”PSSpeech_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”PSSpeechDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”PSSpeechDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”SpeechDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”SpeechDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSSpeech”,

“Direction”:”OutputInput”,

“RecordType”:” PSSpeech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

1.7        PSFaceInterpretation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSFaceInterpretation”,

“Version”:”2″

},

“Description”:”This AIM implements the Face Interpretation function for Personal Status.”,

“Types”:[

{

“Name”:”PSFaceDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”FaceDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”PSFace_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”PSFaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”PSFaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”FaceDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSFace”,

“Direction”:”OutputInput”,

“RecordType”:” PSFace_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

1.8        PSBodyInterpretation

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PSGestureInterpretation”,

“Version”:”2″

},

“Description”:”This AIM implements the Face Interpretation function for Personal Status.”,

“Types”:[

{

“Name”:”PSGestureDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”BodyDescriptors_t”,

“Type”:”uint8[]”

},

{

“Name”:”PSGesture_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”PSFaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”PSGestureDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”FaceDescriptors”,

“Direction”:”InputOutput”,

“RecordType”:”BodyDescriptors_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSFace”,

“Direction”:”OutputInput”,

“RecordType”:” PSGesture_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

1.9        PersonalStatusCombination

{

“Identifier”:{

“ImplementerID”:”/* String assigned by IIDRA */”,

“Specification”:{

“Name”:”MPAI-MMC”,

“AIW”:””,

“AIM”:”PersonalStatusCombination”,

“Version”:”2″

},

“Description”:”This AIM implements the Personal Status Combination function.”,

“Types”:[

{

“Name”:”PSText_t”,

“Type”:”uint8[]”

},

{

“Name”:”PSSpeech_t”,

“Type”:”uint8[]”

},

{

“Name”:”PSFace_t”,

“Type”:”uint8[]”

},

{

“Name”:”PSGesture_t”,

“Type”:”uint8[]”

},

{

“Name”:”PersonalStatus_t”,

“Type”:”uint8[]”

}

],

“Ports”:[

{

“Name”:”PSText”,

“Direction”:”InputOutput”,

“RecordType”:”PSText_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSSpeech”,

“Direction”:”InputOutput”,

“RecordType”:”PSSpeech_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSFace”,

“Direction”:”InputOutput”,

“RecordType”:”PSFace_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PSGesture”,

“Direction”:”InputOutput”,

“RecordType”:”PSGesture_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

},

{

“Name”:”PersonalStatus”,

“Direction”:”InputOutput”,

“RecordType”:”PersonalStatus_t”,

“Technology”:”Software”,

“Protocol”:””,

“IsRemote”:false

}

],

“SUbAIMs”:[

 

],

“Topology”:[

 

],

“Implementations”:[

 

],

“Documentation”:[

{

“Type”:”Tutorial”,

“URI”:”https://mpai.community/standards/mpai-mmc/”

}

]

}

}

 

 

 

 

 

 

 

 

 

 

 

 

  • Communication Among AIM Implementors (Informative)

 

To the extent possible, AIM input and output data are specified so that the inner implementation of an AIM need not be known or considered by cooperating AIMs. In other words, so far as possible, cooperating AIMs are designed to interact as black boxes. However, AIMs based upon the neural network technology currently prevalent in AI systems will generally require closer cooperation – in effect, greater transparency. An AIM receiving neural input in the form of features (vectors) will require some assistance in processing them. The downstream AIM will need either

  • The neural network model used to train the upstream AIM, or
  • A precise specification of the syntax and semantics of the features,

so that the downstream AIM can handle the features received from the upstream AIM.

 

A core design principle of MPAI is modularity: AI Modules or AIMs and their interfaces must be defined such that each AIM can be built by an independent implementor, without damage to the function of a use case as a whole.

However, MPAI also recognizes that that AIMs and their implementors may sometimes profit from communication and interchange of data and/or components. Such exchanges can be especially appropriate for AIMs featuring neural network components or comparable elements for machine learning – an increasingly common and important situation in the design of cooperative artificial intelligence modules.

The Unidirectional Speech Translation workflow provides a good example. It is designed to enable addition to the Translated Speech (that is, to the target language or output speech) of Speech Features extracted from the input, or source language, speech. This addition can enable the spoken translation to express the original emotion, or to employ the original speaker’s voice quality to give the impression that he or she is pronouncing the translation. For these purposes, a Speech Feature Extraction AIM can extract relevant speech features from the input speech and pass them to the Speech Synthesis (Features) AIM. However, while the two AIMs can indeed be independently implemented, the downstream (receiving) AIM, in this case Speech Synthesiser (Features), will need to process the received speech features appropriately. If Speech Feature Extraction employs neural network technology and passes the resulting features as vectors, then Speech Synthesis (Features) will need cooperation from Speech Feature Extraction. The downstream AIM will need either (1) the neural network model used to train the upstream AIM, or (2) a precise specification of the syntax and semantics of the features, so that the downstream AIM can handle the features received from the upstream AIM.

 

Comparable considerations obtain for the Conversation with Emotion (CWE) use case. And, more generally, they will obtain for any AIMs that exchange neural information. In explicitly providing for such communication among artificial machine learning models and components, MPAI is not only recognising practical requirements for cooperation among such modules, but also acknowledging an analogy with communication among biological neural subsystems.

 

 

 

 

[1] At the time of publication of this Technical Report, the MPAI Store was assigned as the IIDRA.