<- Scope Go to ToC References ->
Terms beginning with a capital letter have the meaning defined in Table 1. Terms beginning with a small letter have the meaning commonly defined for the context in which they are used. For instance, Table 1 defines Object and Scene but does not define object and scene.
A dash “-” preceding a Term in Table 1 indicates the following readings according to the font:
- Normal font: the Term in the table without a dash and preceding the one with a dash should be read before that Term. For example, “Avatar” and “- Model” will yield “Avatar Model.”
- Italic font: the Term in the table without a dash and preceding the one with a dash should be read after that Term. For example, “Avatar” and “- Portable” will yield “Portable Avatar.”
The full set of Terms and Definitions relevant to all MPAI Technical Specifications, including MPAI-HMC, can be accessed online.
Table 1 – General MPAI-HMC terms
Attitude | |
– Social | The coded representation of the internal state related to the way a human or avatar intends to position vis-à-vis the Environment or subsets of it, e.g., “Respectful”, “Confrontational”, “Soothing”. |
– Spatial | Position and Orientation and their velocities and accelerations of an Object in a Real or Virtual Environment. |
Audio | Digital representation of an analogue audio signal sampled at a frequency between 8-192 kHz with a number of bits/sample between 8 and 32, and non-linear and linear quantisation. Data with characteristics of Audio may be synthetically produced. |
Audio Block | A set of consecutive Audio samples. |
Audio Channel | A sequence of Audio Blocks. |
Avatar | An Object rendered to represent a Human of a Machine in a virtual space. |
– Model | An inanimate Avatar exposing animation interfaces. |
– Portable | A Data Type including Avatar ID, Time, Visual Environment, Spatial Attitude, Avatar Model, Body Descriptors, Face Descriptors, Language Preference, Speech Coding, Speech Data, Text, and Personal Status [8]. |
Body | A digital representation of a human body, head included, face excluded. |
Centre Point | The point of an Object selected to have Local Coordinates (0,0,0). |
Cognitive State | The coded representation of the internal state reflecting the way a human or avatar understands the Environment, such as “Confused”, “Dubious”, “Convinced”. |
Communication Item | An element generated by a Machine communicating with an Entity expressed with a Portable Avatar. |
Context | Information surrounding an Entity and providing additional insight into the information the Entity communicates. |
Coordinate System | A coordinate system where the position of a point is specified by three numbers. |
– Cartesian | A coordinate system where the three numbers are the signed distances from the point to three mutually perpendicular planes. |
– Spherical | A coordinate system where the three numbers are:
– the radial distance of that point from a fixed origin. – the polar angle measured from a fixed zenith direction. – the azimuthal angle of its orthogonal projection on a reference plane. |
Culture | The collection of language and customs governing the way a human, or a group of humans employ to express their internal statuses. |
Data | Information in digital form. |
– Format | The standard digital representation of Data. |
– Type | An instance of Data with a specific Data Format. |
Descriptor | The Digital Representation of a feature of an Object. |
– Body | A Data Type including the digital representation of the features of the body of a real or digital human. |
– Face | A Data Type including the digital representation of a feature of the face of a real or digital human. |
– Speech | A Data Type including the digital representation of a feature of speech of a real or digital human, such as degree of vocal tension, pitch, etc. |
– Text | A Data Type including the digital representation of a feature of text. |
Digital Representation | Data corresponding to and representing a physical entity. |
Emotion | The coded representation of the internal state resulting from the interaction of a human or avatar with the Environment or subsets of it, such as “Angry”, “Sad”, “Determined”. |
Entity | A human in a real environment or digitally represented as a Digitised Human in a Virtual Environment a Digital or a Virtual Human in a Virtual Environment. |
Environment | A Virtual Space that may be null or may include an Audio-Visual Scene. |
Experience | The state of an Entity whose senses/sensors are continuously affected for a meaningful period. |
Face | A digital representation of a human face. |
Factor | One of Emotion, Cognitive State, and Attitude. |
Gesture | A movement of a Digital Human or part of it, such as the head, arm, hand, and finger, often a complement to a vocal utterance. |
Human | A human being in a real space. |
– Digital | A Digitised or a Virtual Human in a Virtual Space. |
– Digitised | An Object in a Virtual Space that has the appearance of a specific human when rendered. |
– Virtual | An Object in a Virtual Space created by a computer that has a human appearance when rendered but is not a Digitised Human. |
Identifier | The label uniquely associated with a human or an Object. |
Instance | An element of a set of entities – Objects, Digital Humans etc. – belonging to some levels in a hierarchical classification (taxonomy). |
– Audio | The instance of an Audio Object. |
– Visual | The instance of a Visual Object. |
Machine | An Implementation of MPAI-MMC. |
Meaning | Information extracted from Text such as syntactic and semantic information, Personal Status, and other information, such as an Object Identifier. |
Microphone Array | A microphone system that uses multiple microphones arranged in a specific pattern to capture audio in an audio space. |
– Geometry | A Data Type representing the spatial arrangement of the microphones in a Microphone Array. |
Modality | One of Text, Speech, Face, or Gesture. |
Object | A data structure that can be rendered to cause an Experience. |
– Audio | An Object described by Audio Descriptors. |
– Audio-Visual | An Object described by Audio-Visual Descriptors. |
– Body | A digital representation of the body of a Human or a Machine. |
– Descriptor | The digital representation of the feature of an Object. |
– Digital | A Digitised or a Virtual Object. |
– Digitised | The digital representation of a real object. |
– Face | The digital representation of the face of a Human or a Machine. |
– Speech | An Object described by Speech Descriptors. |
– Text | A string of Text. |
– Virtual | An Object not representing an object in the real environment. |
– Visual | An Object described by Visual Descriptors. |
Orientation | The 3 Euler angles of an Object in a Virtual Space. |
Personal Status | A Data Type including three Factors – Cognitive State, Emotion and Social Attitude – conveyed by four Modalities – Text, Speech, Face, and Gesture and providing standard extensible labels for the three Factors [6]. |
– Face | The Cognitive State, Emotion, and Social Attitude conveyed by a Face Object. |
– Gesture | The Cognitive State, Emotion, and Social Attitude conveyed by the Gesture of a Body Object. |
– Speech | The Cognitive State, Emotion, and Social Attitude conveyed by a Speech Object. |
– Text | The Cognitive State, Emotion, and Social Attitude conveyed by a Text Object. |
Portable Avatar | A Data Type representing an Avatar and its Context. |
Position | The coordinates of a representative point for an object in a Virtual Space with respect to a set of coordinate axes. |
Principal Axis | The x axis of an Object. |
Rendering | The process of instantiating a Virtual Space as a human-perceptible entity. |
Scene | A composition of Objects located according to a Scene Geometry. |
– Audio | A Scene composed of Audio Objects. |
– Audio-Visual | A Scene composed of Audio Objects, Visual Objects and co-located Audio-Visual Objects. |
– Multichannel | A data structure containing at least 2 time-aligned interleaved Audio Channels. |
– Visual | A Scene composed of Visual Objects. |
Scene Descriptors | The digital representation of a feature of a scene. |
– Audio | A Data Type including the digital representation of the audio features of a real or digital scene. |
– Audio-Visual | A Data Type combining the Audio or Visual Scene Descriptors. |
– Visual | A Data Type including the digital representation of the visual features of a real or digital scene. |
Scene Geometry | The digital representation of the Object arrangement of a Scene. |
– Audio | A Data Type describing the spatial arrangement of the Visual Objects of a Scene. |
– Audio-Visual | A Data Type describing the spatial arrangement of the Audio, Visual, and Audio-Visual Objects of a Scene. |
– Visual | A Data Type describing the spatial arrangement of the Visual Objects of a Scene. |
Selector | Input Data having the goal to set a parameter (e.g., use of Text vs Speech or Language Preference) or an operating mode of a Machine. |
Speech | Digital representation of analogue speech sampled at a frequency between 8 kHz and 96 kHz with a number of bits/sample of 8, 16 or 24, and non-linear and linear quantisation or compressed. Data with characteristics of Speech may be synthetically produced. |
Text | A sequence of characters represented according to [12]. |
– Recognised | The Text at the output of an Automatic Speech Recognition AIM. |
– Refined Text | The Text at the output of a Natural Language Understanding AIM. |
– Translated Text | The Text at the output of a Natural Language Translation AIM. |
Virtual Space | A space generated and maintained by a computing platform that can be rendered. |