Go to AI Workflows

 

1 Functions 2 Reference Model 3 Input/Output Data
4 Functions of AI Modules 5 Input/Output Data of AI Modules 6 AIW, AIMs, and JSON Metadata
7 Reference Software 8 Conformance Testing 9 Performance Assessment

1. Functions

The Communicating Entities in Context (HMC-CEC) AI Workflow enables Entities to communicate with other Entities, possibly in different Contexts, using Machines implementing HMC-CEC, such as where:

This enables the communication between Entities, possibly in different Contexts, where:

  1. Entity refers to one of:
    1. human in an audio-visual scene or represented as a Digitised Human in an Audio-Visual Scene.
    2. Digital Human – representing a human (Digitised Human) or a Machine (Virtual Human) – in an Audio-Visual Scene rendered as an audio-visual scene.
  2. Context is information describing Attributes of an Entity, such as language, culture etc.

Note that the same non-capitalised and capitalised word represents an object in the real world and its digital representation in the Virtual World, respectively.

Depending on its real or virtual nature, an Entity communicates with another Entity by:

  1. Using the human’s body, speech, context, and the audio-visual scene the human is immersed in, or
  2. Rendering the Virtual Entity as a speaking humanoid in an audio-visual scene,

and by emitting Communication Items, an implementations of Portable Avatar, a Data Type including Data related to an Avatar and its Context, in order to enable a receiver to render an Avatar as intended by the sender.

HMC-CEC assumes that:

  1. Input/Output Audio and Input/Output Visual are Audio Object and Visual Object, respectively.
  2. The real space is digitally represented as an Audio-Visual Scene that includes the communicating human and may include other humans and generic objects.
  3. The Virtual Space contains a Digital Human and/or its Speech components and may include other Digital Humans and generic Objects in an Audio-Visual Scene.
  4. The Machine can:
    • Understand the semantics of the Communication Item at different layers of depth.
    • Produce a multimodal response expected to be congruent with the received information.
    • Render the response as a speaking Virtual Human in an Audio-Visual Scene.
    • Convert the Data produced by an Entity to Data whose semantics is compatible with the Context of another Entity.

Note 1: An AI Module is specified only by its Functions and Interfaces. Implementers are free to use their preferred technologies to achieve the expected AIM Functions while respecting the constraints of the interfaces.

Note 2: An implementation may subdivide a given AIM into more than one AIM, provided that the combined AIM interface conforms with the interfaces of the corresponding HMC-CEC AIM.

Note 3: An implementation may combine AIMs into one, provided that the resulting AIM exposes the interface of the combined HMC-CEC AIMs.

2 Reference Model

Figure 1 depicts the Reference Model of the Communicating Entities in Context (HMC-CEC) AIW that includes AI Modules (AIM) per Technical Specification: AI Framework (MPAI-AIF) V2.1. Three out of the six AIMs in Figure 1 (Audio-Visual Scene Description, Entity Context Understanding, and Personal Status Display) are Composite AIMs, i.e., they include interconnected AIMs.

Figure 1 – Communicating Entities in Context (HMC-CEC) AIW

Note that:

  1. The Input Selector enables the Entity to inform the Machine through the Entity and Context Understanding AIM about use of Text vs. Speech in the communication, Language Preferences, and Selected Language in translation.
  2. The Machine captures the information emitted by the Entity and its Context through Input Text, Input Audio and Input Visual. In Figure 1 Audio includes Speech.
  3. The Input Portable Avatar is the Communication Item emitted by a communicating Machine.
  4. The Audio-Visual Scene Descriptors are digital representations of a real audio-visual scene or a Virtual Audio-Visual Scene produced either by the Audio-Visual Scene Description AIM or the Audio-Visual Scene Integration and Description AIM.
  5. To facilitate identification, AIMs are labelled with three letters indicating the Technical Specification that specifies it, followed by a hyphen “-”, followed by three letters uniquely identifying the AIM defined by that Technical Specification. For instance, Portable Avatar Demultiplexing is indicated as PAF-PDX where PAF refers to Technical Specification: Portable Avatar Format (MPAI-PAF) and PDX refers to the Portable Avatar Demultiplexing AIM also specified by MPAI-PAF.

3 Input/Output Data

Table 1 gives the Input/Output Data of the MPAI-HMC AIW.

Table 1 – Input/Output Data of the HMC-CEC AIW

Input Description
Portable Avatar A Communication Item emitted by the Entity communicating with the ego Entity.
Input Selector Selector containing data specifying the media and the language used in the communication.
Input Text Text Object generated by the communicating Entity as information additional to or in lieu of Speech Object.
Input Audio The audio scene captured by the Machine.
Input Visual The visual scene captured by the Machine.
Output Description
Portable Avatar The Communication Item produced by the Machine.
Output Speech The speech corresponding to the Speech Object  in the output Communication Item.
Output Audio The audio corresponding to the Audio Object in the output Communication Item.
Output Visual The visual corresponding to the Visual Object  in the output Communication Item.
Output Text The Text contained in a Communication Item or associated with Output Audio and Output Visual.

4 Functions of AI Modules

Table 2 gives the functions of HMC-CEC AIMs.

Table 2 – Functions of AI Modules

AIM Functions
Audio-Visual Scene Integration and Description Adds Avatar to Audio-Visual Scene in Portable Avatar providing Audio-Visual Scene Descriptors.
Audio-Visual Scene Description Provides Audio-Visual Scene Descriptors.
Entity and Context Understanding Understands the information emitted by the Entity and its Context.
Entity Dialogue Processing Produces Text and Personal Status of Machine in response to inputs.
Text-to-Text Translation Produces Machine Translated Text from Machine Text and Personal Status.
Personal Status Display Produces Portable Avatar.
Audio-Visual Scene Rendering Renders the content of the Portable Avatar.

5 Input/Output Data of AI Modules

Table 3 gives the I/O Data of the AIMs of HMC-CEC. Note that an ID can either be specified as an Instance Identifier or refer to a generic identifier.

Table 3 – Input/Output Data of AI Modules

AIM Receives Produces
Audio-Visual Scene Integration and Description Input Portable Avatar Audio-Visual Scene Descriptors
Audio-Visual Scene Description Input Audio
Input Visual
Audio-Visual Scene Descriptors
Entity and Context Understanding Audio-Visual Scene Descriptors
Input Text
Input Selector
Audio-Visual Scene Geometry
Personal Status
Entity ID
Text
Meaning
Instance Identifier
Entity Dialogue Processing Audio-Visual Scene Geometry
Personal Status
Entity ID
Text
Meaning
Instance Identifier
Machine Personal Status
Machine Avatar ID
Machine Text
Output Text
Text-to-Text Translation Machine Text
Machine Personal Status
Machine Translated Text
Personal Status Display Machine Personal Status
Machine Avatar ID
Machine Text
Output Portable Avatar
Audio-Visual Scene Rendering Output Portable Avatar Output Audio
Output Visual

6 AIW, AIMs, and JSON Metadata

Table 4 – AIW, AIMs, and JSON Metadata

AIW AIMs/1 AIMs/2 AIMs/3 Name JSON
HMC-CEC HMC-SID Communicating Entities in Context X
HMC-SID AV Scene Integration and Description X
OSD-AVS Audio-Visual Scene Description X
CAE-ASD Audio Scene Description X
CAE-AAT Audio Analysis Transform X
CAE-ASL Audio Source Localisation X
CAE-ASE Audio Separation and Enhancement X
CAE-AST Audio Synthesis Transform X
CAE-AMX Audio Descriptors Multiplexing X
OSD-VSD Visual Scene Description X
OSD-AVA Audio-Visual Alignment X
HMC-ECU Entity And Context Understanding X
OSD-SDX Audio-Visual Scene Demultiplexing X
MMC-ASR Automatic Speech Recognition X
OSD-VOI Visual Object Identification X
OSD-VDI Visual Direction Identification X
OSD-VOE Visual Object Extraction X
OSD-VII Visual Instance Identification X
CAE-AOI Audio Object Identification X
MMC-NLU Natural Language Understanding X
MMC-PSE Personal Status Extraction X
MMC-ETD Entity Text Description X
MMC-ESD Entity Speech Description X
PAF-EFD Entity Face Description X
PAF-EBD Entity Body Description X
MMC-PTI PS-Text Interpretation X
MMC-PSI PS-Speech Interpretation X
PAF-PFI PS-Face Interpretation X
PAF-PGI PS-Gesture Interpretation X
MMC-PMX Personal Status Multiplexing X
MMC-TTT Text-to-Text Translation X
MMC-EDP Entity Dialogue Processing X
MMC-TTT Text-to-Text Translation X
PAF-PSD Personal Status Display X
MMC-TTS Text-to-Speech X
PAF-IFD Entity Face Description X
PAF-IBD Entity Body Description X
PAF-PMX Portable Avatar Multiplexing X
PAF-AVR Audio-Visual Scene Rendering X

7 Reference Software

8 Conformance Testing

Table 5 provides the Conformance Testing Method for HMC-CEC AIM.

If a schema contains references to other schemas, conformance of data for the primary schema implies that any data referencing a secondary schema shall also validate against the relevant schema, if present and conform with the Qualifier, if present.

Table 5 – Conformance Testing Method for HMC-CEC AIM

Receives Portable Avatar Shall validate against Portable Avatar Schema.
Portable Avatar Data shall conform with respective Qualifiers.
Input Selector Shall validate against Selector schema.
Input Text Shall validate against Text Object schema.
Audio Data shall conform with Text Qualifier.
Input Audio Shall validate against Audio Object schema.
Audio Data shall conform with Audio Qualifier.
Input Visual Shall validate against Visual Object schema.
Audio Data shall conform with Visual Qualifier.
Produces Portable Avatar Shall validate against Portable Avatar Schema.
Portable Avatar Data shall conform with respective Qualifiers.
Output Audio Shall validate against Audio Object schema.
Audio Data shall conform with Audio Qualifier.
Output Visual Shall validate against Visual Object schema.
Audio Data shall conform with Visual Qualifier.
Output Text Shall validate against Text Object schema.
Audio Data shall conform with Text Qualifier.

9 Performance Assessment

Go to AI Workflow