MPAI-HMC V1.1 AIWs Communicating Entities in Context

Go to AI Workflows

This Chapter specifies Functions, Reference Model, and Input and Output Data of the Communicating Entities in Context (HMC-CEC) AI Workflow (AIW), and the Functions and the Input and Output Data of its AI Modules (AIM). Each Input and Output Data of the HMC-CEC AIW and its AIMs is linked to its online specification.

1 Functions	2 Reference Model	3 Input/Output Data
4 Functions of AI Modules	5 Input/Output Data of AI Modules	6 AIW, AIMs, and JSON Metadata

1. Functions

The Communicating Entities in Context AI Workflow enables Machines to communicate with Entities in different Contexts where:

Machine is software embedded in a device that implements the HMC-CEC specification.
Entity refers to one of:
1. A human in a real audio-visual scene.
2. A human in a real scene represented as a Digitised Human in an Audio-Visual Scene.
3. A Machine represented as a Virtual Human in an Audio-Visual Scene.
4. A Digital Human in an Audio-Visual Scene rendered as a real audio-visual scene.
Context is information describing the Attributes of an Entity, such as language, culture etc.
Digital Human is either a Digitised Human, i.e., the digital representation of a human, or a Virtual Human, i.e., a Machine that can be rendered for human perception as humanoids.
A word beginning with a small letter represents an object in the real world. A word beginning with a capital letter represents an Object in the Virtual World.

Entities communicate in one of the following ways:

When communicating to humans:
1. Use the Entities’ body, speech, Context, and the audio-visual scene that the Entities are immersed in.
2. Use HMC-CEC-enabled Machines emitting Communication Items.
When communicating to Machines:
1. Render Entities as speaking humanoids in audio-visual scenes, as appropriate.
2. Emit Communication Items.

Communication Items are implementations of Portable Avatar, a Data Type providing information on an Avatar and its context to enable a receiver to render an Avatar as intended by the sender.

HMC-CEC assumes that:

Input Audio and Input Visual are Audio Object and Visual Object, respectively.
Output Audio and Output Visual convey audio and visual information rendered by the Audio-Visual Rendering AIM.
The real space is digitally represented as an Audio-Visual Scene that includes the communicating human and may include other humans and generic objects.
The Virtual Space contains a Digital Humans and/or its Audio components and may include other Digital Humans and generic Objects in an Audio-Visual Scene.
The Machine can:
- Understand the semantics of the Communication Item at different layers of depth.
- Produce a multimodal response expected to be congruent with the received information.
- Render the response as a speaking Virtual Human in an Audio-Visual Scene.
- Convert the semantics of the information produced by an Entity to a form that is compatible with the Context of another Entity.

An AI Module is specified only by its Functions and Interfaces. Implementers are free to use their preferred technologies to achieve the Functions providing the features while respecting the constraints of the interfaces. An implementation may subdivide a given AI Module into more than one AI Module, provided that the combined AI Module exposes the interfaces of the corresponding AI Modules of the HMC-CEC Specification. An implementation may combine AI Modules into one, provided that the resulting AI Module exposes the interfaces of the corresponding AI Modules of the HMC-CEC Specification.

Usage Scenarios offer a collection of example applications enabled by HMC-CEC.

2 Reference Model

Figure 1 depicts the Reference Model of the Communicating Entities in Context (HMC-CEC) Use Case implemented as an AI Workflow (AIW) that includes AI Modules (AIM) per Technical Specification: AI Framework (MPAI-AIF). Three out of six AIMs in Figure 1 (Audio-Visual Scene Description, Entity Context Understanding, and Personal Status Display) are Composite AIMs, i.e., they include interconnected AIMs. An introduction to MPAI-AIF is provided here.

Figure 1 – Human-Machine Communication AIW

Note that:

Words beginning with a capital are defined in Definitions, Words beginning with a small letter have the commonly understood meaning.
The Input Selector enables the Entity to inform the Machine through the Entity and Context Understanding AIM about use of Text vs. Speech in the communication, Language Preferences, and Selected Language in translation.
The Machine captures the information emitted by the Entity and its Context through Input Text, Input Speech, Input Audio and Input Visual.
The Input Portable Avatar is the Communication Item emitted by a communicating Machine.
The Audio-Visual Scene Descriptors are digital representations of a real audio-visual scene or a Virtual Audio-Visual Scene produced either by the Audio-Visual Scene Description AIM or the Audio-Visual Scene Integration and Description AIM.
To facilitate identification, AIMs are labelled with three letters indicating the Technical Specification that specifies it, followed by a hyphen “-”, followed by three letters uniquely identifying the AIM defined by that Technical Specification. For instance, Portable Avatar Demultiplexing is indicated as PAF-PDX where PAF refers to Technical Specification: Portable Avatar Format (MPAI-PAF) and PDX refers to the Portable Avatar Demultiplexing AIM specified by MPAI-PAF.

3 Input/Output Data

Table 1 gives the Input/Output Data of the MPAI-HMC AIW.

Table 1 – Input/Output Data of the HMC-CEC AIW

Input	Description
Portable Avatar	A Communication Item emitted by the Entity communicating with the ego Entity.
Input Selector	Selector containing data specifying the media and the language used in the communication.
Input Text	Text Object generated by the communicating Entity as information additional to or in lieu of Speech Object.
Input Audio	The audio scene captured by the Machine.
Input Visual	The visual scene captured by the Machine.
Output	Description
Portable Avatar	The Communication Item produced by the Machine.
Output Audio	The rendered audio corresponding to the Audio in the Communication Item.
Output Visual	The rendered visual corresponding to the visual in the Communication Item.
Output Text	The Text contained in a Communication Item or associated with Output Audio and Output Visual.

4 Functions of AI Modules

Table 2 gives the functions of HMC-CEC AIMs.

Table 2 – Functions of AI Modules

AIM	Functions
Audio-Visual Scene Integration and Description	Adds Avatar to Audio-Visual Scene in Portable Avatar providing Audio-Visual Scene Descriptors.
Audio-Visual Scene Description	Provides Audio-Visual Scene Descriptors.
Entity Context Understanding	Understands the information emitted by the Entity and its Context.
Entity Dialogue Processing	Produces Text and Personal Status of Machine in response to inputs.
Text-to-Text Translation	Produces Machine Translated Text from Machine Text and Personal Status.
Personal Status Display	Produces Portable Avatar.
Audio-Visual Scene Rendering	Renders the content of the Portable Avatar.

5 Input/Output Data of AI Modules

Table 3 gives the I/O Data of the AIMs of HMC-CEC. Note that an ID can either be specified as an Instance Identifier or refer to a generic identifier.

Table 3 – Input/Output Data of AI Modules

AIM	Receives	Produces
Audio-Visual Scene Integration and Description	Input Portable Avatar	Audio-Visual Scene Descriptors
Audio-Visual Scene Description	Input Audio Input Visual	Audio-Visual Scene Descriptors
Entity Context Understanding	Audio-Visual Scene Descriptors Input Text Input Selector	Audio-Visual Scene Geometry Personal Status Entity ID Text Meaning Instance Identifier
Entity Dialogue Processing	Audio-Visual Scene Geometry Personal Status Entity ID Text Meaning Instance Identifier	Machine Personal Status Machine Avatar ID Machine Text
Text-to-Text Translation	Machine Text Machine Personal Status	Machine Translated Text
Personal Status Display	Machine Personal Status Machine Avatar ID Machine Text	Output Portable Avatar
Audio-Visual Scene Rendering	Output Portable Avatar	Output Text Output Audio Output Visual

6 AIW, AIMs, and JSON Metadata

Table 4 – AIW, AIMs, and JSON Metadata

AIW	AIMs/1	AIMs/2	AIMs/3	Name	JSON
HMC-CEC	HMC-SID			Communicating Entities in Context	X
	HMC-SID			AV Scene Integration and Description	X
	OSD-AVS			Audio-Visual Scene Description	X
		CAE-ASD		Audio Scene Description	X
			CAE-AAT	Audio Analysis Transform	X
			CAE-ASL	Audio Source Localisation	X
			CAE-ASE	Audio Separation and Enhancement	X
			CAE-AST	Audio Synthesis Transform	X
			CAE-AMX	Audio Descriptors Multiplexing	X
		OSD-VSD		Visual Scene Description	X
		OSD-AVA		Audio-Visual Alignment	X
	HMC-ECU			Entity And Context Understanding	X
		OSD-SDX		Audio-Visual Scene Demultiplexing	X
		MMC-ASR		Automatic Speech Recognition	X
		OSD-VOI		Visual Object Identification	X
			OSD-VDI	Visual Direction Identification	X
			OSD-VOE	Visual Object Extraction	X
			OSD-VII	Visual Instance Identification	X
		CAE-AOI		Audio Object Identification	X
		MMC-NLU		Natural Language Understanding	X
		MMC-PSE		Personal Status Extraction	X
			MMC-ITD	Entity Text Description	X
			MMC-ISD	Entity Speech Description	X
			PAF-IFD	Entity Face Description	X
			PAF-IBD	Entity Body Description	X
			MMC-PTI	PS-Text Interpretation	X
			MMC-PSI	PS-Speech Interpretation	X
			PAF-PFI	PS-Face Interpretation	X
			PAF-PGI	PS-Gesture Interpretation	X
			MMC-PMX	Personal Status Multiplexing	X
		MMC-TTT		Text-to-Text Translation	X
	MMC-EDP			Entity Dialogue Processing	X
	MMC-TTT			Text-to-Text Translation	X
	PAF-PSD			Personal Status Display	X
		MMC-TTS		Text-to-Speech	X
		PAF-IFD		Entity Face Description	X
		PAF-IBD		Entity Body Description	X
		PAF-PMX		Portable Avatar Multiplexing	X
	PAF-AVR			Audio-Visual Scene Rendering	X