MPAI-MMC V2.2 AIWs Conversation About a Scene

<-Go to AI Workflows Go to ToC Conversation with Personal Status->

1 Functions	2 Reference Model	3 I/O Data
4 Functions of AI Modules	5 I/O Data of AI Modules	6 AIW, AIMs, and JSON Metadata
7 Reference Software	8 Conformance Testing	9 Performance Assessment

1 Functions

This Use Case addresses the case of a human holding a conversation with a Machine:

The human converses with the Machine indicating the object in the Environment s/he wishes to talk to or ask questions about it using Speech, Face, and Gesture.
The Machine
- Sees and hears an Environment containing a speaking human and some scattered objects.
- Recognises the human’s Speech and obtains the human’s Personal Status by capturing Speech, Face, and Gesture.
- Understands which object the human is referring to and generates an avatar that:
  - Utters Speech conveying a synthetic Personal Status that is relevant to the human’s Personal Status as shown by his/her Speech, Face, and Gesture, and
  - Displays a face conveying a Personal Status that is relevant to the human’s Personal Status and to the response the Machine intends to make.
- Renders the Scene that it perceives from a human-selected Point of View. The objects in the scene are labelled with the Machine’s understanding of their semantics so that the human can understand how the Machine sees the Environment.

2 Reference Model

Figure 1 gives the Conversation About a Scene Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

Figure 1 – Reference Model of Conversation About a Scene (MMC-CAS) AIM

The Machine operates according to the following workflow:

Visual Scene Description produces Body Descriptors, Visual Scene Geometry and Visual Objects from Input Visual.
Automatic Speech Recognition produces Recognised Text from Input Speech.
Visual Object Identification produces Visual Object Instance ID from Visual Objects, Body Descriptors, and Visual Scene Geometry.
Natural Language Understanding produces Meaning and Refined Text from Recognised Text and Visual Object ID.
Personal Status Extraction produces Input Personal Status from Meaning, Input Speech, Face Descriptors, and Body Descriptors.
Entity Dialogue Processing produces Machine Text and Machine Personal Status from Input Personal Status, Meaning, and Refined Text.
Personal Status Display produces Machine Portable Avatar from Machine Text, and Machine Personal Status.
Audio-Visual Scene Rendering renders the Audio-Visual Scene
1. Described by the Visual Scene Descriptors.
2. Integrated by the Machine’s Portable Avatar information depending on View Selector.
3. As seen from the human-selected Point of View.

3 I/O Data

Table 1 gives the input/output data of Conversation About a Scene.

Table 1 – I/O data of Conversation About a Scene

Input data	From	Description
View Selector	Human	Selects whether Machine is rendered in the scene
Input Visual	Camera	Points to human and scene.
Input Speech	Microphone	Speech of human.
Point of View	Human	The point of view of the Audio-Visual Scene displayed by Audio-Visual Scene Rendering.
Output data	To	Descriptions
Output Visual	Human	Rendering of the Visual Scene containing labelled objects, human, and Machine depending on View Selector as perceived by Machine and seen from the Point of View.
Output Speech	Human	Speech of Portable Avatar produced by Machine.

4 Functions of AI Modules

Table 2 provides the functions of the Conversation About a Scene Use Case.

Table 2 – Functions of AI Modules of Conversation About a Scene

AIM	Functions
Visual Scene Description	1. Receives Input Visual 2. Provides Visual Objects and Visual Scene Geometry.
Visual Object Identification	1. Receives Body Descriptors and non-human Visual Objects 2. Provides the Instance ID of the Visual Object indicated by the human.
Automatic Speech Recognition	1. Receives Input Speech 2. Provides Recognised Text.
Natural Language Understanding	1. Receives Instance ID and Recognised Text 2. Refines Text and extracts Meaning.
Personal Status Extraction	1. Receives Input Speech, Body Descriptors, Face Descriptors, and Meaning. 2. Provides Personal Status.
Entity Dialogue Processing	1. Receives Refined Text and Personal Status. 2. Produces Machine’s Text and Personal Status.
Personal Status Display	1. Receives Machine’s Personal Status and Text. 2. Provides Machine Portable Avatar.
Audio-Visual Scene Rendering	1. Receives the Descriptors of the Visual Scene perceived by Machine including the Portable Avatar of the Personal Status Display. 2. Renders the Audio-Visual Scene from the Point of View selected by human.

5 I/O Data of AI Modules

Table 3 gives the list of AIMs with their I/O Data.

Table 3 – AI Modules of Conversation About a Scene

AIM	Receives	Produces
Visual Scene Description	Input Visual	1. Visual Scene Descriptors 2. Body Descriptors 3. Face Descriptors 4. Visual Scene Geometry 5. Visual Objects
Visual Object Identification	1. Body Object 2. Visual Objects 3. Visual Scene Geometry	1. Visual Object Instance Identifier
Automatic Speech Recognition	1. Input Speech	1. Recognised Text
Natural Language Understanding	1. Recognised Text 2. Visual Object Instance Identifier	1. Meaning 2.Refined Text
Personal Status Extraction	1. Body Object 2. Face Object 3. Input Speech 4. Meaning	1. Personal Status
Entity Dialogue Processing	1. Personal Status 2. Meaning 3. Visual Object ID 4. Refined Text	1. Machine Personal Status
Personal Status Display	1. Machine Text 2. Machine Personal Status	1. Machine Portable Avatar
Audio-Visual Scene Rendering	1. Visual Scene Descriptors 2. Point of View	1. Output Speech 2. Output Visual

6 AIW, AIMs, and JSON Metadata and AIMs

Table 4 provides the links to the AIW and AIM specifications and to the JSON syntaxes. AIMs/1 indicates that the column contains Composite AIMs and AIMs/2 indicates that the column contains their Basic AIMs.

Table 4 – AIW, AIMs, and JSON Metadata

AIW	AIMs/1	AIMs/2	Name	JSON
MMC-CAS			Conversation About a Scene	X
	OSD-VSD		Visual Scene Description	X
	OSD-VOI		Visual Object Identification	X
		OSD-VDI	Visual Direction Identification	X
		OSD-VOE	Visual Object Extraction	X
		OSD-VII	Visual Instance Identification	X
	MMC-ASR		Automatic Speech Recognition	X
	MMC-NLU		Natural Language Understanding	X
	MMC-PSE		Personal Status Extraction	X
		MMC-ETD	Entity Text Description	X
		MMC-ESD	Entity Speech Description	X
		PAF-EFD	Entity Face Description	X
		PAF-EBD	Entity Body Description	X
		MMC-PTI	PS-Text Interpretation	X
		MMC-PSI	PS-Speech Interpretation	X
		PAF-PFI	PS-Face Interpretation	X
		PAF-PGI	PS-Gesture Interpretation	X
		MMC-PMX	Personal Status Multiplexing	X
	MMC-EDP		Entity Dialogue Processing	X
	OSD-PSD		Personal Status Display	X
		MMC-TTS	Text-to-Speech	X
		PAF-EFD	Entity Face Description	X
		PAF-EBD	Entity Body Description	X
		PAF-PMX	Portable Avatar Multiplexing	X
	PAF-AVR		Audio-Visual Scene Rendering	X

7 Reference Software

8 Conformance Testing

9 Performance Assessment

<-Go to AI Workflow Go to ToC Conversation with Personal Status->

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit