Conversation About a Scene

1 Functions	2 Reference Model	3 I/O Data
4 Functions of AI Modules	5 I/O Data of AI Modules	6 AIW, AIMs, and JSON Metadata

1 Scope of Conversation About a Scene

This Use Case addresses the case of a human holding a conversation with a Machine:

The human converses with the Machine indicating the object in the Environment s/he wishes to talk to or ask questions about it using Speech, Face, and Gesture.
The Machine
- Sees and hears an Environment containing a speaking human and some scattered objects.
- Recognises the human’s Speech and obtains the human’s Personal Status by capturing Speech, Face, and Gesture.
- Understands which object the human is referring to and generates an avatar that:
  - Utters Speech conveying a synthetic Personal Status that is relevant to the human’s Personal Status as shown by his/her Speech, Face, and Gesture, and
  - Displays a face conveying a Personal Status that is relevant to the human’s Personal Status and to the response the Machine intends to make.
- Renders the Scene that it perceives from a human-selected Point of View. The objects in the scene are labelled with the Machine’s understanding of their semantics so that the human can understand how the Machine sees the Environment.

2 Reference Model

Figure 3 gives the Conversation About a Scene Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

Figure 3 – Reference Model of Conversation About a Scene

The Machine operates according to the following workflow:

Visual Scene Description produces Body Descriptors, Visual Scene Geometry and Visual Objects from Input Visual.
Automatic Speech Recognition produces Recognised Text from Input Speech.
Visual Object Identification produces Visual Object Instance ID from Visual Objects, Body Descriptors, and Visual Scene Geometry.
Natural Language Understanding produces Meaning and Refined Text from Recognised Text and Visual Object ID.
Personal Status Extraction produces Input Personal Status from Meaning, Input Speech, Face Descriptors, and Body Descriptors.
Entity Dialogue Processing produces Machine Text and Machine Personal Status from Input Personal Status, Meaning, and Refined Text.
Personal Status Display produces Machine Portable Avatar from Machine Text, and Machine Personal Status.
Audio-Visual Scene Rendering rendered the Scene as seen from the user-selected Point of View using the Visual Scene Descriptors. The rendering is constantly updated as the machine improves its understanding of the scene and its objects.

3 I/O Data

Table 1 gives the input/output data of Conversation About a Scene.

Table 1 – I/O data of Conversation About a Scene

Input data	From	Description
Input Visual	Camera	Points to human and scene.
Input Speech	Microphone	Speech of human.
Point of View	Human	The point of view of the scene displayed by Scene Presentation.
Output data	To	Descriptions
Output Visual	Human	Rendering of the Scene containing labelled objects as perceived by Machine and seen from the Point of View.
Machine Portable Avatar	Human	Portable Avatar produced by Machine.

4 Functions of AI Modules

Table 2 provides the functions of the Conversation About a Scene Use Case.

Table 2 – Functions of AI Modules of Conversation About a Scene

AIM	Functions
Visual Scene Description	1. Receives Input Visual 2. Provides Visual Objects and Visual Scene Geometry.
Visual Object Identification	1. Receives Body Descriptors and non-human Visual Objects 2. Provides the Instance ID of the Visual Object indicated by the human.
Automatic Speech Recognition	1. Receives Input Speech 2. Provides Recognised Text.
Natural Language Understanding	1. Receives Instance ID and Recognised Text 2. Refines Text and extracts Meaning.
Personal Status Extraction	1. Receives Input Speech, Body Descriptors, Face Descriptors, and Meaning. 2. Provides Personal Status.
Entity Dialogue Processing	1. Receives Refined Text and Personal Status. 2. Produces Machine’s Text and Personal Status.
Audio-Visual Scene Rendering	1. Receives the Descriptors of the Visual Scene perceived by Machine. 2. Renders the Visual Scene from the Point of View selected by human.
Personal Status Display	1. Receives Machine’s Personal Status and Text. 2. Provides Machine Portable Avatar.

5 I/O Data of AI Modules

Table 3 gives the list of AIMs with their I/O Data.

Table 3 – AI Modules of Conversation About a Scene

AIM	Receives	Produces
Visual Scene Description	Input Visual	1. Visual Scene Descriptors 2. Body Descriptors 3. Face Descriptors 4. Visual Scene Geometry 5. Visual Objects
Visual Object Identification	1. Body Object 2. Visual Objects 3. Visual Scene Geometry	1. Visual Object Instance Identifier
Automatic Speech Recognition	1. Input Speech	1. Recognised Text
Natural Language Understanding	1. Recognised Text 2. Visual Object Instance Identifier	1. Meaning 2.Refined Text
Personal Status Extraction	1. Body Object 2. Face Object 3. Input Speech 4. Meaning	1. Personal Status
Entity Dialogue Processing	1. Personal Status 2. Meaning 3. Refined Text	1. Machine Personal Status
Audio-Visual Scene Rendering	1. Visual Scene Descriptors 2. Point of View	1. Output Visual
Personal Status Display	1. Machine Text 2. Machine Personal Status	1. Machine Portable Avatar

6 AIW, AIMs, and JSON Metadata and AIMs

Table 4 – AIW, AIMs, and JSON Metadata

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit