1     Scope of Conversation About a Scene

2     Reference Architecture of Conversation About a Scene

3     I/O Data of Conversation About a Scene

4     Functions of AI Modules of Conversation About a Scene

5     I/O Data of AI Modules of Conversation About a Scene

6     JSON Metadata of Conversation About a Scene

1      Scope of Conversation About a Scene

This Use Case addresses the case of a human holding a conversation with a Machine:

  1. The human converses with the Machine indicating the object in the Environment s/he wishes to talk to or ask questions about it using Speech, Face, and Gesture.
  2. The Machine
    • Sees and hears an Environment containing a speaking human and some scattered objects.
    • Recognises the human’s Speech and obtains the human’s Personal Status by capturing Speech, Face, and Gesture.
    • Understands which object the human is referring to and generates an avatar that:
      • Utters Speech conveying a synthetic Personal Status that is relevant to the human’s Personal Status as shown by his/her Speech, Face, and Gesture, and
      • Displays a face conveying a Personal Status that is relevant to the human’s Personal Status and to the response the Machine intends to make.
    • Renders the Scene that it perceives from a human-selected Point of View. The objects in the scene are labelled with the Machine’s understanding of their semantics so that the human can understand how the Machine sees the Environment.

2      Reference Architecture of Conversation About a Scene

Figure 3 gives the Conversation About a Scene Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

Figure 3 – Reference Model of Conversation About a Scene

The Machine operates according to the following workflow:

  1. Visual Scene Description produces Body Descriptors, Visual Scene Geometry and Visual Objects from Input Visual.
  2. Automatic Speech Recognition produces Recognised Text from Input Speech.
  3. Visual Object Identification produces Visual Object Instance ID from Visual Objects, Body Descriptors, and Visual Scene Geometry.
  4. Natural Language Understanding produces Meaning and Refined Text from Recognised Text and Visual Object ID.
  5. Personal Status Extraction produces Input Personal Status from Meaning, Input Speech, Face Descriptors, and Body Descriptors.
  6. Entity Dialogue Processing produces Machine Text and Machine Personal Status from Input Personal Status, Meaning, and Refined Text.
  7. Personal Status Display produces Machine Portable Avatar from Machine Text, and Machine Personal Status.
  8. Audio-Visual Scene Rendering rendered the Scene as seen from the user-selected Point of View using the Visual Scene Descriptors. The rendering is constantly updated as the machine improves its understanding of the scene and its objects.

3      I/O Data of Conversation About a Scene

Table 1 gives the input/output data of Conversation About a Scene.

Table 1 – I/O data of Conversation About a Scene

Input data From Description
Input Visual Camera Points to human and scene.
Input Speech Microphone Speech of human.
Point of View Human The point of view of the scene displayed by Scene Presentation.
Output data To Descriptions
Output Visual Human Rendering of the Scene containing labelled objects as perceived by Machine and seen from the Point of View.
Machine Portable Avatar Human Portable Avatar produced by Machine.

4      Functions of AI Modules of Conversation About a Scene

Table 2 provides the functions of the Conversation About a Scene Use Case.

Table 2 – Functions of AI Modules of Conversation About a Scene

AIM Functions
Visual Scene Description 1.     Receives Input Visual
2.     Provides Visual Objects and Visual Scene Geometry.
Visual Object Identification 1.     Receives Body Descriptors and non-human Visual Objects
2.     Provides the Instance ID of the Visual Object indicated by the human.
Automatic Speech Recognition 1.     Receives Input Speech
2.     Provides Recognised Text.
Natural Language Understanding 1.     Receives Instance ID and Recognised Text
2.     Refines Text and extracts Meaning.
Personal Status Extraction 1.     Receives Input Speech, Body Descriptors, Face Descriptors, and Meaning.
2.     Provides Personal Status.
Entity Dialogue Processing 1.     Receives Refined Text and Personal Status.
2.     Produces Machine’s Text and Personal Status.
Audio-Visual Scene Rendering 1.     Receives the Descriptors of the Visual Scene perceived by Machine.
2.     Renders the Visual Scene from the Point of View selected by human.
Personal Status Display 1.     Receives Machine’s Personal Status  and Text.
2.     Provides Machine Portable Avatar.

5      I/O Data of AI Modules of Conversation About a Scene

Table 3 gives the list of AIMs with their I/O Data.

Table 3 – AI Modules of Conversation About a Scene

AIM Receives Produces
Visual Scene Description Input Visual 1.  Visual Scene Descriptors
2. Body Descriptors
3. Face Descriptors
4. Visual Scene Geometry
5. Visual Objects
Visual Object Identification 1.    Body Object
2.    Visual Objects
3.    Visual Scene Geometry
1. Visual Object Instance Identifier
Automatic Speech Recognition 1. Input Speech 1. Recognised Text
Natural Language Understanding 1. Recognised Text
2. Visual Object Instance Identifier
1. Meaning
2.Refined Text
Personal Status Extraction 1.  Body Object
2. Face Object
3. Input Speech
4. Meaning
 1. Personal Status
Entity Dialogue Processing 1. Personal Status
2. Meaning
3. Refined Text
1. Machine Personal Status
Audio-Visual Scene Rendering 1. Visual Scene Descriptors
2. Point of View
1. Output Visual
Personal Status Display 1. Machine Text
2. Machine Personal Status
1. Machine Portable Avatar

6      Specification of Conversation About a Scene JSON Metadata and AIMs

Table 4 – AIMs and JSON Metadata

AIW and AIMs Name and AIW/AIM Specification JSON
MMC-CAS Conversation About a Scene X
OSD-VSD Visual Scene Description X
OSD-VOI Visual Object Identification X
OSD-VDI Visual Direction Identification X
OSD-VOE Visual Object Extraction X
OSD-VII Visual Instance Identification X
MMC-ASR Automatic Speech Recognition X
MMC-NLU Natural Language Understanding X
MMC-PSE Personal Status Extraction X
MMC-ITD Input Text Description X
MMC-ISD Input Speech Description X
PAF-IFD Input Face Description X
PAF-IBD Input Body Description X
MMC-PTI PS-Text Interpretation X
MMC-PSI PS-Speech Interpretation X
PAF-PFI PS-Face Interpretation X
PAF-PGI PS-Gesture Interpretation X
MMC-PMX Personal Status Multiplexing X
PAF-AVR Audio-Visual Scene Rendering X
PAF-PSD Personal Status Display X
OSD-AVS Audio-Visual Scene Description X
MMC-TTS Text-to-Speech X
PAF-IFD Input Face Description X
PAF-IBD Input Body Description X
PAF-PMX Portable Avatar Multiplexing X