1 Functions | 2 Reference Model | 3 I/O Data |
4 Functions of AI Modules | 5 I/O Data of AI Modules | 6 AIW, AIMs, and JSON Metadata |
1 Scope of Conversation About a Scene
This Use Case addresses the case of a human holding a conversation with a Machine:
- The human converses with the Machine indicating the object in the Environment s/he wishes to talk to or ask questions about it using Speech, Face, and Gesture.
- The Machine
- Sees and hears an Environment containing a speaking human and some scattered objects.
- Recognises the human’s Speech and obtains the human’s Personal Status by capturing Speech, Face, and Gesture.
- Understands which object the human is referring to and generates an avatar that:
- Utters Speech conveying a synthetic Personal Status that is relevant to the human’s Personal Status as shown by his/her Speech, Face, and Gesture, and
- Displays a face conveying a Personal Status that is relevant to the human’s Personal Status and to the response the Machine intends to make.
- Renders the Scene that it perceives from a human-selected Point of View. The objects in the scene are labelled with the Machine’s understanding of their semantics so that the human can understand how the Machine sees the Environment.
2 Reference Model
Figure 3 gives the Conversation About a Scene Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.
Figure 3 – Reference Model of Conversation About a Scene
The Machine operates according to the following workflow:
- Visual Scene Description produces Body Descriptors, Visual Scene Geometry and Visual Objects from Input Visual.
- Automatic Speech Recognition produces Recognised Text from Input Speech.
- Visual Object Identification produces Visual Object Instance ID from Visual Objects, Body Descriptors, and Visual Scene Geometry.
- Natural Language Understanding produces Meaning and Refined Text from Recognised Text and Visual Object ID.
- Personal Status Extraction produces Input Personal Status from Meaning, Input Speech, Face Descriptors, and Body Descriptors.
- Entity Dialogue Processing produces Machine Text and Machine Personal Status from Input Personal Status, Meaning, and Refined Text.
- Personal Status Display produces Machine Portable Avatar from Machine Text, and Machine Personal Status.
- Audio-Visual Scene Rendering rendered the Scene as seen from the user-selected Point of View using the Visual Scene Descriptors. The rendering is constantly updated as the machine improves its understanding of the scene and its objects.
3 I/O Data
Table 1 gives the input/output data of Conversation About a Scene.
Table 1 – I/O data of Conversation About a Scene
Input data | From | Description |
Input Visual | Camera | Points to human and scene. |
Input Speech | Microphone | Speech of human. |
Point of View | Human | The point of view of the scene displayed by Scene Presentation. |
Output data | To | Descriptions |
Output Visual | Human | Rendering of the Scene containing labelled objects as perceived by Machine and seen from the Point of View. |
Machine Portable Avatar | Human | Portable Avatar produced by Machine. |
4 Functions of AI Modules
Table 2 provides the functions of the Conversation About a Scene Use Case.
Table 2 – Functions of AI Modules of Conversation About a Scene
AIM | Functions |
Visual Scene Description | 1. Receives Input Visual 2. Provides Visual Objects and Visual Scene Geometry. |
Visual Object Identification | 1. Receives Body Descriptors and non-human Visual Objects 2. Provides the Instance ID of the Visual Object indicated by the human. |
Automatic Speech Recognition | 1. Receives Input Speech 2. Provides Recognised Text. |
Natural Language Understanding | 1. Receives Instance ID and Recognised Text 2. Refines Text and extracts Meaning. |
Personal Status Extraction | 1. Receives Input Speech, Body Descriptors, Face Descriptors, and Meaning. 2. Provides Personal Status. |
Entity Dialogue Processing | 1. Receives Refined Text and Personal Status. 2. Produces Machine’s Text and Personal Status. |
Audio-Visual Scene Rendering | 1. Receives the Descriptors of the Visual Scene perceived by Machine. 2. Renders the Visual Scene from the Point of View selected by human. |
Personal Status Display | 1. Receives Machine’s Personal Status and Text. 2. Provides Machine Portable Avatar. |
5 I/O Data of AI Modules
Table 3 gives the list of AIMs with their I/O Data.
Table 3 – AI Modules of Conversation About a Scene
AIM | Receives | Produces |
Visual Scene Description | Input Visual | 1. Visual Scene Descriptors 2. Body Descriptors 3. Face Descriptors 4. Visual Scene Geometry 5. Visual Objects |
Visual Object Identification | 1. Body Object 2. Visual Objects 3. Visual Scene Geometry |
1. Visual Object Instance Identifier |
Automatic Speech Recognition | 1. Input Speech | 1. Recognised Text |
Natural Language Understanding | 1. Recognised Text 2. Visual Object Instance Identifier |
1. Meaning 2.Refined Text |
Personal Status Extraction | 1. Body Object 2. Face Object 3. Input Speech 4. Meaning |
1. Personal Status |
Entity Dialogue Processing | 1. Personal Status 2. Meaning 3. Refined Text |
1. Machine Personal Status |
Audio-Visual Scene Rendering | 1. Visual Scene Descriptors 2. Point of View |
1. Output Visual |
Personal Status Display | 1. Machine Text 2. Machine Personal Status |
1. Machine Portable Avatar |
6 AIW, AIMs, and JSON Metadata and AIMs
Table 4 – AIW, AIMs, and JSON Metadata