1 Scope of Conversation About a Scene
2 Reference Architecture of Conversation About a Scene
3 I/O Data of Conversation About a Scene
4 Functions of AI Modules of Conversation About a Scene
5 I/O Data of AI Modules of Conversation About a Scene
6 JSON Metadata of Conversation About a Scene
1 Scope of Conversation About a Scene
This Use Case addresses the case of a human holding a conversation with a Machine:
- The human converses with the Machine indicating the object in the Environment s/he wishes to talk to or ask questions about it using Speech, Face, and Gesture.
- The Machine
- Sees and hears an Environment containing a speaking human and some scattered objects.
- Recognises the human’s Speech and obtains the human’s Personal Status by capturing Speech, Face, and Gesture.
- Understands which object the human is referring to and generates an avatar that:
- Utters Speech conveying a synthetic Personal Status that is relevant to the human’s Personal Status as shown by his/her Speech, Face, and Gesture, and
- Displays a face conveying a Personal Status that is relevant to the human’s Personal Status and to the response the Machine intends to make.
- Renders the Scene that it perceives from a human-selected Point of View. The objects in the scene are labelled with the Machine’s understanding of their semantics so that the human can understand how the Machine sees the Environment.
2 Reference Architecture of Conversation About a Scene
Figure 3 gives the Conversation About a Scene Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.
Figure 3 – Reference Model of Conversation About a Scene
The Machine operates according to the following workflow:
- Visual Scene Description produces Body Descriptors, Visual Scene Geometry and Visual Objects from Input Visual.
- Automatic Speech Recognition produces Recognised Text from Input Speech.
- Visual Object Identification produces Visual Object Instance ID from Visual Objects, Body Descriptors, and Visual Scene Geometry.
- Natural Language Understanding produces Meaning and Refined Text from Recognised Text and Visual Object ID.
- Personal Status Extraction produces Input Personal Status from Meaning, Input Speech, Face Descriptors, and Body Descriptors.
- Entity Dialogue Processing produces Machine Text and Machine Personal Status from Input Personal Status, Meaning, and Refined Text.
- Personal Status Display produces Machine Portable Avatar from Machine Text, and Machine Personal Status.
- Audio-Visual Scene Rendering rendered the Scene as seen from the user-selected Point of View using the Visual Scene Descriptors. The rendering is constantly updated as the machine improves its understanding of the scene and its objects.
3 I/O Data of Conversation About a Scene
Table 1 gives the input/output data of Conversation About a Scene.
Table 1 – I/O data of Conversation About a Scene
Input data | From | Description |
Input Visual | Camera | Points to human and scene. |
Input Speech | Microphone | Speech of human. |
Point of View | Human | The point of view of the scene displayed by Scene Presentation. |
Output data | To | Descriptions |
Output Visual | Human | Rendering of the Scene containing labelled objects as perceived by Machine and seen from the Point of View. |
Machine Portable Avatar | Human | Portable Avatar produced by Machine. |
4 Functions of AI Modules of Conversation About a Scene
Table 2 provides the functions of the Conversation About a Scene Use Case.
Table 2 – Functions of AI Modules of Conversation About a Scene
AIM | Functions |
Visual Scene Description | 1. Receives Input Visual 2. Provides Visual Objects and Visual Scene Geometry. |
Visual Object Identification | 1. Receives Body Descriptors and non-human Visual Objects 2. Provides the Instance ID of the Visual Object indicated by the human. |
Automatic Speech Recognition | 1. Receives Input Speech 2. Provides Recognised Text. |
Natural Language Understanding | 1. Receives Instance ID and Recognised Text 2. Refines Text and extracts Meaning. |
Personal Status Extraction | 1. Receives Input Speech, Body Descriptors, Face Descriptors, and Meaning. 2. Provides Personal Status. |
Entity Dialogue Processing | 1. Receives Refined Text and Personal Status. 2. Produces Machine’s Text and Personal Status. |
Audio-Visual Scene Rendering | 1. Receives the Descriptors of the Visual Scene perceived by Machine. 2. Renders the Visual Scene from the Point of View selected by human. |
Personal Status Display | 1. Receives Machine’s Personal Status and Text. 2. Provides Machine Portable Avatar. |
5 I/O Data of AI Modules of Conversation About a Scene
Table 3 gives the list of AIMs with their I/O Data.
Table 3 – AI Modules of Conversation About a Scene
AIM | Receives | Produces |
Visual Scene Description | Input Visual | 1. Visual Scene Descriptors 2. Body Descriptors 3. Face Descriptors 4. Visual Scene Geometry 5. Visual Objects |
Visual Object Identification | 1. Body Object 2. Visual Objects 3. Visual Scene Geometry |
1. Visual Object Instance Identifier |
Automatic Speech Recognition | 1. Input Speech | 1. Recognised Text |
Natural Language Understanding | 1. Recognised Text 2. Visual Object Instance Identifier |
1. Meaning 2.Refined Text |
Personal Status Extraction | 1. Body Object 2. Face Object 3. Input Speech 4. Meaning |
1. Personal Status |
Entity Dialogue Processing | 1. Personal Status 2. Meaning 3. Refined Text |
1. Machine Personal Status |
Audio-Visual Scene Rendering | 1. Visual Scene Descriptors 2. Point of View |
1. Output Visual |
Personal Status Display | 1. Machine Text 2. Machine Personal Status |
1. Machine Portable Avatar |
6 Specification of Conversation About a Scene JSON Metadata and AIMs
Table 4 – AIMs and JSON Metadata
AIW and AIMs | Name and AIW/AIM Specification | JSON | ||
MMC-CAS | Conversation About a Scene | X | ||
– | OSD-VSD | Visual Scene Description | X | |
– | OSD-VOI | Visual Object Identification | X | |
– | OSD-VDI | Visual Direction Identification | X | |
– | OSD-VOE | Visual Object Extraction | X | |
– | OSD-VII | Visual Instance Identification | X | |
– | MMC-ASR | Automatic Speech Recognition | X | |
– | MMC-NLU | Natural Language Understanding | X | |
– | MMC-PSE | Personal Status Extraction | X | |
– | MMC-ITD | Input Text Description | X | |
– | MMC-ISD | Input Speech Description | X | |
– | PAF-IFD | Input Face Description | X | |
– | PAF-IBD | Input Body Description | X | |
– | MMC-PTI | PS-Text Interpretation | X | |
– | MMC-PSI | PS-Speech Interpretation | X | |
– | PAF-PFI | PS-Face Interpretation | X | |
– | PAF-PGI | PS-Gesture Interpretation | X | |
– | MMC-PMX | Personal Status Multiplexing | X | |
– | PAF-AVR | Audio-Visual Scene Rendering | X | |
– | PAF-PSD | Personal Status Display | X | |
– | OSD-AVS | Audio-Visual Scene Description | X | |
– | MMC-TTS | Text-to-Speech | X | |
– | PAF-IFD | Input Face Description | X | |
– | PAF-IBD | Input Body Description | X | |
– | PAF-PMX | Portable Avatar Multiplexing | X |