<-Go to AI Workflows       Go to ToC       Conversation with Personal Status->

1     Functions 2     Reference Model 3     I/O Data
4     Functions of AI Modules 5     I/O Data of AI Modules 6     AIW, AIMs, and JSON Metadata
7     Reference Software 8     Conformance Testing 9     Performance Assessment

1      Functions

This Use Case addresses the case of a human holding a conversation with a Machine:

  1. The human converses with the Machine indicating the object in the Environment s/he wishes to talk to or ask questions about it using Speech, Face, and Gesture.
  2. The Machine
    • Sees and hears an Environment containing a speaking human and some scattered objects.
    • Recognises the human’s Speech and obtains the human’s Personal Status by capturing Speech, Face, and Gesture.
    • Understands which object the human is referring to and generates an avatar that:
      • Utters Speech conveying a synthetic Personal Status that is relevant to the human’s Personal Status as shown by his/her Speech, Face, and Gesture, and
      • Displays a face conveying a Personal Status that is relevant to the human’s Personal Status and to the response the Machine intends to make.
    • Renders the Scene that it perceives from a human-selected Point of View. The objects in the scene are labelled with the Machine’s understanding of their semantics so that the human can understand how the Machine sees the Environment.

2      Reference Model

Figure 1 gives the Conversation About a Scene Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

Figure 1 – Reference Model of Conversation About a Scene (MMC-CAS) AIM

The Machine operates according to the following workflow:

  1. Visual Scene Description produces Body Descriptors, Visual Scene Geometry and Visual Objects from Input Visual.
  2. Automatic Speech Recognition produces Recognised Text from Input Speech.
  3. Visual Object Identification produces Visual Object Instance ID from Visual Objects, Body Descriptors, and Visual Scene Geometry.
  4. Natural Language Understanding produces Meaning and Refined Text from Recognised Text and Visual Object ID.
  5. Personal Status Extraction produces Input Personal Status from Meaning, Input Speech, Face Descriptors, and Body Descriptors.
  6. Entity Dialogue Processing produces Machine Text and Machine Personal Status from Input Personal Status, Meaning, and Refined Text.
  7. Personal Status Display produces Machine Portable Avatar from Machine Text, and Machine Personal Status.
  8. Audio-Visual Scene Rendering renders the Audio-Visual Scene
    1. Described by the Visual Scene Descriptors.
    2. Integrated by the Machine’s Portable Avatar information depending on View Selector.
    3. As seen from the human-selected Point of View.

3      I/O Data

Table 1 gives the input/output data of Conversation About a Scene.

Table 1 – I/O data of Conversation About a Scene

Input data From Description
View Selector Human Selects whether Machine is rendered in the scene
Input Visual Camera Points to human and scene.
Input Speech Microphone Speech of human.
Point of View Human The point of view of the Audio-Visual Scene displayed by Audio-Visual Scene Rendering.
Output data To Descriptions
Output Visual Human Rendering of the Visual Scene containing labelled objects, human, and Machine depending on View Selector as perceived by Machine and seen from the Point of View.
Output Speech Human Speech of Portable Avatar produced by Machine.

4      Functions of AI Modules

Table 2 provides the functions of the Conversation About a Scene Use Case.

Table 2 – Functions of AI Modules of Conversation About a Scene

AIM Functions
Visual Scene Description 1.     Receives Input Visual
2.     Provides Visual Objects and Visual Scene Geometry.
Visual Object Identification 1.     Receives Body Descriptors and non-human Visual Objects
2.     Provides the Instance ID of the Visual Object indicated by the human.
Automatic Speech Recognition 1.     Receives Input Speech
2.     Provides Recognised Text.
Natural Language Understanding 1.     Receives Instance ID and Recognised Text
2.     Refines Text and extracts Meaning.
Personal Status Extraction 1.     Receives Input Speech, Body Descriptors, Face Descriptors, and Meaning.
2.     Provides Personal Status.
Entity Dialogue Processing 1.     Receives Refined Text and Personal Status.
2.     Produces Machine’s Text and Personal Status.
Personal Status Display 1.     Receives Machine’s Personal Status  and Text.
2.     Provides Machine Portable Avatar.
Audio-Visual Scene Rendering 1.     Receives the Descriptors of the Visual Scene perceived by Machine including the Portable Avatar of the Personal Status Display.
2.     Renders the Audio-Visual Scene from the Point of View selected by human.

5     I/O Data of AI Modules

Table 3  gives the list of AIMs with their I/O Data.

Table 3 – AI Modules of Conversation About a Scene

AIM Receives Produces
Visual Scene Description Input Visual 1.  Visual Scene Descriptors
2. Body Descriptors
3. Face Descriptors
4. Visual Scene Geometry
5. Visual Objects
Visual Object Identification 1.   Body Object
2.  Visual Objects
3.  Visual Scene Geometry
1. Visual Object Instance Identifier
Automatic Speech Recognition 1. Input Speech 1. Recognised Text
Natural Language Understanding 1. Recognised Text
2. Visual Object Instance Identifier
1. Meaning
2.Refined Text
Personal Status Extraction 1.  Body Object
2. Face Object
3. Input Speech
4. Meaning
 1. Personal Status
Entity Dialogue Processing 1. Personal Status
2. Meaning
3. Visual Object ID
4. Refined Text
1. Machine Personal Status
Personal Status Display 1. Machine Text
2. Machine Personal Status
1. Machine Portable Avatar
Audio-Visual Scene Rendering 1. Visual Scene Descriptors
2. Point of View
1. Output Speech
2. Output Visual

6      AIW, AIMs, and JSON Metadata and AIMs

Table 4 provides the links to the AIW and AIM specifications and to the JSON syntaxes. AIMs/1 indicates that the column contains Composite AIMs and AIMs/2 indicates that the column contains their Basic AIMs.

Table 4 – AIW, AIMs, and JSON Metadata

AIW AIMs/1 AIMs/2 Name JSON
MMC-CAS Conversation About a Scene X
OSD-VSD Visual Scene Description X
OSD-VOI Visual Object Identification X
OSD-VDI Visual Direction Identification X
OSD-VOE Visual Object Extraction X
OSD-VII Visual Instance Identification X
MMC-ASR Automatic Speech Recognition X
MMC-NLU Natural Language Understanding X
MMC-PSE Personal Status Extraction X
MMC-ETD Entity Text Description X
MMC-ESD Entity Speech Description X
PAF-EFD Entity Face Description X
PAF-EBD Entity Body Description X
MMC-PTI PS-Text Interpretation X
MMC-PSI PS-Speech Interpretation X
PAF-PFI PS-Face Interpretation X
PAF-PGI PS-Gesture Interpretation X
MMC-PMX Personal Status Multiplexing X
MMC-EDP Entity Dialogue Processing X
OSD-PSD Personal Status Display X
MMC-TTS Text-to-Speech X
PAF-EFD Entity Face Description X
PAF-EBD Entity Body Description X
PAF-PMX Portable Avatar Multiplexing X
PAF-AVR Audio-Visual Scene Rendering X

7     Reference Software

8     Conformance Testing

9     Performance Assessment

 

<-Go to AI Workflows       Go to ToC       Conversation with Personal Status->