1 Scope of Human-CAV Interaction Subsystem
2 Reference Architecture of Human-CAV Interaction Subsystem.
3 I/O Data of Human-CAV Interaction
4 Functions of AI Modules of Human-CAV Interaction
5 I/O Data of AI Modules of Human-CAV Interaction
6 JSON Metadata of Human-CAV Interaction
1 Scope of Human-CAV Interaction Subsystem
The MPAI Connected Autonomous Vehicle (CAV) – Architecture specifies the Reference Model of a Vehicle – called Connected Autonomous Vehicle (CAV) – able to reach a destination by understanding the environment using its own sensors, exchanging information with other CAVs, and and actuating motion. (see here for an introduction to MPAI-CAV and here for the full specification). The Human-CAV interaction (HCI) Subsystem has the function to recognise the human owner or renter, respond to humans’ commands and queries, converse with humans during the travel, converse with the Autonomous Motion Subsystem in response to humans’ requests, and communicate with HCIs on board other CAVs.
2 Reference Architecture of Human-CAV Interaction Subsystem
Figure 1 represents the Human-CAV Interaction (HCI) Reference Model. This includes data such as Audio and Visual, Inter-HCI Information, HCI-AMS Message, and AMS-HCI Message that are not part of the specification but are used to define the full scope of the Human-CAV Interaction Subsystem.
Figure 1 – Human-CAV Interaction Reference Model
Note that it is assumed that Natural Language Understanding produces a Refined Text that is either the Refined Text or the Input Text, depending on which one is active. Meaning is always computed based on the available Text – Refined or Input. Personal Status Extraction is unaware of the decisions made by Natural Language Understanding.
A group of humans approaches the CAV outside the CAV: or sitting seats inside the CAV:
- Audio Scene Description AIM creates the Audio Scene Description in the form of Audio (Speech) Objects corresponding to each speaking human in the Environment (close to the CAV) and Audio Scene Geometry.
- Visual Scene Description creates the Visual Scene Descriptors in the form of Descriptors of the Faces and the Bodies corresponding to each human in the Environment (close to the CAV) and Visual Scene Geometry.
- Automatic Speech Recognition recognises the speech of each human and produces Recognised Text.
- Audio-Visual Alignment produces the Audio-Visual Scene Geometry.
- Visual Object Identification produces Object ID from Visual Objects, Body Descriptors, and Visual Scene Geometry.
- Natural Language Understanding extracts Meaning and produces Refined Text from the Recognised Text of each Input Speech and Visual Object.
- The Speaker Identity Recognition and Face Identity Recognition AIMs authenticate the humans that the HCI is interacting with using Speech and Face Descriptors.
- The Personal Status Extraction AIM extracts the Personal Status of the humans.
- The Entity Dialogue Processing AIM validates the human Identities, produces the response, and displays the HCI Personal Status, and issues commands to the Autonomous Motion Subsystem.
The HCI interacts with the humans in the cabin in several ways:
- By responding to commands/queries from one or more humans at the same time, e.g.:
- Commands to go to a waypoint, park at a place, etc.
- Commands with an effect in the cabin, e.g., turn off air conditioning, turn on the radio, call a person, open window or door, search for information etc.
- By conversing with and responding to questions from one or more humans at the same time about travel-related issues (in-depth domain-specific conversation), e.g.:
- Humans request information, e.g., time to destination, route conditions, weather at destination, etc.
- CAV offers alternatives to humans, e.g., long but safe way, short but likely to have interruptions.
- Humans ask questions about objects in the cabin.
- By following the conversation on travel matters held by humans in the cabin if 1) the passengers allow the HCI to do so, and 2) the processing is carried out inside the CAV.
The Audio Scene Description AIM provides all the Speech Objects in the Audio Scene, removing all other audio sources. The Speaker Identity Recognition and Automatic Speech Recognition AIMs support multiple Speech Objects as input. Each Speech Object has an identifier to enable the Speaker Identity Recognition and Automatic Speech Recognition AIMs to provide labelled Speaker IDs and Recognised Texts. If the Face Identity Recognition AIM provides Face IDs corresponding to the Speaker IDs, the Entity Dialogue Processing AIM can correctly associate the Speaker IDs (and the corresponding Recognised Texts) with the Face IDs.
3 I/O Data of Human-CAV Interaction
Table 1 gives the input/output data of Human-CAV Interaction.
Note that communication with the Autonomous Motion Subsystem (AMS) and remote HCI Subsystem is not specified here.
Table 1 – I/O data of Human-CAV Interaction
Input data | From | Description |
Input Audio (Outdoor)) | Environment Sensing Subsystem | User authentication User command User conversation |
Input Audio (Indoor) | Cabin Passengers | User’s social life Commands/interaction with HCI |
Input Visual (Outdoor) | Environment Sensing Subsystem | Commands/interaction with HCI |
Input Visual (Indoor) | Cabin Passengers | User’s social life Commands/interaction with HCI |
AMS-HCI Message | Autonomous Motion Subsystem | Includes response to HCI-AMS Message |
Inter HCI Information | Remote HCI | HCI-to-HCI information |
Output data | To | Comments |
Inter HCI Information | Remote HCI | HCI-to-HCI information |
HCI-AMS Command | Autonomous Motion Subsystem | HCI-to-AMS information |
Machine Portable Avatar | Cabin Passengers | HCI’s avatar. |
4 Functions of AI Modules of Human-CAV Interaction
Table 2 gives the functions of all Human-CAV Interaction AIMs.
Table 2 – Functions of Human-CAV Interaction’s AI Modules
AIM | Function |
Audio Scene Description | 1. Receives Input Audio captured by the appropriate (indoor or outdoor) Microphone Array. 2. Produces the Audio Scene Descriptors. |
Visual Scene Description | 1. Receives Input Visual captured by the appropriate (indoor or outdoor) visual sensors. 2. Produces the Visual Scene Descriptors. |
Automatic Speech Recognition | 1. Receives Input Speech from one of the human. 2. Converts speech into Recognised Text. |
Audio-Visual Alignment | 1. Receives Audio and Visual Scene Geometries and Audio and Visual Objects. 2. Re-identifies the Audio and Visual Objects having the same Spatial Attitudes. |
Visual Object Identification | 1. Receives Body Descriptors, Visual Scene Geometry, and Visual Objects. 2. Provides the ID of the class of objects of which the Visual Object is an Instance |
Natural Language Understanding | 1. Receives Recognised Text, Input Text, Visual Object Instance ID. 2. Produces Refined Text and Meaning. |
Speaker Identity Recognition | 1. Receives Speech Object. 2. Provides Speaker ID. |
Personal Status Extraction | 1. Receives Input Speech, Meaning, Body Descriptors, Face Descriptors. 2. Provides Input Personal Status of human. |
Face Identity Recognition | 1. Receives Face Object. 2. Provides Face ID. |
Entity Dialogue Processing | 1. Receives Speaker ID, Meaning, Refined Text, Input Personal Status, Face ID. 2. Provides Machine (HCI) Text and Personal Status. |
Personal Status Display | 1. Receives Machine Personal Status and Text. 2. Produces Machine Portable Avatar. |
5 I/O Data of AI Modules of Human-CAV Interaction
Table 3 gives the AI Modules of the Human-CAV Interaction depicted in Figure 3.
Table 3 – AI Modules of Human-CAV interaction
6 AIW and AIM Specification and JSON Metadata
Table 4 – AIW and AIM Specification and JSON Metadata