1     Scope of Human-CAV Interaction Subsystem

2     Reference Architecture of Human-CAV Interaction Subsystem.

3     I/O Data of Human-CAV Interaction

4     Functions of AI Modules of Human-CAV Interaction

5     I/O Data of AI Modules of Human-CAV Interaction

6     JSON Metadata of Human-CAV Interaction

1      Scope of Human-CAV Interaction Subsystem

The MPAI Connected Autonomous Vehicle (CAV) – Architecture specifies the Reference Model of a Vehicle – called Connected Autonomous Vehicle (CAV) – able to reach a destination by understanding the environment using its own sensors, exchanging information with other CAVs, and and actuating motion. (see here for an  introduction to MPAI-CAV and here for the full specification). The Human-CAV interaction (HCI) Subsystem has the function to recognise the human owner or renter, respond to humans’ commands and queries, converse with humans during the travel, converse with the Autonomous Motion Subsystem in response to humans’ requests, and communicate with HCIs on board other CAVs.

2      Reference Architecture of Human-CAV Interaction Subsystem

Figure 1 represents the Human-CAV Interaction (HCI) Reference Model. This includes data such as Audio and Visual, Inter-HCI Information, HCI-AMS Message, and AMS-HCI Message  that are not part of the specification but are used to define the full scope of the Human-CAV Interaction Subsystem.

Figure 1 – Human-CAV Interaction Reference Model

Note that it is assumed that Natural Language Understanding produces a Refined Text that is either the Refined Text or the Input Text, depending on which one is active. Meaning is always computed based on the available Text – Refined or Input. Personal Status Extraction is unaware of the decisions made by Natural Language Understanding.

A group of humans approaches the CAV outside the CAV: or sitting seats inside the CAV:

  1. Audio Scene Description AIM creates the Audio Scene Description in the form of Audio (Speech) Objects corresponding to each speaking human in the Environment (close to the CAV) and Audio Scene Geometry.
  2. Visual Scene Description creates the Visual Scene Descriptors in the form of Descriptors of the Faces and the Bodies corresponding to each human in the Environment (close to the CAV) and Visual Scene Geometry.
  3. Automatic Speech Recognition recognises the speech of each human and produces Recognised Text.
  4. Audio-Visual Alignment produces the Audio-Visual Scene Geometry.
  5. Visual Object Identification produces Object ID from Visual Objects, Body Descriptors, and Visual Scene Geometry.
  6. Natural Language Understanding extracts Meaning and produces Refined Text from the Recognised Text of each Input Speech and Visual Object.
  7. The Speaker Identity Recognition and Face Identity Recognition AIMs authenticate the humans that the HCI is interacting with using Speech and Face Descriptors.
  8. The Personal Status Extraction AIM extracts the Personal Status of the humans.
  9. The Entity Dialogue Processing AIM validates the human Identities, produces the response, and displays the HCI Personal Status, and issues commands to the Autonomous Motion Subsystem.

The HCI interacts with the humans in the cabin in several ways:

  1. By responding to commands/queries from one or more humans at the same time, e.g.:
    • Commands to go to a waypoint, park at a place, etc.
    • Commands with an effect in the cabin, e.g., turn off air conditioning, turn on the radio, call a person, open window or door, search for information etc.
  2. By conversing with and responding to questions from one or more humans at the same time about travel-related issues (in-depth domain-specific conversation), e.g.:
    • Humans request information, e.g., time to destination, route conditions, weather at destination, etc.
    • CAV offers alternatives to humans, e.g., long but safe way, short but likely to have interruptions.
    • Humans ask questions about objects in the cabin.
  3. By following the conversation on travel matters held by humans in the cabin if 1) the passengers allow the HCI to do so, and 2) the processing is carried out inside the CAV.

The Audio Scene Description AIM provides all the Speech Objects in the Audio Scene, removing all other audio sources. The Speaker Identity Recognition and Automatic Speech Recognition AIMs support multiple Speech Objects as input. Each Speech Object has an identifier to enable the Speaker Identity Recognition and Automatic Speech Recognition AIMs to provide labelled Speaker IDs and Recognised Texts. If the Face Identity Recognition AIM provides Face IDs corresponding to the Speaker IDs, the Entity Dialogue Processing AIM can correctly associate the Speaker IDs (and the corresponding Recognised Texts) with the Face IDs.

3      I/O Data of Human-CAV Interaction

Table 1 gives the input/output data of Human-CAV Interaction.

Note that communication with the Autonomous Motion Subsystem (AMS) and remote HCI Subsystem is not specified here.

 Table 1 – I/O data of Human-CAV Interaction

Input data From Description
Input Audio (Outdoor)) Environment Sensing Subsystem User authentication
User command
User conversation
Input Audio (Indoor) Cabin Passengers User’s social life
Commands/interaction with HCI
Input Visual (Outdoor) Environment Sensing Subsystem Commands/interaction with HCI
Input Visual (Indoor) Cabin Passengers User’s social life
Commands/interaction with HCI
AMS-HCI Message Autonomous Motion Subsystem Includes response to HCI-AMS Message
Inter HCI Information Remote HCI HCI-to-HCI information
Output data To Comments
Inter HCI Information Remote HCI HCI-to-HCI information
HCI-AMS Command Autonomous Motion Subsystem HCI-to-AMS information
Machine Portable Avatar Cabin Passengers HCI’s avatar.

4      Functions of AI Modules of Human-CAV Interaction

Table 2 gives the functions of all Human-CAV Interaction AIMs.

Table 2 – Functions of Human-CAV Interaction’s AI Modules

AIM Function
Audio Scene Description 1.     Receives Input Audio captured by the appropriate (indoor or outdoor) Microphone Array.
2.     Produces the Audio Scene Descriptors.
Visual Scene Description 1.     Receives Input Visual captured by the appropriate (indoor or outdoor) visual sensors.
2.     Produces the Visual Scene Descriptors.
Automatic Speech Recognition 1.     Receives Input Speech from one of the human.
2.     Converts speech into Recognised Text.
Audio-Visual Alignment 1.     Receives Audio and Visual Scene Geometries and Audio and Visual Objects.
2.     Re-identifies the Audio and Visual Objects having the same Spatial Attitudes.
Visual Object Identification 1.     Receives Body Descriptors, Visual Scene Geometry, and Visual Objects.
2.     Provides the ID of the class of objects of which the Visual Object is an Instance
Natural Language Understanding 1.     Receives Recognised Text, Input Text, Visual Object Instance ID.
2.     Produces Refined Text and Meaning.
Speaker Identity Recognition 1.     Receives Speech Object.
2.     Provides Speaker ID.
Personal Status Extraction 1.     Receives Input Speech, Meaning, Body Descriptors, Face Descriptors.
2.     Provides Input Personal Status of human.
Face Identity Recognition 1.     Receives Face Object.
2.     Provides Face ID.
Entity Dialogue Processing 1.     Receives Speaker ID, Meaning, Refined Text, Input Personal Status, Face ID.
2.     Provides Machine (HCI) Text and Personal Status.
Personal Status Display 1.     Receives Machine Personal Status and Text.
2.     Produces Machine Portable Avatar.

5      I/O Data of AI Modules of Human-CAV Interaction

Table 3 gives the AI Modules of the Human-CAV Interaction depicted in Figure 3.

Table 3 – AI Modules of Human-CAV interaction

AIM Receives Produces
Audio Scene Description Input Audio (outdoor)
Input Audio (indoor)
Speech Objects
Visual Scene Description Input Visual (outdoor)
Input Visual (indoor)
Face Object
Visual Object
Body Descriptors
Face Descriptors
Automatic Speech Recognition Speech Object Recognised Text
Audio-Visual Alignment Audio Scene Geometry
Visual Scene Geometry
Participant ID
Visual Object Identification Visual Object
Visual Scene Geometry
Body Descriptors
Visual Object Instance Identifier
Natural Language Understanding Recognised Text
Participant ID
Visual Object Instance Identifier
Meaning
Refined Text
Participant ID
Speaker Identity Recognition Speech Descriptors Speaker ID
Personal Status Extraction Input Speech
Meaning
Participant ID
Face Descriptors
Body Descriptors
Personal Status
Participant ID
Face Identity Recognition Face Object Face ID
Entity Dialogue Processing Participant ID
Speaker ID
Meaning
Refined Text
Personal Status
Face ID
Output Text
Output Personal Status
Personal Status Display Machine Text
Output Personal Status
Machine Portable Avatar

6   AIW and AIM Specification and JSON Metadata

 Table 4 – AIW and AIM Specification and JSON Metadata

AIW and AIMs Name JSON
MMC-HCI Human-CAV Interaction X
CAE-ASD Audio Scene Description X
CAE-AAT Audio Analysis Transform X
CAE-ASL Audio Source Localisation X
CAE-ASE Audio Separation and Enhancement X
CAE-AST Audio Synthesis Transform X
CAE-AMX Audio Descriptor Multiplexing X
OSD-VSD Visual Scene Description X
MMC-ASR Automatic Speech Recognition X
OSD-AVA Audio-Visual Alignment X
OSD-VOI Visual Object Identification X
OSD-VDI Visual Direction Identification X
OSD-VOE Visual Object Extraction X
OSD-VII Visual Instance Identification X
MMC-NLU Natural Language Understanding X
MMC-SIR Speaker Identity Recognition X
MMC-PSE Personal Status Extraction X
MMC-ITD Input Text Description X
MMC-ISD Input Speech Description X
PAF-IFD Input Face Description X
PAF-IBD Input Body Description X
MMC-PTI PS-Text Interpretation X
MMC-PSI PS-Speech Interpretation X
PAF-PFI PS-Face Interpretation X
PAF-PGI PS-Gesture Interpretation X
MMC-PMX Personal Status Multiplexing X
MMC-EDP Entity Dialogue Processing X
PAF-FIR Face Identity Recognition X
PAF-PSD Personal Status Display X
MMC-TTS Text-to-Speech X
PAF-IFD Input Face Description X
PAF-IBD Input Body Description X
PAF-PMX Portable Avatar Multiplexing X