1     Functions

2     Reference Model 

3     I/O Data

4     Functions of AI Modules

5     I/O Data of AI Module

6     AIW, AIMs, AIMs and JSON Metadata

7     AI Modules

8    Data Types

1      Functions of Human-CAV Interaction Subsystem

The MPAI Connected Autonomous Vehicle (CAV) – Architecture specifies the Reference Model of a Vehicle – called Connected Autonomous Vehicle (CAV) – able to reach a destination by understanding the environment using its own sensors, exchanging information with other CAVs and actuating motion. The Reference Model subdivides a CAV in four Subsystems. Annex 1 – MPAI Basics Chapter 6 introduces MPAI-CAV. Reference [5] provides the full specification.

The Human-CAV interaction (HCI) Subsystem has the function to recognise the human owner or renter, respond to humans’ commands and queries, converse with humans during the travel, exchange information with the Autonomous Motion Subsystem in response to humans’ requests, and communicate with HCIs on board other CAVs.

2      Reference Model of Human-CAV Interaction Subsystem

Figure 1 represents the Human-CAV Interaction (HCI) Reference Model.

Note that it is assumed that Natural Language Understanding produces a Refined Text that is either the refined Recognised Text or the Input Text, depending on which one is active. Meaning is always computed based on the available text – Refined or Input. Personal Status Extraction is unaware of the decisions made by Natural Language Understanding.

Figure 1 – Human-CAV Interaction Reference Model

A group of humans approaches the CAV outside the CAV or is sitting inside the CAV:

  1. Audio-Visual Scene Description produces Audio Scene Descriptors in the form of Audio (Speech) Objects corresponding to each speaking human in the Environment (outside or inside the CAV) and Visual Scene Descriptors in the form of Descriptors of Faces and Bodies. All non-Speech Objects are removed from or signaled in the Audio Scene.
  2. Automatic Speech Recognition recognises the speech of each human and produces Recognised Text supporting multiple Speech Objects as input properly identified by the Spatial Attitude.
  3. Visual Object Identification produces Instance IDs of Visual Objects indicated by humans.
  4. Natural Language Understanding produces Refined Text and extracts Meaning from the Recognised Text of each Input Speech using the spatial information of Visual Object Identifiers.
  5. Speaker Identity Recognition and Face Identity Recognition identifies the humans the HCI is interacting with. If the Face Identity Recognition AIM provides Face IDs corresponding to the Speaker IDs, the Entity Dialogue Processing AIM can correctly associate the Speaker IDs (and the corresponding  Text) with the Face IDs.
  6. Personal Status Extraction extracts the Personal Status of the humans.
  7. Personal Status Display produces the Machine Portable Avatar conveying Machine Speech and Machine Personal Status.
  8. Audio-Visual Scene Rendering renders Audio-Visual information using Machine Portable Avatar or the Autonomous Motion Subsystem’s Full Environment Representation based on the Point of View provided by human.
  9. Entity Dialogue Processing communicates with
    1. The CAV’s Autonomous Motion Subsystem, e.g., to request:
      1. That the CAV moves a destination
      2. Views of the AMS’s Full Environment Representation for passengers’ benefit
      3. To be informed about CAV’s internal situation
      4. Information from the AMS that may be relevant to passengers.
    2. The Autonomous Motion Subsystems of Remote CAVs.

The HCI interacts with the humans in the cabin in several ways:

  1. By responding to commands/queries from one or more humans at the same time, e.g.:
    • Commands to go to a waypoint, park at a place, etc.
    • Commands with an effect in the cabin, e.g., turn off air conditioning, turn on the radio, call a person, open window or door, search for information etc.
  2. By conversing with and responding to questions from one or more humans at the same time about travel-related issues (in-depth domain-specific conversation), e.g.:
    • Humans request information, e.g., time to destination, route conditions, weather at destination, etc.
    • CAV offers alternatives to humans, e.g., long but safe way, short but likely to have interruptions.
    • Humans ask questions about objects in the cabin.
  3. By following the conversation on travel matters held by humans in the cabin if 1) the passengers allow the HCI to do so, and 2) the processing is carried out inside the CAV.

3      I/O Data of Human-CAV Interaction

Table 1 gives the input/output data of Human-CAV Interaction.

Table 1 – I/O data of Human-CAV Interaction

Input data From Comment
Input Audio Environment, Passenger Cabin User authentication, command/interaction with HCI, etc.
Input Text User Text complementing/replacing User input
Input Visual Environment, Passenger Cabin Environment perception, User authentication, command/interaction with HCI, etc.
AMS-HCI Message AMS Subsystem AMS response to HCI request.
Remote-Ego HCI Message Remote HCI Remote HCI to Ego HCI.
Output data To Comment
Output Audio Cabin Passengers HCI’s avatar Audio.
Output Visual Cabin Passengers HCI’s avatar Visual.
HCI-AMS Message AMS Subsystem HCI request to AMS, e.g., Route or Point of View.
Ego-Remote HCI Message Remote HCI Ego HCI to Remote HCI.

4      Functions of AI Modules of Human-CAV Interaction

Table 2 gives the functions of all Human-CAV Interaction AIMs.

Table 2 – Functions of Human-CAV Interaction’s AI Modules

AIM Function
Audio-Visual Scene Description 1.     Receives Input Audio and Visual from the appropriate (indoor or outdoor) Microphone Array and Input Visual.
2.     Produces the Audio-Visual Scene Descriptors.
Automatic Speech Recognition 1.     Receives Speech Objects.
2.     Produces Recognised Text.
Visual Object Identification 1.     Receives Visual Scenes Descriptors
2.     Exytracts Visual Objects
3.     Visual information from human (e.g., pointing finger).
3.     Provides Instance ID of an indicated Visual Object.
Natural Language Understanding 1.     Receives Recognised Text.
2.     Uses context information (e.g., Instance ID of object).
3.     Produces Natural Language Understanding Text (using either Refined or Input Text) and Meaning.
Speaker Identity Recognition 1.     Receives Speech Object.
2.     Produces Speaker ID.
Personal Status Extraction 1.     Receives Speech Object, Meaning, Face Descriptors and Body Descriptors of a human with a Participant ID.
2.     Produces the human’s Personal Status.
Face Identity Recognition 1.     Receives Face Object of a human with a Participant ID.
2.     Produces Face ID.
Entity Dialogue Processing 1.     Receives Speaker ID, Face ID, AV Scene Descriptors, Meaning, Natural Language Understanding Text , Visual Object ID, and Personal Status.
2.     Produces Machine (HCI) Text and Personal Status.
Personal Status Display 1.     Receives Machine Text and Machine Personal Status.
2.     Produces Machine’s Portable Avatar.
Audio-Visual Scene Rendering 1.     Receives AV Scene Descriptors or Portable Avatar.
2.     Produces Output Text, Output Audio, and Output Visual.

5      I/O Data of AI Modules of Human-CAV Interaction

Table 3 gives the AI Modules of the Human-CAV Interaction depicted in Figure 3.

Table 3 – AI Modules of Human-CAV Interaction AIW

AIM Input Output
Audio -Visual Scene Description –  Input Audio
–  Input Visual
–  AV Scene Descriptors
Automatic Speech Recognition –  Speech Object –  Recognised Text
Visual Object Identification –  AV Scene Descriptors
–  Visual Objects
–  Visual Object Instance ID
Natural Language Understanding –  Recognised Text
–  AV Scene Descriptors
–  Visual Object ID
–  Input Text
–  Natural Language Understanding Text
–  Meaning
Speaker Identity Recognition –  Speech Object –  Speaker ID
Personal Status Extraction –  Meaning
–  Input Speech
–  Face Descriptors
–  Body Descriptors
–  Personal Status
Face Identity Recognition –  Face Object –  Face ID
Entity Dialogue Processing –  Ego-Remote HCI Message
–  AMS-HCI Message
–  Speaker ID
–  Meaning
–  Natural Language Understanding Text
–  Personal Status
–  Face ID
–  Ego-Remote HCI Message
–  HCI-AMS Message
–  Machine Text
–  Machine Personal Status
Personal Status Display –  Machine Personal Status
–  Machine Text
–  Machine Portable Avatar
Audio-Visual Scene Rendering –  AV Scene Descriptors
–  Machine Portable Avatar
–  Point of View
–  Output Text
–  Output Audio
–  Output Visual

6      AIW, AIMs and JSON Metadata

AIMs in italic are not final.

Table 18 – AIMs and JSON Metadata

AIW AIM Name JSON
MMC-HCI Human-CAV Interaction X
OSD-AVS Audio-Visual Scene Description X
CAE-ASD Audio Scene Description X
CAE-AAT Audio Analysis Transform X
CAE-ASL Audio Source Localisation X
CAE-ASE Audio Separation and Enhancement X
CAE-AST Audio Synthesis Transform X
CAE-ADM Audio Description Multiplexing X
OSD-VSD Visual Scene Description X
MMC-ASR Automatic Speech Recognition X
OSD-AVA Audio-Visual Alignment X
OSD-VOI Visual Object Identification X
OSD-VDI Visual Direction Identification X
OSD-VOE Visual Object Extraction X
OSD-VII Visual Instance Identification X
MMC-NLU Natural Language Understanding X
MMC-SIR Speaker Identity Recognition X
MMC-PSE Personal Status Extraction X
MMC-ITD Input Text Description X
MMC-ISD Input Speech Description X
PAF-IFD Input Face Description X
PAF-IBD Input Body Description X
MMC-PTI PS-Text Interpretation X
MMC-PSI PS-Speech Interpretation X
PAF-PFI PS-Face Interpretation X
PAF-PGI PS-Gesture Interpretation X
MMC-PMX Personal Status Multiplexing X
MMC-EDP Entity Dialogue Processing X
PAF-FIR Face Identity Recognition X
PAF-PSD Personal Status Display X
MMC-TTS Text-to-Speech X
PAF-IFD Input Face Description X
PAF-IBD Input Body Description X
PAF-PMX Portable Avatar Multiplexing X
PAF-AVR Audio-Visual Scene Rendering X

7       AI Modules

Initial version of HCI AI Modules is available.

8         Data Types

Initial version of HCI Data Types is available.