Go To CAV-TEC V1.0 Use Cases and Functional Requirements home page


1       Functions of Human-CAV Interaction Subsystem

2       Reference Model of Human-CAV Interaction Subsystem

3       I/O Data of Human-CAV Interaction

4       Functions of Human-CAV Interaction’s AI Modules

5       I/O Data of Human-CAV Interaction’s AI Modules

6       Data Types

1        Functions of Human-CAV Interaction Subsystem

The MPAI Connected Autonomous Vehicle (CAV) – Architecture specifies the Reference Model of a Vehicle – called Connected Autonomous Vehicle (CAV) – able to reach a destination by understanding the environment using its own sensors, exchanging information with other CAVs and actuating motion. The Reference Model subdivides a CAV in four Subsystems. Annex 1 – MPAI Basics Chapter 6 introduces MPAI-CAV. Reference [5] provides the full specification.

The Human-CAV interaction (HCI) Subsystem has the function to recognise the human owner or renter, respond to humans’ commands and queries, converse with humans during the travel, exchange information with the Autonomous Motion Subsystem in response to humans’ requests, and communicate with HCIs on board other CAVs.

2        Reference Model of Human-CAV Interaction

A group of humans approaches the CAV outside the CAV or is sitting inside the CAV:

  1. Audio-Visual Scene Description produces Audio Scene Descriptors in the form of Audio (Speech) Objects corresponding to each speaking human in the Environment (outside or inside the CAV) and Visual Scene Descriptors in the form of Descriptors of Faces and Bodies. Note that all non-Speech Objects are removed from the Audio Scene.
  2. Automatic Speech Recognition recognises the speech of each human and produces Recognised Text supporting multiple Speech Objects as input. Each Speech Object has an identifier to enable the Speaker Identity Recognition to provide labelled Recognised Texts.
  3. Visual Object Identification produces Identifiers of Visual Objects indicated by a human.
  4. Natural Language Understanding extracts Meaning and produces Refined Text from the Recognised Text of each Input Speech potentially using spatial information of Visual Object Identifiers.
  5. Speaker Identity Recognition and Face Identity Recognition authenticate the humans that the HCI is interacting with. If the Face Identity Recognition AIM provides Face IDs corresponding to the Speaker IDs, the Entity Dialogue Processing AIM can correctly associate the Speaker IDs (and the corresponding Recognised Texts) with the Face IDs.
  6. Personal Status Extraction extracts the Personal Status of the humans.
  7. Personal Status Display produces the ready-to-render Machine Portable Avatar [14] conveying Machine Speech and Machine Personal Status.
  8. Audio-Visual Scene Rendering visualises either information from Machine Portable Avatar or the AMS’s Full Environment Representation based on the Point of View provided by human.

Figure 4 – Reference Model of the CAV-HCI Subsystem

3        I/O Data of Human-CAV Interaction

Table 4 gives the input/output data of the Human-CAV Interaction Subsystem.

Table 4 – I/O data of Human-CAV Interaction

Input data From Comment
Input Text User Text complementing/replacing User input
Input Audio Environment User authentication, command/interaction with HCI, etc.
Input Visual Environment, Passenger Cabin Environment perception, User authentication, command/interaction with HCI, etc.
AMS-HCI Message AMS Subsystem AMS response to HCI request.
Remote-Ego HCI Message Remote HCI Remote HCI to Ego HCI.
Output data To Comment
Output Text Ego HCI Text complementing/replacing other media
Output Audio Cabin Passengers HCI’s avatar Audio.
Output Visual Cabin Passengers HCI’s avatar Visual.
HCI-AMS Message AMS Subsystem HCI request to AMS, e.g., Route or Point of View.
Ego-Remote HCI Message Remote HCI Ego HCI to Remote HCI.

4        Functions of Human-CAV Interaction’s AI Modules

Table 5 gives the functions of all Environment Sensing Subsystem AIMs.

Table 5 – Functions of Human-CAV Interaction’s AI Modules

AIM Function
Audio-Visual Scene Description 1.      Receives Input Audio and Visual captured by the appropriate (indoor or outdoor) Input Audio (Microphone Array), Input Visual and Input LiDAR.

2.      Produces the Audio-Visual Scene Descriptors.

Automatic Speech Recognition 1.      Receives Speech Objects.

2.      Produces Recognised Text.

Visual Object Identification 1.      Receives Visual Scenes Descriptors

2.      Uses Visual Objects and visual information from human (finger pointing).

3.      Provides the ID of the class of objects of which the indicated Visual Object is an Instance.

Natural Language Understanding 1.      Receives Recognised Text.

2.      Uses context information (e.g., Instance ID of object).

3.      Produces NLU Text (either Refined or Input) and Meaning.

Speaker Identity Recognition 1.      Receives Speech Object.

2.      Produces Speaker ID.

Personal Status Extraction 1.      Receives Speech Object, Meaning, Refined Text, Face Descriptors and Body Descriptors of a human with a Participant ID.

2.      Produces the Personal Status of a human.

Face Identity Recognition 1.      Receives Face Object of a human with a Participant ID.

2.      Produces Face ID.

Entity Dialogue Processing 1.      Receives Speaker and Face ID, AV Scene Descriptors, Meaning, Text from Natural Language Understanding, Visual Object ID, and Personal Status.

2.      Produces Machine Text (HCI response) and Machine (HCI) Personal Status.

Personal Status Display 1.      Receives Machine Text and Machine Personal Status.

2.      Produces Machine’s Portable Avatar.

Audio-Visual Scene Rendering 1.      Receives AV Scene Descriptors or Portable Avatar

2.      Produces Output Text, Output Audio, and Output Visual.

5        I/O Data of Human-CAV Interaction’s AI Modules

Table 6 gives the input/output data of the Human-CAV Interaction AIMs.

Table 6 – I/O Data of Human-CAV Interaction’s AI Modules

AIM Input Output
Audio -Visual Scene Description –  Input Audio

–  Input Visual

–  AV Scene Descriptors
Automatic Speech Recognition –  Speech Object –  Recognised Text
Visual Object Identification –  AV Scene Descriptors

–  Visual Objects

–  Visual Object Instance ID
Natural Language Understanding –  Recognised Text

–  AV Scene Descriptors

–  Visual Object ID

–  Input Text

–  Natural Language Understanding Text

–  Meaning

Speaker Identity Recognition –  Speech Object –  Speaker ID
Personal Status Extraction –  Meaning

–  Input Speech

–  Face Descriptors

–  Body Descriptors

–  Personal Status
Face Identity Recognition –  Face Object –  Face ID
Entity Dialogue Processing –  Ego-Remote HCI Message

–  AMS-HCI Message

–  Speaker ID

–  Meaning

–  Natural Language Understanding Text

–  Personal Status

–  Face ID

–  Ego-Remote HCI Message

–  HCI-AMS Message

–  Machine Text

–  Machine Personal Status

Personal Status Display –  Machine Personal Status

–  Machine Text

–  Machine Portable Avatar
Audio-Visual Scene Rendering –  AV Scene Descriptors

–  Machine Portable Avatar

–  Point of View

–  Output Text

–  Output Audio

–  Output Visual

6        Data Types

MPAI has already issued a Technical Specification of a subset of CAV-HCI Data Types in [11]. This section will heavily draw on that specification.

Data obtained from Audio, Visual, and LiDAR sensors are used by HCI. However, their specification is delegated to the Environment Sensing Subsystem.

An initial version of the HCI Data Types is available.