| Function | Ref. Model | I/O Data | SubAIMs | JSON MData | Profiles | Ref. Software | Conformance | Performance |
1 Functions
The Human-CAV Interaction (HCI) Subsystem has the function to recognise the human owner or renter, respond to humans’ commands and queries, converse with humans, manifests itself as a perceptible entity, exchange information with the Autonomous Motion Subsystem in response to humans’ requests, and communicate with HCIs on board other CAVs.
The Human-CAV Interaction (MMC-HCI) AIM:
| Receives | Point of View | User’s Point of View looking at environment. |
| Audio-Visual Scene Descriptors | Audio-Visual representation of the Environment. | |
| Audio Object | From Environment | |
| Text Object | From User | |
| Visual Object | From Environment. | |
| AMS-HCI Message | AMS response to HCI request. | |
| Ego-Remote HCI Message | Remote HCI to Ego HCI message. | |
| Produces | Text Object | HCI’s Text. |
| Speech Object | HCT’s avatar Speech. | |
| Audio Object | HCI’s avatar or FED Audio. | |
| Visual Object | HCI’s avatar or FED Visual. | |
| AMS-HCI Message | HCI request to AMS, e.g., Route or Point of View. | |
| Ego-Remote HCI Message | Ego HCI to Remote HCI message. |
2 Reference Model
Figure 1 depicts the Reference Model of the (CAV-) AIM.

Figure 1 – Reference Model of the Human-CAV Interaction (MMC-HCI) AIM
3 I/O Data
Table 1 specifies the Input and Output Data of the (CAV-) AIM.
Table 1 – I/O Data of the (CAV-) AIM
| Input data | Description |
| Point of View | Passenger’s Point of View looking at environment. |
| Audio-Visual Scene Descriptors | Audio-Visual representation of the environment. |
| Audio Object | User authentication, command/interaction with HCI, etc. and environment Audio. |
| Text Object | Text complementing/replacing User input |
| Visual Object | Environment perception, User authentication, command/interaction with HCI, etc. and environment Visual. |
| AMS-HCI Message | AMS response to HCI request. |
| Ego-Remote HCI Message | Remote HCI to Ego HCI. |
| Output data | Description |
| Text Object | HCI’s output Text. |
| Speech Object | HCT’s avatar Speech. |
| Audio Object | HCI’s avatar or FED Audio. |
| Visual Object | HCI’s avatar or FED Visual. |
| AMS-HCI Message | HCI request to AMS, e.g., Route or Point of View. |
| Ego-Remote HCI Message | Ego HCI to Remote HCI. |
4 SubAIMs
4.1 Reference Model
Figure 2 depicts the Reference Model of Human-CAV (MMC-HCI) Composite AIM.

Figure 2 – Reference Model of Human-CAV (MMC-HCI) Composite AIM
4.2 Operation
The operation of the HCI subsystem is described by the following scenario where a group of humans is approaching the CAV outside the CAV or is sitting inside the CAV:
- Audio-Visual Scene Description (AVS) produces:
- Speech Scene Descriptors in the form of Speech Objects corresponding to each speaking human in the Environment (outside or inside the CAV).
- Visual Scene Descriptors in the form of Descriptors of Faces and Bodies.
- All non-Speech Objects are removed from the Speech Scene or signalled in the Audio Scene.
- Automatic Speech Recognition (ASR) recognises the speech of each human and produces Recognised Text supporting multiple Speech Objects as input properly identified by their Spatial Attitudes.
- Visual Object Identification (VOI) produces Instance IDs of Visual Objects indicated by humans.
- Natural Language Understanding (NLU) produces Refined Text and extracts Meaning from the Recognised Text of each Input Speech using the spatial information of Visual Object Identifiers. Refined Text is either the refined Recognised Text from the Automatic Speech Recognition or the direct Input Text, depending on which one is being used. Meaning is always computed based on the Recognised or Input Text, depending on which is available.
- Speaker Identity Recognition (SIR) and Face Identity Recognition (FIR) identify the humans the HCI is interacting with. If FIR provides Face IDs corresponding to the Speaker IDs, Entity Dialogue Processing AIM can correctly associate the Speaker IDs (and the corresponding Text) with the Face IDs.
- Personal Status Extraction (PSE) extracts the Personal Status of the humans.
- Entity Dialogue Processing (EDP)
- Communicates with the Autonomous Motion Subsystem of
- The Ego CAV to request to:
- Move the CAV to a destination.
- Views the Full Environment Descriptors for the passengers’ benefit.
- Be informed about CAV’s situation.
- Receive relevant information for passengers.
- A Remote CAVs to exchange Environment Descriptors.
- The Ego CAV to request to:
- Produces the Machine Text and Machine Personal Status.
- Communicates with the Autonomous Motion Subsystem of
- Personal Status Display (PSD) produces the Machine Portable Avatar conveying Machine Speech, Machine Personal Status, and any other information that may be relevant for the the Audio-Visual Rendering AIM .
- Audio-Visual Scene Rendering (AVR) renders Audio, and Visual information using Machine Portable Avatar or the Autonomous Motion Subsystem’s Full Environment Descriptors based on the Point of View provided by the human.
- Entity Dialogue Processing (EDP)
- Requests the AMS subsystem to provide candidate Routes in response to a human requesting to be taken to a destination.
- Responses from AMS are processed by EDP and converted to multimodal messages understandable by the human.
- Eventually, the human accepts the Route or further elaborates on the EDP response.
- May receive messages from Ego AMS or Remote HCI that are processed and converted to multimodal messages understandable by the human.
The HCI interacts with the humans in the cabin in several ways:
- By responding to commands/queries from one or more humans at the same time, e.g.:
- Commands to go to a waypoint, park at a place, etc.
- Commands with an effect in the cabin, e.g., turn off air conditioning, turn on the radio, call a person, open window or door, search for information etc.
- By conversing with and responding to questions from one or more humans at the same time about travel-related issues (in-depth domain-specific conversation), e.g.:
- Humans request information, e.g., time to destination, route conditions, weather at destination, etc.
- CAV offers alternatives to humans, e.g., long but safe way, short but likely to have interruptions.
- Humans ask questions about objects in the cabin.
- By following the conversation on travel matters held by humans in the cabin if
- The passengers allow the HCI to do so, and
- The processing is carried out inside the CAV.
4.3 Functions of AI Modules
Table gives the functions of Human-CAV Interaction’s AI Modules.
Table 2 – Functions of Human-CAV Interaction’s AI Modules
| AIM | Function |
| Audio-Visual Scene Description | 1. Receives Audio and Visual Objects from the appropriate Devices. 2. Produces Audio-Visual Scene Descriptors. |
| Automatic Speech Recognition | 1. Receives Speech Objects. 2. Produces Recognised Text. |
| Visual Object Identification | 1. Receives Visual Scenes Descriptors. 2. Provides Instance ID of indicated Visual Object. |
| Natural Language Understanding | 1. Receives Recognised Text. 2. Uses context information (e.g., Instance ID of object). 3. Produces Natural Language Understanding Text (using Refined or Input) and Meaning. |
| Speaker Identity Recognition | 1. Receives Speech Object of a human and Speech Scene Geometry. 2. Produces Speaker ID. |
| Personal Status Extraction | 1. Receives Speech Object, Meaning, Face Descriptors and Body Descriptors of a human with a Participant ID. 2. Produces the human’s Personal Status. |
| Face Identity Recognition | 1. Receives Face Object of a human and Visual Scene Geometry. 2. Produces Face ID. |
| Entity Dialogue Processing | 1. Receives Speaker ID, Face ID, AV Scene Descriptors, Meaning, Natural Language Understanding Text , Visual Object ID, and Personal Status. Moreover it receives AMS-HCI Messages and Ego-Remote HCI Messages. 2. Produces Machine (HCI) Text Object and Personal Status. Moreover it produces AMS-HCI Messages and Ego-Remote HCI Messages. |
| Personal Status Display | 1. Receives Machine Text Object and Machine Personal Status. 2. Produces Machine’s Portable Avatar. |
| Audio-Visual Scene Rendering | 1. Receives AV Scene Descriptors, Portable Avatar, and Point of View. 2. Produces Output Speech, Output Audio, and Output Visual. |
4.4 I/O Data of AI Modules
Table 3 gives the AI Modules of the Human-CAV Interaction depicted in Figure 3.
Table 3 – AI Modules of Human-CAV Interaction AIW
4.5 AIMs and JSON Metadata
Table 4 provides the links to the AIW and AIM specifications and to the JSON syntaxes. AIMs/1 indicates that the column contains Composite AIMs and AIMs/2 indicates that the column contains Basic and Composite AIMs. AIMs/3 indicates the the column only contains Basic AIMs.
Table 4 – AIMs and JSON Metadata
| AIM1 | AIM2 | Name | JSON |
| MMC-HCI | Human-CAV Interaction | X | |
| OSD-AVS | Audio-Visual Scene Description | X | |
| MMC-ASR | Automatic Speech Recognition | X | |
| OSD-AVA | Audio-Visual Alignment | X | |
| OSD-VOI | Visual Object Identification | X | |
| MMC-NLU | Natural Language Understanding | X | |
| MMC-SIR | Speaker Identity Recognition | X | |
| MMC-PSE | Personal Status Extraction | X | |
| MMC-EDP | Entity Dialogue Processing | X | |
| PAF-FIR | Face Identity Recognition | X | |
| PAF-PSD | Personal Status Display | X | |
| PAF-AVR | Response and Scene Rendering | X |
5 JSON Metadata
https://schemas.mpai.community/MMC/V2.5/AIMs/HumanCAVInteraction.json
6 Profiles
No Profiles
7 Reference Software
8 Conformance Testing
9 Performance Assessment