MPAI-MMC V2.5 AIWs Human-CAV Interaction

<-Go to MMC AI Workflows Go toMMC ToC ->

1 Functions	2 Reference Model	3 I/O Data
4 Functions of AI Modules	5 I/O Data of AI Module	6 AIW, AIMs, AIMs and JSON Metadata
7 Reference Software	8 Conformance Texting	9 Performance Assessment

1 Functions

The Human-CAV interaction (HCI) Subsystem has the function to recognise the human owner or renter, respond to humans’ commands and queries, converse with humans, manifests itself as a perceptible entity, exchange information with the Autonomous Motion Subsystem in response to humans’ requests, and communicate with HCIs on board other CAVs.

2 Reference Model

Figure 1 represents the Human-CAV Interaction (HCI) Reference Model.

Figure 1 – Human-CAV Interaction Reference Model

The operation of the HCI subsystem is described by the following scenario where a group of humans approaches the CAV outside the CAV or is sitting inside the CAV:

Audio-Visual Scene Description (AVS) produces:
1. Speech Scene Descriptors in the form of Speech Objects corresponding to each speaking human in the Environment (outside or inside the CAV).
2. Visual Scene Descriptors in the form of Descriptors of Faces and Bodies.
3. All non-Speech Objects are removed from the Speech Scene or signalled in the Audio Scene.
Automatic Speech Recognition (ASR) recognises the speech of each human and produces Recognised Text supporting multiple Speech Objects as input properly identified by their Spatial Attitudes.
Visual Object Identification (VOI) produces Instance IDs of Visual Objects indicated by humans.
Natural Language Understanding (NLU) produces Refined Text and extracts Meaning from the Recognised Text of each Input Speech using the spatial information of Visual Object Identifiers. Refined Text is either the refined Recognised Text from the Automatic Speech Recognition or the direct Input Text, depending on which one is being used. Meaning is always computed based on the Recognised or Input Text, depending on which is available.
Speaker Identity Recognition (SIR) and Face Identity Recognition (FIR) identify the humans the HCI is interacting with. If FIR provides Face IDs corresponding to the Speaker IDs, Entity Dialogue Processing AIM can correctly associate the Speaker IDs (and the corresponding Text) with the Face IDs.
Personal Status Extraction (PSE) extracts the Personal Status of the humans.
Entity Dialogue Processing (EDP)
1. Communicates with the Autonomous Motion Subsystem of
  1. The Ego CAV to request to:
    1. Move the CAV to a destination.
    2. Views the Full Environment Descriptors for the passengers’ benefit.
    3. Be informed about CAV’s situation.
    4. Receive relevant information for passengers.
  2. A Remote CAVs to exchange Environment Descriptors.
2. Produces the Machine Text and Machine Personal Status.
Personal Status Display (PSD) produces the Machine Portable Avatar conveying Machine Speech, Machine Personal Status, and any other information that may be relevant for the the Audio-Visual Rendering AIM .
Audio-Visual Scene Rendering (AVR) renders Audio, and Visual information using Machine Portable Avatar or the Autonomous Motion Subsystem’s Full Environment Descriptors based on the Point of View provided by the human.
Entity Dialogue Processing (EDP)
1. Requests the AMS subsystem to provide candidate Routes in response to a human requesting to be taken to a destination.
2. Responses from AMS are processed by EDP and converted to multimodal messages understandable by the human.
3. Eventually, the human accepts the Route or further elaborates on the EDP response.
4. May receive messages from Ego AMS or Remote HCI that are processed and converted to multimodal messages understandable by the human.

The HCI interacts with the humans in the cabin in several ways:

By responding to commands/queries from one or more humans at the same time, e.g.:
1. Commands to go to a waypoint, park at a place, etc.
2. Commands with an effect in the cabin, e.g., turn off air conditioning, turn on the radio, call a person, open window or door, search for information etc.
By conversing with and responding to questions from one or more humans at the same time about travel-related issues (in-depth domain-specific conversation), e.g.:
1. Humans request information, e.g., time to destination, route conditions, weather at destination, etc.
2. CAV offers alternatives to humans, e.g., long but safe way, short but likely to have interruptions.
3. Humans ask questions about objects in the cabin.
By following the conversation on travel matters held by humans in the cabin if
1. The passengers allow the HCI to do so, and
2. The processing is carried out inside the CAV.

3 I/O Data

Table 1 gives the input/output data of Human-CAV Interaction. I/O Data to/from Remote HCI and Ego AMS are not part of this Technical Specification.

Table 1 – I/O data of Human-CAV Interaction

Input data	From	Comment
Point of View	Passenger	Passenger’s Point of View looking at environment.
Audio-Visual Scene Descriptors	AMS Subsystem	Audio-Visual representation of the environment.
Input Audio	Environment, Passenger Cabin	User authentication, command/interaction with HCI, etc. and environment Audio.
Input Text	User	Text complementing/replacing User input
Input Visual	Environment, Passenger Cabin	Environment perception, User authentication, command/interaction with HCI, etc. and environment Visual.
AMS-HCI Message	AMS Subsystem	AMS response to HCI request.
Ego-Remote HCI Message	Remote HCI	Remote HCI to Ego HCI.
Output data	To	Comment
Output Text	Cabin Passengers	HCI’s avatar Text.
Output Speech	Cabin Passengers	HCT’s avatar Speech.
Output Audio	Cabin Passengers	HCI’s avatar or FED Audio.
Output Visual	Cabin Passengers	HCI’s avatar or FED Visual.
AMS-HCI Message	AMS Subsystem	HCI request to AMS, e.g., Route or Point of View.
Ego-Remote HCI Message	Remote HCI	Ego HCI to Remote HCI.

4 Functions of AI Modules

Table 2 gives the functions of all Human-CAV Interaction AIMs.

Table 2 – Functions of Human-CAV Interaction’s AI Modules

AIM	Function
Audio-Visual Scene Description	1. Receives Audio and Visual Objects from the appropriate Devices. 2. Produces Audio-Visual Scene Descriptors.
Automatic Speech Recognition	1. Receives Speech Objects. 2. Produces Recognised Text.
Visual Object Identification	1. Receives Visual Scenes Descriptors. 2. Provides Instance ID of indicated Visual Object.
Natural Language Understanding	1. Receives Recognised Text. 2. Uses context information (e.g., Instance ID of object). 3. Produces Natural Language Understanding Text (using Refined or Input) and Meaning.
Speaker Identity Recognition	1. Receives Speech Object of a human and Speech Scene Geometry. 2. Produces Speaker ID.
Personal Status Extraction	1. Receives Speech Object, Meaning, Face Descriptors and Body Descriptors of a human with a Participant ID. 2. Produces the human’s Personal Status.
Face Identity Recognition	1. Receives Face Object of a human and Visual Scene Geometry. 2. Produces Face ID.
Entity Dialogue Processing	1. Receives Speaker ID, Face ID, AV Scene Descriptors, Meaning, Natural Language Understanding Text , Visual Object ID, and Personal Status. Moreover it receives AMS-HCI Messages and Ego-Remote HCI Messages. 2. Produces Machine (HCI) Text Object and Personal Status. Moreover it produces AMS-HCI Messages and Ego-Remote HCI Messages.
Personal Status Display	1. Receives Machine Text Object and Machine Personal Status. 2. Produces Machine’s Portable Avatar.
Audio-Visual Scene Rendering	1. Receives AV Scene Descriptors, Portable Avatar, and Point of View. 2. Produces Output Speech, Output Audio, and Output Visual.

5 I/O Data of AI Modules

Table 3 gives the AI Modules of the Human-CAV Interaction depicted in Figure 3.

Table 3 – AI Modules of Human-CAV Interaction AIW

AIM	Input	Output
Audio-Visual Scene Description	– Input Audio – Input Visual	– AV Scene Descriptors
Automatic Speech Recognition	– Speech Object	– Recognised Text
Visual Object Identification	– AV Scene Descriptors – Visual Objects	– Visual Object Instance ID
Natural Language Understanding	– Recognised Text – AV Scene Descriptors – Visual Object Instance ID – Input Text	– Natural Language Understanding Text – Meaning
Speaker Identity Recognition	– Speech Object – Speech Scene Geometry	– Speaker ID
Personal Status Extraction	– Meaning – Input Speech – Face Descriptors – Body Descriptors	– Personal Status
Face Identity Recognition	– Face Object – Visual Scene Geometry	– Face ID
Entity Dialogue Processing	– Ego-Remote HCI Message – AMS-HCI Message – Speaker ID – Meaning – Natural Language Understanding Text – Visual Object Instance ID – Personal Status – Face ID	– Ego-Remote HCI Message – AMS-HCI Message – Machine Text – Machine Personal Status
Personal Status Display	– Machine Personal Status – Machine Text	– Machine Portable Avatar
Audio-Visual Scene Rendering	– AV Scene Descriptors – Machine Portable Avatar – Point of View	– Output Text – Output Speech – Output Audio – Output Visual

6 AIW, AIMs and JSON Metadata

Table 4 provides the links to the AIW and AIM specifications and to the JSON syntaxes. AIMs/1 indicates that the column contains Composite AIMs and AIMs/2 indicates that the column contains Basic and Composite AIMs. AIMs/3 indicates the the column only contains Basic AIMs.

Table 4 – AIMs and JSON Metadata

AIW	AIMs/1	AIMs/2	AIMs/3	Name	JSON
MMC-HCI				Human-CAV Interaction	X
	OSD-AVS			Audio-Visual Scene Description	X
		CAE-ASD		Audio Scene Description	X
			CAE-AAT	Audio Analysis Transform	X
			CAE-ASL	Audio Source Localisation	X
			CAE-ASE	Audio Separation and Enhancement	X
			CAE-AST	Audio Synthesis Transform	X
			CAE-ADM	Audio Descriptors Multiplexing	X
		OSD-VSD		Visual Scene Description	X
	MMC-ASR			Automatic Speech Recognition	X
	OSD-AVA			Audio-Visual Alignment	X
	OSD-VOI			Visual Object Identification	X
		OSD-VDI		Visual Direction Identification	X
		OSD-VOE		Visual Object Extraction	X
		OSD-VII		Visual Instance Identification	X
	MMC-NLU			Natural Language Understanding	X
	MMC-SIR			Speaker Identity Recognition	X
	MMC-PSE			Personal Status Extraction	X
		MMC-ETD		Entity Text Description	X
		MMC-ESD		Entity Speech Description	X
		PAF-EFD		Entity Face Description	X
		PAF-EBD		Entity Body Description	X
		MMC-PTI		PS-Text Interpretation	X
		MMC-PSI		PS-Speech Interpretation	X
		PAF-PFI		PS-Face Interpretation	X
		PAF-PGI		PS-Gesture Interpretation	X
		MMC-PMX		Personal Status Multiplexing	X
	MMC-EDP			Entity Dialogue Processing	X
	PAF-FIR			Face Identity Recognition	X
	PAF-PSD			Personal Status Display	X
		MMC-TTS		Text-to-Speech	X
		PAF-IFD		Entity Face Description	X
		PAF-IBD		Entity Body Description	X
		PAF-PMX		Portable Avatar Multiplexing	X
	PAF-AVR			Audio-Visual Scene Rendering	X

7 Reference Software

As a rule, MPAI provides Reference Software implementing the AIWs released with the following disclaimers:

The MPAI-MMC V2.5 Reference Software Implementation, if in source code, is released with the BSD-3-Clause licence.
The purpose of this Reference Software is to provide a working Implementation of MPAI-MMC V2.5, not to provide a ready-to-use product.
MPAI disclaims the suitability of the Software for any other purposes and does not guarantee that it is secure.
Use of this Reference Software may require acceptance of licences from the respective copyright holders. Users shall verify that they have the right to use any third-party software required by this Reference Software.

Note that at this stage the MPAI-MMC V2.5 does not include Reference Software.

8 Conformance Testing

An implementation of an AIW conforms with MPAI-MMC V2.5 if it accepts as input _and_ produces as output Data and/or Data Objects (the combination of Data of a Data Type and its Qualifier) conforming with those specified by MPAI-MMC V2.5.

The Conformance is expressed by one of the two statements

“Data conforms with the relevant (Non-MPAI) standard” – for Data.
“Data validates against the Data Type Schema” – for Data Object.

The latter statement implies that:

Any Sub-Type of the Data conforms with the relevant Sub-Type specification of the applicable Qualifier.
Any Content and Transport Format of the Data conform with the relevant Format specification of the applicable Qualifier.
Any Attribute of the Data
1. Conforms with the relevant (Non-MPAI) standard – for Data, or
2. Validates against the Data Type Schema – for Data Object.

The method to Test the Conformance of an instance of Data or Data Object is specified in the Data Types chapter.

Table 5 provides the Conformance Testing Method for MMC-HCI AIM.

Table 5 – Conformance Testing Method for MMC-HCI AIM

Receives	Input Audio	Shall validate against Audio Object Schema. Audio Data shall conform with Audio Qualifier.
	Input Text	Shall validate against Text Object Schema. Speech Data shall conform with Text Qualifier.
	Input Visual	Shall validate against Visual Object Schema. Speech Data shall conform with Visual Qualifier.
	AMS-HCI Message	Shall validate against AMS-HCI Message Schema.
	Ego-Remote HCI Message	Shall validate against Ego-Remote HCI Message Schema.
Produces	Output Text	Shall validate against Text Object Schema. Text Data shall conform with Text Qualifier.
	Output Speech	Shall validate against Speech Object Schema. Speech Data shall conform with Speech Qualifier.
	Output Audio	Shall validate against Audio Object Schema. Audio Data shall conform with Audio Qualifier.
	Output Visual	Shall validate against Visual Object Schema. Visual Data shall conform with Visual Qualifier.
	AMS-HCI Message	Shall validate against AMS-HCI Message Schema.
	Ego-Remote HCI Message	Shall validate against Ego-Remote HCI Message Schema.

9 Performance Assessment

Performance is an umbrella term used to describe a variety of attributes – some specific of the application domain the Implementation intends to address. Therefore, Performance Assessment Specifications provide methods and procedures to measure how well an AIW or an AIM performs its function. Performance of an Implementation includes methods and procedures for all or a subset of the following characteristics:

Quality – for instance, how well a Face Identity Recognition AIM recognises faces, how precise or error-free are the changes in a Visual Scene detected by a Visual Change Detection AIM, or how satisfactory are the responses provided by an Answer to Multimodal Question AIW.
Robustness – for instance, how robust is the operation of an Implementation with respect to duration of operation, load scaling, etc.
Extensibility – for instance, the degree of confidence a user can have in an Implementation when it deals with data outside of its stated application scope.
Bias: – for instance, how dependent on specific features of the training data is the inference, as in Company Performance Prediction when the accuracy of the prediction may widely change based on the size or the geographic position of a Company; or face recognition in Television Media Analysis.
Legality – for instance, in which jurisdictions the use of an AIM or an AIW complies with a regulation, e.g., the European AI Act.
Ethics: may indicate the conformity of an AIM or AIW to a target ethical standard.

<-Go to AI Workflows Go to ToC Multimodal Question Answering->

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit