Definition Functional Requirements Syntax Semantics

Definition

The Multimodal Input Harmonisation (MIH) Data Type:

  • Is produced by the Multimodal Input Harmonisation internal component of the Entity State Description (ESD) AIM.
  • Represents a time-bounded, entity-centred harmonisation of multimodal perceptual inputs.
  • Correlates visual objects, audio objects, and textual elements that are contemporaneous and refer to the same entity or entities.
  • Does not introduce semantic interpretation, inference, or state attribution.

MIH provides a precisely aligned evidential substrate for downstream linguistic, behavioural, cross‑modal interpretation, and Entity State construction.

Functional Requirements

MIH conveys the following main information elements:

Function Description
Multimodal Correlation Establishes explicit correspondences between Visual Objects, Audio Objects, and Text segments that coexist within a common temporal window.
Temporal Anchoring Provides a harmonisation time reference to ensure that subsequent reasoning operates on co‑temporal evidence only.
Entity Referencing Identifies which multimodal evidence items relate to the same logical entity (e.g. the User or another entity in the scene).
Modal Integrity Preserves the original structure and semantics of Visual and Audio Scene Descriptors without duplication or modification.
Referential Transparency Uses object‑or‑objectID constructs to ensure that all references are explicit, inspectable, and verifiable.
Interpretation Neutrality Explicitly refrains from performing affective, intentional, or cognitive inference.
Reasoning Substrate Serves as the mandatory input substrate for Linguistic–Paralinguistic Analysis, Behavioural / Expressive Analysis, and subsequent Entity State construction.
Auditability Includes Data Exchange Metadata to support provenance, traceability, and confidence assessment.

Syntax

Semantics

Label Description
Header MIH header identifying the data type and version, formatted as MMC-MIH-Vx.y.
MInstanceID Identifier of the M‑Instance producing the MIH data.
MIHID Unique identifier of the Multimodal Input Harmonisation instance.
HarmonisationTime Time reference identifying the temporal window within which multimodal evidence is harmonised.
VisualContext Set of visual entities relevant to harmonisation. Each item SHALL be either a VisualObject or a VisualObjectID referencing a Visual Scene Descriptor produced upstream.
AudioContext Set of audio entities relevant to harmonisation. Each item SHALL be either an AudioObject or an AudioObjectID referencing an Audio Scene Descriptor produced upstream.
TextContext Set of textual elements derived from ASR. Each item binds a recognised text segment or a TextSegmentID to a temporal anchor.
TextContext.TextOrTextID Either the recognised text string or an identifier referencing a text segment produced by an ASR AIM.
TextContext.SpaceTime Temporal anchor indicating when the text segment was uttered.
EntityContext Entity‑centric correspondence structure grouping visual, audio, and textual evidence that refers to the same logical entity.
EntityContext.EntityID Identifier of the logical entity (e.g. User or other actor) to which the referenced evidence relates.
EntityContext.VisualRefs Visual evidence associated with the entity. Each item SHALL be either a VisualObject or a VisualObjectID.
EntityContext.AudioRefs Audio evidence associated with the entity. Each item SHALL be either an AudioObject or an AudioObjectID.
EntityContext.TextRefs Textual evidence associated with the entity. Each item SHALL be either a recognised text segment or a TextSegmentID.
DataXMData Data Exchange Metadata providing provenance, source AIM identification, confidence, legality, and rights information.
DescrMetadata Human‑readable descriptive metadata associated with the MIH instance.