| Definition | Functional Requirements | Syntax | Semantics |
Definition
The Multimodal Input Harmonisation (MIH) Data Type:
- Is produced by the Multimodal Input Harmonisation internal component of the Entity State Description (ESD) AIM.
- Represents a time-bounded, entity-centred harmonisation of multimodal perceptual inputs.
- Correlates visual objects, audio objects, and textual elements that are contemporaneous and refer to the same entity or entities.
- Does not introduce semantic interpretation, inference, or state attribution.
MIH provides a precisely aligned evidential substrate for downstream linguistic, behavioural, cross‑modal interpretation, and Entity State construction.
Functional Requirements
MIH conveys the following main information elements:
| Function | Description |
|---|---|
| Multimodal Correlation | Establishes explicit correspondences between Visual Objects, Audio Objects, and Text segments that coexist within a common temporal window. |
| Temporal Anchoring | Provides a harmonisation time reference to ensure that subsequent reasoning operates on co‑temporal evidence only. |
| Entity Referencing | Identifies which multimodal evidence items relate to the same logical entity (e.g. the User or another entity in the scene). |
| Modal Integrity | Preserves the original structure and semantics of Visual and Audio Scene Descriptors without duplication or modification. |
| Referential Transparency | Uses object‑or‑objectID constructs to ensure that all references are explicit, inspectable, and verifiable. |
| Interpretation Neutrality | Explicitly refrains from performing affective, intentional, or cognitive inference. |
| Reasoning Substrate | Serves as the mandatory input substrate for Linguistic–Paralinguistic Analysis, Behavioural / Expressive Analysis, and subsequent Entity State construction. |
| Auditability | Includes Data Exchange Metadata to support provenance, traceability, and confidence assessment. |
Syntax
Semantics
| Label | Description |
|---|---|
| Header | MIH header identifying the data type and version, formatted as MMC-MIH-Vx.y. |
| MInstanceID | Identifier of the M‑Instance producing the MIH data. |
| MIHID | Unique identifier of the Multimodal Input Harmonisation instance. |
| HarmonisationTime | Time reference identifying the temporal window within which multimodal evidence is harmonised. |
| VisualContext | Set of visual entities relevant to harmonisation. Each item SHALL be either a VisualObject or a VisualObjectID referencing a Visual Scene Descriptor produced upstream. |
| AudioContext | Set of audio entities relevant to harmonisation. Each item SHALL be either an AudioObject or an AudioObjectID referencing an Audio Scene Descriptor produced upstream. |
| TextContext | Set of textual elements derived from ASR. Each item binds a recognised text segment or a TextSegmentID to a temporal anchor. |
| TextContext.TextOrTextID | Either the recognised text string or an identifier referencing a text segment produced by an ASR AIM. |
| TextContext.SpaceTime | Temporal anchor indicating when the text segment was uttered. |
| EntityContext | Entity‑centric correspondence structure grouping visual, audio, and textual evidence that refers to the same logical entity. |
| EntityContext.EntityID | Identifier of the logical entity (e.g. User or other actor) to which the referenced evidence relates. |
| EntityContext.VisualRefs | Visual evidence associated with the entity. Each item SHALL be either a VisualObject or a VisualObjectID. |
| EntityContext.AudioRefs | Audio evidence associated with the entity. Each item SHALL be either an AudioObject or an AudioObjectID. |
| EntityContext.TextRefs | Textual evidence associated with the entity. Each item SHALL be either a recognised text segment or a TextSegmentID. |
| DataXMData | Data Exchange Metadata providing provenance, source AIM identification, confidence, legality, and rights information. |
| DescrMetadata | Human‑readable descriptive metadata associated with the MIH instance. |