(Tentative)

Function Reference Model Input/Output Data
SubAIMs JSON Metadata Profiles

1. Function

Context Capture AIM (PGM‑CXC)

1. Receives

  • Text Objects representing user utterances or written input.
  • Audio Objects representing speech and environmental sounds.
  • Visual Objects representing user gestures, facial expressions, and scene elements.
  • 3D Model Objects representing spatial geometry and environmental structures.
  • Context Capture Directives from A‑User Control guiding modality prioritization and acquisition strategy.

2. Extracts, refines, and interprets multimodal signals to:

  • Disambiguate inputs by aligning text, audio, visual, and 3D model objects into a coherent multimodal frame.
  • Normalise descriptors into canonical formats, ensuring consistency across modalities and traceability of provenance.
  • Infer relationships among signals, such as linking gestures to utterances, or mapping audio cues to spatial anchors.
  • Resolve context by applying A‑User Control directives to prioritise modalities and filter noise, and highlight salient features.
  • Generate enriched semantics that capture  intent‑aligned cues, such as urgency, confidence, and attentional focus.

3. Constructs a representation of the User environment and Entity State, including:

  • Scene grounding in spatio‑temporal coordinates.
  • User localsation (position, orientation, posture).
  • Semantic tagging of objects, actions, and environmental features.
  • Framing of cognitive, emotional, attentional, intentional, motivational, and temporal states.

4. Sends

  • A Entity State for multimodal prompt generation to Prompt Creation.
  • Audio and Visual Scene Descriptors for environment understanding and alignment to Spatial Reasoning.
  • A Context Capture Status in response to Context Capture Directives to A‑User Control.

5. Enables

  • Spatial Reasoning and Prompt Creation to operate with full awareness of the audio-visual environment, supporting perception‑aligned interactions and context‑aware orchestration of AIMs.
  • A‑User Control to stay informed about the implementation of Context Capture Directives.

2. Reference Model

Figure 3 gives the Context Capture (PGM-CXC) Reference Model.

Figure 1 – The Reference Model of the Context Capture (PGM-CXC) AIM

3. Input/Output Data

Table 1 – Context Capture (PGM-CXC) AIM

Input Description
Text Object User input expressed in structured text form, including written or transcribed utterances.
Audio Object Captured audio signals from the scene, covering speech, environmental sounds, and paralinguistic cues.
3D Model Object Geometric and spatial data describing the environment, including structures, surfaces, and volumetric features.
Visual Object Visual signals from the scene, encompassing gestures, facial expressions, and environmental imagery.
Context Capture Directive Control instructions specifying modality prioritization, acquisition parameters, or framing rules to guide M‑Location perceptual processing.
Output Description
Context A time‑stamped snapshot integrating multimodal inputs into an initial situational model of the environment and user posture.
Context Capture Status Scene‑level metadata describing user presence, environmental conditions, and confidence measures for contextual framing.

4. SubAIMs

Figure 1 gives the Reference Model of the Context Capture (PGM‑CXC) Composite AIM.

 

 

Figure 1 – Reference Model of Context Capture (PGM‑CXC) Composite AIM

PGM-CXC may include the following SubAIMs:

1. Audio Scene Description (ASD)

  • Is produced by parsing raw Audio Objects.
  • Serves to generate Audio Scene Descriptors (structured representation of ambient sounds, speech, and spatial audio sources).
  • Enables downstream modules (AOI, AVA) to work with semantically enriched audio data.

2. Visual Scene Description (VSD)

  • Is produced by parsing raw Visual Objects and 3D Model Objects.
  • Serves to generate Visual Scene Descriptors (structured representation of geometry, objects, and layout).
  • Enables VOI and AVA to reason over spatial and visual features.

3. Audio Object Identification (AOI)

  • Is produced by analysing Audio Scene Descriptors.
  • Serves to classify discrete Audio Object Types (speech segments, sound events, environmental cues).
  • Enables AVA and USE to align audio semantics with visual and contextual data.

4. Visual Object Identification (VOI)

  • Is produced by analysing Visual Scene Descriptors.
  • Serves to classify discrete Visual Object Types (gestures, facial expressions, environmental objects).
  • Enables AVA and USE to integrate visual semantics into context framing.

5. Audio‑Visual Alignment (AVA)

  • Is produced by combining Audio Scene Descriptors, Audio Object Types, Visual Object Types, Visual Scene Descriptors, and Context Capture Directive.
  • Serves to synchronise audio and visual streams into Aligned Audio Scene Descriptors and Aligned Visual Scene Descriptors.
  • Enables directive‑aware reasoning by generating Context Capture State (metadata on synchronisation, directive compliance, anchoring).

6. Entity State Extraction (ESE)

  • Is produced by integrating Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Directive, Context Capture State, and Personal Status (via internal PSE).
  • Serves to generate the Entity State (engagement, focus, temporal cues, intent, motivation, personal status).
  • Enables downstream AIMs to adapt reasoning and expressive stance based on the User’s posture.

7. Context Capture Output (CCO)

  • Is produced by consolidating Entity State and Context Capture State with aligned descriptors.
  • Serves as the final structured output of the CXC AIM.
  • Enables external AIMs (Prompt Creation, Domain Access, A‑User Control) to consume directive‑aware context frames.

Table 2 gives the AIMs composing the Context Capture (PGM-CXC) Composite  AIM.

Table 2 – AIMs composing the Context Capture (PGM‑CXC) Composite AIM

AIM AIMs Names JSON
PGM‑CXC Context Capture Link
PGM‑ASD Audio Scene Description Link
PGM‑VSD Visual Scene Description Link
PGM‑AOI Audio Object Identification Link
PGM‑VOI Visual Object Identification Link
PGM‑AVA Audio-Visual Alignment Link

Table 3 defines all input and output data involved in PGM-CXC AIM.

Table 3 – Input and output data of the PGM‑CXC AIM

AIMs Input Output To
Audio Scene Description Audio Object Audio Scene Descriptors AOI, AVA
Visual Scene Description Visual Object, 3D Model Object Visual Scene Descriptors VOI, AVA
Audio Object Identification Audio Scene Descriptors Audio Object Types AVA
Visual Object Identification Visual Scene Descriptors Visual Object Types AVA
Audio‑Visual Alignment Audio Scene Descriptors, Audio Object Types, Visual Scene Descriptors, Visual Object Types, Context Capture Directive Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Directive, Context Capture Status USE
Entity State Extraction Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Directive, Context Capture Status Entity State CCO
Context Capture Output Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Status, Entity State Context PRC, ASR, VSR
Context Capture Status AUC

Table 4 – External and Internal Data Types identified in Context Capture AIM

Data Type Definition
TextObject Structured representation of user input expressed in written or transcribed text form.
AudioObject Captured audio signals from the scene, covering speech, environmental sounds, and paralinguistic cues.
VisualObject Visual signals from the scene, encompassing gestures, facial expressions, and environmental imagery.
3DModelObject Geometric and spatial data describing the environment, including structures, surfaces, and volumetric features.
AudioSceneDescriptors Structured description of environmental audio features and sources.
VisualSceneDescriptors Structured description of visual elements, geometry, and scene layout.
AudioObjects Identified and classified discrete audio entities.
VisualObjects Identified and classified discrete visual entities.
AlignedAVDescriptors Unified multimodal representation synchronising audio and visual streams.
PersonalStatus Extracted cognitive, emotional, and attentional states.
UserState Formalised representation of user’s overall cognitive, emotional, attentional, and temporal state.
Context Time‑stamped snapshot integrating multimodal inputs into situational model of environment and user posture.
ContextCaptureStatus Metadata describing user presence, environmental conditions, and confidence measures.

Table 5 – Effects of Context Capture Directive on SubAIMs

  Sub‑AIM Directive Effects
ASD Audio Scene Description Prioritise specific audio modalities (speech vs ambient), suppress irrelevant sources.
VSD Visual Scene Description Focus on directive‑relevant visual regions, adjust framing parameters.
AOI Audio Object Identification Bias identification toward directive‑relevant sound events.
VOI Visual Object Identification Bias detection toward directive‑relevant gestures or expressions.
AVA  Audio‑Visual Alignment Synchronise streams with directive‑defined temporal focus or modality priority.
USE Entity State Extraction Integrate directive constraints into final Entity State framing.

Table 6 – Contributions to Context Capture Status from SubAIMs

  Sub‑AIM Status Contributions
ASD Audio Scene Description Audio modality compliance, source inclusion/exclusion rationale.
VSD Visual Scene Description Visual modality compliance, framing override trace.
AOI Audio Object Identification Identification confidence, directive bias trace.
VOI Visual Object Identification Detection confidence, directive bias trace.
AVA  Audio‑Visual Alignment Synchronisation compliance, override flags, anchoring metadata.
USE Entity State Extraction Final directive compliance summary, integrated presence metadata.
CCO Context Capture Output Integrates Data into CXC output Data.

5. JSON Metadata

https://schemas.mpai.community/PGM1/V1.0/AIMs/ContextCapture.json

6. Profiles

No Profiles.