(Tentative)
| Function | Reference Model | Input/Output Data |
| SubAIMs | JSON Metadata | Profiles |
1. Function
Context Capture AIM (PGM‑CXC)
1. Receives
- Text Objects representing user utterances or written input.
- Audio Objects representing speech and environmental sounds.
- Visual Objects representing user gestures, facial expressions, and scene elements.
- 3D Model Objects representing spatial geometry and environmental structures.
- Context Capture Directives from A‑User Control guiding modality prioritization and acquisition strategy.
2. Extracts, refines, and interprets multimodal signals to:
- Disambiguate inputs by aligning text, audio, visual, and 3D model objects into a coherent multimodal frame.
- Normalise descriptors into canonical formats, ensuring consistency across modalities and traceability of provenance.
- Infer relationships among signals, such as linking gestures to utterances, or mapping audio cues to spatial anchors.
- Resolve context by applying A‑User Control directives to prioritise modalities and filter noise, and highlight salient features.
- Generate enriched semantics that capture intent‑aligned cues, such as urgency, confidence, and attentional focus.
3. Constructs a representation of the User environment and Entity State, including:
- Scene grounding in spatio‑temporal coordinates.
- User localsation (position, orientation, posture).
- Semantic tagging of objects, actions, and environmental features.
- Framing of cognitive, emotional, attentional, intentional, motivational, and temporal states.
4. Sends
- A Entity State for multimodal prompt generation to Prompt Creation.
- Audio and Visual Scene Descriptors for environment understanding and alignment to Spatial Reasoning.
- A Context Capture Status in response to Context Capture Directives to A‑User Control.
5. Enables
- Spatial Reasoning and Prompt Creation to operate with full awareness of the audio-visual environment, supporting perception‑aligned interactions and context‑aware orchestration of AIMs.
- A‑User Control to stay informed about the implementation of Context Capture Directives.
2. Reference Model
Figure 3 gives the Context Capture (PGM-CXC) Reference Model.

Figure 1 – The Reference Model of the Context Capture (PGM-CXC) AIM
3. Input/Output Data
Table 1 – Context Capture (PGM-CXC) AIM
| Input | Description |
| Text Object | User input expressed in structured text form, including written or transcribed utterances. |
| Audio Object | Captured audio signals from the scene, covering speech, environmental sounds, and paralinguistic cues. |
| 3D Model Object | Geometric and spatial data describing the environment, including structures, surfaces, and volumetric features. |
| Visual Object | Visual signals from the scene, encompassing gestures, facial expressions, and environmental imagery. |
| Context Capture Directive | Control instructions specifying modality prioritization, acquisition parameters, or framing rules to guide M‑Location perceptual processing. |
| Output | Description |
| Context | A time‑stamped snapshot integrating multimodal inputs into an initial situational model of the environment and user posture. |
| Context Capture Status | Scene‑level metadata describing user presence, environmental conditions, and confidence measures for contextual framing. |
4. SubAIMs
Figure 1 gives the Reference Model of the Context Capture (PGM‑CXC) Composite AIM.

Figure 1 – Reference Model of Context Capture (PGM‑CXC) Composite AIM
PGM-CXC may include the following SubAIMs:
1. Audio Scene Description (ASD)
- Is produced by parsing raw Audio Objects.
- Serves to generate Audio Scene Descriptors (structured representation of ambient sounds, speech, and spatial audio sources).
- Enables downstream modules (AOI, AVA) to work with semantically enriched audio data.
2. Visual Scene Description (VSD)
- Is produced by parsing raw Visual Objects and 3D Model Objects.
- Serves to generate Visual Scene Descriptors (structured representation of geometry, objects, and layout).
- Enables VOI and AVA to reason over spatial and visual features.
3. Audio Object Identification (AOI)
- Is produced by analysing Audio Scene Descriptors.
- Serves to classify discrete Audio Object Types (speech segments, sound events, environmental cues).
- Enables AVA and USE to align audio semantics with visual and contextual data.
4. Visual Object Identification (VOI)
- Is produced by analysing Visual Scene Descriptors.
- Serves to classify discrete Visual Object Types (gestures, facial expressions, environmental objects).
- Enables AVA and USE to integrate visual semantics into context framing.
5. Audio‑Visual Alignment (AVA)
- Is produced by combining Audio Scene Descriptors, Audio Object Types, Visual Object Types, Visual Scene Descriptors, and Context Capture Directive.
- Serves to synchronise audio and visual streams into Aligned Audio Scene Descriptors and Aligned Visual Scene Descriptors.
- Enables directive‑aware reasoning by generating Context Capture State (metadata on synchronisation, directive compliance, anchoring).
6. Entity State Extraction (ESE)
- Is produced by integrating Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Directive, Context Capture State, and Personal Status (via internal PSE).
- Serves to generate the Entity State (engagement, focus, temporal cues, intent, motivation, personal status).
- Enables downstream AIMs to adapt reasoning and expressive stance based on the User’s posture.
7. Context Capture Output (CCO)
- Is produced by consolidating Entity State and Context Capture State with aligned descriptors.
- Serves as the final structured output of the CXC AIM.
- Enables external AIMs (Prompt Creation, Domain Access, A‑User Control) to consume directive‑aware context frames.
Table 2 gives the AIMs composing the Context Capture (PGM-CXC) Composite AIM.
Table 2 – AIMs composing the Context Capture (PGM‑CXC) Composite AIM
| AIM | AIMs | Names | JSON |
| PGM‑CXC | Context Capture | Link | |
| PGM‑ASD | Audio Scene Description | Link | |
| PGM‑VSD | Visual Scene Description | Link | |
| PGM‑AOI | Audio Object Identification | Link | |
| PGM‑VOI | Visual Object Identification | Link | |
| PGM‑AVA | Audio-Visual Alignment | Link |
Table 3 defines all input and output data involved in PGM-CXC AIM.
Table 3 – Input and output data of the PGM‑CXC AIM
| AIMs | Input | Output | To |
| Audio Scene Description | Audio Object | Audio Scene Descriptors | AOI, AVA |
| Visual Scene Description | Visual Object, 3D Model Object | Visual Scene Descriptors | VOI, AVA |
| Audio Object Identification | Audio Scene Descriptors | Audio Object Types | AVA |
| Visual Object Identification | Visual Scene Descriptors | Visual Object Types | AVA |
| Audio‑Visual Alignment | Audio Scene Descriptors, Audio Object Types, Visual Scene Descriptors, Visual Object Types, Context Capture Directive | Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Directive, Context Capture Status | USE |
| Entity State Extraction | Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Directive, Context Capture Status | Entity State | CCO |
| Context Capture Output | Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Status, Entity State | Context | PRC, ASR, VSR |
| Context Capture Status | AUC |
Table 4 – External and Internal Data Types identified in Context Capture AIM
| Data Type | Definition |
| TextObject | Structured representation of user input expressed in written or transcribed text form. |
| AudioObject | Captured audio signals from the scene, covering speech, environmental sounds, and paralinguistic cues. |
| VisualObject | Visual signals from the scene, encompassing gestures, facial expressions, and environmental imagery. |
| 3DModelObject | Geometric and spatial data describing the environment, including structures, surfaces, and volumetric features. |
| AudioSceneDescriptors | Structured description of environmental audio features and sources. |
| VisualSceneDescriptors | Structured description of visual elements, geometry, and scene layout. |
| AudioObjects | Identified and classified discrete audio entities. |
| VisualObjects | Identified and classified discrete visual entities. |
| AlignedAVDescriptors | Unified multimodal representation synchronising audio and visual streams. |
| PersonalStatus | Extracted cognitive, emotional, and attentional states. |
| UserState | Formalised representation of user’s overall cognitive, emotional, attentional, and temporal state. |
| Context | Time‑stamped snapshot integrating multimodal inputs into situational model of environment and user posture. |
| ContextCaptureStatus | Metadata describing user presence, environmental conditions, and confidence measures. |
Table 5 – Effects of Context Capture Directive on SubAIMs
| Sub‑AIM | Directive Effects | |
| ASD | Audio Scene Description | Prioritise specific audio modalities (speech vs ambient), suppress irrelevant sources. |
| VSD | Visual Scene Description | Focus on directive‑relevant visual regions, adjust framing parameters. |
| AOI | Audio Object Identification | Bias identification toward directive‑relevant sound events. |
| VOI | Visual Object Identification | Bias detection toward directive‑relevant gestures or expressions. |
| AVA | Audio‑Visual Alignment | Synchronise streams with directive‑defined temporal focus or modality priority. |
| USE | Entity State Extraction | Integrate directive constraints into final Entity State framing. |
Table 6 – Contributions to Context Capture Status from SubAIMs
| Sub‑AIM | Status Contributions | |
| ASD | Audio Scene Description | Audio modality compliance, source inclusion/exclusion rationale. |
| VSD | Visual Scene Description | Visual modality compliance, framing override trace. |
| AOI | Audio Object Identification | Identification confidence, directive bias trace. |
| VOI | Visual Object Identification | Detection confidence, directive bias trace. |
| AVA | Audio‑Visual Alignment | Synchronisation compliance, override flags, anchoring metadata. |
| USE | Entity State Extraction | Final directive compliance summary, integrated presence metadata. |
| CCO | Context Capture Output | Integrates Data into CXC output Data. |
5. JSON Metadata
https://schemas.mpai.community/PGM1/V1.0/AIMs/ContextCapture.json
6. Profiles
No Profiles.