| Function | Reference Model | Input/Output Data |
| SubAIMs | JSON Metadata | Profiles |
| Reference Software | Conformance Testing | Performance Assessment |
1. Function
The Context Capture (PGM‑CXC) AIM is the A‑User’s active perceptual interface to the spatial environment. It collects, fuses, and structures multimodal contextual information – including audio, visual, spatial, and environmental signals – and supports runtime reorientation under explicit Human or AUC commands.
CXC provides Audio Scene Descriptors and Visual Scene Descriptors. They include object localisation, user gaze/gesture alignment, and spatial layout information required for Goal Acquisition.
CXC may be directed by the Human through natural commands (e.g., “look at that corner”, “zoom there”, “follow that object”), which AUC translates into perceptual redirection operations. These operations adjust CXC’s capture configuration prior to any semantic interpretation.
CXC does not interpret goals or meaning; instead, it supplies perceptual context.
Specific functionalities
Multimodal Context Acquisition: The CXC AIM continuously captures audio, visual, spatial, and environmental signals from the surrounding environment.
Audio and Visual Scene Descriptor Generation: The CXC AIM generates Audio Scene Descriptors and Visual Scene Descriptors that describe object localisation, spatial layout, User gaze/gesture alignment, and other perceptual features of the environment.
Human‑Driven Perceptual Redirection: The CXC AIM supports perceptual redirection when instructed by the Human through AUC (e.g., reorienting viewpoint, changing focus, zooming, following a referenced object or region).
Runtime Capture Reconfiguration: The CXC AIM dynamically reconfigures its capture parameters (e.g., direction, focus, zoom, sampling region) in response to perceptual redirection commands issued by AUC.
Perceptual Grounding for Goal Acquisition: The CXC AIM provides perceptual descriptors that enable spatial grounding of Human expressions involving referenced objects or regions (e.g., “that corner”, “this object”, “over there”).
2. Reference Model
Figure 3 gives the Context Capture (PGM-CXC) Reference Model.

Figure 1 – The Reference Model of the Context Capture (PGM-CXC) AIM
3. Input/Output Data
Table 1 – Context Capture (PGM-CXC) AIM
| Input | Description |
| Text Object | User input expressed in structured text form, including written or transcribed utterances. |
| Audio Object | Captured audio signals from the scene, covering speech, environmental sounds, and paralinguistic cues. |
| 3D Model Object | Geometric and spatial data describing the environment, including structures, surfaces, and volumetric features. |
| Visual Object | Visual signals from the scene, encompassing gestures, facial expressions, and environmental imagery. |
| Context Capture Directive | Control instructions specifying modality prioritisation, acquisition parameters, or framing rules to guide the perceptual processing of an M‑Location. |
| Output | Description |
| Context | A time‑stamped snapshot integrating multimodal inputs into an initial situational model of the environment and user posture. |
| Context Capture Status | Scene‑level metadata describing User presence, environmental conditions, and confidence measures for contextual framing. |
4. SubAIMs (informative)
Figure 2 gives the informative Reference Model of the Context Capture (PGM‑CXC) Composite AIM.

Figure 2 – Reference Model of Context Capture (PGM‑CXC) Composite AIM
Figure 1 assumes that PGM-CXC includes the following SubAIMs:
1. Audio Scene Description (ASD)
- Parses raw Audio Objects.
- Produces Audio Scene Descriptors (structured representation of ambient sounds, speech, and spatial audio sources).
- Enables downstream modules (AOI, AVA) to work with semantically enriched audio data.
2. Visual Scene Description (VSD)
- Parses raw Visual Objects and 3D Model Objects.
- Produces Visual Scene Descriptors (structured representation of geometry and objects).
- Enables VOI and AVA to add details to spatial visual features.
3. Audio Object Identification (AOI)
- Analyses Audio Scene Descriptors.
- Classifies discrete Audio Object Types (speech segments, sound events, environmental cues).
- Enables AVA and Entity State Extraction (ESE) to align audio semantics with visual and contextual data.
4. Visual Object Identification (VOI)
- Analyses Visual Scene Descriptors.
- Classifies discrete Visual Object Types (gestures, facial expressions, environmental objects).
- Enables AVA and ESE to integrate visual semantics into context framing.
5. Audio‑Visual Alignment (AVA)
- Combines Audio Scene Descriptors, Audio Object Types, Visual Object Types, Visual Scene Descriptors, and Context Capture Directive.
- Synchronises audio and visual streams into Aligned Audio Scene Descriptors and Aligned Visual Scene Descriptors.
- Enables directive‑aware actions and report Context Capture State (metadata on synchronisation, directive compliance, anchoring).
6. Entity State Extraction (ESE)
- Integrates Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Directive, Context Capture State, and Personal Status (via internal PSE).
- Produces the initial User’s Entity State, a highly structured Data Type that may be reduced to Personal Status.
- Enables downstream AIMs to adapt reasoning and expressive stance based on the User’s posture.
7. Context Capture Multiplexing (CCX)
- Consolidates Entity State extracted from the User’s Visual captured representation and Context Capture State with aligned descriptors.
- Produces the final structured output of the CXC AIM.
- Enables external AIMs (Prompt Creation, Audio Spatial Reasoning, and Visual Spatial Reasoning) to consume directive‑aware context frames and A‑User Control to receive a report on the execution of the Directive.
Table 2 gives the AIMs composing the Context Capture (PGM-CXC) Composite AIM.
Table 2 – AIMs composing the Context Capture (PGM‑CXC) Composite AIM
| AIM | AIMs | Names | JSON |
| PGM‑CXC | Context Capture | Link | |
| PGM‑ASD | Audio Scene Description | Link | |
| PGM‑VSD | Visual Scene Description | Link | |
| PGM‑AOI | Audio Object Identification | Link | |
| PGM‑VOI | Visual Object Identification | Link | |
| PGM‑AVA | Audio-Visual Alignment | Link | |
| PGM-ESE | Entity State Extraction | Link | |
| PGM-CCX | Context Capture Multiplexing | Link |
Table 3 defines all input and output data involved in PGM-CXC AIM.
Table 3 – Input and output data of the PGM‑CXC AIM SubAIMs
| AIMs | Input | Output | To |
| Audio Scene Description | Audio Object | Audio Scene Descriptors | AOI, AVA |
| Visual Scene Description | Visual Object, 3D Model Object | Visual Scene Descriptors | VOI, AVA |
| Audio Object Identification | Audio Scene Descriptors | Audio Object Types | AVA |
| Visual Object Identification | Visual Scene Descriptors | Visual Object Types | AVA |
| Audio‑Visual Alignment | Audio Scene Descriptors, Audio Object Types, Visual Scene Descriptors, Visual Object Types, Context Capture Directive | Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Directive, Context Capture Status | ESE |
| Entity State Extraction | Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Directive, Context Capture Status | Entity State | CCO |
| Context Capture Multiplexing | Aligned Audio Scene Descriptors, Aligned Visual Scene Descriptors, Context Capture Status, Entity State | Context | PRC, ASR, VSR |
| Context Capture Status | AUC |
Table 4 – External and Internal Data Types identified in Context Capture AIM
| Data Type | Definition |
| TextObject | Structured representation of user input expressed in written or transcribed text form. |
| AudioObject | Captured audio signals from the scene, covering speech, environmental sounds, and paralinguistic cues. |
| VisualObject | Visual signals from the scene, encompassing gestures, facial expressions, and environmental imagery. |
| 3DModelObject | Geometric and spatial data describing the environment, including structures, surfaces, and volumetric features. |
| AudioSceneDescriptors | Structured description of environmental audio features and sources. |
| VisualSceneDescriptors | Structured description of visual elements, geometry, and scene layout. |
| 3DModelSceneDescriptors | Structured description of 3D Model elements, geometry, and scene layout. |
| IdentifiedAudioObjects | Identified and classified discrete audio entities. |
| IdentifiedVisualObjects | Identified and classified discrete visual entities. |
| Identified3DModelObjects |
Identified and classified discrete 3D Model entities. |
| AlignedAVDescriptors | Unified multimodal representation synchronising audio and visual streams. |
| EntityState | Representation of user’s overall cognitive, emotional, attentional, and temporal state. |
| Context | Time‑stamped snapshot integrating multimodal inputs into situational model of environment and User posture. |
| ContextCaptureDirective | Instruction provided by A-User Control. |
| ContextCaptureStatus | Metadata describing user presence, environmental conditions, and confidence measures. |
Table 5 maps CXC Inputs/Outputs to Unified Messages.
Table 5 – Table — PGM-CXC Inputs/Outputs mapped to PGM-AUC Unified Messages
| CXC Data Name | Role | Origin / Destination | Unified Schema Mapping |
|---|---|---|---|
| Context Capture Directive | Input | A‑User Control | Directive → TargetAIM=CXC (AIMInstance); acquisition/framing parameters in Parameters and/or Constraints. |
| Text Object | Input | Captured stream | Aggregated into Context (PGM‑CXT) under InputChannels. |
| Audio Object | Input | Captured stream | Same as above; referenced in Context. |
| Visual Object | Input | Captured stream | Same as above; referenced in Context. |
| 3D Model Object | Input | Captured stream | Same as above; referenced in Context. |
| Context | Output | PRC, ASR, VSR | Status → Result (Context snapshot with AVSceneDescriptors + UserState); correlated via Envelope.CorrelationId. |
| Context Capture Status | Output | A‑User Control | Status → State, Progress, Summary, Result (scene-level metadata, confidence, presence). |
5. JSON Metadata
https://schemas.mpai.community/PGM1/V1.0/AIMs/ContextCapture.json
6. Profiles
No Profiles.