(Tentative)
| Function | Reference Model | Input/Output Data |
| SubAIMs | JSON Metadata | Profiles |
Function
The Context Capture AIM (PGM-CXC) receives one or more of the following modalities: Text, Audio, Visual, and 3D Model inputs. Its primary function is to synthesize these heterogeneous signals into Context, a structured semantic representation of the User and the Audio-Visual Scene the User is embedded in.
Internally, PGM-CXC it may perform the following operations:
- Multimodal Parsing: Decomposes incoming signals into interpretable units – a e.g., utterances, gestures, spatial anchors, and visual entities.
- Scene Grounding: Maps perceptual inputs to a unified spatial-temporal frame, identifying relationships between the User and surrounding elements.
- User Localization: Determines the User’s Position, Orientation, and Personal Status within the Audio-Visual Scene using audio-visual cues and 3D geometry.
- Semantic Tagging: Annotates Items, actions, and environmental features with semantic labels to support downstream reasoning.
- Context Framing: Assembles a coherent representation of the current interaction episode, including modality status, perceptual confidence, and temporal boundaries.
- Directive Emission: Optionally emits Context Capture Directive to A-User Control, enabling dynamic modulation of capture parameters (e.g., modality prioritisation).
The resulting Context output serves as the perceptual backbone for AIM orchestration, enabling spatial reasoning, prompt generation, and expressive alignment to operate with full situational awareness.
Reference Model
Figure 3 gives the Context Capture (PGM-CXC) Reference Model.

Figure 3 – The Reference Model of the Context Capture (PGM-CXC) AIM
Input/Output Data
Table 10 – Context Capture (PGM-CXC) AIM
| Input | Description |
| Text Object | User input as text |
| Audio Object | The Audio component of the Scene where the User is embedded |
| 3D Model Object | The 3D Model component of the Scene where the User is embedded |
| Visual Object | The Visual component of the Scene where the User is embedded |
| Context Capture Directive | Instructions such as modality prioritisation, or context framing parameters to guide M-Location perceptual acquisition. |
| Output | Description |
| Context | A structured and time-stamped snapshot representing the initial understanding that the A-User achieves of the environment and of the User posture. |
| Context Capture Status | Provides scene-level context and User presence metadata. |
SubAIMs
No SubAIMs.
JSON Metadata
https://schemas.mpai.community/PGM1/V1.0/AIMs/ContextCapture.json
Profiles
No Profiles.