Function
Ref. Model
I/O Data
SubAIMs
JSON MData
Profiles
Ref. Software
Conformance
Performance
1 Functions
The Context Description (PGM‑CXT) AIM is the A‑User’s perceptual front end to the spatial environment, integrating in a single Composite AIM the active capture of a multimodal scene and its interpretative enrichment. It receives raw Audio and Visual Objects, captures and structures the scene without LLM involvement and without MCP interactions, and then applies modality‑specific analysis, cross‑modal alignment, and optional domain knowledge to produce Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors together with an interpreted description of the User State.
The PGM‑CXT AIM supports runtime reorientation under CXT Directives issued by A‑User Control, which may reflect Human Commands. A single CXT Directive carries a SessionID and CaptureIndex identifying the capture’s position within the current session, together with modality prioritisation and acquisition parameters for the capture stage and scope, depth, and policy constraints for the enhancement stage. The Audio Scene Descriptors (ASD) and Visual Scene Descriptors (VSD) derived internally within the Audio and Visual Scene Enhancement SubAIMs are not exposed as AIM outputs; the enhanced result is consolidated into a single set of Context Descriptors.
All aggregate exchanges with the surrounding AIMs cross the two boundary SubAIMs. On the input side the CXT‑AUS Interface demultiplexes the single CXT Directive, Domain Response, and Interaction History Response into their per‑modality (Audio, Visual, User) components and routes the Audio and Visual Objects to the corresponding enhancement SubAIMs. On the output side Audio‑Visual‑User Multiplexing recombines the per‑modality results into the single Context Descriptors, Domain Request, Interaction History Request, and CXT Status. Domain knowledge is obtained from the Domain Access AIM (Request out, Response in) and Interaction History is exchanged with A‑User Storage (Request out, Response in) through this same boundary pair.
| Receives | Audio Object | Audio signals from the scene including speech and environmental sounds. |
| Visual Object | Visual signals from the scene. | |
| CXT Directive | Control instructions from A‑User Control specifying modality prioritisation, acquisition parameters, framing rules, session identification, enhancement scope/depth/policy, domain policy, and A‑User Storage access instructions. | |
| Domain Response | Domain‑specific knowledge received from Domain Access. | |
| CXT IH Response | Prior session content read from A‑User Storage (prior descriptors, User State). | |
| Produces | Context Descriptors | Aggregated result combining Enhanced Audio and Visual Scene Descriptors and User State. |
| Domain Request | Request for domain‑specific knowledge sent to Domain Access. | |
| CXT IH Request | Request to A‑User Storage to read prior and write produced session content. | |
| CXT Status | Scene‑level metadata describing capture and enhancement outcomes, per-modality results, A‑User Storage and Domain operation outcomes, and confidence measures. |
2 Reference Model
Figure 1 depicts the Reference Model of the Context Description (PGM‑CXT) AIM.

Figure 1 – The Context Description (PGM‑CXT) AIM
3 I/O Data
Table 1 specifies the Input and Output Data of the Context Description (PGM‑CXT) AIM.
| Input | Description |
|---|---|
| Audio Object | Captured audio signals from the scene, covering speech, environmental sounds, and paralinguistic cues. |
| Visual Object | Visual signals from the scene, encompassing gestures, facial expressions, and environmental imagery. |
| CXT Directive | Control instructions from A‑User Control covering both capture (modality prioritisation, acquisition parameters, framing rules, session identification) and enhancement (scope, depth, policy constraints), together with domain policy and A‑User Storage access instructions. |
| Domain Response | Domain‑specific knowledge received from Domain Access. |
| CXT IH Response | Prior session content read from A‑User Storage (prior descriptors and User State). |
| Output | Description |
| Context Descriptors | Aggregated result combining Enhanced Audio Scene Descriptors, Enhanced Visual Scene Descriptors, and User State. |
| Domain Request | Request for domain‑specific knowledge sent to Domain Access. |
| CXT IH Request | Request to A‑User Storage to read prior and write produced session content. |
| CXT Status | Scene‑level metadata describing capture and enhancement outcomes, per-modality results, A‑User Storage and Domain Access operation outcomes, and confidence measures. |
Note – The Audio Scene Descriptors (ASD) and Visual Scene Descriptors (VSD) derived internally within the Audio and Visual Scene Enhancement SubAIMs are internal to the PGM‑CXT AIM and are not exposed as Input or Output Data.
4 SubAIMs (informative)
This section is informative. The decomposition into SubAIMs described below illustrates one conformant architecture for producing the normative outputs of PGM‑CXT. Implementations may adopt alternative internal structures provided they satisfy the conformance requirements of Section 8.
4.1 Reference Model
Figure 2 gives the Reference Model of the Context Description (PGM‑CXT) Composite AI Module. Two boundary SubAIMs isolate the composite from its surroundings. The Context Capture Demultiplexing demultiplexes each aggregate input — the CXT Directive, the Domain Response, and the CXT Interaction History Response — into its Audio, Visual, and User components, and routes the Audio and Visual Objects to the enhancement SubAIMs. The Context Capture Demultiplexing SubAIM performs the inverse, recombining the per‑modality results into the aggregate Context Descriptors, Domain Request, CXT Interaction History Request, and CXT Status. The central SubAIMs — Audio and Visual Scene Enhancement, Audio‑Visual Alignment, and User State Description — exchange data only with the two boundary SubAIMs and with each other; they never address A‑User Storage, Domain Access, or A‑User Control directly.

Figure 2 – Reference Model of the Context Description (PGM‑CXT) Composite AI Module
4.2 Operation
The Context Description AIM is activated by a CXT Directive issued by A‑User Control. The Context Description operation is carried out with the following steps:
- Reception and demultiplexing: the Context Capture Demultiplexing receives the CXT Directive, the Domain Response, and the CXT Interaction History Response (read from A‑User Storage), together with the Audio and Visual Objects, and demultiplexes them into the Audio, Visual, and User CXT Directives, Domain Responses, and CXT Interaction History Responses, routing each Object to its enhancement SubAIM.
- Capture and enhancement: the Audio Scene Enhancement (ASE) and Visual Scene Enhancement (VSE) SubAIMs process the Audio Objects and Visual Objects in parallel under their modality‑specific CXT Directives, internally deriving the Audio Scene Descriptors (ASD) and Visual Scene Descriptors (VSD) and producing the Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors, each with its CXT Status and, where required, an CXT Interaction History Request and a Domain Request.
- Audio‑Visual Alignment, producing the Audio‑Visual Scene Geometry from the Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors.
- Production of the User State by User State Description from the Enhanced Scene Descriptors, the Audio‑Visual Scene Geometry, and the User‑side directive, Domain, and CXT Interaction History Responses.
- Multiplexing: the Audio‑Visual‑User Multiplexing SubAIM recombines the Enhanced Audio Scene Descriptors, Enhanced Visual Scene Descriptors and the User State into the Context Descriptors (delivered to PRC), the per‑modality CXT Statuses into the composite CXT Status, the per‑modality Domain Requests into the Domain Request (to Domain Access), and the per‑modality CXT Interaction History Requests into the Interaction History Request (to A‑User Storage).
The reference model explicitly separates capture, modal enhancement, cross‑modal alignment, and user/entity interpretation, ensuring modularity, traceability, and reuse.
4.3 Functions of SubAIMs
Table 2 gives the functions of the Context Description (PGM‑CXT) SubAIMs.
| SubAIM | Function |
|---|---|
| Context Capture Demultiplexing | Demultiplexes the CXT Directive, Domain Response, and Interaction History Response into their Audio, Visual, and User components and routes the Audio and Visual Objects to the enhancement SubAIMs. |
| Audio Scene Enhancement | Derives the Audio Scene Descriptors from the Audio Object and enhances them, producing the Enhanced Audio Scene Descriptors. |
| Visual Scene Enhancement | Derives the Visual Scene Descriptors from the Visual Object and enhances them, producing the Enhanced Visual Scene Descriptors. |
| Audio‑Visual Alignment | Cross‑modal association between Audio Objects and Visual Objects referring to the same source or entity. Production of Audio‑Visual Scene Geometry expressing correspondence and spatial relations. |
| User State Description | Interpretation of enhanced descriptors and alignment evidence with respect to the User or other entities. Derivation of User‑centric evidence and state descriptions under the control of directives (User State). |
| Context Capture Multiplexing | Multiplexes the per‑modality results into the aggregate Context Descriptors (delivered to PRC), Domain Request, Interaction History Request, and CXT Status. |
4.4 I/O Data of SubAIMs
Table 3 gives the Input and Output Data of the Context Description (PGM‑CXT) SubAIMs.
4.5 AIMs and JSON Metadata
Table 4 provides the links to the AIM specifications and JSON schemas. AIM1 indicates the Composite AIM and AIM2 its SubAIMs.
| AIM1 | AIM2 | Name | JSON |
|---|---|---|---|
| PGM‑CXT | Context Description | X | |
| PGM‑CAI | Context Capture Demultiplexing | X | |
| PGM‑ASE | Audio Scene Enhancement | X | |
| PGM‑VSE | Visual Scene Enhancement | X | |
| OSD‑AVA | Audio‑Visual Alignment | X | |
| PGM‑USD | User State Description | X | |
| PGM‑AVU | Context Capture Multiplexing | X |
5 JSON Metadata
https://schemas.mpai.community/PGM1/V1.0/AIMs/ContextDescription.json
6 Profiles
No Profiles.
7 Reference Software
Not part of this specification.
8 Conformance Testing
Table 5 provides the Conformance Testing Method for the Context Description (PGM‑CXT) Composite AIM. Conformance Testing of the individual SubAIMs is given by the individual AIM specifications.
If a schema contains references to other schemas, conformance of data for the primary schema implies that any data referencing a secondary schema shall also validate against the relevant schema, if present, and conform with the Qualifier, if present.
| Receives | Audio Object | Shall validate against Audio Object schema. Audio Data shall conform with Audio Qualifier. |
| Visual Object | Shall validate against Visual Object schema. Visual Data shall conform with Visual Qualifier. | |
| CXT Directive | Shall validate against CXT Directive schema. | |
| Domain Response | Shall validate against Domain Response schema. | |
| CXT Interaction History Response | Shall validate against Interaction History schema. | |
| Produces | Context Descriptors | Shall validate against Context Descriptors schema. |
| Domain Request | Shall validate against Domain Request schema. | |
| CXT Interaction History Request | Shall validate against Interaction History schema. | |
| CXT Status | Shall validate against CXT Status schema. |
9 Performance Assessment
Not part of this specification.