Go to PGM-AUA V1.0 AI Modules

Function
Ref. Model
I/O Data
SubAIMs
JSON MData
Profiles
Ref. Software
Conformance
Performance

1 Functions

The Context Description (PGM‑CXT) AIM is the A‑User’s perceptual front end to the spatial environment, integrating in a single Composite AIM the active capture of a multimodal scene and its interpretative enrichment. It receives raw Audio and Visual Objects, captures and structures the scene without LLM involvement and without MCP interactions, and then applies modality‑specific analysis, cross‑modal alignment, and optional domain knowledge to produce Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors together with an interpreted description of the User State.

The PGM‑CXT AIM supports runtime reorientation under CXT Directives issued by A‑User Control, which may reflect Human Commands. A single CXT Directive carries a SessionID and CaptureIndex identifying the capture’s position within the current session, together with modality prioritisation and acquisition parameters for the capture stage and scope, depth, and policy constraints for the enhancement stage. The Audio Scene Descriptors (ASD) and Visual Scene Descriptors (VSD) derived internally within the Audio and Visual Scene Enhancement SubAIMs are not exposed as AIM outputs; the enhanced result is consolidated into a single set of Context Descriptors.

All aggregate exchanges with the surrounding AIMs cross the two boundary SubAIMs. On the input side the CXT‑AUS Interface demultiplexes the single CXT Directive, Domain Response, and Interaction History Response into their per‑modality (Audio, Visual, User) components and routes the Audio and Visual Objects to the corresponding enhancement SubAIMs. On the output side Audio‑Visual‑User Multiplexing recombines the per‑modality results into the single Context Descriptors, Domain Request, Interaction History Request, and CXT Status. Domain knowledge is obtained from the Domain Access AIM (Request out, Response in) and Interaction History is exchanged with A‑User Storage (Request out, Response in) through this same boundary pair.

Receives Audio Object Audio signals from the scene including speech and environmental sounds.
Visual Object Visual signals from the scene.
CXT Directive Control instructions from A‑User Control specifying modality prioritisation, acquisition parameters, framing rules, session identification, enhancement scope/depth/policy, domain policy, and A‑User Storage access instructions.
Domain Response Domain‑specific knowledge received from Domain Access.
CXT IH Response Prior session content read from A‑User Storage (prior descriptors, User State).
Produces Context Descriptors Aggregated result combining Enhanced Audio and Visual Scene Descriptors and User State.
Domain Request Request for domain‑specific knowledge sent to Domain Access.
CXT IH Request Request to A‑User Storage to read prior and write produced session content.
CXT Status Scene‑level metadata describing capture and enhancement outcomes, per-modality results, A‑User Storage and Domain operation outcomes, and confidence measures.

2 Reference Model

Figure 1 depicts the Reference Model of the Context Description (PGM‑CXT) AIM.


Figure 1 – The Context Description (PGM‑CXT) AIM

3 I/O Data

Table 1 specifies the Input and Output Data of the Context Description (PGM‑CXT) AIM.

Table 1 – I/O Data of the Context Description (PGM‑CXT) AIM

Input Description
Audio Object Captured audio signals from the scene, covering speech, environmental sounds, and paralinguistic cues.
Visual Object Visual signals from the scene, encompassing gestures, facial expressions, and environmental imagery.
CXT Directive Control instructions from A‑User Control covering both capture (modality prioritisation, acquisition parameters, framing rules, session identification) and enhancement (scope, depth, policy constraints), together with domain policy and A‑User Storage access instructions.
Domain Response Domain‑specific knowledge received from Domain Access.
CXT IH Response Prior session content read from A‑User Storage (prior descriptors and User State).
Output Description
Context Descriptors Aggregated result combining Enhanced Audio Scene Descriptors, Enhanced Visual Scene Descriptors, and User State.
Domain Request Request for domain‑specific knowledge sent to Domain Access.
CXT IH Request Request to A‑User Storage to read prior and write produced session content.
CXT Status Scene‑level metadata describing capture and enhancement outcomes, per-modality results, A‑User Storage and Domain Access operation outcomes, and confidence measures.

Note – The Audio Scene Descriptors (ASD) and Visual Scene Descriptors (VSD) derived internally within the Audio and Visual Scene Enhancement SubAIMs are internal to the PGM‑CXT AIM and are not exposed as Input or Output Data.

4 SubAIMs (informative)

This section is informative. The decomposition into SubAIMs described below illustrates one conformant architecture for producing the normative outputs of PGM‑CXT. Implementations may adopt alternative internal structures provided they satisfy the conformance requirements of Section 8.

4.1 Reference Model

Figure 2 gives the Reference Model of the Context Description (PGM‑CXT) Composite AI Module. Two boundary SubAIMs isolate the composite from its surroundings. The Context Capture Demultiplexing demultiplexes each aggregate input — the CXT Directive, the Domain Response, and the CXT Interaction History Response — into its Audio, Visual, and User components, and routes the Audio and Visual Objects to the enhancement SubAIMs. The Context Capture Demultiplexing SubAIM performs the inverse, recombining the per‑modality results into the aggregate Context Descriptors, Domain Request, CXT Interaction History Request, and CXT Status. The central SubAIMs — Audio and Visual Scene Enhancement, Audio‑Visual Alignment, and User State Description — exchange data only with the two boundary SubAIMs and with each other; they never address A‑User Storage, Domain Access, or A‑User Control directly.

 

Figure 2 – Reference Model of the Context Description (PGM‑CXT) Composite AI Module

4.2 Operation

The Context Description AIM is activated by a CXT Directive issued by A‑User Control. The Context Description operation is carried out with the following steps:

  1. Reception and demultiplexing: the Context Capture Demultiplexing receives the CXT Directive, the Domain Response, and the CXT Interaction History Response (read from A‑User Storage), together with the Audio and Visual Objects, and demultiplexes them into the Audio, Visual, and User CXT Directives, Domain Responses, and CXT Interaction History Responses, routing each Object to its enhancement SubAIM.
  2. Capture and enhancement: the Audio Scene Enhancement (ASE) and Visual Scene Enhancement (VSE) SubAIMs process the Audio Objects and Visual Objects in parallel under their modality‑specific CXT Directives, internally deriving the Audio Scene Descriptors (ASD) and Visual Scene Descriptors (VSD) and producing the Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors, each with its CXT Status and, where required, an CXT Interaction History Request and a Domain Request.
  3. Audio‑Visual Alignment, producing the Audio‑Visual Scene Geometry from the Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors.
  4. Production of the User State by User State Description from the Enhanced Scene Descriptors, the Audio‑Visual Scene Geometry, and the User‑side directive, Domain, and CXT Interaction History Responses.
  5. Multiplexing: the Audio‑Visual‑User Multiplexing SubAIM recombines the Enhanced Audio Scene Descriptors, Enhanced Visual Scene Descriptors and the User State into the Context Descriptors (delivered to PRC), the per‑modality CXT Statuses into the composite CXT Status, the per‑modality Domain Requests into the Domain Request (to Domain Access), and the per‑modality CXT Interaction History Requests into the Interaction History Request (to A‑User Storage).

The reference model explicitly separates capture, modal enhancement, cross‑modal alignment, and user/entity interpretation, ensuring modularity, traceability, and reuse.

4.3 Functions of SubAIMs

Table 2 gives the functions of the Context Description (PGM‑CXT) SubAIMs.

Table 2 – Functions of the Context Description (PGM‑CXT) SubAIMs

SubAIM Function
Context Capture Demultiplexing Demultiplexes the CXT Directive, Domain Response, and Interaction History Response into their Audio, Visual, and User components and routes the Audio and Visual Objects to the enhancement SubAIMs.
Audio Scene Enhancement Derives the Audio Scene Descriptors from the Audio Object and enhances them, producing the Enhanced Audio Scene Descriptors.
Visual Scene Enhancement Derives the Visual Scene Descriptors from the Visual Object and enhances them, producing the Enhanced Visual Scene Descriptors.
Audio‑Visual Alignment Cross‑modal association between Audio Objects and Visual Objects referring to the same source or entity. Production of Audio‑Visual Scene Geometry expressing correspondence and spatial relations.
User State Description Interpretation of enhanced descriptors and alignment evidence with respect to the User or other entities. Derivation of User‑centric evidence and state descriptions under the control of directives (User State).
Context Capture Multiplexing Multiplexes the per‑modality results into the aggregate Context Descriptors (delivered to PRC), Domain Request, Interaction History Request, and CXT Status.

4.4 I/O Data of SubAIMs

Table 3 gives the Input and Output Data of the Context Description (PGM‑CXT) SubAIMs.

Table 3 – I/O Data of the Context Description (PGM‑CXT) SubAIMs

SubAIM Input Output
Context Capture Demultiplexing CXT Directive
Domain Response
CXT IH Response
Audio Object
Visual Object
Audio Object,
Visual Object
Audio CXT Directive,
Visual CXT Directive,
User CXT Directive
Audio Domain Response,
Visual Domain Response,
User Domain Response
Audio CXT IH Response,
Visual CXT IH Response,
User CXT IH Response
Audio Scene Enhancement Audio Object
Audio CXT Directive
Audio CXT IH Response
Audio Domain Response
Enhanced Audio Scene Descriptors
Audio CXT Status
Audio CXT IH Request
Audio Domain Request
Visual Scene Enhancement Visual Object
Visual CXT Directive
Visual CHT IH Response
Visual Domain Response
Enhanced Visual Scene Descriptors
Visual CXT Status
Visual CXT IH Request
Visual Domain Request
Audio‑Visual Alignment Enhanced Audio Scene Descriptors
Enhanced Visual Scene Descriptors
Audio‑Visual Scene Geometry
User State Description Enhanced Audio Scene Descriptors
Enhanced Visual Scene Descriptors
Audio‑Visual Scene Geometry
User CXT Directive
User CXY IH Response
User Domain Response
User State
User CXT Status
User CXT IH Request
User Domain Request
Context Capture Multiplexing Enhanced Audio Scene Descriptors
Audio CXT Status,
Audio ICXT IH Request,
Audio Domain Request
Enhanced Visual Scene Descriptors,
Visual CXT Status,
Visual CXT IH Request,
Visual Domain Request
User State,
User CXT Status,
User CXT IH Request,
User Domain Request
Context Descriptors
Domain Request
CXT Interaction History Request
CXT Status

4.5 AIMs and JSON Metadata

Table 4 provides the links to the AIM specifications and JSON schemas. AIM1 indicates the Composite AIM and AIM2 its SubAIMs.

Table 4 – AIMs and JSON Metadata of the Context Description (PGM‑CXT)

AIM1 AIM2 Name JSON
PGM‑CXT Context Description X
PGM‑CAI Context Capture Demultiplexing X
PGM‑ASE Audio Scene Enhancement X
PGM‑VSE Visual Scene Enhancement X
OSD‑AVA Audio‑Visual Alignment X
PGM‑USD User State Description X
PGM‑AVU Context Capture Multiplexing X

5 JSON Metadata

https://schemas.mpai.community/PGM1/V1.0/AIMs/ContextDescription.json

6 Profiles

No Profiles.

7 Reference Software

Not part of this specification.

8 Conformance Testing

Table 5 provides the Conformance Testing Method for the Context Description (PGM‑CXT) Composite AIM. Conformance Testing of the individual SubAIMs is given by the individual AIM specifications.

If a schema contains references to other schemas, conformance of data for the primary schema implies that any data referencing a secondary schema shall also validate against the relevant schema, if present, and conform with the Qualifier, if present.

Table 5 – Conformance Testing Method for the Context Description (PGM‑CXT) Composite AIM

Receives Audio Object Shall validate against Audio Object schema. Audio Data shall conform with Audio Qualifier.
Visual Object Shall validate against Visual Object schema. Visual Data shall conform with Visual Qualifier.
CXT Directive Shall validate against CXT Directive schema.
Domain Response Shall validate against Domain Response schema.
CXT Interaction History Response Shall validate against Interaction History schema.
Produces Context Descriptors Shall validate against Context Descriptors schema.
Domain Request Shall validate against Domain Request schema.
CXT Interaction History Request Shall validate against Interaction History schema.
CXT Status Shall validate against CXT Status schema.

9 Performance Assessment

Not part of this specification.

Go to PGM-AUA V1.0 AI Modules