PGM-AUA V1.0 AIMs - Context Description

Function
Ref. Model
I/O Data
SubAIMs
JSON MData
Profiles
Ref. Software
Conformance
Performance

1 Functions

The Context Description (PGM‑CXT) AIM is the A‑User’s perceptual front end to the spatial environment, integrating in a single Composite AIM the active capture of a multimodal scene and its interpretative enrichment. It receives raw Audio and Visual Objects, captures and structures the scene without LLM involvement and without MCP interactions, and then applies modality‑specific analysis, cross‑modal alignment, and optional domain knowledge to produce Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors together with an interpreted description of the User State.

The PGM‑CXT AIM supports runtime reorientation under CXT Directives issued by A‑User Control, which may reflect Human Commands. A single CXT Directive carries a SessionID and CaptureIndex identifying the capture’s spatial position within the current session. Also included are modality prioritisation and acquisition parameters for the scope of the capture, along with policy constraints for the enhancement stage. The Enhanced Audio Scene Descriptors (ASE) and Enhanced Visual Scene Descriptors (VSE) derived internally within the Audio and Visual Scene Enhancement SubAIMs are not exposed as AIM outputs; rather, the enhanced result is consolidated into a single set of Context Descriptors.

On the input side, the Context Description Demultiplexing AIM demultiplexes the three main inputs (CXT Directive, Domain Response, and Interaction History Response) into their modality components (Audio, Visual, User) and routes the Audio and Visual Objects to the corresponding enhancement SubAIMs. On the output side, Context Description Multiplexing recombines the results into Context Descriptors, Domain Request, Interaction History Request, and CXT Status. Domain knowledge is obtained from the Domain Access AIM (Request out, Response in) and Interaction History is exchanged with A‑User Storage (Request out, Response in).

Receives	Audio Object	Audio signals from the scene including speech and environmental sounds.
	Visual Object	Visual signals from the scene.
	CXT Directive	Control instructions from A‑User Control specifying modality prioritisation, acquisition parameters, framing rules, session identification, enhancement scope/depth/policy, domain policy, and A‑User Storage access instructions.
	Domain Response	Domain‑specific knowledge received from Domain Access.
	CXT IH Response	Prior session content read from A‑User Storage (prior descriptors, User State).
Produces	Context Descriptors	Aggregated result combining Enhanced Audio and Visual Scene Descriptors and User State.
	Domain Request	Request for domain‑specific knowledge sent to Domain Access.
	CXT IH Request	Request to A‑User Storage to read prior and write produced session content.
	CXT Status	Scene‑level metadata describing capture and enhancement outcomes, per-modality results, A‑User Storage and Domain operation outcomes, and confidence measures.

2 Reference Model

Figure 1 depicts the Reference Model of the Context Description (PGM‑CXT) AIM.

Figure 1 – The Context Description (PGM‑CXT) AIM

3 I/O Data

Table 1 specifies the Input and Output Data of the Context Description (PGM‑CXT) AIM.

Table 1 – I/O Data of the Context Description (PGM‑CXT) AIM

Input	Description
Audio Object	Audio signals captured from the scene, including speech, environmental sounds, and paralinguistic cues.
Visual Object	Visual signals from the scene, including gestures, facial expressions, and environmental imagery.
CXT Directive	Control instructions from A‑User Control including both capture (modality prioritisation, acquisition parameters, framing rules, session identification) and enhancement (scope, depth, policy constraints), together with domain policy and A‑User Storage access instructions.
Domain Response	Domain‑specific knowledge received from Domain Access.
CXT IH Response	Prior session content obtained from A‑User Storage (prior descriptors and User State).
Output	Description
Context Descriptors	Aggregated result combining Enhanced Audio Scene Descriptors, Enhanced Visual Scene Descriptors, and User State.
Domain Request	Request for domain‑specific knowledge that was previously sent to Domain Access.
CXT IH Request	Request to A‑User Storage to read and write content produced during the current or previous sessions.
CXT Status	Scene‑level metadata describing capture and enhancement outcomes.

Note – The Audio Scene Descriptors (ASD) and Visual Scene Descriptors (VSD) derived internally within the Audio and Visual Scene Enhancement SubAIMs are internal to the PGM‑CXT AIM. They are not exposed as Input or Output Data.

4 SubAIMs (informative)

This section is informative. The decomposition into SubAIMs described below illustrates one conformant architecture for producing the normative outputs of PGM‑CXT. Implementations may adopt alternative internal structures provided they satisfy the conformance requirements of Section 8.

4.1 Reference Model

Figure 2 gives the Reference Model of the Context Description (PGM‑CXT) Composite AI Module. Two boundary SubAIMs isolate the composite from its surroundings. The Context Description Demultiplexing demultiplexes each aggregate input — the CXT Directive, the Domain Response, and the CXT Interaction History Response — into its Audio, Visual, and User components, and routes the Audio and Visual Objects to the enhancement SubAIMs. The Context Description Multiplexing SubAIM performs the inverse, recombining the per‑modality results into the aggregate Context Descriptors, Domain Request, CXT Interaction History Request, and CXT Status. The central SubAIMs — Audio and Visual Scene Enhancement, Audio‑Visual Alignment, and User State Description — exchange data with the two boundary SubAIMs, with each other, and A‑User Storage, Domain Access, or A‑User Control.

Figure 2 – Reference Model of the Context Description (PGM‑CXT) Composite AI Module

4.2 Operation

The Context Description AIM is activated by a CXT Directive issued by A‑User Control. The Context Description operation is carried out with the following steps:

Reception and de-multiplexing: the Context Description Demultiplexing
1. Receives multiplexed
  1. CXT Directive,
  2. Domain Response,CXT Interaction History Response (read from A‑User Storage),
  3. Audio Object and Visual Object,
2. Demultiplexes them into the
  1. Audio, Visual, and User CXT Directives,
  2. Domain Responses
  3. CXT Interaction History Responses,
3. Routes the appropriate demultiplexed data to the appropriate enhancement SubAIM.
Enhancement: the Audio Scene Enhancement (PGM-ASE) and Visual Scene Enhancement (PGM-VSE) SubAIMs
1. Process the Audio Objects and Visual Objects in parallel under their modality‑specific CXT Directives,
2. Produce the
  1. Enhanced Audio Scene Descriptors with their CXT Status
  2. Enhanced Visual Scene Descriptors CXT Status
  3. CXT Interaction History Request
  4. Domain Request.
Alignment: Audio‑Visual Alignment, produces the Audio‑Visual Scene Geometry from the Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors.
Production of the User State by User State Description from the
1. Enhanced Scene Descriptors,
2. Audio‑Visual Scene Geometry produced by Audio‑Visual Alignement
3. the User‑side CXT Directives, Domain, and CXT Interaction History Responses.
Multiplexing of the Audio‑Visual‑User Multiplexing SubAIM by recombining
1. Context Descriptors (delivered to Prompt Creation) from
  1. Enhanced Audio Scene Descriptors,
  2. Enhanced Visual Scene Descriptors
  3. User State ,
2. Composite CXT Status from per‑modality CXT Statuses,
3. Domain Request (to Domain Access) from per‑modality Domain Requests,
4. Interaction History Request (to A‑User Storage) from the per‑modality CXT Interaction History Requests.

The reference model explicitly separates capture, modal enhancement, cross‑modal alignment, and user/entity interpretation by using specific Sub-AIMs thus ensuring modularity, traceability, and reuse.

4.3 Functions of SubAIMs

Table 2 gives the functions of the Context Description (PGM‑CXT) SubAIMs.

Table 2 – Functions of the Context Description (PGM‑CXT) SubAIMs

SubAIM	Function
Context Description Demultiplexing	Demultiplexes the CXT Directive, Domain Response, and Interaction History Response into their Audio, Visual, and User components and routes the Audio and Visual Objects to the enhancement SubAIMs.
Audio Scene Enhancement	Derives the Audio Scene Descriptors from the Audio Object and enhances the Descriptors, producing the Enhanced Audio Scene Descriptors.
Visual Scene Enhancement	Derives the Visual Scene Descriptors from the Visual Object and enhances the Descriptors, producing the Enhanced Visual Scene Descriptors.
Audio‑Visual Alignment	Performs a cross‑modal association between Audio Objects and Visual Objects referring to the same source or entity. The output Audio‑Visual Scene Geometry expresses correspondence and spatial relations.
User State Description	Interprets enhanced descriptors and alignment evidence with respect to the User or other entities and derives User‑centric evidence and state descriptions under the control of CXT Directives producing User State.
Context Description Multiplexing	Multiplexes the per‑modality results into the aggregated Context Descriptors (delivered to Prompt Creation), Domain Request, Interaction History Request, and CXT Status.

4.4 I/O Data of SubAIMs

Table 3 gives the Input and Output Data of the Context Description (PGM‑CXT) SubAIMs.

Table 3 – I/O Data of the Context Description (PGM‑CXT) SubAIMs

SubAIM	Input	Output
Context Description Demultiplexing	CXT Directive Domain Response CXT IH Response Audio Object Visual Object	Audio Object Visual Object Audio CXT Directive Visual CXT Directive User CXT Directive Audio Domain Response Visual Domain Response User Domain Response Audio CXT IH Response Visual CXT IH Response, User CXT IH Response
Audio Scene Enhancement	Audio Object Audio CXT Directive Audio CXT IH Response Audio Domain Response	Enhanced Audio Scene Descriptors Audio CXT Status Audio CXT IH Request Audio Domain Request
Visual Scene Enhancement	Visual Object Visual CXT Directive Visual CHT IH Response Visual Domain Response	Enhanced Visual Scene Descriptors Visual CXT Status Visual CXT IH Request Visual Domain Request
Audio‑Visual Alignment	Enhanced Audio Scene Descriptors Enhanced Visual Scene Descriptors	Audio‑Visual Scene Geometry
User State Description	Enhanced Audio Scene Descriptors Enhanced Visual Scene Descriptors Audio‑Visual Scene Geometry User CXT Directive User CXT IH Response User Domain Response	User State User CXT Status User CXT IH Request User Domain Request Enhanced Audio Scene Descriptors Enhanced Visual Scene Descriptors
Context Description Multiplexing	Enhanced Audio Scene Descriptors Audio CXT Status Audio CXT IH Request Audio Domain Request Enhanced Visual Scene Descriptors Visual CXT Status Visual CXT IH Request Visual Domain Request User State User CXT Status User CXT IH Request User Domain Request	Context Descriptors Domain Request CXT Interaction History Request CXT Status

4.5 AIMs and JSON Metadata

Table 4 provides the links to the AIM specifications and JSON schemas. AIM1 indicates the Composite AIM and AIM2 its SubAIMs.

Table 4 – AIMs and JSON Metadata of the Context Description (PGM‑CXT)

AIM1	AIM2	Name	JSON
PGM‑CXT		Context Description	X
	PGM‑CAI	Context Description Demultiplexing	X
	PGM‑ASE	Audio Scene Enhancement	X
	PGM‑VSE	Visual Scene Enhancement	X
	OSD‑AVA	Audio‑Visual Alignment	X
	PGM‑USD	User State Description	X
	PGM‑AVU	Context Description Multiplexing	X

5 JSON Metadata

https://schemas.mpai.community/PGM1/V1.0/AIMs/ContextDescription.json

6 Profiles

No Profiles.

7 Reference Software

Not part of this specification.

8 Conformance Testing

Table 5 provides the Conformance Testing Method for the Context Description (PGM‑CXT) Composite AIM. Conformance Testing of the individual SubAIMs is given by the individual AIM specifications.

If a schema contains references to other schemas, conformance of data for the primary schema implies that any data referencing a secondary schema shall also validate against the relevant schema, if present, and conform with the Qualifier, if present.

Table 5 – Conformance Testing Method for the Context Description (PGM‑CXT) Composite AIM

Receives	Audio Object	Shall validate against Audio Object schema. Audio Data shall conform with Audio Qualifier.
	Visual Object	Shall validate against Visual Object schema. Visual Data shall conform with Visual Qualifier.
	CXT Directive	Shall validate against CXT Directive schema.
	Domain Response	Shall validate against Domain Response schema.
	CXT Interaction History Response	Shall validate against Interaction History schema.
Produces	Context Descriptors	Shall validate against Context Descriptors schema.
	Domain Request	Shall validate against Domain Request schema.
	CXT Interaction History Request	Shall validate against Interaction History schema.
	CXT Status	Shall validate against CXT Status schema.

9 Performance Assessment