PGM-AUA V1.0 AIMs - Context Description

Function
Ref. Model
I/O Data
SubAIMs
JSON MData
Profiles
Ref. Software
Conformance
Performance

1 Functions

The Context Description (PGM‑CXT) AIM is the A‑User’s perceptual front end to the spatial environment, integrating in a single Composite AIM the active capture of a multimodal scene and its interpretative enrichment. It receives raw Audio and Visual Objects, captures and structures the scene without LLM involvement and without MCP interactions, and then applies modality‑specific analysis, cross‑modal alignment, and optional domain knowledge to produce Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors together with an interpreted description of the User State.

The PGM‑CXT AIM supports runtime reorientation under CXT Directives issued by A‑User Control, which may reflect Human Commands. A single CXT Directive carries a SessionID and CaptureIndex identifying the capture’s position within the current session, together with modality prioritisation and acquisition parameters for the capture stage and scope, depth, and policy constraints for the enhancement stage. The Audio Scene Descriptors (ASD) and Visual Scene Descriptors (VSD) derived internally within the Audio and Visual Scene Enhancement SubAIMs are not exposed as AIM outputs; the enhanced result is consolidated into a single set of Context Descriptors.

All aggregate exchanges with the surrounding AIMs cross the two boundary SubAIMs. On the input side the CXT‑AUS Interface demultiplexes the single CXT Directive, Domain Response, and Interaction History Response into their per‑modality (Audio, Visual, User) components and routes the Audio and Visual Objects to the corresponding enhancement SubAIMs. On the output side Audio‑Visual‑User Multiplexing recombines the per‑modality results into the single Context Descriptors, Domain Request, Interaction History Request, and CXT Status. Domain knowledge is obtained from the Domain Access AIM (Request out, Response in) and Interaction History is exchanged with A‑User Storage (Request out, Response in) through this same boundary pair.

Receives	Audio Object	Audio signals from the scene including speech and environmental sounds.
	Visual Object	Visual signals from the scene.
	CXT Directive	Control instructions from A‑User Control specifying modality prioritisation, acquisition parameters, framing rules, session identification, enhancement scope/depth/policy, domain policy, and A‑User Storage access instructions.
	Domain Response	Domain‑specific knowledge received from Domain Access.
	CXT IH Response	Prior session content read from A‑User Storage (prior descriptors, User State).
Produces	Context Descriptors	Aggregated result combining Enhanced Audio and Visual Scene Descriptors and User State.
	Domain Request	Request for domain‑specific knowledge sent to Domain Access.
	CXT IH Request	Request to A‑User Storage to read prior and write produced session content.
	CXT Status	Scene‑level metadata describing capture and enhancement outcomes, per-modality results, A‑User Storage and Domain operation outcomes, and confidence measures.

2 Reference Model

Figure 1 depicts the Reference Model of the Context Description (PGM‑CXT) AIM.

Figure 1 – The Context Description (PGM‑CXT) AIM

3 I/O Data

Table 1 specifies the Input and Output Data of the Context Description (PGM‑CXT) AIM.

Table 1 – I/O Data of the Context Description (PGM‑CXT) AIM

Input	Description
Audio Object	Captured audio signals from the scene, covering speech, environmental sounds, and paralinguistic cues.
Visual Object	Visual signals from the scene, encompassing gestures, facial expressions, and environmental imagery.
CXT Directive	Control instructions from A‑User Control covering both capture (modality prioritisation, acquisition parameters, framing rules, session identification) and enhancement (scope, depth, policy constraints), together with domain policy and A‑User Storage access instructions.
Domain Response	Domain‑specific knowledge received from Domain Access.
CXT IH Response	Prior session content read from A‑User Storage (prior descriptors and User State).
Output	Description
Context Descriptors	Aggregated result combining Enhanced Audio Scene Descriptors, Enhanced Visual Scene Descriptors, and User State.
Domain Request	Request for domain‑specific knowledge sent to Domain Access.
CXT IH Request	Request to A‑User Storage to read prior and write produced session content.
CXT Status	Scene‑level metadata describing capture and enhancement outcomes, per-modality results, A‑User Storage and Domain Access operation outcomes, and confidence measures.

Note – The Audio Scene Descriptors (ASD) and Visual Scene Descriptors (VSD) derived internally within the Audio and Visual Scene Enhancement SubAIMs are internal to the PGM‑CXT AIM and are not exposed as Input or Output Data.

4 SubAIMs (informative)

This section is informative. The decomposition into SubAIMs described below illustrates one conformant architecture for producing the normative outputs of PGM‑CXT. Implementations may adopt alternative internal structures provided they satisfy the conformance requirements of Section 8.

4.1 Reference Model

Figure 2 gives the Reference Model of the Context Description (PGM‑CXT) Composite AI Module. Two boundary SubAIMs isolate the composite from its surroundings. The Context Capture Demultiplexing demultiplexes each aggregate input — the CXT Directive, the Domain Response, and the CXT Interaction History Response — into its Audio, Visual, and User components, and routes the Audio and Visual Objects to the enhancement SubAIMs. The Context Capture Demultiplexing SubAIM performs the inverse, recombining the per‑modality results into the aggregate Context Descriptors, Domain Request, CXT Interaction History Request, and CXT Status. The central SubAIMs — Audio and Visual Scene Enhancement, Audio‑Visual Alignment, and User State Description — exchange data only with the two boundary SubAIMs and with each other; they never address A‑User Storage, Domain Access, or A‑User Control directly.

Figure 2 – Reference Model of the Context Description (PGM‑CXT) Composite AI Module

4.2 Operation

The Context Description AIM is activated by a CXT Directive issued by A‑User Control. The Context Description operation is carried out with the following steps:

Reception and demultiplexing: the Context Capture Demultiplexing receives the CXT Directive, the Domain Response, and the CXT Interaction History Response (read from A‑User Storage), together with the Audio and Visual Objects, and demultiplexes them into the Audio, Visual, and User CXT Directives, Domain Responses, and CXT Interaction History Responses, routing each Object to its enhancement SubAIM.
Capture and enhancement: the Audio Scene Enhancement (ASE) and Visual Scene Enhancement (VSE) SubAIMs process the Audio Objects and Visual Objects in parallel under their modality‑specific CXT Directives, internally deriving the Audio Scene Descriptors (ASD) and Visual Scene Descriptors (VSD) and producing the Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors, each with its CXT Status and, where required, an CXT Interaction History Request and a Domain Request.
Audio‑Visual Alignment, producing the Audio‑Visual Scene Geometry from the Enhanced Audio Scene Descriptors and Enhanced Visual Scene Descriptors.
Production of the User State by User State Description from the Enhanced Scene Descriptors, the Audio‑Visual Scene Geometry, and the User‑side directive, Domain, and CXT Interaction History Responses.
Multiplexing: the Audio‑Visual‑User Multiplexing SubAIM recombines the Enhanced Audio Scene Descriptors, Enhanced Visual Scene Descriptors and the User State into the Context Descriptors (delivered to PRC), the per‑modality CXT Statuses into the composite CXT Status, the per‑modality Domain Requests into the Domain Request (to Domain Access), and the per‑modality CXT Interaction History Requests into the Interaction History Request (to A‑User Storage).

The reference model explicitly separates capture, modal enhancement, cross‑modal alignment, and user/entity interpretation, ensuring modularity, traceability, and reuse.

4.3 Functions of SubAIMs

Table 2 gives the functions of the Context Description (PGM‑CXT) SubAIMs.

Table 2 – Functions of the Context Description (PGM‑CXT) SubAIMs

SubAIM	Function
Context Capture Demultiplexing	Demultiplexes the CXT Directive, Domain Response, and Interaction History Response into their Audio, Visual, and User components and routes the Audio and Visual Objects to the enhancement SubAIMs.
Audio Scene Enhancement	Derives the Audio Scene Descriptors from the Audio Object and enhances them, producing the Enhanced Audio Scene Descriptors.
Visual Scene Enhancement	Derives the Visual Scene Descriptors from the Visual Object and enhances them, producing the Enhanced Visual Scene Descriptors.
Audio‑Visual Alignment	Cross‑modal association between Audio Objects and Visual Objects referring to the same source or entity. Production of Audio‑Visual Scene Geometry expressing correspondence and spatial relations.
User State Description	Interpretation of enhanced descriptors and alignment evidence with respect to the User or other entities. Derivation of User‑centric evidence and state descriptions under the control of directives (User State).
Context Capture Multiplexing	Multiplexes the per‑modality results into the aggregate Context Descriptors (delivered to PRC), Domain Request, Interaction History Request, and CXT Status.

4.4 I/O Data of SubAIMs

Table 3 gives the Input and Output Data of the Context Description (PGM‑CXT) SubAIMs.

Table 3 – I/O Data of the Context Description (PGM‑CXT) SubAIMs

SubAIM	Input	Output
Context Capture Demultiplexing	CXT Directive Domain Response CXT IH Response Audio Object Visual Object	Audio Object, Visual Object Audio CXT Directive, Visual CXT Directive, User CXT Directive Audio Domain Response, Visual Domain Response, User Domain Response Audio CXT IH Response, Visual CXT IH Response, User CXT IH Response
Audio Scene Enhancement	Audio Object Audio CXT Directive Audio CXT IH Response Audio Domain Response	Enhanced Audio Scene Descriptors Audio CXT Status Audio CXT IH Request Audio Domain Request
Visual Scene Enhancement	Visual Object Visual CXT Directive Visual CHT IH Response Visual Domain Response	Enhanced Visual Scene Descriptors Visual CXT Status Visual CXT IH Request Visual Domain Request
Audio‑Visual Alignment	Enhanced Audio Scene Descriptors Enhanced Visual Scene Descriptors	Audio‑Visual Scene Geometry
User State Description	Enhanced Audio Scene Descriptors Enhanced Visual Scene Descriptors Audio‑Visual Scene Geometry User CXT Directive User CXY IH Response User Domain Response	User State User CXT Status User CXT IH Request User Domain Request
Context Capture Multiplexing	Enhanced Audio Scene Descriptors Audio CXT Status, Audio ICXT IH Request, Audio Domain Request Enhanced Visual Scene Descriptors, Visual CXT Status, Visual CXT IH Request, Visual Domain Request User State, User CXT Status, User CXT IH Request, User Domain Request	Context Descriptors Domain Request CXT Interaction History Request CXT Status

4.5 AIMs and JSON Metadata

Table 4 provides the links to the AIM specifications and JSON schemas. AIM1 indicates the Composite AIM and AIM2 its SubAIMs.

Table 4 – AIMs and JSON Metadata of the Context Description (PGM‑CXT)

AIM1	AIM2	Name	JSON
PGM‑CXT		Context Description	X
	PGM‑CAI	Context Capture Demultiplexing	X
	PGM‑ASE	Audio Scene Enhancement	X
	PGM‑VSE	Visual Scene Enhancement	X
	OSD‑AVA	Audio‑Visual Alignment	X
	PGM‑USD	User State Description	X
	PGM‑AVU	Context Capture Multiplexing	X

5 JSON Metadata

https://schemas.mpai.community/PGM1/V1.0/AIMs/ContextDescription.json

6 Profiles

No Profiles.

7 Reference Software

Not part of this specification.

8 Conformance Testing

Table 5 provides the Conformance Testing Method for the Context Description (PGM‑CXT) Composite AIM. Conformance Testing of the individual SubAIMs is given by the individual AIM specifications.

If a schema contains references to other schemas, conformance of data for the primary schema implies that any data referencing a secondary schema shall also validate against the relevant schema, if present, and conform with the Qualifier, if present.

Table 5 – Conformance Testing Method for the Context Description (PGM‑CXT) Composite AIM

Receives	Audio Object	Shall validate against Audio Object schema. Audio Data shall conform with Audio Qualifier.
	Visual Object	Shall validate against Visual Object schema. Visual Data shall conform with Visual Qualifier.
	CXT Directive	Shall validate against CXT Directive schema.
	Domain Response	Shall validate against Domain Response schema.
	CXT Interaction History Response	Shall validate against Interaction History schema.
Produces	Context Descriptors	Shall validate against Context Descriptors schema.
	Domain Request	Shall validate against Domain Request schema.
	CXT Interaction History Request	Shall validate against Interaction History schema.
	CXT Status	Shall validate against CXT Status schema.

9 Performance Assessment