| Function | Ref. Model | I/O Data | SubAIMs | JSON MData | Profiles | Ref. Software | Conformance | Performance |
Function
The Visual Scene Enhancement (PGM-VSE) AIM enriches the visual scene captured by Context Capture, in order to derive additional, possibly non‑perceptual visual properties relevant to spatial understanding, interaction, and A-User‑centric reasoning.
VSE operates exclusively on Visual Scene Descriptors (VSD0) and produces Enhanced Visual Scene Descriptors (VSD1), preserving the original perceptual semantics while augmenting them with derived and semantic information under A-User Directive control.
Reference Model
Figure 1 gives the Reference Model of the Visual Scene Enhancement(PGM-VSE) AIM.

Figure 1 – The Reference Model of the Visual Scene Enhancement(PGM-VSE) AIM
Input/Output Data
Table 2 gives the Input and Output Data of the Visual Scene Enhancement(PGM-VSE) AIM.
Table 1 – Input/Output Data of Visual Scene Enhancement(PGM-VSE) AIM
| Input | Description |
|---|---|
| Visual Scene Descriptors | Perceptual description of the visual scene produced by Context Capture. |
| Visual CXE Directive | Control directives specifying scope, depth, or policy constraints for visual enhancement. |
| Visual CXE Request | Domain‑specific knowledge supporting visual interpretation and semantic classification. |
| Output | Description |
| Enhanced Visual Scene Descriptors | Visual Scene Descriptors augmented with derived and semantic visual properties produced by VSE. |
| Visual CXE Status | Status information describing the execution and outcome of Visual Scene Enhancement processing. |
| Visual CXE Response | Response to domain‑specific knowledge request. |
SubAIMs (Informative)
4.1 Reference Model
Figure 2 depicts the Reference Architecture of the Visual Scene Enhancement (PGM-VSE) AIM.

Figure 2 – Reference Model of Visual Scene Enhancement (PGM-VSE) Composite AIM
Table 2 specifies the Functions and I/O Data of Scene Enhancement (PGM-VSE) AIM’s SubAIMs.
4.2 Operation
The Visual Scene Enhancement AI Module operates by progressively enriching an input visual scene description into an Enhanced Visual Scene Descriptors representation through the combined action of its internal SubAIMs.The effective inputs are Visual Objects, their Spatial Attitudes, Domain Responses, and the visual component of the CXE Directive from A-User Control.
The Depth and Occlusion Estimation SubAIM enriches the input representation by introducing: relative depth relationships among visual objects and occlusion conditions indicating whether objects partially or fully obscure each other.
This transformation converts the initial spatial description into an ordered representation where
objects are placed within a depth structure, and visibility constraints are explicitly represented.
The Visual Object Identification SubAIM enhances the representation by assigning consistent identifiers to visual objects and resolving correspondence between input objects and internal representations. This step ensures that all subsequent enhancements are coherently attached to the same entities, can be referenced unambiguously across the enhanced description.
The Affordance Inference SubAIM further enriches the representation by introducing affordance tags describing possible interactions with objects, interaction potential reflecting feasibility of interaction, and domain-related responses derived from domain requests.
This stage combines object properties (from parsing and identification), spatial constraints (from depth and occlusion), amd domain information.
The resulting enhancement expresses: what actions may be performed on objects and under what conditions these actions are feasible.
The Visual Salience Mapping SubAIM integrates all previously derived information to produce:
a ranking of visual objects according to relevance and identification of salient visual objects.
This integration considers perceptual prominence, spatial relationships (depth, occlusion), interaction potential, and domain context.
Visual Output Construction the enhanced representation by combining outputs of all preceding SubAIMs, aligning all derived attributes with the corresponding visual objects, and preserving consistency across depth, identity, interaction, and salience information.
The Enhanced Visual Scene Descriptors represent the scene not only in terms of what is present,
but also in terms of: what is visible, can be interacted with, and is relevant,
4.3 Functions of AI Modules
Table 3 – Functions and I/O Data of Scene Enhancement (PGM-VSE) AIM’s SubAIMs
| SubAIM Specification | Purpose |
|---|---|
| Visual Descriptors Parsing | Structures raw Visual Scene Descriptors into explicit Visual Objects and spatial attributes without semantic interpretation. |
| Visual Motion & Proximity Analysis | Detects temporal and spatial dynamics of visual objects by tracking their evolution in space and time. |
| Depth and Occlusion Estimation | Computes relative depth relationships and occlusion conditions among visual objects. |
| Visual Object Identification | Assigns semantic object type labels to visual objects using classification models and optional domain knowledge. |
| Visual Salience Mapping | Determines the relative relevance of visual objects with respect to user interaction and context. |
| Visual Output Construction | Aggregates perceptual and enriched evidence into Enhanced Visual Scene Descriptors and emits execution status. |
4.4 I/O Data of AI Modules
Table 3 – Functions and I/O Data of Scene Enhancement (PGM-VSE) AIM’s SubAIMs
| SubAIM | Input Data | Output Data |
|---|---|---|
| Visual Descriptors Parsing | Visual Scene Descriptors Visual SUD Directive |
Visual Objects Spatial Attitudes Visual SUD Status |
| Depth and Occlusion Estimation | Visual Objects Spatial Attitudes |
Relative Depths Occlusion Flags |
| Affordance Inference | ||
| Visual Object Identification | Visual Objects Spatial Attitudes Domain Response |
Visual Object Type (e.g. human, vehicle, tool) Type Confidence |
| Visual Salience Mapping | Motion Flags Proximity Class Relative Depths Occlusion Flags Visual Object Type Visual SUD Directive Domain Response |
Ranked Visual Objects Filtered Salient Visual Objects |
| Visual Output Construction | Visual Objects Spatial Attitudes Motion Flags Proximity Class Relative Depths Occlusion Flags Visual Object Type Salience Results |
Enhanced Visual Scene Descriptors (Enhanced VSD) Visual SUD Status |
4.5 AIMs and JSON Metadata
Table 4 provides the links to the AIM specifications and to the JSON syntaxes. AIM1 indicates the Composite AIM and AIM2 their SubAIMs.
Table 4 – AIMs and JSON Metadata
| AIM1 | AIM2 | Names | JSON |
| PGM-VSR | Visual Scene Enhancement | Link | |
| PGM-ADP | Visual Descriptors Parsing | Link | |
| PGM-VMP | Depth and Occlusion Estimation | Link | |
| PGM-DOE | Affordance Inference | Link | |
| OSD-VOI | Visual Object Identification | Link | |
| PGM-SMP | Visual Salience Mapping | Link | |
| PGM-VOC | Visual Output Construction | Link |
5. JSON Metadata
https://schemas.mpai.community/PGM1/V1.0/AIMs/VisualSceneEnhancement.json
6. Profiles
No Profiles