Function Ref. Model I/O Data SubAIMs JSON MData Profiles Ref. Software Conformance Performance

Function

The Visual Scene Enhancement (PGM-VSE) AIM enriches the visual scene captured by Context Capture, in order to derive additional, possibly non‑perceptual visual properties relevant to spatial understanding, interaction, and A-User‑centric reasoning.

VSE operates exclusively on Visual Scene Descriptors (VSD0) and produces Enhanced Visual Scene Descriptors (VSD1), preserving the original perceptual semantics while augmenting them with derived and semantic information under A-User Directive control.

Reference Model

Figure 1 gives the Reference Model of the Visual Scene Enhancement(PGM-VSE) AIM.

Figure 1 – The Reference Model of the Visual Scene Enhancement(PGM-VSE) AIM

Input/Output Data

Table 2 gives the Input and Output Data of the Visual Scene Enhancement(PGM-VSE) AIM.

Table 1 – Input/Output Data of Visual Scene Enhancement(PGM-VSE) AIM

Input Description
Visual Scene Descriptors Perceptual description of the visual scene produced by Context Capture.
Visual CXE Directive Control directives specifying scope, depth, or policy constraints for visual enhancement.
Visual CXE Request Domain‑specific knowledge supporting visual interpretation and semantic classification.
Output Description
Enhanced Visual Scene Descriptors Visual Scene Descriptors augmented with derived and semantic visual properties produced by VSE.
Visual CXE Status Status information describing the execution and outcome of Visual Scene Enhancement processing.
Visual CXE Response Response to domain‑specific knowledge request.

SubAIMs (Informative)

4.1 Reference Model

Figure 2 depicts the Reference Architecture of the Visual Scene Enhancement (PGM-VSE) AIM.

Figure 2 – Reference Model of Visual Scene Enhancement (PGM-VSE) Composite AIM

Table 2 specifies the Functions and I/O Data of Scene Enhancement (PGM-VSE) AIM’s SubAIMs.

4.2 Operation

The Visual Scene Enhancement AI Module operates by progressively enriching an input visual scene description into an Enhanced Visual Scene Descriptors representation through the combined action of its internal SubAIMs.The effective inputs are Visual Objects, their Spatial Attitudes, Domain Responses, and the visual component of the CXE Directive from A-User Control.

The Depth and Occlusion Estimation SubAIM enriches the input representation by introducing: relative depth relationships among visual objects and occlusion conditions indicating whether objects partially or fully obscure each other.

This transformation converts the initial spatial description into an ordered representation where

objects are placed within a depth structure, and visibility constraints are explicitly represented.

The Visual Object Identification SubAIM enhances the representation by assigning consistent identifiers to visual objects and resolving correspondence between input objects and internal representations. This step ensures that all subsequent enhancements are coherently attached to the same entities, can be referenced unambiguously across the enhanced description.

The Affordance Inference SubAIM further enriches the representation by introducing affordance tags describing possible interactions with objects, interaction potential reflecting feasibility of interaction, and domain-related responses derived from domain requests.

This stage combines object properties (from parsing and identification), spatial constraints (from depth and occlusion), amd domain information.

The resulting enhancement expresses: what actions may be performed on objects and under what conditions these actions are feasible.

The Visual Salience Mapping SubAIM integrates all previously derived information to produce:

a ranking of visual objects according to relevance and identification of salient visual objects.

This integration considers perceptual prominence, spatial relationships (depth, occlusion), interaction potential, and domain context.

Visual Output Construction the enhanced representation by combining outputs of all preceding SubAIMs, aligning all derived attributes with the corresponding visual objects, and preserving consistency across depth, identity, interaction, and salience information.

The Enhanced Visual Scene Descriptors represent the scene not only in terms of what is present,

but also in terms of: what is visible, can be interacted with, and is relevant,

4.3 Functions of AI Modules

Table 3 – Functions and I/O Data of Scene Enhancement (PGM-VSE) AIM’s SubAIMs

SubAIM Specification Purpose
Visual Descriptors Parsing Structures raw Visual Scene Descriptors into explicit Visual Objects and spatial attributes without semantic interpretation.
Visual Motion & Proximity Analysis Detects temporal and spatial dynamics of visual objects by tracking their evolution in space and time.
Depth and Occlusion Estimation Computes relative depth relationships and occlusion conditions among visual objects.
Visual Object Identification Assigns semantic object type labels to visual objects using classification models and optional domain knowledge.
Visual Salience Mapping Determines the relative relevance of visual objects with respect to user interaction and context.
Visual Output Construction Aggregates perceptual and enriched evidence into Enhanced Visual Scene Descriptors and emits execution status.

4.4 I/O Data of AI Modules

Table 3 – Functions and I/O Data of Scene Enhancement (PGM-VSE) AIM’s SubAIMs

SubAIM Input Data Output Data
Visual Descriptors Parsing Visual Scene Descriptors
Visual SUD Directive
Visual Objects
Spatial Attitudes
Visual SUD Status
Depth and Occlusion Estimation Visual Objects
Spatial Attitudes
Relative Depths
Occlusion Flags
Affordance Inference
Visual Object Identification Visual Objects
Spatial Attitudes
Domain Response
Visual Object Type (e.g. human, vehicle, tool)
Type Confidence
Visual Salience Mapping Motion Flags
Proximity Class
Relative Depths
Occlusion Flags
Visual Object Type
Visual SUD Directive
Domain Response
Ranked Visual Objects
Filtered Salient Visual Objects
Visual Output Construction Visual Objects
Spatial Attitudes
Motion Flags
Proximity Class
Relative Depths
Occlusion Flags
Visual Object Type
Salience Results
Enhanced Visual Scene Descriptors (Enhanced VSD)
Visual SUD Status

4.5 AIMs and JSON Metadata

Table 4 provides the links to the AIM specifications and to the JSON syntaxes. AIM1 indicates the Composite AIM and AIM2 their SubAIMs.

Table 4 – AIMs and JSON Metadata

AIM1 AIM2 Names JSON
PGM-VSR Visual Scene Enhancement Link
PGM-ADP Visual Descriptors Parsing Link
PGM-VMP Depth and Occlusion Estimation Link
PGM-DOE Affordance Inference Link
OSD-VOI Visual Object Identification Link
PGM-SMP Visual Salience Mapping Link
PGM-VOC Visual Output Construction Link

5. JSON Metadata

https://schemas.mpai.community/PGM1/V1.0/AIMs/VisualSceneEnhancement.json

6. Profiles

No Profiles