(Tentative)
| Function | Reference Model | Input/Output Data |
| SubAIMs | JSON Metadata | Profiles |
1. Function
The Visual Spatial Reasoning AIM (PGM‑VSR) operates under the Visual Action Directive (PGM‑VAD). It processes Context – produced by Context Capture – that include User State and Visual Scene Descriptors (VSD0). It reads PGM‑VAD, configures its reasoning pipeline accordingly, then refines and aligns Visual Scene Descriptors through an iterative loop with PGM‑DAC described below. User State is not explicitly mentioned in the iterative loop of Table 1.
Table 1 – Iterative loop of Visual Scene Descriptors
| Phase | Inputs | Operation | Outputs | Recipient |
| Directive intake | PGM‑VAD (Visual Action Directive) | Conform pipeline: select required spatial operations, constraints, and priorities. | Conformance plan (internal) | — |
| Initial refinement | VSD0 (from OSD‑VSD), conformance plan | Produce VSD1 aligned to directive: descriptor parsing, normalisation, preliminary localisation. | VSD1 | PGM‑DAC |
| Domain enrichment | VSD1 | Apply domain‑specific knowledge, resolve ambiguities, add semantic attributes. | VSD2 | PGM‑VSR |
| Directive‑aligned reasoning | VSD2, conformance plan | Execute directive‑scoped spatial reasoning: refined localisation, depth/occlusion, affordance inference, salience mapping. | OSD‑VSD2 (final descriptors) | PGM‑PTC |
| Status reporting | Conformance plan, execution results | Summarise compliance, coverage, uncertainties, and residual constraints | PGM‑VSR (Visual Action Status) | PGM‑AUC |
Processing details
- Directive conformance: VSR reads Visual Action Directive (PGM‑VAD), selects and orders spatial operations, sets thresholds and scope, and binds any domain constraints that must be respected downstream.
- Initial refinement: VSR transforms VSD0 into VSD1 under directive constraints, focusing on descriptor parsing, normalisation, and preliminary localisation.
- Domain enrichment: Domain Access (PGM‑DAC) augments VSD1 with domain knowledge, returning VSD2 for directive‑aligned reasoning.
- Spatial reasoning: VSR applies refined localisation, depth and occlusion estimation, affordance inference, and salience mapping, within the directive’s priorities and limits.
- Outputs:
- OSD‑VSD2 (final refined Visual Scene Descriptors): sent to Prompt Creation (PGM‑PRC).
- PGM‑VAS (Visual Action Status): sent to A‑User Control (PGM‑AUC), capturing compliance and residual gaps.
Iterative loop
- Feedback: If PGM‑AUC updates PGM‑VAD, VSR re‑conforms the pipeline and re‑runs refinement on the latest descriptors.
- Coherence: The loop keeps reasoning spatially coherent and domain‑aware, ensuring downstream AIMs operate with directive‑aligned visual context.
2. Reference Model
Figure 1 gives the Reference Model of the Visual Spatial Reasoning (PGM-VSR) AIM.

Figure 2 – The Reference Model of the Visual Spatial Reasoning (PGM-VSR) AIM
3. Input/Output Data
Table 2 gives the Input and Output Data of PGM-VSR.
Table 2 – Input/Output Data of PGM-VSR
| Input | Description |
| Context | A structured and time-stamped snapshot representing the initial understanding that the A-User achieves of the environment and of the User posture. |
| Visual Scene Descriptors | A modification of the input Visual Scene Descriptors provided by the Domain Access AIM to help the interpretation of the Visual Scene by injecting constraints, priorities, and refinement logic. |
| Visual Action Directive | Visual-related actions and process sequences from PGM-AUC. |
| Output | Description |
| Visual Scene Descriptors | A structured, analytical representation of the Visual Scene with object geometry, 3D positions, depth, occlusion, and affordance data. It highlights salient objects, normalised positions, proximity, and interaction cues. |
| Visual Action Status | Visual spatial constraints and scene anchoring from PGM-AUC. |
4. SubAIMs
Figure 2 gives the Reference Model of the Visual Spatial Reasoning (PGM-VSR) Composite AIM.
Figure 2 – Reference Model of Visual Spatial Reasoning (PGM-VSR) Composite AIM
Table 3 specifies the Functions performed by PGM-VSP AIM’s SubAIMs in the current example partitioning in SubAIMs.
Table 3 – Functions performed by PGM-VSP AIM’s SubAIMs (example)
| Sub-AIM | Function | |
| VDP | Visual Descriptors Parsing | Decomposes VSD into structured components: Visual Objects (VIO) and Spatial Attitudes (OSA) |
| DOE | Depth and Occlusion Estimation | Computes relative depth and occlusion flags |
| AFI | Affordance Inference | Determines actionable properties and interaction potential |
| VSM | Visual Salience Mapping | Ranks objects by prominence and filters salient entities |
| VOC | Visual Output Construction | Builds output Visual Scene Descriptors from salient and full descriptors |
Table 4 gives the AIMs composing the Visual Spatial Reasoning (PGM-VSR) Composite AIM:
Table 4 – AIMs of the Visual Spatial Reasoning (PGM-VSR) Composite AIM
| AIM | AIMs | Names | JSON |
| PGM-VSR | Visual Spatial Reasoning | Link | |
| PGM-ADP | Visual Descriptors Parsing | Link | |
| PGM-DOE | Depth and Occlusion Estimation | Link | |
| PGM-AFI | Affordance Inference | Link | |
| PGM-SMP | Visual Salience Mapping | Link | |
| PGM-VOC | Visual Output Construction | Link |
Table 5 gives the input and output data of the PGM-VSR AIM.
Table 5 – Input and output data of the PGM-VSR AIM
| AIMs | Input | Output | To |
| Visual Descriptors Parsing | Visual Scene Descriptors | Visual Objects Spatial Attitude |
DOE, AFI, SMP, VOC |
| Depth and Occlusion Estimation | Visual Objects Spatial Attitude Visual Spatial Directive |
Relative Depths Occlusion Flags Visual Spatial Status |
SMP, VOC |
| Affordance Inference | Visual Objects Spatial Attitude Visual Action Directive |
Affordance Tags Interaction Potential Visual Spatial Status |
SMP, VOC |
| Salience Mapping | Relative Depths Occlusion Flags Affordance Tags Interaction Potential Visual Action Directive |
Relative Depths Occlusion Flags Ranked Visual Objects Affordance Tags Interaction Potential Salient Visual Objects Visual Spatial Status |
VOC |
| Visual Output Construction | Relative Depths Occlusion Flags Ranked Visual Objects Affordance Tags Interaction Potential Salient Visual Objects Visual Spatial Status |
Visual Scene Descriptors Visual Spatial Status |
— |
Table 6 specifies the External and Internal Data Types of the Visual Spatial Reasoning AIM.
Table 6 – External and Internal Data Types identified in Visual Spatial Reasoning AIM
| Data Type | Definition |
|---|---|
| VisualSceneDescriptors | – Final structured output containing all spatialised and semantically enriched visual data (input). – The product of the Composite AIM (output). |
| UserPointOfView | Contained in component Basic Visual Scene Descriptors. |
| VisualObjects | Structured list of Visual Objects extracted from the Visual Scene Descriptors. |
| SpatialAttitudes | Position, Orientation, and their first and second order D spatial attributes of each Visual Object, including . |
| DepthEstimates | Classification of each object’s relative depth (e.g., foreground, midground, background). |
| OcclusionFlags | Visibility classification of each object (e.g., fully visible, partially occluded, hidden). |
| AffordanceProfile | Actionable properties of visual objects (e.g., graspable, tappable, obstructive) and inferred interaction potential. |
| RankedVisualObject | Ordered list of visual objects prioritized by perceptual salience and interaction relevance. |
| FilteredSalientObjects | Subset of Ranked Visual Objects selected for inclusion in the Visual Spatial Guide. |
| VisualSpatialDirective | Dynamic modifier provided by Domain Access AIM. Injects constraints, priorities, and refinement logic into reasoning Sub-AIMs. |
| VisualSpatialStatus | Structured status report from directive-aware Sub-AIMs. Includes constraint satisfaction, override flags, and anchoring metadata. |
Tables 7 and 8 describe the effects of the Visual Spatial Directive on SubAIMs and of the AIMs on the Visual Spatial Status
Table 7 – Effects of VisualActionDirective on PGM-VSR SubAIMs
| Sub-AIM | Directive Effects | |
| DOE | Depth & Occlusion Estimation | – Override depth thresholds (e.g., redefine “foreground” zone). – Prioritise occlusion types (e.g., favour fully visible objects). – Suppress or amplify depth cues based on User State (if the metaverse allows data sharing). |
| AFI | Affordance Inference | – Inject constraints on actionable properties (e.g., suppress obstructive objects). – Prioritise interaction-relevant affordances. – Refine affordance logic based on User Intent. |
| VSM | Visual Salience Mapping | – Bias salience scoring toward directive-relevant objects – Filter out objects below directive-defined thresholds – Align ranking logic with User attention or Cognitive State |
| VOC | Visual Output Construction | – Select output format based on directive (e.g., guide vs full descriptor) – Apply framing constraints (e.g., include only directive-relevant objects) – Normalise output to User viewpoint or temporal focus |
Table 8 – Contributions to VisualActionStatus from SubAIMs
| Sub-AIM | Status Contributions | |
| DOE | Depth & Occlusion Estimation | – Flags for directive compliance (e.g., depth match) – Occlusion override trace – Object inclusion/exclusion rationale |
| AFI | Affordance Inference | – Affordance constraint match status – Interaction potential override trace – Confidence in affordance inference |
| VSM | Visual Salience Mapping | – Salience override applied – Filtered object count – Ranking logic traceability |
| VOC | Visual Output Construction | – Final directive compliance summary – Anchoring metadata – Output format trace (e.g., guide vs descriptor) |
5. JSON Metadata
https://schemas.mpai.community/PGM1/V1.0/AIMs/VisualSpatialReasoning.json
6. Profiles
No Profiles