(Tentative)

Function Reference Model Input/Output Data
SubAIMs JSON Metadata Profiles

Function

The Visual Spatial Reasoning AIM (PGM-VSR) receives Visual Scene Descriptors, including object geometry, layout metadata, and motion cues. Its primary function is to extract, refine, and interpret these descriptors to construct a structured representation of the spatial configuration of objects and the surrounding environment.

Internally, PGM-VSR may perform the following operations:

  • Descriptor Parsing: Decomposes visual input into structured components such as object boundaries, spatial attitudes, and environmental markers.
  • Object Localization: Computes the 3D positions, orientations, and anchoring constraints of visual entities within the scene.
  • Motion & Occlusion Analysis: Evaluates object movement, visibility, and occlusion relationships to infer interaction feasibility and perceptual relevance.
  • Environmental Layout Mapping: Synthesizes spatial relationships between objects, surfaces, and navigable regions to construct a coherent scene topology.
  • Salience Filtering: Identifies visually prominent or interaction-relevant entities based on motion, proximity, and framing cues.
  • Output Construction: Produces a Visual Spatial Output for Domain Access and a Visual Spatial Guide for Prompt Creation, encapsulating object layout, salience, and spatial constraints.

The resulting outputs enable Domain Access and Prompt Creation to operate with full awareness of the visual environment, supporting location-aware interaction, alignment with User perception, and context-sensitive coordination of AIMs.

Reference Model

Figure 5 gives the of Visual Spatial Reasoning (PGM-ASRV) AIM Reference Model.

Figure 5 – The Reference Model of Visual Spatial Reasoning (PGM-VSR)

The functions performed by the PGM-VSR AIM can be classified as:

  1. Descriptor Parsing

– Ingest Visual Scene Descriptors

– Extract object geometry and spatial attitude

  1. Object Localisation

– Estimate 3D position of each object

– Normalise coordinates to user viewpoint

  1. Depth & Occlusion Estimation

– Compute relative depth

– Detect occluded or partially visible items

  1. Affordance Inference

– Determine actionable properties

– Classify interaction potential

  1. Salience & Relevance Mapping

– Rank objects by prominence and User focus

– Filter for guide inclusion

  1. Output Construction

– Build Visual Spatial Output (for DAC)

– Build Visual Spatial Guide (for PRC)

– Build Visual Spatial Directive (for DAC)

Input/Output Data

Input Description
Context A structured and time-stamped snapshot representing the initial understanding that the A-User achieves of the environment and of the User posture.
Visual Spatial Directive A dynamic modifier provided by the Domain Access AIM to help the interpretation of the Visual Scene by injecting constraints, priorities, and refinement logic.
Visual Action Directive Visual-related actions and process sequences from PGM-AUC.
Output Description
Visual Spatial Output A structured, analytical representation of the Visual Scene with object geometry, 3D positions, depth, occlusion, and affordance data.
Visual Spatial Guide A filtered, User-relative subset of the scene. It highlights salient objects, normalised positions, proximity, and interaction cues to enrich PRC prompts with spatial grounding relevant to User focus.
Visual Action Status Visual spatial constraints and scene anchoring from PGM-AUC.

SubAIMs

Table 13 gives the functions – potentially implementable as AIMs – performed by the Visual Spatial Reasoning AIM.

Table 13 – Functions performed by Visual Spatial Reasoning AIM

Function Inputs Outputs Sent to
1. Descriptor Parsing Visual Scene Descriptors (geometry, spatial attitude, object metadata) Parsed object list, object spatial attitudes 2, 3, 4, 5, 6
2. Object Localisation Parsed object geometry, spatial attitude 3D object positions, viewpoint-normalised coordinates 3, 5, 6
3. Depth & Occlusion Estimation 3D positions, object layout Relative depth values, occlusion flags 5, 6
4. Affordance Inference Object labels, geometry, spatial attitude Affordance tags, interaction potential 5, 6
5. Salience Mapping Depth values, occlusion flags, affordance tags, viewpoint-normalised data Ranked object list, filtered salient objects 6
6. Output Construction Ranked salient objects, full descriptors, depth, occlusion, affordance Visual Spatial Output (→ Domain Access)

Table 14 – Components of Visual Spatial Reasoning AIM’s outputs

Output Name Composing Functions Included Data Elements Destination/Purpose
Visual Spatial Output Descriptor Parsing, Object Localisation, Depth & Occlusion Estimation, Affordance Inference. Full object list, 3D positions, depth values, occlusion flags, affordance tags, confidence scores. Sent to Domain Access for spatial query formulation and scene refinement.
Visual Spatial Guide Object Localisation, Depth & Occlusion Estimation, Affordance Inference, Salience & Relevance Mapping. Filtered salient objects, viewpoint-normalised positions, proximity classification, affordance tags, optional narrative cues. Sent to Prompt Creation for enriching PC-Prompt with user-relative spatial anchoring.

The following gives and analysis of the Visual Spatial Directive impact on operation and output of the Visual Spatial Reasoning AIM:

Functions using the input

  1. Object Localisation
  2. Salience & Relevance Mapping

Directive Elements Used

  1. Viewpoint constraints
  2. Object focus hints
  3. Salience map
  4. Refinement instructions

Internal Effects

  1. Adjusts viewpoint normalisation logic in Object Localisation
  2. Overrides default salience ranking in Salience Mapping
  3. Filters or reorders object list based on DAC-specified focus
  4. Triggers reprocessing of descriptors if refinement is requested

Impact on Outputs

  1. Visual Spatial Output:

– Object positions may be recalculated

– Salience scores may be updated

– Object list may be reordered or filtered

  1. Visual Spatial Guide:

– Salient objects may shift

– Viewpoint-relative cues may be adjusted

– Narrative emphasis may change

JSON Metadata

https://schemas.mpai.community/PGM1/V1.0/AIMs/VisualSpatialReasoning.json

Profiles

No Profiles