1     Definition 2     Functional Requirements 3     Syntax
4     Semantics 5    Conformance Testing 6     Performance Assessment

1      Definition

The Enhanced Visual Scene Descriptors (EVD) augment a previously defined Basic Visual Scene Descriptors (BVS) instance with additional semantic, perceptual, and interaction‑oriented information, thus enabling systems to move from descriptive scene representation to actionable and interaction‑aware scene understanding.

An Enhanced Visual Scene Descriptors instance establishes a link to a base BVS instance and provides enriched descriptors for selected visual scene items. These enhancements include properties such as depth, occlusion, salience, interaction potential, and multimodal affordances (visual, audio, and haptic).

2      Functional Requirements

The Enhanced Visual Scene Descriptors shall:

  • Provide a mechanism to extend a Basic Visual Scene Descriptors instance without duplicating its structure.
  • Reference a base BVS instance through a unique identifier (BaseBVSID).
  • Allow per‑item enrichment by linking enhanced descriptors to BVS items.
  • Support enrichment of scene elements with:
    • depth information,
    • occlusion state,
    • interaction potential,
    • salience,
    • affordances.
  • Support multimodal affordances, including:
    • visual affordances,
    • audio affordances,
    • haptic affordances.
  • Represent feasibility and constraints of interactions.
  • Provide confidence and compliance indicators for inferred affordances.
  • Allow optional inclusion of processing metadata.
  • Ensure that enhancements are:
    • consistent with the referenced BVS,
    • composable,
    • extensible.

5     Syntax

https://schemas.mpai.community/OSD/V1.5/data/EnhancedVisualSceneDescriptors.json

4      Semantics

Label Description
Header Identifies the schema version using the pattern OSD‑EVD‑Vx.y.
MInstanceID Identifies the MPAI instance associated with the descriptors.
EnhancedVisualSceneDescriptorsID Unique identifier of the enhanced descriptor instance.
BaseBVSID Identifier of the Basic Visual Scene Descriptors instance being extended.
EnhancedVisualSceneDescriptorsSpaceTime Spatial and temporal scope of the enhanced descriptors.
EnhancedVisualSceneDescriptors Array of enhanced descriptor items associated with BVS items.
EVDItemID Unique identifier of the enhanced descriptor item.
BVSItemID Identifier of the corresponding Basic Visual Scene item being enriched.
Depth Depth information associated with the visual object or scene element.
OcclusionFlag Indicates whether the object is occluded.
InteractionPotential Describes the potential of the object to support interaction.
Salience Indicates the perceptual prominence of the object.
Affordance Describes possible interactions associated with the object, expressed as one of visual, audio, or haptic affordances.
VisualAffordanceItem Describes interaction possibilities based on visual properties (e.g., graspable, pushable).
AudioAffordanceItem Describes interaction possibilities based on audio cues (e.g., notification, urgency signal).
HapticAffordanceItem Describes interaction possibilities based on tactile interaction (e.g., draggable, tappable).
Tag (Affordance) Identifies the type of affordance.
Feasible Indicates whether the affordance can be executed.
Constraints Specifies conditions limiting the affordance.
ConstraintItem Describes a constraint affecting feasibility (e.g., occluded, safety_violation).
Severity Indicates the severity of the constraint (info, warning, error).
Referent Identifier of the entity to which the affordance applies.
AudioReferent Identifies the audio object and associated scene context.
ChannelPolicy Specifies applicable audio channels.
Confidence Degree of confidence in the affordance inference.
Compliance Indicates whether the affordance complies with applicable rules.
FallbackApplied Indicates whether a fallback action has been used.
FallbackTag Indicates the fallback affordance type.
DataXMD Processing and exchange metadata associated with the descriptor.
DescrMetadata Additional descriptive metadata (free text, up to 2048 characters).

5     Conformance Testing

A Data instance Conforms with Basic Visual Scene Descriptors (OSD-BVS) if:

  1. The Data validates against the Basic Visual Scene Descriptors’ JSON Schema.
  2. All Data in the Basic Visual Scene Descriptors’ JSON Schema
    1. Have the specified type
    2. Validate against their JSON Schemas
    3. Conform with their Visual Data Qualifiers.

6     Performance Assessment