| 1 Definition | 2 Functional Requirements | 3 Syntax |
| 4 Semantics | 5 Conformance Testing | 6 Performance Assessment |
1 Definition
The Enhanced Visual Scene Descriptors (EVD) augment a previously defined Basic Visual Scene Descriptors (BVS) instance with additional semantic, perceptual, and interaction‑oriented information, thus enabling systems to move from descriptive scene representation to actionable and interaction‑aware scene understanding.
An Enhanced Visual Scene Descriptors instance establishes a link to a base BVS instance and provides enriched descriptors for selected visual scene items. These enhancements include properties such as depth, occlusion, salience, interaction potential, and multimodal affordances (visual, audio, and haptic).
2 Functional Requirements
The Enhanced Visual Scene Descriptors shall:
- Provide a mechanism to extend a Basic Visual Scene Descriptors instance without duplicating its structure.
- Reference a base BVS instance through a unique identifier (BaseBVSID).
- Allow per‑item enrichment by linking enhanced descriptors to BVS items.
- Support enrichment of scene elements with:
- depth information,
- occlusion state,
- interaction potential,
- salience,
- affordances.
- Support multimodal affordances, including:
- visual affordances,
- audio affordances,
- haptic affordances.
- Represent feasibility and constraints of interactions.
- Provide confidence and compliance indicators for inferred affordances.
- Allow optional inclusion of processing metadata.
- Ensure that enhancements are:
- consistent with the referenced BVS,
- composable,
- extensible.
5 Syntax
https://schemas.mpai.community/OSD/V1.5/data/EnhancedVisualSceneDescriptors.json
4 Semantics
| Label | Description |
|---|---|
| Header | Identifies the schema version using the pattern OSD‑EVD‑Vx.y. |
| MInstanceID | Identifies the MPAI instance associated with the descriptors. |
| EnhancedVisualSceneDescriptorsID | Unique identifier of the enhanced descriptor instance. |
| BaseBVSID | Identifier of the Basic Visual Scene Descriptors instance being extended. |
| EnhancedVisualSceneDescriptorsSpaceTime | Spatial and temporal scope of the enhanced descriptors. |
| EnhancedVisualSceneDescriptors | Array of enhanced descriptor items associated with BVS items. |
| EVDItemID | Unique identifier of the enhanced descriptor item. |
| BVSItemID | Identifier of the corresponding Basic Visual Scene item being enriched. |
| Depth | Depth information associated with the visual object or scene element. |
| OcclusionFlag | Indicates whether the object is occluded. |
| InteractionPotential | Describes the potential of the object to support interaction. |
| Salience | Indicates the perceptual prominence of the object. |
| Affordance | Describes possible interactions associated with the object, expressed as one of visual, audio, or haptic affordances. |
| VisualAffordanceItem | Describes interaction possibilities based on visual properties (e.g., graspable, pushable). |
| AudioAffordanceItem | Describes interaction possibilities based on audio cues (e.g., notification, urgency signal). |
| HapticAffordanceItem | Describes interaction possibilities based on tactile interaction (e.g., draggable, tappable). |
| Tag (Affordance) | Identifies the type of affordance. |
| Feasible | Indicates whether the affordance can be executed. |
| Constraints | Specifies conditions limiting the affordance. |
| ConstraintItem | Describes a constraint affecting feasibility (e.g., occluded, safety_violation). |
| Severity | Indicates the severity of the constraint (info, warning, error). |
| Referent | Identifier of the entity to which the affordance applies. |
| AudioReferent | Identifies the audio object and associated scene context. |
| ChannelPolicy | Specifies applicable audio channels. |
| Confidence | Degree of confidence in the affordance inference. |
| Compliance | Indicates whether the affordance complies with applicable rules. |
| FallbackApplied | Indicates whether a fallback action has been used. |
| FallbackTag | Indicates the fallback affordance type. |
| DataXMD | Processing and exchange metadata associated with the descriptor. |
| DescrMetadata | Additional descriptive metadata (free text, up to 2048 characters). |
5 Conformance Testing
A Data instance Conforms with Basic Visual Scene Descriptors (OSD-BVS) if:
- The Data validates against the Basic Visual Scene Descriptors’ JSON Schema.
- All Data in the Basic Visual Scene Descriptors’ JSON Schema
- Have the specified type
- Validate against their JSON Schemas
- Conform with their Visual Data Qualifiers.