1     Versions

V1.0

2     Functions

The Audio-Visual Scene Description Composite AIM (OSD-AVD):

  1. Receives the Audio-Visual Scene composed of:
    • Text
    • Audio Objects that are Speech Objects or generic Audio Objects whose source is a assumed to be a point.
    • Visual Objects that are either Entities or generic Object.
  2. Produces the Audio-Visual Scene Descriptors.

3      Reference Architecture

Figure 10 depicts the Reference Architecture.

 Figure 10 – Audio-Visual Scene Description

4      I/O Data

Table 5 specifies the Input and Output Data of the Audio-Visual Description.

Table 5 – I/O Data of the Audio-Visual Description Composite AIM

Input Description
Input Audio The audio scene captured by Machine.
Input Visual The visual scene captured by Machine.
Output Description
Audio-Visual Scene Descriptors The Descriptors of of all Audio, Visual, and Audio-Visual Objects.

5      SubAIMs

Audio Scene Description
Visual Scene Description
Audio-Visual Alignment
Audio-Visual Scene Multiplexing

6      JSON Metadata

https://schemas.mpai.community/OSD/V1.0/AIMs/AudioVisualSceneDescription.json