1 Versions
V1.0
2 Functions
The Audio-Visual Scene Description Composite AIM (OSD-AVD):
- Receives the Audio-Visual Scene composed of:
- Text
- Audio Objects that are Speech Objects or generic Audio Objects whose source is a assumed to be a point.
- Visual Objects that are either Entities or generic Object.
- Produces the Audio-Visual Scene Descriptors.
3 Reference Architecture
Figure 10 depicts the Reference Architecture.
Figure 10 – Audio-Visual Scene Description
4 I/O Data
Table 5 specifies the Input and Output Data of the Audio-Visual Description.
Table 5 – I/O Data of the Audio-Visual Description Composite AIM
Input | Description |
Input Audio | The audio scene captured by Machine. |
Input Visual | The visual scene captured by Machine. |
Output | Description |
Audio-Visual Scene Descriptors | The Descriptors of of all Audio, Visual, and Audio-Visual Objects. |
5 SubAIMs
Audio Scene Description |
Visual Scene Description |
Audio-Visual Alignment |
Audio-Visual Scene Multiplexing |
6 JSON Metadata
https://schemas.mpai.community/OSD/V1.0/AIMs/AudioVisualSceneDescription.json