Go to MPAI-OSD V1.5 AI Modules
Function
Ref. Model
I/O Data
SubAIMs
JSON MData
Profiles
Ref. Software
Conformance
Performance
1 Functions
The Audio-Visual Scene Description (OSD-MSD) Composite AIM receives Audio-Visual Objects, the Descriptors of the Scene the Objects belong to, and their Space-Time information as inputs and produces the Descriptors of a Scene composed of Audio-Visual Objects and Scenes. The OSD-MSD AIM may also receive an Alert conveying information on potential anomalies in the input Audio-Visual Objects.
| Receives | Space-Time | Of output Audio-Visual Scene Descriptors. |
| Speech Objects | Individual Speech Objects. | |
| Audio Objects | Individual Audio Objects. | |
| Visual Objects | Individual Visual Objects. | |
| Audio-Visual Scene Descriptors | Of Scene to be augmented. | |
| Augments | Audio-Visual Scene Descriptors | With the input Objects. |
| Produces | Audio-Visual Scene Descriptors | The augmented Audio-Visual Scene Descriptors. |
2 Reference Model
Figure 1 depicts the Reference Model of the Audio-Visual Scene Description (OSD-MSD) Composite AIM.

Figure 1 – The Audio-Visual Scene Description (OSD-MSD) Composite AIM
3 I/O Data
Table 1 gives the Input and Output Data of the Audio-Visual Scene Description (OSD-MSD) Composite AIM.
Table 1 – I/O Data of the Audio-Visual Scene Description (OSD-MSD) Composite AIM
| Input | Description |
|---|---|
| Space-Time | Space-Time information of output Audio-Visual Scene Descriptors. |
| Speech Object | Speech Object. |
| Audio Objects | Audio Objects. |
| Visual Objects | Visual Objects. |
| Audio-Visual Scene Descriptors | The Audio-Visual Descriptors of the Scene part of the target Audio-Visual Scene. |
| Output | Description |
| Audio-Visual Scene Descriptors | The Audio-Visual Descriptors of the Scene. |
4 SubAIMs
4.1 Functions of SubAIMs
Figure 2 depicts the Reference Model of the Audio-Visual Scene Description (OSD-MSD) Composite AIM.

Figure 2 – The Audio-Visual Scene Description (OSD-MSD) Composite AIM
Table 2 gives the functions of the Audio-Visual Scene Description SubAIMs.
Table 2 – Functions of the Audio-Visual Scene Description (OSD-MSD) SubAIMs
| SubAIM | Function |
|---|---|
| Speech Scene Description | Produces the Descriptors of a Scene composed of Speech Objects and Scenes. |
| Audio Scene Description | Produces the Descriptors of a Scene composed of Audio Objects and Scenes. |
| Visual Scene Description | Produces the Descriptors of a Scene composed of Visual Objects and Scenes. |
| Audio-Visual Alignment | Produces the Descriptors of an Audio-Visual Scene whose Objects have compatible Identifiers if they have the same Position. |
4.2 Operation
The OSD-MSD receives input media and Audio-visual Scene. It produces and Audio-Visual Scene.
4.3 I/O Data of SubAIMs
Table 3 gives, for each SubAIM, the Input and Output Data of the Audio-Visual Scene Description.
Table 3 – I/O Data of the Audio-Visual Scene Description (OSD-MSD) SubAIMs
4.4 AIMs and JSON Metadata
Table 4 provides the links to the AIM specifications and JSON schemas. AIM1 indicates the Composite AIM and AIM2 its SubAIMs.
Table 4 – AIMs and JSON Metadata of the Audio-Visual Scene Description (OSD-AVS)
| AIM1 | AIM2 | Name | JSON |
|---|---|---|---|
| OSD-MSD | Audio-Visual Scene Description | X | |
| OSD-SSD | Speech Scene Description | X | |
| OSD-ASD | Audio Scene Description | X | |
| OSD-VSD | Visual Scene Description | X | |
| OSD-AVA | Audio-Visual Alignment | X |
5 JSON Metadata
https://schemas.mpai.community/OSD/V1.5/AIMs/AudioVisualSceneDescription.json
6 Profiles
No Profiles.
7 Reference Software
7.1 Disclaimers
- This OSD-AVS Reference Software Implementation is released with the BSD-3-Clause licence.
- The purpose of this OSD-AVS Reference Software is to show a working Implementation of OSD-AVS, not to provide a ready-to-use product.
- MPAI disclaims the suitability of the Software for any other purposes and does not guarantee that it is secure.
- Use of this Reference Software may require acceptance of licences from the respective repositories. Users shall verify that they have the right to use any third-party software required by this Reference Software.
7.2 Guide to the OSD-AVS code
OSD-NSS arranges the aligned visual and speech objects into Audio-Visual Scene Descriptors.
Use of this Reference Software for the OSD-MSD AI Module is for developers who are familiar with Python, Docker, and RabbitMQ.
The OSD-msd Reference Software is found at the MPAI gitlab site. It contains:
- src: a folder with the Python code implementing the AIM.
- Dockerfile: a Docker file containing only the libraries required to build the Docker image and run the container.
- requirements.txt: dependencies installed in the Docker image.
7.3 Acknowledgements
This OSD-AVS Reference Software has been developed by the MPAI AI Framework Development Committee (AIF-DC).
8 Conformance Testing
Table 5 provides the Conformance Testing Method for the Audio-Visual Scene Description (OSD-msd) Composite AIM. Conformance Testing of the individual SubAIMs of the OSD-msd Composite AIM are given by the individual AIM specifications.
If a schema contains references to other schemas, conformance of data for the primary schema implies that any data referencing a secondary schema shall also validate against the relevant schema, if present, and conform with the Qualifier, if present.
Table 5 – Conformance Testing Method for the Audio-Visual Scene Description (OSD-msd) Composite AIM
| Receives | Space-Time | Shall validate against Space-Time schema. |
| Speech Objects | Shall validate against Speech Objects schema. Speech Data shall conform with Qualifier. | |
| Audio Objects | Shall validate against Audio Objects schema. Audio Data shall conform with Qualifier. | |
| Visual Objects | Shall validate against Visual Objects schema. Visual Data shall conform with Qualifier. | |
| Produces | Audio-Visual Scene Descriptors | Shall validate against Audio-Visual Scene Descriptors schema. |
9 Performance Assessment
Not part of this specification.