| 1 Function | 2 Reference Model | 3 Input/Output Data | 
| 4 SubAIMs | 5 JSON Metadata | 6 Profiles | 
| 7 Reference Software | 8 Conformance Texting | 9 Performance Assessment | 
1 Functions
The Audio-Visual Alignment (OSD-AVA) AIM provides the Descriptors of an Audio-Visual Scene whose Audio Objects, Speech Objects, 3D Model Objects, and Visual Objects have compatible Identifiers if they have the same Position.
| Receives | Speech Scene Descriptors | Descriptors of potentially present Speech Scene. | 
| Audio Scene Descriptors | Descriptors of potentially present Audio Scene. | |
| Visual Scene Descriptors | Descriptors of potentially present Visual Scene. | |
| 3D Model Scene Descriptors | Descriptors of potentially present 3D Model Scene. | |
| Aligns | Speech, Audio, and Visual Objects | Sharing the same Spatial Attitude | 
| Produces | Audio-Visual Scene Descriptors | Where Speech Objects, Audio Objects, 3D Model Objects, and Visual Objects have compatible Identifiers if they have the same Spatial Attitude. | 
2 Reference Model
Figure 1 specifies the Reference Model of the Audio-Visual Alignment (OSD-AVA) AIM.

Figure 1 – Reference Model of the Audio-Visual Alignment (OSD-AVA) AIM
3 Input/Output Data
Table 1 specifies the Input and Output Data of the Audio-Visual Alignment (OSD-AVA) AIM.
Table 1 – I/O Data of the Audio-Visual Alignment AIM
| Input | Description | 
| Speech Scene Descriptors | The IDs and the geometry of the Speech Objects of the Scene. | 
| Audio Scene Descriptors | The IDs and the geometry of the Audio Objects of the Scene. | 
| Visual Scene Descriptors | The IDs and the geometry of the Audio Objects of the Scene. | 
| 3D Model Scene Descriptors | |
| Output | Description | 
| Audio-Visual Scene Descriptors | The IDs and the geometry of the Audio, Speech, 3D Model, Visual and Audio-Visual Objects of the Scene. | 
4 SubAIMs
No SubAIMs.
5 JSON Metadata
https://schemas.mpai.community/OSD/V1.4/AIMs/AudioVisualAlignment.json
6 Profiles
No profiles.
7 Reference Software
7.1 Disclaimers
- This OSD-AVA Reference Software Implementation is released with the BSD-3-Clause licence.
- The purpose of this Reference Software is to show a working Implementation of OSD-AVA, not to provide a ready-to-use product.
- MPAI disclaims the suitability of this Reference Software for any other purposes and does not guarantee that it is secure.
- Use of this Reference Software may require acceptance of licences from the respective repositories. Users shall verify that they have the right to use any third-party software required by this Reference Software.
7.2 Guide to OSD-AVA code
OSD-AVA arranges the output Visual Objects and Speech Objects with the corresponding Time information: scene cuts/transitions and speakers’ turns. Each Object is bounded by two adjacent times from a list of unique times that are either 1) scene cuts/transitions or 2) starts and ends of speakers’ turns.
Use of this Reference Software for the OSD-AVA AI Module is for developers who are familiar with Python, Docker, and RabbitMQ.
OSD-AVA computes segments as unique intervals from scene bounds and from speech segments. Moreover, OSD-AVA outputs visual objects and speech objects.
The OSD-AVA Reference Software is found at the MPAI gitlab site. It contains:
- src: a folder with the Python code implementing the AIM
- Dockerfile: a Docker file containing only the libraries required to build the Docker image and run the container
- requirements.txt: dependencies installed in the Docker image.
7.3 Acknowledgements
This version of the OSD-AVA Reference Software has been developed by the MPAI AI Framework Development Committee (AIF-DC).
8 Conformance Testing
Table 2 provides the Conformance Testing Method for OSD-AVA AIM.
If a schema contains references to other schemas, conformance of data for the primary schema implies that any data referencing a secondary schema shall also validate against the relevant schema, if present and conform with the Qualifier, if present.
Table 2 – Conformance Testing Method for OSD-AVA AIM
| Receives | Speech Scene Descriptors | Shall validate against Speech Scene Descriptors schema | 
| Audio Scene Descriptors | Shall validate against Audio Scene Descriptors schema | |
| Visual Scene Descriptors | Shall validate against Visual Scene Descriptors schema | |
| 3D Model Scene Descriptors | Shall validate against 3D Model Scene Descriptors schema | |
| Produces | Audio-Visual Scene Descriptors | Shall validate against AV Scene Descriptors schema | 
9 Performance Assessment
Performance Assessment of an OSD-AVA AIM Implementation shall be performed using a dataset of scenes containing Audio and/or Speech and Visual objects.
The Performance Assessment Report of an OSD-AVA AIM Implementation shall include:
- The Identifier of the OSD-AVA AIM whose Performance is being Assessed.
- The Identifier of the scene dataset used which include the identifiers of the aligned objects.
- The data type of the scenes: analogue, digital, without or with separated objects.
- The Performance of the OSD-AVA AIM expressed as the number of times the OSD-AVA AIM being Assessed for Performance:
- Correctly identifies as aligned the objects that the data set declares as aligned divided by the total number of aligned objects (Truly aligned objects).
- Incorrectly identifies as aligned the object that the dataset declares aligned in the dataset divided by the total number of aligned objects (Falsely aligned objects).
- Incorrectly identifies as non-aligned object that are declared aligned in the dataset referenced in 2 divided by the total number of aligned objects (Missed aligned objects).
 
