MPAI-OSD V1.2 AIWs Television Media Analysis

1 Functions	2 Reference Model	3 I/O Data
4 Functions of AI Modules	5 I/O Data of AI Modules	6 AIW, AIMs, and JSON Metadata
7 Reference Software	8 Conformance Testing	9 Performance Assessment

1 Functions

The Television Media Analysis (OSD-TMA) AI Workflow produces Audio-Visual Event Descriptors (OSD-AVE) composed of a set of Audio-Visual Scene Descriptors (AVS) that include

Relevant Audio, Visual, or Audio-Visual Object
IDs of speakers and ID of faces with their spatial positions
Text from utterances

of a television program provided as input together with textual information that OSD-TMA may use to improve its performance.

OSD-TMA assumes that there is only one speaker at a time.

2 Reference Model

Figure 1 depicts the Reference Model of TV Media Analysis (OSD-TMA).

Figure 1 – Reference Model of OSD-TMA

3 Input/Output Data

Table 1 provides the input and output data of the TV Media Analysis Use Case:

Table 1 – I/O Data of TV Media Analysis (OSD-TMA)

Input	Descriptions
Audio-Video	Audio-Video to be analysed.
Text Object	Text helping OSD-TMA AIMs to improve their performance.
Output	Descriptions
Audio-Visual Event Descriptors	Resulting analysis of Input Audio-Video-Text.

4 Functions of AI Modules

Table 2 provides the functions of the TV Media Analysis Use Case. Note that processing proceeds asynchronously, e.g., TV Splitter separates audio and video for the entire duration of the file and passes the entire audio and video files.

Table 2 – Functions of AI Modules of TV Media Analysis (OSD-TMA)

AIM	Function
TelevisionSplitting	Receives Audio-Visual File composed of: – An Audio-Video component. – A Text component. Produces – Video file – Audio file – Text file When all the files of the full duration of the video are ready, AV Splitter informs the following AIMs.
VisualChangeDetection	Receives Video file. Iteratively: – Looks for a video frame that conveys a scene changed from the preceding scene (depends on threshold). – Assigns a video clip identifier to the video clip. – Produces a set of images with StartTime and EndTime.
AudioSegmentation	Receives Audio file. Iteratively: – For each audio segment (from one change to the next): – Becomes aware that there is speech. – Assigns a speech segment ID and anonymous speaker ID (i.e., the identity is unknown) to the segment. – Decides whether: – The existing speaker has stopped. – A new speaker has started a speech segment. – If a speaker has started a speech: – Assigns a new speech segment ID. – Check whether the speaker is new or old in the session. – If old retain old anonymous speaker ID. – If new assign a new anonymous speaker ID. – Produces a series of audio sequences each of which contains: – A speech segment. – Start and end time. – Anonymous Speaker ID. – Overlap information
Face Identity Recognition	– Receives a Text file. – Extracts semantic information from the Text file. – Receives a set of images per video clip. – For each image identifies the bounding boxes. – Extracts faces from the bounding boxes. – Extracts the embeddings that represent a face. – Compares the embeddings with those stored in the face recognition data base. – Associates the embeddings with a face ID taking into account the semantic information from the Text.
Speaker Identity Recognition	1. Receives a Text file. 2. Extracts semantic information from the Text file. 3. Receives a Speech Object and Speech Overlap information. 4. Extracts the embeddings that represent the speech segment. 5. Compares the embeddings with those stored in the speaker recognition data base. 6. Associates the embeddings with a Speaker ID taking into account the semantic information from the Text.
Audio-Visual Alignment	Receives: – Face ID – Bounding Box – Face Time – Speaker ID – Speaker Time Associates Speaker ID and Face ID
Automatic Speech Recognition	– Receives a Text file. – Extracts semantic information from the Text file. – Receives a Speech Object. – Produces the transcription of the speech payload taking into account the semantic information from the Text. – Attaches time stamps to specific portions of the transcription.
Audio-Visual Scene Description	Receives – Bounding box coordinates, Face ID, and time stamps – Speaker ID and time stamps. – Reconciles Face ID and Speaker ID. – Text and time stamps Produces Audio-Visual Scene Descriptors
Audio-Visual Event Description	Receives Audio-Visual Scene Descriptors Produces Audio-Visual Event Descriptors

5 I/O Data of AI Modules

Table 3 provides the I/O Data of the AI Modules of the TV Media Analysis Use Case.

Table 3 – I/O Data of AI Modules of TV Media Analysis (OSD-TMA)

AIM	Receives	Produces
TelevisionSplitting	Audio – Video – AuxiliaryText	– Audio – Video – AuxiliaryText
VisualChangeDetection	– Video	– Image
AudioSegmentation	– Speech	– SpeechObjects – SpeechOverlap
Face Identity Recognition	– Image – Time – AuxiliaryText	VisualSceneDescriptors with: – FaceID – FaceTime – BoundingBox
Speaker Identity Recognition	– SpeechObject – SpeakerTime	– SpeechSceneDescriptors: with: – Speaker ID – SpeakerTime
Audio-Visual Alignment	– SpeechOverlap – SpeechObject – SpeakerTime – AuxiliaryText	– Recognised Text – SpeechObject – SpeakerTime
Automatic Speech Recognition	– SpeechSceneDescriptors – VisualSceneDescriptors	– AVSceneDescriptor with: – AlignedFaceID – BoundingBox
Audio-Visual Scene Description	– BoundingBox – AlignedFaceID – SceneTime – SpeakerID – RecognisedText	– AVSceneDescriptors
Audio-Visual Event Description	– AVSceneDescriptors:	– AVEventDescriptors

6 AIW, AIMs, and JSON Metadata

Table 4 – AIW, AIMs, and JSON Metadata of TV Media Analysis (OSD-TMA)

AIW	AIM	Name	JSON
OSD-TMA		Television Media Analysis	X
	MMC-AUS	Audio Segmentation	X
	OSD-AVA	Audio-Visual Alignment	X
	OSD-AVE	Audio-Visual Event Description	X
	OSD-AVS	Audio-Visual Scene Description	X
	MMC-ASR	Automatic Speech Recognition	X
	PAF-FIR	Face Identity Recognition	X
	MMC-SIR	Speaker Identity Recognition	X
	OSD-TVS	Television Splitting	X
	OSD-VCD	Visual Change Detection	X

7 Reference Software

7.1 Disclaimers

This OSD-TMA Reference Software Implementation is released with the BSD-3-Clause licence.
The purpose of this Reference Software is to provide a working Implementation of OSD-TMA, not to provide a ready-to-use product.
MPAI disclaims the suitability of the Software for any other purposes and does not guarantee that it is secure.
Use of this Reference Software may require acceptance of licences from the respective repositories. Users shall verify that they have the right to use any third-party software required by this Reference Software.

7.2 Guide to the OSD-TMA code

The OSD-TMA Reference Software:

Receives any input Audio-Visual file that can be demultiplexed by FFMPEG.
Uses FFMPEG to extract:
1. A WAV file (uncompressed audio).
2. A video file.
Produces descriptors of the input Audio-Visual file represented by Audio-Visual Event Descriptors.

The current OSD-TMA implementation does not support Auxiliary Text.

The OSD-TMA uses the following software components:

RabbitMQ is an MPAI-AIF function but is adapted for use in OSD-TMA. This service starts, stops and removes all OSD-TMA AIMs automatically using Docker in Docker (DinD): the controller lets only one dockerized OSD-TMA AIM run at any time, which saves computing resources. Docker is used to automate the deployment of all OSD-TMA AIMs so that they can run in different environments.
Portainer is a service that helps manage Docker containers, images, volumes, networks and stacks using a Graphical User Interface (GUI). Here it is mainly used to manage two containerized services: RabbitMQ and Controller
Docker Compose has a yml file to build and run all OSD-TMA AIMs and the Module for the Controller
Controller builds the Docker images of all OSD-TMA AIMs and of the Controller by reading a YAML file called compose.yml. Docker Compose helps run RabbitMQ and Controller as Docker containers.
Python code manages the MPAI-AIF operation but is adapted for use in OSD-TMA.

The OSD-TMA software includes compose.yml and controller. They can be found at https://experts.mpai.community/software/mpai-aif/osd_tma/orchestrator. compose.yml starts RabbitMQ, nor Portainer. https://docs.portainer.io/start/install-ce describes how to install Portainer.

The OSD-TMA code includes compose.yml and the code for the Controller.

compose.yml starts RabbitMQ, not Portainer. Portainer installation is described.

The OSD-TMA code is released with a demo. Recorded demo: The saved JSON of OSD-AVE.

The OSD-TMA Reference Software has been developed by the MPAI AI Framework Development Committee (AIF-DC), in particular, Francesco Gallo (EURIX) and Mattia Bergagio (EURIX) for developing this software.

The implementation includes the AI Modules of the following table. AIMs in red achieve high performance if a GPU to used. However, the software can also operate without GPU.

AIM	Name	Library	Dataset
MMC-ASR	Speech recognition	whisper-timestamped	Labeled Faces in the Wild
MMC-SIR	Speaker identification	SpeechBrain	VoxCeleb1
MMC-AUS	Audio segmentation	pyannote.audio
OSD-AVE	Event descriptors
OSD-AVS	Scene descriptors
OSD-TVS	TV splitting	ffmpeg
OSD-VCD	Visual change	PySceneDetect
PAF-FIR	Face recognition	DeepFace

RabbitMQ

RabbitMQ is an open-source message broker that implements AMQP (Advanced Message Queuing Protocol). It enables asynchronous communication among the modules, allows them to be decoupled and improves the scalability and resilience of the system.

RabbitMQ provides a number of key features:

Asynchronous messaging: RabbitMQ supports asynchronous messaging, which decouples senders from receivers and allows applications to communicate in a non-blocking manner.
Queue management: RabbitMQ implements message queuing, which helps distribute messages among consumers.
Reliability: RabbitMQ provides mechanisms to ensure message reliability, such as delivery acknowledgement and message persistence.
Flexible routing: RabbitMQ supports different types of exchanges for message routing, so messages can be routed according to various criteria.
Clustering: RabbitMQ supports clustering, which allows multiple RabbitMQ servers to be grouped into a single logical broker, thereby improving reliability and availability. RabbitMQ is usable in a wide range of applications, from simple data transfer among processes to more complex messaging patterns such as publish/subscribe and work queues.

Example of queues:

Portainer

Portainer is an open-source user interface that streamlines the management of Docker containers. It allows you to manage Docker containers, images, stacks, networks and volumes from a web interface; no need for a command-line interface. You can use Portainer to perform tasks such as building images, adding containers or stacks, and monitoring resource usage.

Example of containers in one of our stacks:

You can build and start all modules by running

docker compose –env-file .env -f compose.yml up

compose.yml is in repo.

.env lists the environment variables below:

TAG=0.1

LOG_LEVEL: log level. Can be CRITICAL, ERROR, WARNING, INFO, DEBUG, or NOTSET

PATH_SHARED: path of storage on disk

AI_FW_DIR: path of storage in container

HUGGINGFACE_TOKEN: HuggingFace token. Used by MMC-AUS

GIT_NAME: Git username

GIT_TOKEN: Git token

MIDDLEWARE_USER: RabbitMQ username

MIDDLEWARE_PASSWORD: RabbitMQ password

MIDDLEWARE_PORT: RabbitMQ port

MIDDLEWARE_EXTERNAL_PORT: outer RabbitMQ port

MIDDLEWARE_VIRTUALHOST: RabbitMQ virtual host. Default: MIDDLEWARE_VIRTUALHOST=/

EXCHANGE_NAME: RabbitMQ exchange name

To start module x run

docker compose –env-file .env -f compose.yml up controller rabbitmq_dc x

Send a suitable payload to the queue of module x, as shown for the controller here:

Controller

The controller is the module that orchestrates the workflow. It sends a message to the right module for processing.

The controller receives the input message from queue queue_module_controller. Then, the controller sends a tweaked message to OSD-TVS. This message contains the input message. OSD-TVS sends its output message to the controller, which sends a tweaked output message to MMC-AUS.

Modules are called in this order:

OSD-TVS
MMC-AUS
OSD-VCD
MMC-SIR
MMC-ASR
PAF-FIR
OSD-AVA
OSD-AVS
OSD-AVE.

Examples/README.md

examples/README.md details how to run examples are here.

7.3 Acknowledgements

This version of the OSD-TMA Reference Software has been developed by the MPAI AI Framework Development Committee (AIF-DC).

8 Conformance Testing

Table 2 provides the Conformance Testing Method for OSD-TMA AIW. Conformance Testing of the individual AIMs of the OSD-TMA Composite AIM are given by the individual AIM Specification.

If a schema contains references to other schemas, conformance of data for the primary schema implies that any data referencing a secondary schema shall also validate against the relevant schema, if present and conform with the Qualifier, if present.

Table 2 – Conformance Testing Method for OSD-TMA AIW

Receives	Audio-Video	Shall validate against Audio-Visual Object schema. Also, Audio-Visual Data shall conform with Audio-Visual Qualifier.
	Text Object	Shall validate against Text Object schema. Also, Text Data shall conform with Text Qualifier.
Produces	Audio-Visual Event Descriptors	Shall validate against Audio-Visual Event Descriptors schema.

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit