1 Functions 2 Reference Model 3 I/O Data
4 Functions of AI Modules 5 I/O Data of AI Modules 6 AIW, AIMs, and JSON Metadata
7 Reference Software 8 Conformance Testing 9 Performance Assessment

1 Functions

The Television Media Analysis (OSD-TMA) AI Workflow produces Audio-Visual Event Descriptors (OSD-AVE) composed of a set of Audio-Visual Scene Descriptors (AVS) that include

  1. Relevant Audio, Visual, or Audio-Visual Object
  2. IDs of speakers and ID of faces with their spatial positions
  3. Text from utterances

of a television program provided as input together with textual information that OSD-TMA may use to improve its performance.

OSD-TMA assumes that there is only one speaker at a time.

2 Reference Model

Figure 1 depicts the Reference Model of TV Media Analysis (OSD-TMA).

Figure 1 – Reference Model of OSD-TMA

3 Input/Output Data

Table 1 provides the input and output data of the TV Media Analysis Use Case:

Table 1 – I/O Data of TV Media Analysis (OSD-TMA)

Input Descriptions
Audio-Video Audio-Video to be analysed.
Text Object Text helping OSD-TMA AIMs to improve their performance.
Output Descriptions
Audio-Visual Event Descriptors Resulting analysis of Input Audio-Video-Text.

4 Functions of AI Modules

Table 2 provides the functions of the TV Media Analysis Use Case. Note that processing proceeds asynchronously, e.g., TV Splitter separates audio and video for the entire duration of the file and passes the entire audio and video files.

Table 2 – Functions of AI Modules of TV Media Analysis (OSD-TMA)

AIM Function
TelevisionSplitting Receives Audio-Visual File composed of:
– An Audio-Video component.
– A Text component.
Produces
– Video file
– Audio file
– Text file
When all the files of the full duration of the video are ready, AV Splitter informs the following AIMs.
VisualChangeDetection Receives Video file.
Iteratively:
– Looks for a video frame that conveys a scene changed from the preceding scene (depends on threshold).
– Assigns a video clip identifier to the video clip.
– Produces a set of images with StartTime and EndTime.
AudioSegmentation Receives Audio file.
Iteratively:
– For each audio segment (from one change to the next):
– Becomes aware that there is speech.
– Assigns a speech segment ID and anonymous speaker ID (i.e., the identity is unknown) to the segment.
– Decides whether:
– The existing speaker has stopped.
– A new speaker has started a speech segment.
– If a speaker has started a speech:
– Assigns a new speech segment ID.
– Check whether the speaker is new or old in the session.
– If old retain old anonymous speaker ID.
– If new assign a new anonymous speaker ID.
– Produces a series of audio sequences each of which contains:
– A speech segment.
– Start and end time.
– Anonymous Speaker ID.
– Overlap information
Face Identity Recognition – Receives a Text file.
– Extracts semantic information from the Text file.
– Receives a set of images per video clip.
– For each image identifies the bounding boxes.
– Extracts faces from the bounding boxes.
– Extracts the embeddings that represent a face.
– Compares the embeddings with those stored in the face recognition data base.
– Associates the embeddings with a face ID taking into account the semantic information from the Text.
Speaker Identity Recognition 1. Receives a Text file.
2. Extracts semantic information from the Text file.
3. Receives a Speech Object and Speech Overlap information.
4. Extracts the embeddings that represent the speech segment.
5. Compares the embeddings with those stored in the speaker recognition data base.
6. Associates the embeddings with a Speaker ID taking into account the semantic information from the Text.
Audio-Visual Alignment Receives:
– Face ID
– Bounding Box
– Face Time
– Speaker ID
– Speaker Time
Associates Speaker ID and Face ID
Automatic Speech Recognition – Receives a Text file.
– Extracts semantic information from the Text file.
– Receives a Speech Object.
– Produces the transcription of the speech payload taking into account the semantic information from the Text.
– Attaches time stamps to specific portions of the transcription.
Audio-Visual Scene Description Receives
– Bounding box coordinates, Face ID, and time stamps
– Speaker ID and time stamps.
– Reconciles Face ID and Speaker ID.
– Text and time stamps
Produces Audio-Visual Scene Descriptors
Audio-Visual Event Description Receives Audio-Visual Scene Descriptors
Produces Audio-Visual Event Descriptors

5 I/O Data of AI Modules

Table 3 provides the I/O Data of the AI Modules of the TV Media Analysis Use Case.

Table 3 – I/O Data of AI Modules of TV Media Analysis (OSD-TMA)

AIM Receives Produces
TelevisionSplitting AudioVideo
– AuxiliaryText
Audio
Video
– AuxiliaryText
VisualChangeDetection Video Image
AudioSegmentation Speech SpeechObjects
SpeechOverlap
Face Identity Recognition Image
Time
– AuxiliaryText
VisualSceneDescriptors with:
– FaceID
– FaceTime
BoundingBox
Speaker Identity Recognition SpeechObject
– SpeakerTime
SpeechSceneDescriptors: with:
– Speaker ID
– SpeakerTime
Audio-Visual Alignment SpeechOverlap
SpeechObject
– SpeakerTime
– AuxiliaryText
– Recognised Text
SpeechObject
– SpeakerTime
Automatic Speech Recognition SpeechSceneDescriptors
VisualSceneDescriptors
AVSceneDescriptor with:
– AlignedFaceID
BoundingBox
Audio-Visual Scene Description BoundingBox
– AlignedFaceID
– SceneTime
– SpeakerID
– RecognisedText
AVSceneDescriptors
Audio-Visual Event Description AVSceneDescriptors: AVEventDescriptors

6 AIW, AIMs, and JSON Metadata

Table 4 – AIW, AIMs, and JSON Metadata of TV Media Analysis (OSD-TMA)

AIW AIM Name JSON
OSD-TMA Television Media Analysis X
MMC-AUS Audio Segmentation X
OSD-AVA Audio-Visual Alignment X
OSD-AVE Audio-Visual Event Description X
OSD-AVS Audio-Visual Scene Description X
MMC-ASR Automatic Speech Recognition X
PAF-FIR Face Identity Recognition X
MMC-SIR Speaker Identity Recognition X
OSD-TVS Television Splitting X
OSD-VCD Visual Change Detection X

7 Reference Software

7.1 Disclaimers

  1. This OSD-TMA Reference Software Implementation is released with the BSD-3-Clause licence.
  2. The purpose of this Reference Software is to provide a working Implementation of OSD-TMA, not to provide a ready-to-use product.
  3. MPAI disclaims the suitability of the Software for any other purposes and does not guarantee that it is secure.
  4. Use of this Reference Software may require acceptance of licences from the respective repositories. Users shall verify that they have the right to use any third-party software required by this Reference Software.

7.2 Guide to the OSD-TMA code

The OSD-TMA Reference Software:

  1. Receives any input Audio-Visual file that can be demultiplexed by FFMPEG.
  2. Uses FFMPEG to extract:
    1. A WAV file (uncompressed audio).
    2. A video file.
  3. Produces descriptors of the input Audio-Visual file represented by Audio-Visual Event Descriptors.

The current OSD-TMA implementation does not support Auxiliary Text.

The OSD-TMA uses the following software components:

  1. RabbitMQ is an MPAI-AIF function but is adapted for use in OSD-TMA. This service starts, stops and removes all OSD-TMA AIMs automatically using Docker in Docker (DinD): the controller lets only one dockerized OSD-TMA AIM run at any time, which saves computing resources. Docker is used to automate the deployment of all OSD-TMA AIMs so that they can run in different environments.
  2. Portainer is a service that helps manage Docker containers, images, volumes, networks and stacks using a Graphical User Interface (GUI). Here it is mainly used to manage two containerized services: RabbitMQ and Controller
  3. Docker Compose has a yml file to build and run all OSD-TMA AIMs and the Module for the Controller
  4. Controller builds the Docker images of all OSD-TMA AIMs and of the Controller by reading a YAML file called compose.yml. Docker Compose helps run RabbitMQ and Controller as Docker containers.
  5. Python code manages the MPAI-AIF operation but is adapted for use in OSD-TMA.

The OSD-TMA software includes compose.yml and controller. They can be found at https://experts.mpai.community/software/mpai-aif/osd_tma/orchestrator. compose.yml starts RabbitMQ, nor Portainer. https://docs.portainer.io/start/install-ce describes how to install Portainer.

The OSD-TMA code includes compose.yml and the code for the Controller.

compose.yml starts RabbitMQ, not Portainer. Portainer installation is described.

The OSD-TMA code is released with a demo. Recorded demo: The saved JSON of OSD-AVE.

The OSD-TMA Reference Software has been developed by the MPAI AI Framework Development Committee (AIF-DC), in particular, Francesco Gallo (EURIX) and Mattia Bergagio (EURIX) for developing this software.

The implementation includes the AI Modules of the following table. AIMs in red achieve high performance if a GPU to used. However, the software can also operate without GPU.

AIM Name Library Dataset
MMC-ASR Speech recognition whisper-timestamped Labeled Faces in the Wild
MMC-SIR Speaker identification SpeechBrain VoxCeleb1
MMC-AUS Audio segmentation pyannote.audio
OSD-AVE Event descriptors
OSD-AVS Scene descriptors
OSD-TVS TV splitting ffmpeg
OSD-VCD Visual change PySceneDetect
PAF-FIR Face recognition DeepFace
RabbitMQ

RabbitMQ is an open-source message broker that implements AMQP (Advanced Message Queuing Protocol). It enables asynchronous communication among the modules, allows them to be decoupled and improves the scalability and resilience of the system.

RabbitMQ provides a number of key features:

  • Asynchronous messaging: RabbitMQ supports asynchronous messaging, which decouples senders from receivers and allows applications to communicate in a non-blocking manner.
  • Queue management: RabbitMQ implements message queuing, which helps distribute messages among consumers.
  • Reliability: RabbitMQ provides mechanisms to ensure message reliability, such as delivery acknowledgement and message persistence.
  • Flexible routing: RabbitMQ supports different types of exchanges for message routing, so messages can be routed according to various criteria.
  • Clustering: RabbitMQ supports clustering, which allows multiple RabbitMQ servers to be grouped into a single logical broker, thereby improving reliability and availability. RabbitMQ is usable in a wide range of applications, from simple data transfer among processes to more complex messaging patterns such as publish/subscribe and work queues.

Example of queues:

Portainer

Portainer is an open-source user interface that streamlines the management of Docker containers. It allows you to manage Docker containers, images, stacks, networks and volumes from a web interface; no need for a command-line interface. You can use Portainer to perform tasks such as building images, adding containers or stacks, and monitoring resource usage.

Example of containers in one of our stacks:

You can build and start all modules by running

docker compose –env-file .env -f compose.yml up

compose.yml is in repo.

.env lists the environment variables below:

TAG=0.1

LOG_LEVEL: log level. Can be CRITICAL, ERROR, WARNING, INFO, DEBUG, or NOTSET

PATH_SHARED: path of storage on disk

AI_FW_DIR: path of storage in container

HUGGINGFACE_TOKEN: HuggingFace token. Used by MMC-AUS

GIT_NAME: Git username

GIT_TOKEN: Git token

MIDDLEWARE_USER: RabbitMQ username

MIDDLEWARE_PASSWORD: RabbitMQ password

MIDDLEWARE_PORT: RabbitMQ port

MIDDLEWARE_EXTERNAL_PORT: outer RabbitMQ port

MIDDLEWARE_VIRTUALHOST: RabbitMQ virtual host. Default: MIDDLEWARE_VIRTUALHOST=/

EXCHANGE_NAME: RabbitMQ exchange name

To start module x run

docker compose –env-file .env -f compose.yml up controller rabbitmq_dc x

Send a suitable payload to the queue of module x, as shown for the controller here:

Controller

The controller is the module that orchestrates the workflow. It sends a message to the right module for processing.

The controller receives the input message from queue queue_module_controller. Then, the controller sends a tweaked message to OSD-TVS. This message contains the input message. OSD-TVS sends its output message to the controller, which sends a tweaked output message to MMC-AUS.

Modules are called in this order:

  1. OSD-TVS
  2. MMC-AUS
  3. OSD-VCD
  4. MMC-SIR
  5. MMC-ASR
  6. PAF-FIR
  7. OSD-AVA
  8. OSD-AVS
  9. OSD-AVE.
Examples/README.md

examples/README.md details how to run examples are here.

7.3 Acknowledgements

This version of the OSD-TMA Reference Software has been developed by the MPAI AI Framework Development Committee (AIF-DC).

8 Conformance Testing

Table 2 provides the Conformance Testing Method for OSD-TMA AIW. Conformance Testing of the individual AIMs of the OSD-TMA Composite AIM are given by the individual AIM Specification.

If a schema contains references to other schemas, conformance of data for the primary schema implies that any data referencing a secondary schema shall also validate against the relevant schema, if present and conform with the Qualifier, if present.

Table 2 – Conformance Testing Method for OSD-TMA AIW

Receives Audio-Video Shall validate against Audio-Visual Object schema.
Also, Audio-Visual Data shall conform with Audio-Visual Qualifier.
Text Object Shall validate against Text Object schema.
Also, Text Data shall conform with Text Qualifier.
Produces Audio-Visual Event Descriptors Shall validate against Audio-Visual Event Descriptors schema.

9 Performance Assessment