1     Functions

2     Reference Model

3     I/O Data

4     Functions of AI Modules

5     I/O Data of AI Modules

6     AIWs, AIMs, and JSON Metadata

1      Functions

Television Media Analysis (OSD-TMA) produces Audio-Visual Event Descriptors in the form of a set of significant set of Audio-Visual Scene Descriptors that include Audio, Visual, or Audio-Visual scene changes, IDs of speakers and faces with their spatial positions, and text from utterances of a video program provided as input. The set of Audio-Visual Scene Descriptors is packaged in Audio-Visual Event Descriptors.

2      Reference Model

Figure 1 depicts the Reference Model of TV Media Analysis.

 

Figure 1 – Reference Model of OSD-TMA

3      I/O Data

Table 1 provides the input and output data of the TV Media Analysis Use Case:

Table 1 – I/O Data of Conversation with Personal Status

Input Descriptions
Input Audio-Video-text File Audio-Video to be analysed with the help of associated Text.
Input Descriptions
Audio-Visual Event Descriptors Resulting analysis of Input Audio-Video-Text.

4      Functions of AI Modules

Table 2 provides the functions of the TV Media Analysis Use Case. Note that processing proceeds asynchronously, e.g., TV Splitter separates audio and video for the entire duration of the file and passes the entire audio and video files.

Table 2 – Functions of AI Modules of Conversation with Personal Status

AIM Function
TelevisionSplitting 1.     Receives Audio-Visual File composed of:
a.     An Audio-Video component.
b.     A Text component.
2.     Produces
a.     Video file
b.     Audio file
c.     Text file
3.     When the files of the full duration of the video are ready, AV Splitter informs the following AIMs.
VisualChangeDetection 1.     Receives Video file.
2.     Iteratively:
a.     Looks for a video frame that conveys a scene changed from the preceding scene (depends on threshold).
b.     Assigns a video clip identifier to the video clip.
c.     Produces a set of images with StartTime and EndTime.
i.     An image
ii.     Time stamp
AudioSegmentation 1.     Receives Audio file.
2.     Iteratively detects speaker change.
a.     For each audio segment (from one change to the next):
i.     Becomes aware that there is speech.
ii.     Assigns a speech segment ID and anonymous speaker ID (i.e., the identity is unknown) in the segment.
iii.     Decides whether:
1.     The existing speaker has stopped.
2.     A new speaker has started a speech segment.
iv.     If a speaker has started a speech:
1.     Assigns a new speech segment ID.
2.     Check whether the speaker is new or old in the session.
3.     If old retain old anonymous speaker ID.
4.     If new assign a new anonymous speaker ID.
b.     Produces a series of audio sequences each of which contains:
i.     A speech segment.
ii.     Start and end time.
iii.     Anonymous Speaker ID.
iv.     Overlap information
Face Identity Recognition 1.    Receives a Text file.
2.    Extracts semantic information from the Text file.
3.    Receives a set of images per video clip.
4.     For each image identifies the bounding boxes.
5.     Extracts faces from the bounding boxes.
6.     Extracts the embeddings that represent a face.
7.     Compares the embeddings with those stored in the face recognition data base.
8.     Associates the embeddings with a face ID taking into account the semantic information from the Text.
Speaker Identity Recognition 1.    Receives a Text file.
2.    Extracts semantic information from the Text file.
3.     Receives a Speech Object and Speech Overlap information.
4.     Extracts the embeddings that represent the speech segment.
5.     Compares the embeddings with those stored in the speaker recognition data base.
6.     Associates the embeddings with a Speaker ID taking into account the semantic information from the Text.
Audio-Visual Alignment 1.     Receives::
a.     Face ID
b.     Bounding Box
c.     Face Time
d.     Speaker ID
e.     Speaker Time
2.     Associates Speaker ID and Face ID
Automatic Speech Recognition 1.    Receives a Text file.
2.    Extracts semantic information from the Text file.
3.     Receives a Speech Object.
4.     Produces the transcription of the speech payload taking into account the semantic information from the Text..
5.     Attaches time stamps to specific portions of the transcription.
Audio-Visual Scene Description 1.     Receives
a.     Bounding box coordinates, Face ID, and time stamps
b.     Speaker ID and time stamps.
c.     Reconciles Face ID and Speaker ID.
d.     Text and time stamps
2.     Produces Audio-Visual Scene Descriptors
Audio-Visual Event Description 1.     Receives Audio-Visual Scene Descriptors
2.     Produces Audio-Visual Event Descriptors

5      I/O Data of AI Modules

Table 3 provides the I/O Data of the AI Modules of the TV Media Analysis Use Case.

Table 3 – I/O Data of AI Modules of Television Media Analysis

AIM Receives Produces
TelevisionSplitting – Audio-Video-Auxiliary Text Audio
Video
– AuxiliaryText
VisualChangeDetection Video Image
AudioSegmentation Speech SpeechObjects
SpeechOverlap
Face Identity Recognition Image
Time
– AuxiliaryText
VisualSceneDescriptors with:
– FaceID
– FaceTime
BoundingBox
Speaker Identity Recognition SpeechObject
– SpeakerTime
SpeechSceneDescriptors: with:
– Speaker ID
– SpeakerTime
Audio-Visual Alignment SpeechOverlap
SpeechObject
– SpeakerTime
– AuxiliaryText
– Recognised Text
SpeechObject
– SpeakerTime
Automatic Speech Recognition SpeechSceneDescriptors
VisualSceneDescriptors
AVSceneDescriptor with:
– AlignedFaceID
BoundingBox
Audio-Visual Scene Description BoundingBox
– AlignedFaceID
– SceneTime
– SpeakerID
– RecognisedText
AVSceneDescriptors
Audio-Visual Event Description AVSceneDescriptors: AVEventDescriptors

6      AIWs, AIMs, and JSON Metadata

Table 4 – AIWs, AIMs, and JSON Metadata

AIW AIM Name JSON
OSD-TMA Television Media Analysis X
OSD-TVS Television Splitting X
OSD-VCD Visual Change Detection X
MMC-AUS Audio Segmentation X
PAF-FIR Face Identity Recognition X
MMC-SIR Speaker Identity Recognition X
OSD-AVA Audio-Visual Alignment X
MMC-ASR Automatic Speech Recognition X
OSD-AVE Audio-Visual Event Description X
OSD-AVS Audio-Visual Scene Description X