Context-based Audio Enhancement

Context-based Audio Enhancement (MPAI-CAE) improves the user experience for several audio-related applications including entertainment, communication, teleconferencing, gaming, post-production, restoration etc. in a variety of contexts such as in the home, in the car, on-the-go, in the studio etc.


Use Cases and Functional RequirementsFramework LicenceCall for TechnologiesTemplate for responses to the Call for TechnologiesApplication NoteClarifications

Clarifications of Call for Use Cases and Functional Requirements

MPAI-5 has approved the MPAI-CAE Use Cases and Functional Requirements (N151) as attachment to the Calls for Technologies N171. However, the source CAE-DC has identified some issues that are worth a clarificaton. This is posted on the MPAI web site and will be com­mun­icated directly to those who have informed the Secretariat of their intention to respond.

General issue

MPAI understands that the scope of N151 is very broad. Therefore it reiterates the point made in N152 that:

Completeness of a proposal for a Use Case is a merit because reviewers can assess that the components are integrated. However, submissions will be judged for the merit of what is proposed. A submission on a single technology that is excellent may be considered instead of a submission that is complete but has a less performing technology.

Emotion-Enhanced Speech (Use case #1 in N151)

The Functional Requirements of the Use Case does not explicitly indicate the form in which speech without emotion and Emotion enter the EES system. Possible modalities are

  1. A speech file and a separate Emotion file where the sequence of Emotions carries time stamps
  2. An interleaved stream of speech and Emotions

MPAI welcomes proposals addressing these issues.

The assessment of submissions by Respondents who elect not to not answer this point will not influence the assessment of the rest of their submission

Audio Recording Preservation (Use case #2 in N151)

MPAI welcomes proposed semantics of the information conveyed in the Text output of the Musicological classifier. It is not clear what information the Musicological classifier is providing.

The assessment of submissions by Respondents who elect not to not answer this point will not influence the assessment of the rest of their submission

Enhanced Audioconference Experience (Use case #3 in N151)

The function of the Speech detection and separation AIM is described as “Separation of relevant speech vs non-speech signals”.

As the description is possibly misleading, we inform submitters that the correct description of the AIM should be “Separation of relevant speech vs other signals (including unwanted speech)”.

The assessment of submissions by Respondents who base their submission on the text in N151 will not be affected by this clarification.

References

  1. MPAI-CAE Use Cases & Functional Requirements; MPAI N151; https://mpai.community/standards/mpai-cae/#UCFR
  2. MPAI-CAE Call for Technologies, MPAI N152; https://mpai.community/standards/mpai-cae/#Technologies
  3. MPAI-CAE Framework Licence, MPAI N171; https://mpai.community/standards/mpai-cae/#Licence

Use Cases and Functional Requirements

This document is also available in MS Word format MPAI-CAE Use Cases and Functional Requirements

1       Introduction

2       The MPAI AI Framework (MPAI-AIF) 

3       Use Cases.

3.1       Emotion-Enhanced Speech (EES)

3.2       Audio Recording Preservation (ARP)

3.3       Enhanced Audioconference Experience (EAE)

3.4       Audio-on-the-go (AOG)

4       Functional Requirements.

4.1       Introduction.

4.2       Emotion-Enhanced Speech.

4.2.1       Reference architecture.

4.2.2       AI Modules.

4.2.3       I/O interfaces of AI Modules.

4.2.4       Technologies and Functional Requirements.

4.3       Audio Recording Preservation.

4.3.1       Reference architecture.

4.3.2       AI Modules.

4.3.3       I/O interfaces of AI Modules.

4.3.4       Technologies and Functional Requirements.

4.3.5       Information about Audio enhancement performance.

4.4       Enhanced Audioconference Experience.

4.4.1       Reference architecture.

4.4.2       AI Modules.

4.4.3       I/O interfaces of AI Modules.

4.4.4       Technologies and Functional Requirements.

4.5       Audio-on-the-go.

4.5.1       Reference architecture.

4.5.2       AI Modules.

4.5.3       I/O interfaces of AI Modules.

4.5.4       Technologies and Functional Requirements.

5       Potential common technologies.

6       Terminology.

7       References.

1        Introduction

Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) is an international association with the mission to develop AI-enabled data coding standards. Research has shown that data coding with AI-based technologies is more efficient than with existing technologies.

The MPAI approach to developing AI data coding standards is based on the definition of standard interfaces of AI Modules (AIM). AIMs operate on input and output data both having a standard format. AIMs can be combined and executed in an MPAI-specified AI-Framework according to the emerging standard MPAI-AIF. A Call for MPAI-AIF Technologies [2] with associated Use Cases and Functional Requirements [1] was issued on 2020/12/16 and is now closed.

While AIMs must expose standard interfaces to be able to operate in an MPAI AI Framework, their performance may differ depending on the technologies used to implement them. MPAI believes that competing developers striving to provide more performing proprietary and inter­operable AIMs will promote horizontal markets of AI solutions that build on and further promote AI innovation.

This document is a collection of Use Cases and corresponding Functional Requirements in the MPAI Context-based Audio Enhancement (MPAI-CAE) application area. The Use Cases in this area help improve the audio user experience for several application spaces that include enter­tain­ment, communication, teleconferencing, gaming, post-production, restoration etc. in a variety of contexts such as in the home, in the car, on-the-go, in the studio etc.

Currently MPAI has identified four Use Cases falling in the Context-based Audio Enhancement application area:

  1. Emotion-Enhanced Speech (EES)
  2. Audio Recording Preservation (ARP)
  3. Enhanced Audioconference Experience (EAC)
  4. Audio-on-the-go (AOG)

This document is to be read in conjunction with the MPAI-CAE Call for Technologies (CfT) [3] as it provides the functional requirements of all technologies identified as required to implement the current MPAI-CAE Use Cases. Respondents to the MPAI-CAE CfT are requested to make sure that their responses are aligned with the functional requirements expressed in this document.

In the future MPAI may issue other Calls for Technologies falling in the scope of MPAI-CAE to support identified Use Cases. Currently these are:

  1. Efficient 3D sound
  2. (Serious) gaming
  3. Normalization of TV volume
  4. Automotive
  5. Audio mastering
  6. Speech communication
  7. Audio (post-)production

It should also be noted that some technologies identified in this document are the same, similar, or related to technologies required to implement some of the Use Cases of the companion document MPAI-MMC Use Cases and Functional Requirements [5]. Readers of this document are advised that being familiar with the content of the said companion document is a prerequisite for a proper understanding of this document.

This document is structured in 7 chapters, including this Introduction.

 

Chapter 2 briefly introduces the AI Framework Reference Model and its six Components
Chapter 3 briefly introduces the 4 Use Cases.
Chapter 4 presents the 4 MPAI-CAE Use Cases with the following structure

1.     Reference architecture

2.     AI Modules

3.     I/O data of AI Modules

4.     Technologies and Functional Requirements

Chapter 5 identifies the technologies likely to be common across MPAI-CAE and MPAI-MMC, a companion standard project whose Call for Technologies is issued simul­taneously with MPAI-CAE’s.
Chapter 6 gives relevant references
Chapter 7 gives a basic list of relevant terms and their definition

For the reader’s convenience, Table 1 introduces the meaning of the acronyms used in this doc­ument.

Table 1 – MPAI-CAE acronyms

Acronym Meaning
AI Artificial Intelligence
AIF AI Framework
AIM AI Module
AOG Audio-on-the-go
ARP Audio Recording Preservation
CfT Call for Technologies
DP Data Processing
EAE Enhanced Audioconference Experience
EES Emotion-Enhanced Speech
KB Knowledge Base
ML Machine Learning

2        The MPAI AI Framework (MPAI-AIF)

Most MPAI applications considered so far can be implemented as a set of AIMs – AI, ML and even traditional DP-based units with standard interfaces assembled in suitable topol­ogies to achieve the specific goal of an application and executed in an MPAI-defined AI Frame­work. MPAI is making all efforts to identify processing modules that are re-usable and upgradable without necessarily changing the inside logic. MPAI plans on completing the development of a 1st generation MPAI-AIF AI Framework in July 2021.

The MPAI-AIF Architecture is given by Figure 1.

Figure 1 – The MPAI-AIF Architecture

MPAI-AIF is made up of 6 Components:

  1. Management and Control manages and controls the AIMs, so that they execute in the correct order and at the time when they are needed.
  2. Execution is the environment in which combinations of AIMs operate. It receives external inputs and produces the requested outputs, both of which are Use Case specific, activates the AIMs, exposes interfaces with Management and Control and interfaces with Communic­ation, Storage and Access.
  3. AI Modules (AIM) are the basic processing elements receiving processing specific inputs and producing processing specific outputs.
  4. Communication is the basic infrastructure used to connect possibly remote Components and AIMs. It can be implemented, e.g., by means of a service bus.
  5. Storage encompasses traditional storage and is used to e.g., store the inputs and outputs of the individual AIMs, intermediary results data from the AIM states and data shared by AIMs.
  6. Access represents the access to static or slowly changing data that are required by the application such as domain knowledge data, data models, etc.

3        Use Cases

3.1       Emotion-Enhanced Speech (EES)

Speech carries information not only about the lexical content, but also about a variety of other aspects such as age, gender, signature, and emotional state of the speaker [3]. Speech synthesis is evolving towards supporting these aspects.

There are many cases where a speech without emotion needs to be converted to a speech carrying an emotion, possibly with grades of a particular emotion. This is the case, for instance, of a human-machine dialogue where the message conveyed by the machine is more effective if it carries an emotion properly related to the emotion detected in the human speaker.

The AI Modules identified in the Emotion-Enhanced Speech (EES) Use Case considered in this document will make it possible to create virtual agents communicating in a more natural way, and thus improve the quality of human-machine interaction, by making it closer to a human-human interaction [8].

EES’s ultimate goal is to help realise a user-friendly system control interface that lets users gen­erate speech with various – continuous and real time – expressiveness control levels.

3.2       Audio Recording Preservation (ARP)

Preservation of audio assets recorded on a variety of media (vinyl, tapes, cassettes etc.) is an important activity for a variety of application domains, in particular cultural heritage, because preservation requires more that “neutral” transfer of audio information from the analogue to the digital domain. For instance,

  1. It is necessary to recover and preserve context information, obviously, but not exclusively, audio.
  2. The recording of an acoustic event cannot be a neutral operation because the timbre quality and the plastic value of the recorded sound, which are of great importance in, for example, contemporary music, are influenced by the positioning of the microphones used during the recording.
  3. The processing carried out by the Tonmeister, i.e., the person who has a detailed theoretical and practical knowledge of all aspects of sound recording. However, unlike a sound engineer, the Tonmeister must also be deeply trained in music: music­ological and historic-critical competence are essential for the identification and correct cataloguing of the information contained in audio documents [9].
  4. As sound carriers are made of unstable base materials, they are more subject to damage caused by inadequate handling. The commingling of a technical and scientific formation with historic-philol­ogical knowledge (an important element for the identification and correct cataloguing of the infor­mation contained in audio documents) becomes essential for preservative re-recording oper­ations, going beyond mere A/D conversion. In the case of magnetic tapes, the carrier may hold important information: the tape can include multiples splices; it can be annotated (by the composer or by the technicians) and/or display several types of irregularities (e.g., corruptions of the carrier, tape of different colour or chemical composition).

In this Audio Recording Preservation Use Case, currently restricted to magnetic tapes, audio is digitised and fed into a preservation system. The audio information is supplemented by the information coming from a video camera pointed to the head that reads the magnetic tape. The output of the restoration process is composed by:

  1. Preservation digital audio
  2. Preservation master file that contains, next to the preservation audio file, several other inform­ation types created by the preservation process.

The introduction of this use case in the field of active preservation of audio documents opens the way to respond in an effective way to the methodological questions of reliability with respect to the recordings as documentary sources, while clarifying the concept of “historical faithfulness”.

The goal is to cover the whole “philologically informed” archival process of an audio document, from the active preservation of sound documents to the access to digitized files.

3.3       Enhanced Audioconference Experience (EAE)

Often, the user experience of a video/audio conference is far from satisfactory. Too much background noise or undesired sounds can lead to participants not understanding or even misun­derstanding what participants are saying, in addition to creating distraction.

By using AI-based adaptive noise-cancellation and sound enhancement, those kinds of noise can be virtually eliminated without using complex microphone systems that capture environment char­acteristics.

In this use case, the system receives microphone sound and microphone geometry information which describes number, positioning and configuration of the microphone or the array of micro­phones. Using this information, the system is able to detect and separate audioconference speech information from spurious sounds. It is to be noted that Microphone Physical information (i.e., frequency response and deviation of the microphone) might be added, but that will likely be an overkill for this scenario. The resulting speech then undergoes Noise Cancellation. The resulting output is equalized based on the output device characteristics, fetched from an Output Device Acoustic Model Knowledge Base, which describes the frequency response of the selected output device. This way the speech can be equalized removing any coloration from the output device, resulting in an optimally delivered sound experience.

3.4       Audio-on-the-go (AOG)

While biking in the middle of city traffic, the user should enjoy a satisfactory listening experience without losing contact with the acoustic surroundings.

The microphones available in earphones or earbuds capture the signals from the environment. The relevant environment sounds (i.e., the horn of a car) are selectively recognised and the sound rendition is adapted to the acoustic environment, providing an enhanced audio experience (e.g., performing dynamic signal equalization) and an improved battery life.

In this use case, Microphone sound captures the surrounding environment noise, together with geometry information (which describes number, positioning and configuration of the microphone or the array of microphones).

The sounds are then categorized. The result is an array of sounds with their categorization.

Sounds not relevant for the user in the specific moment are trimmed out and the rest of the sound information undergoes dynamic signal equalization using User Hearing Profile information.

Finally, the resulting sound is delivered to the output via the most appropriate the delivery method.

4        Functional Requirements

4.1       Introduction

The Functional Requirements developed in this document refer to the individual technologies identified as necessary to implement Use Cases belonging to given MPAI-CAE application area using AIMs operating in an MPAI-AIF AI Framework. The Functional Requirements developed adhere to the following guidelines:

  1. AIMs are defined to allow implementations by multiple technologies (AI, ML, DP)
  2. DP-based AIMs need interfaces such as to a Know­ledge Base. AI-based AIM will typically require a learning process, however, support for this process is not included in the document. MPAI may develop further requirements covering that process in a future document.
  3. AIMs can be aggregated in larger AIMs. Some data flows of aggregated AIMs may not neces­sarily be exposed any longer.
  4. AIMs may be influenced by the companion MPAI-MMC Use Cases and Functional Requ­ir­ements [5] as some technologies needed by some MPAI-MMC AIMs share a significant number of functional requirements.
  5. Current AIMs do not feed information back to AIMs upstream. Respondents to the MPAI-CAE Call for Technologies [3] are welcome to motivate the need for such feed-back data flows and propose assoc­iated requirements.

The Functional Requirements described in the following sections are the result of a dedicated effort by MPAI experts over many meetings where different partitioning in AIMs have been proposed, discussed and revised. MPAI is aware that alternative partitioning or alternative I/O data to/from AIMs are possible. Those reading this document for the purpose of submitting a response to the MPAI-CAE Call for Technologies (N152) [2] are welcome to propose alternative partitionings or alternative I/O data in their submissions. In this case, however, they are required to justify their alternatives and determine the functional requirements of the relevant technol­ogies. The evaluation team, to which proponents can, if they so wish, be members, will study the proposed alternative arrangement and may decide to accept all or part of the proposed new arrangement.

4.2       Emotion-Enhanced Speech

4.2.1      Reference architecture

This Use Case can be implemented as in Figure 2 and Figure 3. The two figures differ in the use of legacy DP technology vs AI technology:

  1. In Figure 2 the Speech analysis AIM is implemented with legacy Data Processing technologies.
  2. In Figure 3 the Speech analysis AIM is implemented as a neural network which incorporates the Emotion KB information.

Figure 2 – Emotion-enhanced speech (using external Knowledge Base)

Figure 3 – Emotion-enhanced speech (fully AI-based)

4.2.2      AI Modules

The AI Modules perform the functions described in Table 2.

Table 2 – AI Modules of Emotion-Enhanced Speech

AIM Function
Speech feature analyser Computes Speech features, queries the Emotion KB and obtains Emotion des­criptors. Alternatively, Emotion descriptors are produced by an embedded neural network.
Emotion KB Exposes an interface that allows Speech feature analyser to quey a KB of speech features extracted from recordings of dif­ferent speakers reading/reciting the same corpus of texts, with the standard set of emotions and without emotion, for different languages and genders.
Emotion inserter Inserts a particular emotional vocal timbre, e.g., anger, disgust, fear, happiness, sadness, and surprise into a neutral (emotion-less) synthesised voice. It also changes the strength of an emotion (from neutral speech) in a gradual fashion.

4.2.3      I/O interfaces of AI Modules

The I/O data of the Emotion Enhanced Speech AIMs are given in Table 3.

Table 3 – I/O data of Emotion-Enhanced Speech AIMs

AIM Input Data Output Data
Speech features analyser Emotion-less speech

Emotion

Emotion descriptors

Emotion descriptors

 

Speech features

Emotion KB Speech features Emotion descriptors
Emotion inserter Emotion-less speech

Emotion descriptors

Speech with Emotion

Emotion descriptors

4.2.4      Technologies and Functional Requirements

4.2.4.1     Digital Speech

Speech should be sampled at a frequency between 8 kHz and 96 kHz and digitally represented between 16 bits/sample and 24 bits/sample (both linear). The frequency of 22.05 kHz should be used for the purpose of a response to the MPAI-CAE Call for Technologies. Demonstrations of a proposed technology for other sampling frequencies are welcome.

 To Respondents

Respondents are invited to comment on these choices.

4.2.4.2     Emotion

By Emotion we mean a digital attribute that indicates an emotion out of a finite set of Emotions.

In EES the input speech – natural or synthesised – does not contain emotion while the output speech is expected to contain the emotion expressed by the input Emotion.

The most basic Emotions are described by the set: “anger, disgust, fear, happiness, sadness, and surprise” [10], or “joy versus sadness, anger versus fear, trust versus disgust, and surprise versus anticipation” [11]. One of these sets can be taken as “universal” in the sense that they are common across all cultures. An Emotion may have different Grades [12,13].

 To Respondents

Respondents are requested to propose:

  1. A minimal set of Emotions whose semantics are shared across cultures.
  2. A set of Grades that can be associated to Emotions.
  3. A digital representation of Emotions and their Grades (starting from [14]).

Currently, the MPAI-CAE Call for Technologies does not envisage to consider culture-specific Emotions. However, the proposed digital rep­resentation of Emotions and their Grades should either accommodate, or be extensible to accom­modate, culture-specific Emotions.

4.2.4.3     Emotion KB query format

To accom­plish their task, speech processing applications utilize certain features of speech signals. General speech features are described in [15,16]. The extraction of these features from a speech signal is known as speech analysis. Extraction can be done in the time domain as well as in the frequency domain.

Time-domain features are related to the waveform analysis in the time domain. Analysing speech in the time domain often requires simple calculation and interpretation. Time-domain features can be used to measure the arousal level of emotions.

Time-domain features carry information about sequences of short-time prosody acoustic features (features estimated on a frame basis). Example features modified by the emotional states are given by short-time zero crossing rate, short-term speech energy and duration [19].

Frequency-domain features can be computed using (short-time) Fourier transform, wavelet transform, and other mathematical tools [24]. Frequency domain operation provides mechan­isms to obtain some of the most useful parameters in speech analysis because the human cochlea performs a quasi-frequency analysis.

Initially, the time-domain signal is transformed into the frequency-domain, from which the features are extracted. Such features are highly associated with the human perception of speech. Hence, they have apparent acoustic characteristics. These features usually comprise formant frequency, linear prediction cepstral coefficient (LPCC), and Mel frequency cepstral coefficients (MFCC).

The frequency-domain features can carry information about:

  1. The Pitch signal (i.e., the glottal waveform) that depends on the tension of the vocal folds and the subglottal air pressure. Two parameters related to the pitch signal can be considered: pitch frequency and glottal air velocity. E.g., high velocity indicates a speech emotion like hap­piness. Low velocity is in harsher styles such as anger [25].
  2. The shape of the vocal tract that is modified by the emotional states. The formants (character­ized by a centre frequency and a bandwidth) could be a representation of the vocal tract reson­ances. Features related to the number of harmonics due to the non-linear airflow in the vocal tract. E.g., in the emotional state of anger, the fast air flow causes additional excitation signals other than the pitch. Teager Energy Operator-based (TEO) features measure the harmonics and cross-harmonics in the spectrum [26].

Example of features modified by the emotional states are given by the Mel-frequency cepstrum (MFC) [27].

Today, there is a variety of speech datasets available (online). Often, they consist of conversational setups and contain overlaps in speech as well as noise, or they are poor in expressiveness. Some datasets offer emotionally rich content with a high quality, but in a limited amount [e.g., 19,20,21,22]. To be effective, an Emotion KB should contain a large and expressive speech data­set.

Emotion KB contains speech features extracted from the speech recordings of speakers reading/ reciting the same corpus of texts with an agreed set of emotions and without emotion, for a set of languages and for different genders (voice performances by professional actors in comparison with the author’s spontaneous speech) [28, 29].

Emotion KB is queried by providing a vector of speech features. Emotion KB responds by prov­iding Emotion descriptors.

 To Respondents

Respondents are requested to propose an Emotion KB query format satisfying the following requ­irements:

  1. Accept as input:
    1. A vector of speech features capable of modelling:
      1. Non-extreme emotional states [17].
      2. Many emotional states with a natural-sounding voice [18].
    2. An Emotion.
  2. Provide as output a set of Emotion descriptors.

When assessing proposed Speech features, MPAI may resort to objective testing.

Note: An AI-based implementation may not need Emotion KB.

4.2.4.4     Emotion descriptors

Emotion descriptors are features used to alter the prosodic characteristics, the pitch, and the for­mant frequencies and bandwidth of Digital speech.

Speech analysis can use different strategies to render the emotion depending on:

  1. The type of sentence (numbers of words, type of phonemes, etc.) to which an emotion is added
  2. The emotions added to the previous and next sentence.

Emotion descriptors can be obtained by querying an Emotion KB (in the case of Figure 2) or from the output of a neural network (in the case of Figure 3).

 To Respondents

Respondents should propose Emotion descriptors suitable to introduce Emotion into the specific emotion-less speech resulting in a speech that appears as “natural” to the listener.

When assessing proposed Speech features, MPAI may resort to subjective testing.

4.3       Audio Recording Preservation

4.3.1      Reference architecture

This Use Case is implemented as in Figure 4 and Figure 5. The two figures differ in the use of legacy DP technology vs AI technology:

  1. In Figure 4 the Audio-video Analysis AIM is implemented with Data Processing Technol­ogies.
  2. In Figure 5 the Audio-video Analysis AIM is implemented as a neural network which incor­porates the Emotion KB information.

Figure 4 – Tape Audio preservation (using external Knowledge Base)

Figure 5 – Tape Audio preservation (fully AI-based)

4.3.2      AI Modules

The AIMs required by this Use Case are described in Table 4.

Table 4 – AI Modules of Audio Recording Preservation

AIM Function
Audio enhancer Produces Preservation audio using internal denoiser, finalized only to compensate (a) non-linear frequency response, caused by imperfect histor­ical recording equipment; (b) rumble, needle noise, or tape hiss caused by the imperfections introduced by aging. (see 4.3.5).
Audio analyser Produces audio excerpts based on signals from Video analysis.
Video analyser Extracts images from Video,queries the Tape irregularity KB and provides Images and Irregularities IDs. Alternatively, an embedded neural network produces images.
Musicological classifier Produces relevant images from Digital video and Text describing images
Packager Produces file containing:

1.     Digital audio

2.     Input video

3.     Audio sync’d images and text

Tape irregularity KB Knowledge Base of visual (tape) and audio irregularities

4.3.3      I/O interfaces of AI Modules

The AIMs of Audio Recording Preservation are given in Table 5

Table 5 – I/O data of Audio Recording Preservation AIMs

AIM Input Data Output Data
Audio enhancer Digital Audio Preservation Audio
Audio analysis Preservation Audio

Irregularity

Audio Excerpts
Video analysis Digital Video

Tape irregularity KB response

Images

Tape irregularity KB query

Irregularity IDs

Musicological classifier Audio Excerpts

Images

Irregularity IDs

Text

Images

Packager Preservation Audio

Digital Video

Text

Images

Preservation Master
Tape irregularity KB Query Response

4.3.4      Technologies and Functional Requirements

4.3.4.1     Digital Audio

Digital Audio sampled from an analogue source (e.g., magnetic tapes, 78rpm phonographic discs) at a frequency in the 44.1-96 kHz range with at least 16 and at most 24 bits/sample [30].

To Respondents

Respondents are invited to comment on this choice.

4.3.4.2     Digital Video

Digital video has the following features.

  1. Pixel shape: square
  2. Bit depth: 8-10 bits/pixel
  3. Aspect ratio: 4/3 and 16/9
  4. 640 < # of horizontal pixels < 1920
  5. 480 < # of vertical pixels < 1080
  6. Frame frequency 50-120 Hz
  7. Scanning: progressive
  8. Colorimetry: ITU-R BT709 and BT2020
  9. Colour format: RGB and YUV
  10. Compression: uncompressed; if compressed AVC, HEVC

To Respondents

Respondents are invited to comment on these choices.

4.3.4.3     Digital Image

A Digital Image is

  1. An uncompressed video frame with time information or
  2. A JPEG-compressed video frame [32] with time information.

To Respondents

Respondents are invited to comment on this choice.

4.3.4.4     Tape irregularity KB query format

Tape irregularity KB contains features extracted from images of different tape irregularities [38].

The Irregularity KB is queried by giving a vector of Image features that describe [37]:

  1. Splices of
    1. Leader tape to magnetic tape
    2. Magnetic tape to magnetic tape
  2. Other irregularities such as brands on tape, ends of tape, ripples, damaged tapes, markings, dirt, shadows etc.

The Irregularity KB responds by providing the type of irregularity detected in the input Image.

To Respondents

Respondents are requested to propose a Tape irregularity KB query format satisfying the follow­ing requirements:

  1. A complete set of audio tape irregularities and Image features that characterise them.
  2. A response to a query shall indicate:
    1. Presence of irregularities or otherwise.
    2. Type of irregularity as output (if there are irregul­arities).

When assessing proposed Image features MPAI may resort to objective testing.

This CfT is specifically for of audio tape preservation. However, its scope may be extended if sufficient technologies covering other audio preservation instances are received. Any proposal for other audio preservation instances should be described with a level of detail comparable to this Use Case.

4.3.4.5     Text

Text should be encoded according to ISO/IEC 10646, Information technology – Universal Coded Character Set (UCS) to support most languages in use [39].

To Respondents

Respondents are invited to comment on this choice.

4.3.4.6     Packager

Packager takes Preservation Audio, Digital Video, Text and Images and produces the Preservation Master file.

To Respondents

Respondents should propose a file format capable to:

  1. Support queries for irregularities, showing all the images corresponding to that given irregularity (splices, carrier corruptions, etc.)
  2. Allow listening to the audio corresponding to a particular image.
  3. Allow to annotate (with text) the audio signal, to support the musicological analysis
  4. Support query on the annotation, returning the corresponding time (sec:ms:sample), the text, the audio signal excerpt and image (if any)
  5. Support random access to a specified portion of video and/or audio providing.

Preference will be given to formats that have already been standardised or are in wide use.

4.3.5      Information about Audio enhancement performance

A fifty-year-long debate around the restoration of audio documents has been ongoing inside the archivists’ and musicologists’ communities [33].

The Preservation audio produced by Audio enhancement must fulfil the requirements of accuracy, reliability, and philological authenticity.

In [34] Schuller makes an accurate investigation of signal alterations classified in two categories:

  1. Intentional that includes recording, equalization, and noise reduction systems.
  2. Unintentional further divided into those caused by:
    1. The imperfection of the recording technique of the time, resulting in various distortions.
    2. Misalignment of the recording equipment, e.g., wrong speed, deviation from the ver­tical cutting angle in cylinders, or misalignment of the recording in magnetic tape.

The choice whether or not to compensate for these alterations reveals different restoration strat­egies: historical faithfulness can refer to the recording as it has been produced, precisely equalized for intentional recording equalizations, compensated for eventual errors caused by misaligned recording equipment (for example, wrong speed, deviation from the vertical cutting angle in cylinders, or misalignment of the recording in magnetic tape) and digitized using a modern equipment to minimize replay distortions.

There is a certain margin of interpretation because historical acquaintance with the document is called into question alongside with technical-scientific knowledge, for instance, to identify the equalization curves of magnetic tapes or to determine the rotation speed of a record. Most of the information provided is retrievable from the history of audio technology, while other information is experimentally inferable with a certain degree of accuracy.

The restoration must be focused to compensate non-linear frequency response, caused by imperfect historical recording equipment; rumble, needle noise, or tape hiss caused by the imperfections introduced by aging.

The restoration step can thus be carried out with a good degree of objectivity and represents an optimum level achievable by the original (analogue) recording equipment.

A legacy denoiser algorithm should [35,36]:

  1. Use little a priori information.
  2. Operate in real time.
  3. Be based on frequency-domain methods, such as various forms of non-casual Wiener filtering or spectral subtraction schemes.
  4. Include algorithms that incorporate knowledge of the human auditory system.

To Proponents

The CfT does not include technologies object of this AIM. However, respondents’ comments on the text above will be welcome.

4.4       Enhanced Audioconference Experience

4.4.1      Reference architecture

This Use Case is implemented as in Figure 6.

Figure 6 – Enhanced Audioconference Experience

4.4.2      AI Modules

The AIMs required by the Enhanced Audioconference Experience are given in Table 6

Table 6 – AIMs of Enhanced Audioconference Experience

AIM Function
Speech detection and separation Separates relevant Speech vs non-speech signals
Noise cancellation Removes noise in Speech signal
Output dynamic noise cancellation Reduces noise level based on Output Device Acoustic Model
Delivery Wraps De-noised Speech signal for Transport
Output Device Acoustic Model KB Contains identifiers of all output devices of by manufacturer and their ID calibration test results

4.4.3      I/O interfaces of AI Modules

The I/O data of Enhanced Audioconference Experience AIMs are given in Table 7.

Table 7 – I/O data of Enhanced Audioconference Experience AIMs

AIM Input Data Output Data
Speech detection and separation Microphone Sound

Geometry Information

Digital Speech

Geometry Information

Noise cancellation Digital Speech

Geometry Information

De-noised Speech
Output dynamic noise cancellation De-noised Speech Equalised Speech
Delivery Equalised Speech

Transport info

Equalised Speech
Output Device Acoustic Model KB Query Response

4.4.4      Technologies and Functional Requirements

4.4.4.1     Digital Speech

Speech should be sampled at a frequency between 8 kHz and 96 kHz and the samples should be represented with a number of bits at least 16 bits/sample and at most 24 bit/sample (both linear).

To Respondents

Respondents are invited to comment on these two choices.

4.4.4.2     Microphone geometry information

Microphone geometry information is a descriptive representation of relative positioning of one or multiple microphones which describes physical characteristics of microphones such as type, pos­itioning, angle and their relative position and overall configuration such as Array Type. It allows to accurately reproduce a signal free of noise and distortion and to better separate noise from signal as required for proper working of EAE AIMs. Formats to represent microphone geom­etry infor­mation are: MPEG-H 3D Audio [40] and platform (Android, Windows, Linux) specific JSON Descriptors API [41].

To Respondents

Respondent are requested to:

  1. Comment about MPAI’s choice of the two formats
  2. Express their preference between the two formats.
  3. Possibly suggest alternative solutions.

4.4.4.3     Output device acoustic model metadata KB query format

The Output device acoustic model KB contains a description of the output device acoustic model, such as frequency response and per-frequency attenuation.

The Output device acoustic model KB is queried by requesting the unique ID of a device, if available, or by providing a means to identify the model or unique reference to output device being considered. The Output device acoustic model KB responds with information about output device characteristics.

To Respondents

Respondents are requested to propose a query/response API satisfying the requirement that API shall provide:

  1. Means to query the KB giving the device model as input to obtain the acoustic model.
  2. Adequate schemas to represent the Output device acoustic model using, if necessary, current representation schemes.

4.4.4.4     Delivery

Equalised Speech needs to be transported using a transport protocol most appropriate for the environment.

To Respondents

Proponents are requested to identify the transport protocols suitable for the EAE Use Case and propose an extensible way to signal which transport mechanism is intended to be used.

4.5       Audio-on-the-go

4.5.1      Reference architecture

This Use Case is implemented as in Figure 7 and in Figure 8. The two figures differ in the use of legacy DP technology vs AI technology:

  1. In Figure 7 Environment sound separation and Environment sound processing AIMs are implemented using legacy Data Proces­sing technology.
  2. In Figure 8 the Environment sound processing AIM is implemented as neural a network.

Figure 7 – Audio-on-the-go (using external Knowledge Base)

Figure 8 – Audio-on-the-go (full AI-based solution)

4.5.2      AI Modules

The AIMs of Audio-on-the-go are given by Table 8.

 Table 8 – AIMs of Audio-on-the-go

AIM Function
Environment sound separation Separates the individual sounds captured from the surrounding environment
Environment sound processing Determines which sounds are relevant to the user
Sound categorisation KB Contains audio features of the sounds in the KB
Dynamic signal equalization Dynamically equalises sound using information from User hearing profiles KB to produce the best possible quality output
Delivery Wraps equalised sound for Transport
User hearing profiles KB A dataset of hearing profiles of target users

4.5.3      I/O interfaces of AI Modules

The I/O data of Audio on the go AIMs are given by Table 9

Table 9 – I/O data of Audio-on-the-go AIMs

AIM Input Data Output Data
Environment sound separation Microphone Sound Geometry info Sound array
Environment sound processing Sound array

Sound categorisation

Relevant sounds

Sound features

Dynamic signal equalization Relevant sounds

User’s hearing profiles

Dynamically equalised sound

User ID

Delivery Equalised Speech

Transport info

Equalised Speech
Sound categorisation KB Sound features vector Sound categorisation
User hearing profiles KB Query Response

4.5.4      Technologies and Functional Requirements

4.5.4.1     Digital Audio

Digital Audio is a stream of samples obtained by sampling audio at a frequency in the 44.1-96 kHz range with at least 16 and at most 24 bits/sample.

To Respondents

Proponents are invited to comment on this choice.

4.5.4.2     Microphone geometry information

Microphone geometry information is a descriptive representation of relative positioning of one or multiple microphones which describes physical characteristics of microphones such as type, pos­itioning, angle and their relative position and overall configuration such as Array Type. It allows to accurately reproduce a noise- and distortion-free signal and to better separate noise from signal as required for proper working of EAE AIMs. Formats to represent microphone geometry infor­mation are: MPEG-H 3D Audio [40] and platform (Android, Windows, Linux) specific JSON Descriptors API [41].

To Respondents

Respondents are requested to:

  1. Express their preference between the two formats.
  2. Comment about MPAI’s choice of the two formats.
  3. Possibly suggest alternative solutions.

4.5.4.3     Sound array

The sounds identified in the Microphone sound are passed as an array of sounds represented as

  1. Sound samples.
  2. Encoding information (e.g., sampling frequency, bits/sample, compression method).
  3. Relative metadata.

To Respondents

Respondents are requested to propose:

  1. A format to package a set of environment sounds with appropriate metadata.
  2. An extensible identification of audio compression methods.

4.5.4.4     Sound categorisation KB query format

Sound categorisation KB contains audio features of the sounds in the KB. Sound categorisation KB is queried by providing a vector of Sound features. Sound categorisation KB responds by giving the category of the sound.

Sound features are extracted from samples of individual sounds in the Sound array for the purpose of querying the Sound categor­is­ation KB.

To Respondents

Respondents should propose a Sound categorisation query format satisfying the following requir­ements:

  1. Use an extensible set of Sound features that satisfy the following requirements:
    1. Be suitable for identifying a sound.
    2. Be suitable as input to query the Sound categorisation.
  2. Provide as output:
    1. The probability for the most relevant N categories.
    2. From which Sound categorisation KB this value has been derived.

When assessing proposed Sound features MPAI may resort to objective testing.

4.5.4.5     Sounds categorisation

Each vector in the sound array should be accompanied by an identifier of the category it belongs to.

To Respondents

Respondents should propose an extensible classification of all types of sound of interest [42]. Support of a set of sounds classified according to a proprietary scheme should also be provided.

4.5.4.6     User Hearing Profiles KB query format

User Hearing Profiles KB contains the user hearing profile for the properly identified (e.g. via a UUID or a third-party identity provider) specific user.

User Hearing Profiles KB is queried giving the User hearing profile ID as input. User hearing profile KB responds with the specific user hearing profile. The User hearing profile contains the hearing attenuation for a defined number of frequency spectrums or any representation able to determine the unique individual sound perception ability [43]. There are currently at least 2 SDKs on the matter: MIMI SDK, NURA SDK (both proprietary) [44].

To Respondents

Respondents should propose a query format which the following requirements:

  1. Input: user identity, array of frequency values
  2. Output: the values of the user’s sound perception ability at those frequency values

4.5.4.7     Delivery

Equalised Speech needs to be transported using a transport protocol most appropriate for the environment.

To Respondents

Proponents are requested to identify the transport protocol suitable for the AOG Use Case and propose an extensible way to signal which transport mechanism is intended to be used.

5        Potential common technologies

Table 10 introduces the acronyms representing the MPAI-CAE and MPAI-MMC Use Cases.

Table 10 – Acronyms of MPAI-CAE and MPAI-MMC Use Cases

Acronym App. Area Use Case
EES MPAI-CAE Emotion-Enhanced Speech
ARP MPAI-CAE Audio Recording Preservation
EAE MPAI-CAE Enhanced Audioconference Experience
AOG MPAI-CAE Audio-on-the-go
CWE MPAI-MMC Conversation with emotion
MQA MPAI-MMC Multimodal Question Answering
PST MPAI-MMC Personalized Automatic Speech Translation

Table 11 gives all MPAI-CAE and MPAI-MMC technologies in alphabetical order.

Please note the following acronyms:

KB Knowledge Base
QF Query Format

Table 11 – Alphabetically ordered MPAI-CAE and MPAI-MMC technologies

Notes UC=Use case
UCFR=Use Cases and Functional Requirements document number
Section=Section of the above document
Technology=name of technology

 

UC UCFR Section Technology
EAE N151 4.4.4.4 Delivery
AOG N151 4.5.4.7 Delivery
CWE N153 4.2.4.9 Dialog KB query format
ARP N151 4.3.4.1 Digital Audio
AOG N151 4.5.4.1 Digital Audio
ARP N151 4.3.4.3 Digital Image
MQA N153 4.3.4.3 Digital Image
EES N151 4.2.4.1 Digital Speech
EAE N151 4.4.4.1 Digital Speech
CWE N153 4.2.4.2 Digital Speech
MQA N153 4.3.4.2 Digital Speech
PST N153 4.4.4.2 Digital Speech
ARP N151 4.3.4.2 Digital Video
CWE N153 4.2.4.3 Digital Video
EES N151 4.2.4.2 Emotion
CWE N153 4.2.4.4 Emotion
EES N151 4.2.4.4 Emotion descriptors
CWE N153 4.2.4.5 Emotion KB (speech) query format
CWE N153 4.2.4.6 Emotion KB (text) query format
CWE N153 4.2.4.7 Emotion KB (video) query format
EES N151 4.2.4.3 Emotion KB query format
MQA N153 4.3.4.4 Image KB query format
CWE N153 4.2.4.11 Input to face animation
CWE N153 4.2.4.10 Input to speech synthesis
MQA N153 4.3.4.7 Intention KB query format
PST N153 4.4.4.4 Language identification
CWE N153 4.2.4.8 Meaning
MQA N153 4.3.4.6 Meaning
EAE N151 4.4.4.2 Microphone geometry information
AOG N151 4.5.4.2 Microphone geometry information
MQA N153 4.3.4.5 Object identifier
MQA N153 4.3.4.8 Online dictionary query format
EAE N151 4.4.4.3 Output device acoustic model metadata KB query format
ARP N151 4.3.4.6 Packager
AOG N151 4.5.4.3 Sound array
AOG N151 4.5.4.4 Sound categorisation KB query format
AOG N151 4.5.4.5 Sounds categorisation
PST N153 4.4.4.3 Speech features
ARP N151 4.3.4.4 Tape irregularity KB query format
ARP N151 4.3.4.5 Text
CWE N153 4.2.4.1 Text
MQA N153 4.3.4.1 Text
PST N153 4.4.4.1 Text
PST N153 4.4.4.5 Translation results
AOG N151 4.5.4.6 User Hearing Profiles KB query format

The following technologies are shared or shareable across Use Cases:

  1. Delivery
  2. Digital speech
  3. Digital audio
  4. Digital image
  5. Digital video
  6. Emotion
  7. Meaning
  8. Microphone geometry information
  9. Text

Image features apply to different visual objects. The Speech features of all Use Cases are different.

However, respondents should consider the possibility of proposing a unified set of Speech features, e.g., as proposed in [45].

6        Terminology

Table 12 identifies and defines the terms used in the MPAI-CAE context.

Table 12 – MPAI-CAE terms

Term Definition
Access Static or slowly changing data that are required by an application such as domain knowledge data, data models, etc.
AI Framework (AIF) The environment where AIM-based workflows are executed
AI Module (AIM) The basic processing elements receiving processing specific inputs and producing processing specific outputs
Audio enhancement An AIM that produces Preservation audio using internal denoiser
Communication The infrastructure that connects the Components of an AIF
Data Processing (DP) A legacy technology that may be used to implement AIMs
Delivery An AIM that wraps data for transport
Digital Speech Digitised speech as specified by MPAI
Dynamic Signal Equalization An AIM that dynamically equalises the sound using information from the User hearing profiles KB
Emotion A digital attribute that indicates an emotion out of a finite set of Emotions
Emotion Descriptor A set of time-domain and frequency-domain features capable to render a particular emotion, starting from an emotion-less digital speech
Emotion inserter A module to set time-domain and frequency-domain features of a neutral speech in order to insert a particular emotional intention.
Emotion KB A speech dataset rich in expressiveness
Emotion KB query format A dataset of time-domain and frequency-domain neutral speech features
Environment Sound Processing An AIM that determines which sounds are relevant for the user vs sounds which are not
Environment Sounds Recognition An AIM that recognises, separates and categorises sounds captured from the environment
Execution The environment in which AIM workflows are executed. It receives external inputs and produces the requested outputs both of which are application specific
Frequency-domain Features Properties (descriptors) of the signal with respect to frequency
Emotion Grade The intensity of an Emotion
Knowledge Base Structured and unstructured information made accessible to AIM (especially DP-based)
Management and Control Manages and controls the AIMs in the AIF, so that they execute in the correct order and at the time when they are needed
Musicological classifier Algorithm that sorts unlabelled images from Digital Video into (relevant) labelled categories of information, linking them with text describing the images.
Noise cancellation An AIM that removes noise in Speech signal
Output Device Acoustic Model KB A dataset of calibration test results for all output devices of a given manufacturer identified by their ID
Output dynamic noise cancellation An AIM that reduces noise level based on Output Device Acoustic Model
Packager An AIM that packages audio, video, images and text in a file
Relevant vs non-relevant sound KB A dataset of audio features of relevant sounds
Sound categorisation KB Contains audio features of the sounds in the KB
Speech analysis The AIM that extracts Emotion descriptors
Speech analysis The AIM that understands the emotion embedded in speech
Speech analysis The AIM that extracts the characteristics of the speaker (e.g., physiology and intention)
Speech and Emotion File Format A file format that contains Digital speech and time-stamped Emotions related to speech
Speech detection and separation AIM that separates relevant Speech vs non-speech signals
Speech Features Speech features used to extract Emotion descriptors
Storage Storage used to e.g., store the inputs and outputs of the individual AIMs, data from the AIM’s state and intermediary results, shared data among AIMs
Tape irregularity KB Dataset that includes examples of the different irregularities that may be present in the carrier (analogue tape, phonographic discs) considered
Text Characters drawn from a finite alphabet
Time-domain features Properties (descriptors) of the signal with respect to frequency
User hearing profiles KB A dataset of hearing profiles of target users

7        References

  1. MPAI-AIF Use Cases and Functional Requirements, N74; https://mpai.community/standards/mpai-aif/#Requirements
  2. MPAI-AIF Call for Technologies, N100; https://mpai.community/standards/mpai-aif/#Technologies
  3. MPAI-CAE Use Cases and Functional Requirements, N151; https://mpai.community/standards/mpai-cae/#UCFR
  4. MPAI-CAE Call for Technologies, N152; https://mpai.community/standards/mpai-cae/#Technologies
  5. MPAI-MMC Use Cases and Functional Requirements, N153; https://mpai.community/standards/mpai-mmc/#Requirements
  6. MPAI-MMC Call for Technologies, N154; https://mpai.community/standards/mpai-mmc/#Technologies
  7. Burkhardt and N. Campbell, “Emotional speech synthesis,” in The Oxford Handbook of Affective Computing. Oxford University Press New York, 2014, p. 286
  8. Noé Tits, A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech – a Deep Learning approach, 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), September 2019, DOI: 10.1109/ACIIW.2019.8925241
  9. W. Adorno, Philosophy of New Music, University of Minnesota Press, Minneapolis, Minn, USA, 2006
  10. Ekman, P. (1999). Basic Emotions. In T. Dalgleish and T. Power (Eds.) The Handbook of Cognition and Emotion Pp. 45–60. Sussex, U.K.: John Wiley & Sons, Ltd.
  11. Plutchik R., Emotion: a psychoevolutionary synthesis, New York Harper and Row, 1980
  12. Russell, James (1980). “A circumplex model of affect”. Journal of Personality and Social Psychology. 39 (6): 1161–1178. doi:10.1037/h0077714
  13. Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
  14. https://www.w3.org/TR/2014/REC-emotionml-20140522/
  15. Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
  16. Burkhardt, F., & Sendlmeier, W. F., Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis, ISCA Workshop on Speech & Emotion, Northern Ireland 2000, p. 151-156.
  17. Scherer, K. R., Ladd, D. R., & Silverman, K., Vocal cues to speaker affect: Testing two models, Journal of the Acoustic Society of America, 76(5), 1984, p. 1346-1356
  18. Kasuya, H., Maekawa, K., & Kiritani, S., Joint Estimation of Voice Source and Vocal Tract Parameters as Applied to the Study of Voice Source Dynamics, ICPhS 99, p. 2505-2512
  19. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PLOS ONE, vol. 13, no. 5, pp. 1–35, 05 2018
  20. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014
  21. Banziger, M. Mortillaro, and K. R. Scherer, “Introducing the geneva multimodal expression corpus for experimental research on emotion perception.” Emotion, vol. 12, no. 5, p. 1161, 2012
  22. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in Ninth European Conference on Speech Communication and Technology, 2005
  23. Mozziconacci, S. J. L., Speech Variability and Emotion: Production and Perception, PhD Thesis, Technical University Eindhoven, 1998
  24. Burkhardt, F., & Sendlmeier, W. F., Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis, ISCA Workshop on Speech & Emotion, Northern Ireland 2000, p. 151-156.
  25. Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
  26. Hamed Beyramienanlou, Nasser Lotfivand, “An Efficient Teager Energy Operator-Based Automated QRS Complex Detection”, Journal of Healthcare Engineering, vol. 2018, Article ID 8360475, 11 pages, 2018. https://doi.org/10.1155/2018/8360475]
  27. Davis S B. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28(4):65-74
  28. Giovanni Costantini, Iacopo Iaderola, Andrea Paoloni, Massimiliano Todisco. EMOVO Corpus: an Italian Emotional Speech Database.
  29. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 3501–3504, May 2014. 2- Moataz El Ayadi, Mohamed S. Kamel, Fakhri Karray. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition Journal, Elsevier, 44 (2011) 572–587
  30. IASA-TC 05: Handling and Storage of Audio and Video Carriers. IASA Technical Committee (2014)
  31. Hamed Beyramienanlou, Nasser Lotfivand, “An Efficient Teager Energy Operator-Based Automated QRS Complex Detection”, Journal of Healthcare Engineering, vol. 2018, Article ID 8360475, 11 pages, 2018. https://doi.org/10.1155/2018/8360475
  32. ISO/IEC 10918-1:1994 Information Technology — Digital Compression And Coding Of Continuous-Tone Still Images: Requirements And Guidelines
  33. Federica Bressan and Sergio Canazza, A Systemic Approach to the Preservation of Audio Documents: Methodology and Software Tools, Journal of Electrical and Computer Engineering, 2013. https://doi.org/10.1155/2013/489515
  34. Boston, Safeguarding the Documentary Heritage. A Guide to Standards, Recommended Practices and Reference Literature Related to the Preservation of Documents of All Kinds, UNESCO, Paris, France, 1988.
  35. Canazza. The digital curation of ethnic music audio archives: from preservation to restoration. International Journal of Digital Libraries, 12(2-3):121–135, 2012
  36. J. Godsill and P.J.W. Rayner. Digital Audio Restoration – a statistical model-based approach (Berlin: Springer-Verlag 1998)
  37. Pretto, Niccolò; Fantozzi, Carlo; Micheloni, Edoardo; Burini, Valentina; Canazza Targon, Sergio. Computing Methodologies Supporting the Preservation of Electroacoustic Music from Analog Magnetic Tape. In Computer Music Journal, 2018, vol. 42 (4), pp.59-74
  38. Fantozzi, Carlo; Bressan, Federica; Pretto, Niccolò; Canazza, Sergio. Tape music archives: from preservation to access. pp.233-249. In International Journal On Digital Libraries, pp. 1432-5012 vol. 18 (3), 2017. DOI:10.1007/s00799-017-0208-8
  39. ISO/IEC 10646:2003 Information Technology — Universal Multiple-Octet Coded Character Set (UCS)
  40. https://www.iis.fraunhofer.de/en/ff/amm/broadcast-streaming/mpegh.html
  41. https://docs.microsoft.com/bs-cyrl-ba/azure/cognitive-services/speech-service/how-to-devices-microphone-array-configuration
  42. https://www.frontiersin.org/articles/10.3389/fpsyg.2018.01277/full
  43. https://help.nuraphone.com/hc/en-us/articles/360000324676-Your-Profile
  44. https://integrate.mimi.io/documentation/android/4.0.1/documentation
  45. Problem Agnostic Speech Encoder; https://github.com/santi-pdp/pase

Use Cases and Functional RequirementsFramework LicenceCall for TechnologiesTemplate for responses to the Call for TechnologiesApplication Note

Framework Licence

This document is also available in MS Word format MPAI-CAE Framework Licence

1        Coverage

MPAI has identified the application area called “Context-based Audio Enhancement” as relevant for MPAI standardisation because the usage of context information can substantially improve the user experience of a variety of audio-related applications in the area of entertainment, communication, teleconferencing, gaming, post-production, restoration, just to mention a few, for a variety of scenarios such as the home, the car, on-the-go, in the studio etc. Therefore, MPAI intends to develop a standard – to be called MPAI-CAE – that will provide standard technologies to implement four Use Cases identified so far

  1. Emotion-Enhanced Speech (EES)
  2. Audio Recording Preservation (ARP)
  3. Enhanced Audioconference Experience (EAC)
  4. Audio-on-the-go (AOG)

The MPAI Context-based Audio Enhancement (MPAI-CAE) standard as will be defined in document Nxyz of Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI).

2        Definitions

Term Definition
Data Any digital representation of a real or computer-generated entity, such as moving pictures, audio, point cloud, computer graphics, sensor and actu­ator. Data includes, but is not restricted to, media, manufacturing, auto­mot­ive, health and generic data.
Development Rights License to use MPAI-CAE Essential IPRs to develop Implementations
Enterprise Any commercial entity that develops or implements the MPAI-CAE standard
Essential IPR Any Proprietary Rights, (such as patents) without which it is not possible on technical (but not commercial) grounds, to make, sell, lease, otherwise dispose of, repair, use or operate Implementations without infringing those Proprietary Rights
Framework License A document, developed in compliance with the gener­ally accepted principles of competition law, which contains the conditions of use of the License without the values, e.g., currency, percent, dates etc.
Implementation A hardware and/or software reification of the MPAI-CAE standard serving the needs of a professional or consumer user directly or through a service
Implementation Rights License to reify the MPAI-CAE standard
License This Framework License to which values, e.g., currency, percent, dates etc., related to a specific Intellectual Property will be added. In this Framework License, the word License will be used as singular. However, multiple Licenses from different IPR holders may be issued
Profile A particular subset of the technologies that are used in MPAI-CAE standard and, where applicable, the classes, subsets, options and parameters relevant to the subset

3        Conditions of use of the License

  1. The License will be in compliance with generally accepted principles of competition law and the MPAI Statutes
  2. The License will cover all of Licensor’s claims to Essential IPR practiced by a Licensee of the MPAI-CAE standard.
  3. The License will cover Development Rights and Implementation Rights
  4. The License for Development and Implementation Rights, to the extent it is developed and implemented only for the purpose of evaluation or demo solutions or for technical trials, will be free of charge
  5. The License will apply to a baseline MPAI-CAE profile and to other profiles containing additional technologies
  6. Access to Essential IPRs of the MPAI-CAE standard will be granted in a non-discriminatory fashion.
  7. The scope of the License will be subject to legal, bias, ethical and moral limitations
  8. Royalties will apply to Implementations that are based on the MPAI-CAE standard
  9. Royalties will apply on a worldwide basis
  10. Royalties will apply to any Implementation, with the exclusion of the type of implementations specified in clause 4
  11. An MPAI-CAE Implementation may use other IPR to extend the MPAI-CAE Implementation or to provide additional functionalities
  12. The License may be granted free of charge for particular uses if so decided by the licensors
  13. A license free of charge for limited time and a limited amount of forfeited royalties will be granted on request
  14. A preference will be expressed on the entity that should administer the patent pool of holders of Patents Essential to the MPAI-CAE standard
  15. The total cost of the Licenses issued by IPR holders will be in line with the total cost of the Licenses for similar technologies standardised in the context of Standard Development Organisations
  16. The total cost of the Licenses will take into account the value on the market of the AI Framework technology Standardised by MPAI.

Use Cases and Functional RequirementsFramework LicenceCall for TechnologiesTemplate for responses to the Call for TechnologiesApplication Note

Call for Technologies

This document is also available in MS Word format as MPAI-CAE Call for Technologies

1       Introduction. 1

2       How to submit a response. 3

3       Evaluation Criteria and Procedure. 4

4       Expected development timeline. 4

5       References. 4

Annex A: Information Form.. 6

Annex B: Evaluation Sheet 7

Annex C: Requirements check list 10

Annex D: Technologies that may require specific testing. 11

Annex E: Mandatory text in responses. 12

1        Introduction

Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) is an international non-profit organisation with the mission to develop standards for Artificial Intelligence (AI) enabled digital data coding and for technologies that facilitate integration of data coding components into ICT systems. With the mechanism of Framework Licences, MPAI seeks to attach clear IPR licensing frameworks to its standards.

MPAI has found that the application area called “Context-based Audio Enhancement” is particul­arly relevant for MPAI standardisation because using context information to act on the input audio content can substantially improve the user experience of a variety of audio-related applications that include entertainment, communication, teleconferencing, gaming, post-produc­tion, restor­ation etc. for a variety of contexts such as in the home, in the car, on-the-go, in the studio etc.

Therefore, MPAI intends to develop a standard – to be called MPAI-CAE – that will provide standard tech­nologies to implement four Use Cases identified so far:

  1. Emotion-Enhanced Speech (EES)
  2. Audio Recording Preservation (ARP)
  3. Enhanced Audioconference Experience (EAC)
  4. Audio-on-the-go (AOG)

This document is a Call for Technologies (CfT) that

  1. Satisfy the MPAI-CAE Functional Requirements (N151) [4] and
  2. Are released according to the MPAI-CAE Framework Licence (N171) [6], if selected by MPAI for inclusion in the MPAI-CAE standard.

The standard will be developed with the following guidelines:

  1. To satisfy the Functional Requirements (N151) [1], available online. In the future, MPAI may decide to extend MPAI-CAE to support other Use Cases.
  2. To use, where feasible and desirable, the same basic tech­nol­ogies required by the companion document MPAI-MMC Use Cases and Functional Requir­ements [2].
  3. To be suitable for implementation as AI Modules (AIM) conforming to the emerging MPAI AI Framework (MPAI-AIF) standard based on the responses to the Call for Technologies (N100) [5] satisfying the MPAI-AIF Functional Requirements (N74) [4].

MPAI has decided to base its application standards on the AIM and AIF notions whose functional requirements have been identified in [1] rather than follow the approach of defining end-to-end systems. It has done so because:

  1. AIMs allow the reduction of a large problem to a set of smaller problems.
  2. AIMs can be independently developed and made available to an open competitive market.
  3. An implementor can build a sophisticated and complex system with potentially limited know­ledge of all the tech­nologies required by the system.
  4. An MPAI system has an inherent explainability.
  5. MPAI systems allow for competitive comparisons of functionally equivalent AIMs.

Respondents should be aware that:

  1. Use Cases that make up MPAI-CAE, the Use Cases themselves and the AIM internals will be non-normative.
  2. The input and output interfaces of the AIMs, whose requirements have been derived to support the Use Cases, will be normative.

Therefore, the scope of this Call for Technologies is restricted to technologies required to implement the input and output interfaces of the AIMs identified in N151 [1].

However, MPAI invites comments on any technology or architectural component identified in N151, specifically,

  1. Additions or removals of input/output signals to the identified AIMs with justification of the changes and identification of data formats required by the new input/output signals.
  2. Possible alternative partitioning of the AIMs implementing the example cases providing:
    1. Arguments in support of the proposed partitioning
    2. Detailed specifications of the input and output data of the proposed new AIMs
  3. New Use Cases fully described as in N151.

All parties who believe they have relevant technologies satisfying all or most of the requirements of one or more than one Use Case described in N151 are invited to submit proposals for consid­eration by MPAI. MPAI membership is not a prerequisite for responding to this CfT. However, proponents should be aware that, if their proposal or part thereof is accepted for inclusion in the MPAI-CAE standard, they shall immediately join MPAI, or their accepted technologies will be discarded.

MPAI will select the most suitable technologies based on their technical merits for inclusion in MPAI-CAE. However, MPAI in not obligated, by virtue of this CfT, to select a particular tech­nology or to select any technology if those submitted are found inadequate.

Submissions are due on 2021/04/12T23:59 UTC and should be sent to the MPAI secretariat (secretariat@mpai.community). The secretariat will acknowledge receipt of the submission via email. Submissions will be reviewed according to the schedule that the 7th MPAI General Assembly (MPAI-7) will define at its online meeting on 2021/04/14. For details on how submitters who are not MPAI members can attend the said review please contact the MPAI secretariat (secretariat@mpai.community).

2        How to submit a response

Those planning to respond to this CfT:

  1. Are advised that online events will be held on 2021/02/24 and 2021/03/10 to present the MPAI-CAE CfT and respond to questions. Logistic information on these events will be posted on the MPAI web site.
  2. Are requested to communicate their intention to respond to this CfT with an initial version of the form of Annex A to the MPAI secretariat (secretariat@mpai.community) by 2021/03/16. A potential submitter making a communication using the said form is not required to actually make a submission. A submission will be accepted even if the submitter did not communicate their intention to submit a response by the said date.
  3. Are advised to visit regularly the https://mpai.community/how-to-join/calls-for-technologies/ web site where relevant information will be posted.

Responses to this MPAI-CAE CfT shall/may include:

Table 1 – Mandatory and optional elements of a response

Item Status
Detailed documentation describing the proposed technologies mandatory
The final version of Annex A mandatory
The text of Annex B duly filled out with the table indicating which requirements identified in MPAI N151 [1] are satisfied. If all the requirements of a Use Case are not satisfied, this should be explained. mandatory
Comments on the completeness and appropriateness of the MPAI-CAE requirem­ents and any motivated suggestion to amend or extend those requirements. optional
A preliminary demonstration, with a detailed document describing it. optional
Any other additional relevant information that may help evaluate the submission, such as additional use cases. optional
The text of Annex E. mandatory

Respondents are invited to take advantage of the check list of Annex C before submitting their response and filling out Annex A.

Respondents are requested to present their submission (mandatory) at a meeting by teleconference that will be properly announced to submitters by the MPAI Secretariat. If no presenter will attend the meeting, the proposal will be discarded.

Respondents are advised that, upon acceptance by MPAI of their submission in whole or in part for further evaluation, MPAI will require that:

  • A working implementation, including source code, – for use in the development of the MPAI-CAE Reference Software and later publication as a standard by MPAI – be made available before the technology is accepted for inclusion in the MPAI-CAE standard. Software may be written in programming languages that can be compiled or interpreted and in hardware description languages.
  • The working implementation be suitable for operation in the MPAI AIF Framework (MPAI-AIF).
  • A non-MPAI member immediately join MPAI. If the non-MPAI member elects not to do so, their submission will be discarded. Direction on how to join MPAI can be found online.

Further information on MPAI can be obtained from the MPAI website.

3        Evaluation Criteria and Procedure

Proposals will be assessed using the following process:

  1. Evaluation panel is created from:
    1. All CAE-DC members attending.
    2. Non-MPAI members who are respondents.
    3. Non respondents/non MPAI member experts invited in a consulting capacity.
  2. No one from 1.1.-1.2. will be denied membership in the Evaluation panel.
  3. Respondents present their proposals.
  4. Evaluation Panel members ask questions.
  5. If required subjective and/or objective tests are carried out:
    1. Define required tests.
    2. Carry out the tests.
    3. Produce report.
  6. At least 2 reviewers will be appointed to review & report about specific points in a proposal if required.
  7. Evaluation panel members fill out Annex B for each proposal.
  8. Respondents respond to evaluations.
  9. Proposal evaluation report is produced.

4        Expected development timeline

Timeline of the CfT, deadlines and response evaluation:

Table 2 – Dates and deadlines

Step Date
Call for Technologies 2021/02/17
CfT introduction conference call 1 2021/02/24T14:00 UTC
CfT introduction conference call 2 2021/03/10T15:00 UTC
Notification of intention to submit proposal 2021/03/16T23.59 UTC
Submission deadline 2021/04/12T23.59 UTC
Evaluation of responses will start 2021/04/14 (MPAI-7)

Evaluation to be carried out during 2-hour sessions according to the calendar agreed at MPAI-7.

5        References

  1. MPAI-AIF Use Cases & Functional Requirements, N74; https://mpai.community/standards/mpai-aif/
  2. MPAI-AIF Call for Technologies, N100; https://mpai.community/standards/mpai-aif/#Technologies
  3. MPAI-AIF Framework Licence, MPAI N171; https://mpai.community/standards/mpai-aif/#Licence
  4. MPAI-CAE Use Cases & Functional Requirements; MPAI N151; https://mpai.community/standards/mpai-cae/#UCFR
  5. MPAI-CAE Call for Technologies, MPAI N152; https://mpai.community/standards/mpai-cae/#Technologies
  6. MPAI-CAE Framework Licence, MPAI N171; https://mpai.community/standards/mpai-cae/#Licence
  7. MPAI-MMC Use Cases & Functional Requirements; MPAI N153; https://mpai.community/standards/mpai-mmc/#UCFR
  8. MPAI-MMC Call for Technologies, MPAI N154; https://mpai.community/standards/mpai-mmc/#Technologies
  9. MPAI-MMC Framework Licence, N173; https://mpai.community/standards/mpai-mmc/#Licence

Annex A: Information Form

This information form is to be filled in by a Respondent to the MPAI-CAE CfT

  1. Title of the proposal
  2. Organisation: company name, position, e-mail of contact person
  3. What are the main functionalities of your proposal?
  4. Does your proposal provide or describe a formal specification and APIs?
  5. Will you provide a demonstration to show how your proposal meets the evaluation criteria?

Annex B: Evaluation Sheet

NB: This evaluation sheet will be filled out by members of the Evaluation Team.

Proposal title:

Main Functionalities:

Response summary: (a few lines)

Comments on Relevance to the CfT (Requirements):

Comments on possible MPAI-CAE profiles[1]

Evaluation table:

Table 3Assessment of submission features

Note 1 The semantics of Submission features is provided by Table 4
Note 2 Evaluation elements indicate the elements used by the evaluator in assessing the submission
Note 3 Final Assessment indicates the ultimate assessment based on the Evaluation Elements

 

Submission features Evaluation elements Final Assessment
Completeness of description

Understandability

Extensibility

Use of Standard Technology

Efficiency

Test cases

Maturity of reference implementation

Relative complexity

Support of MPAI use cases

Support of non-MPAI use cases

Content of the criteria table cells:

Evaluation facts should mention:

  • Not supported / partially supported / fully supported.
  • What supported these facts: submission/presentation/demo.
  • The summary of the facts themselves, e.g., very good in one way, but weak in another.

Final assessment should mention:

  • Possibilities to improve or add to the proposal, e.g., any missing or weak features.
  • How sure the evaluators are, i.e., evidence shown, very likely, very hard to tell, etc.
  • Global evaluation (Not Applicable/ –/ – / + / ++)

New Use Cases/Requirements Identified:

(please describe)

  •  Evaluation summary:
  •  Main strong points, qualitatively:
  •  Main weak points, qualitatively:
  • Overall evaluation: (0/1/2/3/4/5)

0: could not be evaluated

1: proposal is not relevant

2: proposal is relevant, but requires significant more work

3: proposal is relevant, but with a few changes

4: proposal has some very good points, so it is a good candidate for standard

5: proposal is superior in its category, very strongly recommended for inclusion in standard

Additional remarks: (points of importance not covered above.)

The submission features in Table 3 are explained in the following Table 4.

Table 4 – Explanation of submission features

Submission features Criteria
Completeness of description Evaluators should

1.     Compare the list of requirements (Annex C of the CfT) with the submission.

2.     Check if respondents have described in sufficient detail to what part of the requirements their proposal refers to.

NB1: Completeness of a proposal for a Use Case is a merit because reviewers can assess that the components are integrated.

NB2: Submissions will be judged for the merit of what is proposed. A submission on a single technology that is excellent may be considered instead of a submission that is complete but has a less performing technology.

Understandability Evaluators should identify items that are demonstrably unclear (incon­sistencies, sentences with dubious meaning etc.)
Extensibility Evaluators should check if respondent has proposed extensions to the Use Cases.

NB: Extensibility is the capability of the proposed solution to support use cases that are not supported by current requirements.

Use of standard Technology Evaluators should check if new technologies are proposed where widely adopted technologies exist. If this is the case, the merit of the new tech­nology shall be proved.
Efficiency Evaluators should assess power consumption, computational speed, computational complexity.
Test cases Evaluators should report whether a proposal contains suggestions for testing the technologies proposed
Maturity of reference implementation Evaluators should assess the maturity of the proposal.

Note 1: Maturity is measured by the completeness, i.e., having all the necessary information and appropriate parts of the HW/SW implementation of the submission disclosed.

Note 2: If there are parts of the implementation that are not disclosed but demonstrated, they will be considered if and only if such components are replicable.

Relative complexity Evaluators should identify issues that would make it difficult to implement the proposal compared to the state of the art.
Support of MPAI-CAE use cases Evaluators should check how many use cases are supported in the submission
Support of non MPAI-CAE use cases Evaluators should check whether the technologies proposed can demonstrably be used in other significantly different use cases.

Annex C: Requirements check list

Please note the following acronyms

KB Knowledge Base
QF Query Format

Table 5 – List of technologies identified in MPAI-CAE N151 [1]

Note: The numbers in the first column refer to the section numbers of N151 [1].

Technologies by Use Cases Response
Emotion-Enhanced Speech
4.2.4.1 Digital Speech Y/N
4.2.4.2 Emotion Y/N
4.2.4.3 Emotion KB query format Y/N
4.2.4.4 Emotion descriptors Y/N
Audio Recording Preservation
4.3.4.1 Digital Audio Y/N
4.3.4.2 Digital Video Y/N
4.3.4.3 Digital Image Y/N
4.3.4.4 Tape irregularity KB query format Y/N
4.3.4.5 Text Y/N
4.3.4.6 Packager Y/N
Enhanced Audioconference Experience
4.4.4.1 Digital Speech Y/N
4.4.4.2 Microphone geometry information Y/N
4.4.4.3 Output device acoustic model metadata KB query format Y/N
4.4.4.4 Delivery Y/N
Audio-on-the-go
4.5.4.1 Digital Audio Y/N
4.5.4.2 Microphone geometry information Y/N
4.5.4.3 Sound array Y/N
4.5.4.4 Sound categorisation KB query format Y/N
4.5.4.5 Sounds categorisation Y/N
4.5.4.6 User Hearing Profiles KB query format Y/N
4.5.4.7 Delivery Y/N

Respondent should consult the equivalent list in N154 [8] as some technologies are common or have a degree of similarity.

Annex D: Technologies that may require specific testing

Emotion-Enhanced Speech Speech features
Emotion-Enhanced Speech Emotion descriptors
Audio Recording Preservation Image features

Additional technologies may be identified during the evaluation phase.

Annex E: Mandatory text in responses

A response to this MPAI-CAE CfT shall mandatorily include the following text

<Company/Member> submits this technical document in response to MPAI Call for Technologies for MPAI project MPAI-CAE (N151).

 <Company/Member> explicitly agrees to the steps of the MPAI standards development process defined in Annex 1 to the MPAI Statutes (N80), in particular <Company/Member> declares that  <Com­pany/Member> or its successors will make available the terms of the Licence related to its Essential Patents according to the Framework Licence of MPAI-CAE (N171), alone or jointly with other IPR holders after the approval of the MPAI-CAE Technical Specif­ication by the General Assembly and in no event after commercial implementations of the MPAI-CAE Technical Specification become available on the market.

In case the respondent is a non-MPAI member, the submission shall mandatorily include the following text

If (a part of) this submission is identified for inclusion in a specification, <Company>  understands that  <Company> will be requested to immediately join MPAI and that, if  <Company> elects not to join MPAI, this submission will be discarded.

Subsequent technical contribution shall mandatorily include this text

<Member> submits this document to MPAI-CAE Development Committee (CAE-DC) as a con­tribution to the development of the MPAI-CAE Technical Specification.

 <Member> explicitly agrees to the steps of the MPAI standards development process defined in Annex 1 to the MPAI Statutes (N80), in particular  <Company> declares that <Company> or its successors will make available the terms of the Licence related to its Essential Patents according to the Framework Licence of MPAI-CAE (N171), alone or jointly with other IPR holders after the approval of the MPAI-CAE Technical Specification by the General Assembly and in no event after commercial implementations of the MPAI-CAE Technical Specification become available on the market.

[1] Profile of a standard is a particular subset of the technologies that are used in a standard and, where applicable, the classes, subsets, options and parameters relevan for the subset


Use Cases and Functional RequirementsFramework LicenceCall for TechnologiesTemplate for responses to the Call for TechnologiesApplication Note

Template for responses to the Call for technologies

This document is also available in MS Word format Template for responses to the MPAI-CAE Call for Technologies

Abstract

This document is provided as a help to those who intend to submit responses to the MPAI-CAE Call for Technologies. Text in red (as in this sentence) provides guidance to submitters and should not be included in a submission. Text in green shall be mandatorily included in a submission. If a submission does not include the green text, the submission will be rejected.

If the submission is in multiple files, each file shall include the green statement.

Text in white is the text suggested to respondents for use in a submission.

1        Introduction

This document is submitted by <organisation name> (if an MPAI Member) and/or by <organ­is­ation name>, a <company, university etc.> registered in … (if a non-MPAI member) in response to the MPAI-CAE Call for Technol­ogies issued by Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) on 2021/02/17 as MPAI document N152.

In the opinion of the submitter, this document proposes technologies that satisfy the requirements of MPAI document MPAI-CAE Use Cases & Functional Requirements issued by MPAI on 2020/02/17 as MPAI document N151.

Possible additions

This document also contains comments on the requirements as requested by N151.

This document also contains proposed technologies that satisfy additional requirements as allowed by N151.

<Company and/or Member> explicitly agrees to the steps of the MPAI standards development process defined in Annex 1 to the MPAI Statutes (N80), in particular <Company and or Member> declares that  <Company and or Member> or its successors will make available the terms of the Licence related to its Essential Patents according to the MPAI-CAE Framework Licence (N171), alone or jointly with other IPR holders after the approval of the MPAI-CAE Technical Specif­ication by the MPAI General Assembly and in no event after commercial implem­entations of the MPAI-CAE Technical Specification become available on the market.

< Company and/or Member> acknowledges the following points:

  1. MPAI in not obligated, by virtue of this CfT, to select a particular technology or to select any technology if those submitted are found inadequate.
  2. MPAI may decide to use the same technology for functionalities also requested in the MPAI-MMC Call for Technologies (N174) and associated Functional Requirements (N173).
  3. A representative of <Company and/or Member> shall present this submission at a CAE-DC meeting communicated by MPAI Secretariat (mailto:secretariat@mpai.community). If no <Company and/or Member> will attend the meeting and present the submission, this sub­mission will be discarded.
  4. <Company and/or Member> shall make available a working implementation, including source code – for use in the development of the MPAI-CAE Reference Software and eventual public­ation by MPAI as a normative standard – before the technology submitted is accepted for the MPAI-CAE standard.
  5. The software submitted may be written in programming languages that can be compiled or interpreted and in hardware description languages, upon acceptance by MPAI for further eval­uation of their submission in whole or in part.
  6. <Company> shall immediately join MPAI upon acceptance by MPAI for further evaluation of this submission in whole or in part.
  7. If <Company> does not join MPAI, this submission shall be discarded.

2        Information about the submission

This information corresponds to Annex A in N152. It is included here for submitter’s convenience.

  1. Title of the proposal
  2. Organisation: company name, position, e-mail of contact person
  3. What are the main functionalities of your proposal?
  4. Does your proposal provide or describe a formal specification and APIs?
  5. Will you provide a demonstration to show how your proposal meets the evaluation criteria?

3        Comments on/extensions to requirements (if any)

 

4        Overview of Requirements supported by the submission

Please answer Y or N. Detail on the specific answers can be provided in the submission.

Technologies by Use Cases Response
Emotion-Enhanced Speech
4.2.4.1 Digital Speech Y/N
4.2.4.2 Emotion Y/N
4.2.4.3 Emotion KB query format Y/N
4.2.4.4 Emotion descriptors Y/N
Audio Recording Preservation
4.3.4.1 Digital Audio Y/N
4.3.4.2 Digital Video Y/N
4.3.4.3 Digital Image Y/N
4.3.4.4 Tape irregularity KB query format Y/N
4.3.4.5 Text Y/N
4.3.4.6 Packager Y/N
Enhanced Audioconference Experience
4.4.4.1 Digital Speech Y/N
4.4.4.2 Microphone geometry information Y/N
4.4.4.3 Output device acoustic model metadata KB query format Y/N
4.4.4.4 Delivery Y/N
Audio-on-the-go
4.5.4.1 Digital Audio Y/N
4.5.4.2 Microphone geometry information Y/N
4.5.4.3 Sound array Y/N
4.5.4.4 Sound categorisation KB query format Y/N
4.5.4.5 Sounds categorisation Y/N
4.5.4.6 User Hearing Profiles KB query format Y/N
4.5.4.7 Delivery Y/N

5        New Proposed requirements (if any)

1. Y/N
2. Y/N
3. Y/N

6. Detailed description of submission

6.1       Proposal chapter #1

6.2       Proposal chapter #2

….

7        Conclusions


Use Cases and Functional RequirementsFramework LicenceCall for TechnologiesTemplate for responses to the Call for TechnologiesApplication Note

MPAI Application Note #1 Rev. 1

Proponents: Michelangelo Guarise, Andrea Basso (VOLUMIO)

 Description: The overall user experience quality is highly dependent on the context in which audio is used, e.g.

  1. Entertainment audio can be consumed in the home, in the car, on public transport, on-the-go (e.g. while doing sports, running, biking) etc.
  2. Voice communications: can take place office, car, home, on-the-go etc.
  3. Audio and video conferencing can be done in the office, in the car, at home, on-the-go etc.
  4. (Serious) gaming can be done in the office, at home, on-the-go etc.
  5. Audio (post-)production is typically done in the studio
  6. Audio restoration is typically done in the studio

By using context information to act on the content using AI, it is possible substantially to improve the user experience.

Figure 1 represents how MPAI-CAE can reorganise its processing modules within an MPAI-AIF Framework to support different applications.

Figure 1 – Instances of MPAI-CAE

Comments: Currently, there are solutions that adapt the conditions in which the user experiences content or service for some of the contexts mentioned above. However, they tend to be vertical in nature, making it dif­ficult to re-use possibly valuable AI-based components of the solutions for differ­ent applications.

MPAI-CAE aims to create a horizontal market of re-usable and possibly context-depending components that expose standard interfaces. The market would become more receptive to innov­ation hence more compet­itive. Industry and consumers alike will benefit from the MPAI-CAE stan­dard.

Examples

The following examples describe how MPAI-CAE can make the difference.

  1. Enhanced audio experience in a conference call

Often, the user experience of a video/audio conference can be marginal. Too much background noise or undesired sounds can lead to participants not understanding what participants are saying. By using AI-based adaptive noise-cancellation and sound enhancement, MPAI-CAE can virtually eliminate those kinds of noise without using complex microphone systems to capture environment characteristics.

  1. Pleasant and safe music listening while biking

While biking in the middle of city traffic, AI can process the signals from the environment captured by the microphones available in many earphones and earbuds (for active noise cancellation), adapt the sound rendition to the acoustic environment, provide an enhanced audio experience (e.g. performing dynamic signal equalization), improve battery life and selectively recognize and allow relevant environment sounds (i.e. the horn of a car). The user enjoys a satisfactory listening experience without losing contact with the acoustic surroundings.

  1. Emotion enhanced synthesized voice

Speech synthesis is constantly improving and finding several applications that are part of our daily life (e.g. intelligent assistants). In addition to improving the ‘natural sounding’ of the voice, MPAI-CAE can implement expressive models of primary emotions such as fear, happiness, sad­ness, and anger.

  1. Efficient 3D sound

MPAI-CAE can reduce the number of channels (i.e. MPEG-H 3D Audio can support up to 64 loudspeaker channels and 128 codec core channels) in an automatic (unsupervised) way, e.g. by mapping a 9.1 to a 5.1 or stereo (radio broadcasting or DVD), maintaining the musical touch of the composer.

  1. Speech/audio restoration

Audio restoration is often a time-consuming process that requires skilled audio engineers with specific experience in music and recording techniques to go over manually old audio tapes. MPAI-CAE can automatically remove anomalies from recordings through broadband denoising, declicking and decrackling, as well as removing buzzes and hums and performing spectrographic ‘retouching’ for removal of discrete unwanted sounds.

  1. Normalization of volume across channels/streams

Eighty-five years after TV has been first introduced as a public service, TV viewers are still strug­gling to adapt to their needs the different average audio levels from different broadcasters and, within a program, to the different audio levels of the different scenes.

MPAI-CAE can learn from user’s reactions via remote control, e.g. to a loud spot, and control the sound level accordingly.

  1. Automotive

Audio systems in cars have steadily improved in quality over the years and continue to be integrated into more critical applications. Toda, a buyer takes it for granted that a car has a good automotive sound system. In addition, in a car there is usually at least one and sometimes two microphones to handle the voice-response system and the hands-free cell-phone capability. If the vehicle uses any noise cancellation, several other microphones are involved. MPAI-CAE can be used to improve the user experience and enable the full quality of current audio systems by reduc­ing the effects of the noisy automotive environment on the signals.

  1. Audio mastering

Audio mastering is still considered as an ‘art’ and the prerogative of pro audio engineers. Normal users can upload an example track of their liking (possibly obtained from similar musical content) and MPAI-CAE analyzes it, extracts key features and generate a master track that ‘sounds like’  the example track starting from the non-mastered track.  It is also possible to specify the desired style without an example and the original track will be adjusted accordingly.

Requirements:

The following is an initial set of MPAI-CAE functional requirements to be further developed in the next few weeks. When the full set of requirements will be developed, the MPAI General Assembly will decide whether an MPAI-CAE standard should be developed.

  1. The standard shall specify the following natural input signals
    1. Microphone signals
    2. Inertial measurement signals (Acceleration, Gyroscope, Compass, …)
    3. Vibration signals
    4. Environmental signals (Proximity, temperature, pressure, light, …)
    5. Environment properties (geometry, reverberation, reflectivity, …)
  2. The standard shall specify
    1. User settings (equalization, signal compression/expansion, volume, …)
    2. User profile (auditory profile, hearing aids, …)
  3. The standard shall support the retrieval of pre-computed environment models (audio scene, home automation scene, …)
  4. The standard shall reference the user authentication standards/methods required by the specific MPAI-CAE context
  5. The standard shall specify means to authenticate the components and pipelines of an MPAI-CAE instance
  6. The standard shall reference the methods used to encrypt the streams processed by MPAI-CAE and service-related metadata
  7. The standard shall specify the adaptation layer of MPAI-CAE streams to delivery protocols of common use (e.g. Bluetooth, Chromecast, DLNA, …)

 Object of standard: Currently, three areas of standardization are identified:

  1. Context type interfaces: a first set of input and output signals, with corresponding syntax and semantics, for audio usage contexts considered of sufficient interest (e.g. audiocon­ferencing and audio consumption on-the-go). They have the following features
    1. Input and out signals are context specific, but with a significant degree of commonality across contexts
    2. The operation of the framework is implementation-dependent offering implementors the way to produce the set of output signals that best fit the usage context
  2. Processing component interfaces: with the following features
    1. Interfaces of a set of updatable and extensible processing modules (both traditional and AI-based)
    2. Possibility to create processing pipelines and the associated control (including the needed side information) required to manage them
    3. The processing pipeline may be a combination of local and in-cloud processing
  3. Delivery protocol interfaces
    1. Interfaces of the processed audio signal to a variety of delivery protocols

Benefits: MPAI-CAE will bring benefits positively affecting

  1. Technology providers need not develop full applications to put to good use their technol­ogies. They can concentrate on improving the AI technologies that enhance the user exper­ience. Further, their technologies can find a much broader use in application domains beyond those they are accustomed to deal with.
  2. Equipment manufacturers and application vendors can tap from the set of technologies made available according to the MPAI-CAE standard from different competing sources, integrate them and satisfy their specific needs
  3. Service providers can deliver complex optimizations and thus superior user experience with minimal time to market as the MPAI-CAE framework enables easy combination of 3rd party components from both a technical and licensing perspective. Their services can deliver a high quality, consistent user audio experience with minimal dependency on the source by selecting the optimal delivery method
  4. End users enjoy a competitive market that provides constantly improved user exper­iences and controlled cost of AI-based audio endpoints.

 Bottlenecks: the full potential of AI in MPAI-CAE would be unleashed by a market of AI-friendly processing units and introducing the vast amount of AI technologies into products and services.

 Social aspects: MPAI-CAE would free users from the dependency on the context in which they operate; make the content experience more personal; make the collective service experience less dependent on events affecting the individual participant and raise the level of past content to today’s expectations.

Success criteria: MPAI-CAE should create a competitive market of AI-based components expos­ing standard interfaces, processing units available to manufacturers, a variety of end user devices and trigger the implicit need felt by a user to have the best experience whatever the context.