Application NoteDraft Use Cases and Functional RequirementsDraft Call for Technologies

Context-based Audio Enhancement – MPAI-CAE

Draft Call for Technologies

1        Introduction

Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) is an international non-profit organisation with the mission to develop standards for Artificial Intelligence (AI) enabled digital data coding and for technologies that facilitate integration of data coding components into ICT systems. With the mechanism of Framework Licences, MPAI seeks to attach clear IPR licensing frameworks to its standards.

MPAI has found that the application area called “Context-based Audio Enhancement” is particul­arly relevant for MPAI standardisation because using context information to act on the input audio content can substantially improve the user experience of a variety of audio-related applications that include entertainment, communication, teleconferencing, gaming, post-produc­tion, restor­ation etc. for a variety of contexts such as in the home, in the car, on-the-go, in the studio etc.

Therefore, MPAI intends to develop a standard – to be called MPAI-CAE – that will provide standard tech­nologies to implement four Use Cases identified so far

  1. Emotion-Enhanced Speech (EES)
  2. Audio Recording Preservation (ARP)
  3. Enhanced Audioconference Experience (EAC)
  4. Audio-on-the-go (AOG)

This document is a Call for Technologies (CfT) that

  1. satisfy the functional requirements of N131
  2. are released according to the Framework Licence of N1xy available online, if selected by MPAI for inclusion in the MPAI-CAE standard.

The standard will be developed with the following guidelines

  1. To satisfy the Functional Requirements of N131 [1], available online. In the future, MPAI may decide to extend MPAI-CAE to support other Use Cases.
  2. To use, where feasible and desirable, the same basic tech­nol­ogies required by the companion document MPAI-MMC Use Cases and Functional Requir­ements [2].
  3. To be suitable for implementation as AI Modules (AIM) conforming to the emerging MPAI AI Framework (MPAI-AIF) standard. The MPAI-AIF Functional Requirements N74 [4] and the Call for Technologies (N100) [5] are available online here and here.

Respondents should be aware that

  1. Use Cases that make up MPAI-CAE, the Use Cases themselves and the AIM internals will be non-normative
  2. The input and output interfaces of the AIMs, whose requirements have been derived to support the Use Cases, will be normative.

Therefore, the scope of this Call for Technologies is restricted to technologies required to implement the input and output interfaces of the AIMs identified in N131 [1].

However, MPAI invites comments on any technology or architectural component identified in N131, specifically

  1. Additions or removal of input/output signals to the identified AIMs with identification of data formats required by the new input/output signals
  2. Possible alternative partitioning of the AIMs implementing the example cases providing
    1. Arguments in support of the proposed partitioning
    2. Detailed specifications of the inputs and outputs of the proposed new AIMs
  3. New Use Cases fully described as in the final version of this document.

All parties who believe they have relevant technologies satisfying all or most of the requirements of one or more than one Use Case described in N131 are invited to submit proposals for consid­eration by MPAI. MPAI membership is not a prerequisite for responding to this CfT. However, proponents should be aware that, if their proposal or part thereof is accepted for inclusion in the MPAI-CAE standard, they shall immediately join MPAI, or their accepted technologies will be discarded.

MPAI will select the most suitable technologies based on their technical merits for inclusion in MPAI-CAE. However, MPAI in not obligated, by virtue of this CfT, to select a particular tech­nology or to select any technology if those submitted are found inadequate.

Submissions are due on 2021/04/13T23:59 UTC and will be reviewed according to the schedule that the 7th MPAI General Assembly (MPAI-7) will define at its online meeting on 2021/04/15. For details on how submitters who are not MPAI members can attend the said review please contact the MPAI secretariat (secretariat@mpai.community).

2        How to submit a response

Those planning to respond to this CfT

  1. Are advised that online events will be held on 2021/02/24 and 2021/03/10 to present the MPAI-CAE CfT and respond to questions. Logistic information on these events will be posted on the MPAI web site
  2. Are requested to communicate their intention to respond to this CfT with an initial version of the form of Annex A to the MPAI secretariat (secretariat@mpai.community) by 2021/03/18. A potential submitter making a communication using the said form is not required to actually make a submission. Submission will be accepted even if the submitter did not communicate their intention to submit a response.

Responses to this MPAI-CAE CfT shall/may include:

Table 1 – Mandatory and optional elements of a response

Item Status
Detailed documentation describing the proposed technologies mandatory
The final version of Annex A mandatory
The text of Annex B duly filled out with the table indicating which requirements identified in MPAI N131 [1] are satisfied. If all the requirements of a Use Case are not satisfied, this should be explained. mandatory
Comments on the completeness and appropriateness of the MPAI-CAE requirem­ents and any motivated suggestion to amend or extend those requirements. optional
A preliminary demonstration, with a detailed document describing it. optional
Any other additional relevant information that may help evaluate the submission, such as additional use cases. optional
The text of Annex E. mandatory

Respondents are invited to take advantage of  the check list of Annex C before submitting their response and filling out Annex B.

Responses shall be submitted to secretariat@mpai.community (MPAI secretariat) by 2020/04/13 T23:59 UTC. The secretariat will acknowledge receipt of the submission via email.

Respondents are requested to present their submission (mandatory) at a properly announce MPAI meeting by teleconference. If no presenter will attend the meeting, the proposal will be discarded.

Respondents are advised that, upon acceptance by MPAI of their submission in whole or in part for further evaluation, MPAI will require that

  • A working implementation, including source code, – for use in the development of the MPAI-CAE Reference Software – be made available before the technology is accepted for the MPAI-CAE standard. Software may be written in programming languages that can be compiled or interpreted and in hardware description languages.
  • The working implementation be suitable for operation in the MPAI AIF Framework (MPAI-AIF)
  • A non-MPAI member immediately join MPAI. If the non-MPAI member elects not to do so, their submission will be discarded. Direction on how to join MPAI can be found online.

Further information on MPAI can be obtained from the MPAI website.

3        Evaluation Criteria and Procedure

Proposals will be assessed using the following process

  1. Evaluation panel is created from
    1. All CAE-DC members attending
    2. Non-MPAI members who are respondents
    3. Non respondents/non MPAI member experts invited in a consulting capacity
  2. No one from 1.-2.-3. will be denied membership in the Evaluation panel
  3. Respondents present their proposals
  4. Evaluation Panel members ask questions
  5. If required subjective and/or objective tests are carried out
    1. Define required tests
    2. Carry out the tests
    3. Produce report
  6. At least 2 reviewers appointed to review & report about specific points in a proposal if required
  7. Evaluation panel members fill out Annex 2 for each proposal
  8. Respondents respond to evaluations
  9. Proposal evaluation report is produced.

Expected development timeline

Timeline of the CfT, deadlines and response evaluation:

Table 1 – Dates and deadlines

Step Date
Call for Technologies 2021/02/17
CfT introduction conference call 1 2021/02/24T14:00 UTC
CfT introduction conference call 2 2021/03/10T15:00 UTC
Notification of intention to submit proposal 2021/02/18 T23.59 UTC
Submission deadline 2021/04/13T23.59 UTC
Evaluation of responses 2021/04/15 (MPAI-7)

Evaluation to be carried out during 2-hour sessions according to the calendar agreed at MPAI-7

4        References

  1. Draft MPAI-CAE Use Cases & Functional Requirements, MPAI N131
  2. Draft MPAI-MMC Use Cases & Functional Requirements, MPAI N133
  3. Draft MPAI-MMC Call for Technologies, MPAI N134
  4. MPAI-AIF Use Cases & Functional Requirements, MPAI N74; https://mpai.community/standards/mpai-aif/
  5. MPAI-AIF Call for Technologies, MPAI N100

Annex A: Information Form

This information form is to be filled in by a respondent to the MPAI-AIF CfT

  1. Title of the proposal
  2. Organisation: company name, position, e-mail of contact person
  3. What are the main functionalities of your proposal?
  4. Does your proposal provide or describe a formal specification and APIs?
  5. Will you provide a demonstration to show how your proposal meets the evaluation criteria?

Annex B: Evaluation Sheet

Proposal title:

Main Functionalities:

Response summary: (a few lines)

Comments on Relevance to the CfT (Requirements):

Comments on possible MPAI-CAE profiles[1]

Evaluation table:

Table 1Assessment of submission features

Submission features Evaluation elements Final Assessment
Completeness of description

Understandability

Adaptability

Extensibility

Use of Standard Technology

Efficiency

Test cases

Maturity of reference implementation

Relative complexity

Support of MPAI use cases

Support of non-MPAI use cases

Content of the criteria table cells:

Evaluation facts should mention:

  • Not supported / partially supported / fully supported.
  • What supported these facts: submission/presentation/demo.
  • The summary of the facts themselves, e.g., very good in one way, but weak in another.

Final assessment should mention:

  • Possibilities of improving or adding to the proposal, e.g., any missing or weak features.
  • How sure the experts are, i.e., evidence shown, very likely, very hard to tell, etc.
  • Global evaluation (Not Applicable/ –/ – / + / ++)

New Use Cases/Requirements Identified:

(please describe)

 Evaluation summary:

  • Main strong points, qualitatively:
  • Main weak points, qualitatively:
  • Overall evaluation: (0/1/2/3/4/5)

0: could not be evaluated

1: proposal is not relevant

2: proposal is relevant, but requires significant more work

3: proposal is relevant, but with a few changes

4: proposal has some very good points, so it is a good candidate for standard

5: proposal is superior in its category, very strongly recommended for inclusion in standard

Additional remarks: (points of importance not covered above.)

The submission features in Table 1 are explained in the following Table 2.

Table 2 – Explanation of submission features

Submission features Criteria
Completeness of description Evaluators should

1.     Compare the list of requirements (Annex C of the CfT) with the submission.

2.     Check if respondents have described in sufficient detail to what part of the architecture their proposal refers to.

NB1: Completeness of a proposal for a Use Case is a merit because reviewers can assess that the components are integrated.

NB2: Submissions will be judged for the merit of what is proposed.

Understandability Evaluators should identify items that are demonstrably unclear (incon­sistencies, sentences with dubious meaning etc.)
Adaptability Evaluators should check if respondent specifies an execution envir­on­ment with its scope of applicability.

NB: Adaptability is synonymous of portability to different computati­onal frameworks.

Extensibility Evaluators should check if respondent has proposed extensions to the use cases

NB: Extensibility is the capability of the proposed solution to support use cases that are not supported by current requirements.

Use of standard Technology Evaluators should check if new technologies are proposed where widely adopted technologies exists. If this is the case, the merit of the new tech­nology shall be proved.
Efficiency Evaluators should assess power consumption, computational speed, computational complexity, required TOPS
Test cases Evaluators should report whether a proposal contains suggestions for testing the technologies proposed
Maturity of reference implementation Evaluators should assess the maturity of the proposal.

NB1: Maturity is measured by the completeness, i.e., having all the necessary and appropriate parts of the HW/SW disclosed implementation with respect to the submitted proposal.

NB2: If there are parts of the implementation that are not disclosed but demonstrated, they will be considered if and only if such components are replicable.

Relative complexity Evaluators should identify issues that would make it difficult to implement the proposal compared to the state of the art
Support of MPAI use cases Evaluators should check how many use cases are supported in the submission
Support of non-MPAI use cases Evaluators should check whether the technologies proposed can demonstrably be used in other significantly different use cases.

Annex C: Requirements check list

Table 8 This list has been derived from the Requirements of N131 [1].

Please note the following acronyms

KB Knowledge Base
QF Query Format

 

UC Technology Description
AOG Delivery Speech transport format
AOG Digital Audio PCM Audio 48-96 kHz/16-24 bit
AOG Microphone geometry information Description of microphone position
AOG Relevant vs non-relevant sound KB QF Provides relevant sound
AOG Sound array Vector of extracted sounds
AOG Sound categorisation KB QF Provides sound category
AOG Sounds categorisation Identifier of a type of sound
AOG User Hearing Profiles KB QF Provides profile of identified user
ARP Digital Audio PCM Audio 48-96 kHz/16-24 bit
ARP Digital Image A (un)compressed digital video frame
ARP Digital Video Digital Video
ARP Image Features Features characterising tape irregularities
ARP Packager Audio/Video/Images/Text Multiplexer
ARP Tape irregularity KB QF Provides image features
ARP Text Plain text
EAE Delivery Speech transport format
EAE Digital Speech PCM speech 22.05-96kHz/16-24 bit
EAE Microphone geometry information Description of microphone position
EAE Output device acoustic model metadata KB QF Provides output device metadata
EES Digital Speech PCM speech 22.05-96kHz/16-24 bit
EES Emotion Digital representation of emotion
EES Emotion descriptors Derivations of Speech features
EES Emotion KB QF Provides Emotion descriptors
EES Speech and Emotion File Format Multiplexed digital speech and emotion
EES Speech features Features associated to speech analysis

Respondent should consult the equivalent list in N133 [2]

Annex D – Technologies that may require specific testing

EES     Emotion descriptors

EES     Speech features

EES     Emotion KB Query Format

ARP    Image features

ARP    Tape irregularities KB Query Format

Annex E: Mandatory text in responses

A response to this MPAI-AIF is CfT shall mandatorily include the following text

<Company/Member> submits this technical document in response to MPAI Call for Technologies for MPAI project MPAI-XYZ (MPAI document Nijk).

 <Company/Member> explicitly agrees to the steps of the MPAI standards development process defined in Annex 1 to the MPAI Statutes, in particular <Company/Member> declares that  <Com­pany/Member> or its successors will make available the terms of the Licence related to its Essential Patents according to the Framework Licence of MPAI-XYZ (MPAI document Nmnp), alone or jointly with other IPR holders after the approval of the MPAI-XYZ Technical Specif­ication by the General Assembly and in no event after commercial implementations of the MPAI-XYZ Technical Specification become available on the market.

In case the respondent is a non-MPAI member, the submission shall mandatorily include the following text

If (a part of) this submission is identified for inclusion in a specification, <Company>  understands that  <Company> will be requested to immediately join MPAI and that, if  <Company> elects not to join MPAI, this submission will be discarded.

Subsequent technical contribution shall mandatorily include this text

<Member> submits this document to MPAI Development Committee XYZ as a contribution to the development of the MPAI-XYZ Technical Specification.

 <Member> explicitly agrees to the steps of the MPAI standards development process defined in Annex 1 to the MPAI Statutes, in particular  <Company> declares that <Company> or its successors will make available the terms of the Licence related to its Essential Patents according to the Framework Licence of MPAI-XYZ (MPAI document Nmnp), alone or jointly with other IPR holders after the approval of the MPAI-XYZ Technical Specification by the General Assembly and in no event after commercial implementations of the MPAI-XYZ Technical Specification become available on the market.

[1] Profile of a standard is a particular subset of the technologies that are used in a standard and, where applicable, the classes, subsets, options and parameters relevan for the subset


Application NoteDraft Use Cases and Functional RequirementsDraft Call for Technologies

MPAI-CAE – Context-based Audio Enhancement

Draft Use Cases and Functional Requirements

1        Introduction

Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) is an international association with the mission to develop AI-enabled data coding standards. Research has shown that data coding with AI-based technologies is more efficient than with existing technologies.

The MPAI approach to developing AI data coding standards is based on the definition of standard interfaces of AI Modules (AIM). AIMs operate on input data having a standard format to provide output data having a standard format. AIMs can be combined and executed in an MPAI-specified AI-Framework called MPAI-AIF. A Call for MPAI-AIF Technologies [1] is currently open.

While AIMs must expose standard interfaces to be able to operate in an MPAI AI Framework, their performance may differ depending on the technologies used to implement them. MPAI believes that competing developers striving to provide more performing proprietary and interoperable AIMs will promote horizontal markets of AI solutions that build on and further promote AI innovation.

This document is a collection of Use Cases and Functional Requirements for the MPAI Context-based Audio Enhancement (MPAI-CAE) application area. The Use Cases in the MPAI-CAE standard help improve the audio user experience for several applications including entertainment, commun­ication, teleconferencing, gaming, post-production, restoration etc. in a variety of contexts such as in the home, in the car, on-the-go, in the studio etc. Currently MPAI has identified four Use Cases falling in the Context-based Audio Enhancement area:

  1. Emotion-Enhanced Speech (EES)
  2. Audio Recording Preservation (ARP)
  3. Enhanced Audioconference Experience (EAC)
  4. Audio-on-the-go (AOG)

This document is to be read in conjunction with the MPAI-CAE Call for Technologies (CfT) [2] as it provides the functional requirements of all the technologies that have been identified as required to implement the current MPAI-CAE Use Cases. Respondents to the MPAI-CAE CfT should make sure that their responses are aligned with the functional requirements expressed in this document.

In the future MPAI may issue other Calls for Technologies falling in the scope of MPAI-CAE to support identified Use Cases. Currently these are

  1. Efficient 3D sound
  2. (Serious) gaming
  3. Normalization of TV volume
  4. Automotive
  5. Audio mastering
  6. Speech communication
  7. Audio (post-)production

It should also be noted that some technologies identified in this document are the same, similar, or related to technologies required to implement some of the Use Cases of the companion document MPAI-MMC Use Cases and Functional Requirements [3]. Readers of this document are advised that being familiar of the content of the said companion document is a prerequisite for a proper understanding of this document.

This document is structured in 7 chapters, including this Introduction.

Chapter 2 briefly introduces the AI Framework Reference Model and its six Components
Chapter 3 briefly introduces the 4 Use Cases.
Chapter 4 presents the 4 MPAI-CAE Use Cases with the following structure

1.     Reference architecture

2.     AI Modules

3.     I/O data of AI Modules

4.     Technologies and Functional Requirements

Chapter 5 identifies the technologies likely to be common across MPAI-CAE and MPAI-MMC, a companion standard project whose Call for Technologies is issued simul­taneously with MPAI-CAE’s.
Chapter 6 gives suggested references. Respondents are advised to become familiar with the references
Chapter 7 gives a basic list of relevant terms and their definition

2        The MPAI AI Framework (MPAI-AIF)

Most MPAI applications considered so far can be implemented as a set of AIMs – AI, ML and even traditional Data Processing (DP)-based units with standard interfaces assembled in suitable topol­ogies to achieve the specific goal of an application and executed in an MPAI-defined AI Frame­work. MPAI is making all efforts to identify processing modules that are re-usable and upgradable without necessarily changing the inside logic. MPAI plans on completing the development of a 1st generation AI Framework called MPAI-AIF in July 2021.

The MPAI-AIF Architecture is given by Figure 1.

Figure 1 – The MPAI-AIF Architecture

Where

  1. Management and Control manages and controls the AIMs, so that they execute in the correct order and at the time when they are needed.
  2. Execution is the environment in which combinations of AIMs operate. It receives external inputs and produces the requested outputs both of which are application specific interfacing with Management and Control and with Communication, Storage and Access.
  3. AI Modules (AIM) are the basic processing elements receiving processing specific inputs and producing processing specific outputs.
  4. Communication is required in several cases and can be implemented, e.g., by means of a service bus and may be used to connect with remote parts of the framework
  5. Storage encompasses traditional storage and is used to e.g., store the inputs and outputs of the individual AIMs, data from the AIM’s state and intermediary results, shared data among AIMs.
  6. Access represents the access to static or slowly changing data that are required by the application such as domain knowledge data, data models, etc.

3        Use Cases

3.1       Emotion-Enhanced Speech

Speech carries information not only about the lexical content, but also about a variety of other aspects such as age, gender, signature, and emotional state of the speaker [2]. Speech synthesis is evolving towards supporting these aspects.

There are many cases where a speech without emotion needs to be converted to a speech carrying an emotion, possibly with grades of a particular emotion. This is the case, for instance, of a human-machine dialogue where the message conveyed by the machine is more effective if it carries an emotion properly related to the emotion detected in the human speaker.

The AI Modules identified in the Emotion-Enhanced Speech (EES) Use Case considered in this document will make it possible to create virtual agents communicating in a more natural way, and thus to improve the quality of human interaction with a machine, by making it closer to a human-human interaction [5].

The ultimate goal is to realise a user-friendly system control interface that lets users generate speech with various – continuous and real time – expressiveness control levels.

3.2       Audio Recording Preservation

Preservation of audio assets recorded on a variety of media (vinyl, tapes, cassettes etc.) is an important activity for a variety of application domains, in particular cultural heritage.

A totally neutral process in the analogue-to-digital (A/D) audio information transfer is not sufficient. It is necessary to recover and preserve context information, obviously, but not exclusively, audio. The recording of an acoustic event can never be a neutral operation because the timbre quality and the plastic value of the recorded sound, which are of great importance in, for example, contemporary music, are already influenced by the positioning of the microphones used during the recording. In addition, the processing carried out by the Tonmeister, i.e., the person who has a detailed theoretical and practical knowledge of all aspects of sound recording.

However, unlike a sound engineer, the Tonmeister must also be deeply trained in music: music­ological and historic-critical competence are essential for the identification and correct cataloguing of the information contained in audio documents [6].

As sound carriers are made of unstable base materials, they are more subject to damage caused by inadequate handling. The commingling of a technical and scientific formation with historic-philol­ogical knowledge (an important element for the identification and correct cataloguing of the infor­mation contained in audio documents) becomes essential for preservative re-recording oper­ations, going beyond mere A/D transfer. In the case of magnetic tapes, the carrier may hold important information: the tape can include multiples splices; it can be annotated (by the composer or by the technicians) and/or display several types of irregularities (e.g., corruptions of the carrier, tape of different colour or chemical composition).

In this Audio Recording Preservation Use Case, audio is digitised and fed into a preservation system. The audio information is supplemented by the information coming from a video camera that is pointed to the head that reads the magnetic tape. The output of the restoration process is the preservation digital audio and a preservation master file that contains, next to the preservation audio file, several other information types created by the preservation process.

The introduction of this use case in the field of active preservation of audio documents opens the way to effective answer to the methodological questions of reliability with respect to the recordings as documentary sources, also clarifying the concept of “historical faithfulness”.

The goal is to cover the whole “philologically informed” archival process of an audio document, from the active preservation of sound documents to the access to digitized files.

3.3       Enhanced Audioconference Experience

Often, the user experience of a video/audio conference is far from satisfactory. Too much background noise or undesired sounds can lead to participants not to understand or even misun­derstand what participants are saying, in addition to creating distraction.

By using AI-based adaptive noise-cancellation and sound enhancement, those kinds of noise can be virtually eliminated without using complex microphone systems that capture environment char­acteristics.
In this use case, the goal is achieved by using a series of AIMs. The first AIM is fed with Microphone sound (which captures the conversation audio) and the according geometry information (which describes number, positioning and configuration of the microphone or the array of microphones). It is to be noted that also Microphone Physical information (frequency response and deviation of the microphone) might be added, but that will likely be an overkill for this scenario. The resulting output (Speech signal and Geometry information) is then fed to the Noise Cancellation AIM which performs de-noising of the conversation. The resulting output is then equalized based on the output device characteristics, fetched from the Output Device Acoustic Model KB, which describes the frequency response of the selected output device. This way the speech can be equalized removing any coloration from the output device, resulting in an optimally delivered sound experience.

3.4       Audio-on-the-go

While biking in the middle of city traffic, the user should enjoy a satisfactory listening experience without losing contact with the acoustic surroundings.

The microphones available in earphones and earbuds capture the signals from the environment, the relevant environment sounds (i.e., the horn of a car) are selectively recognised and the sound rendition is adapted to the acoustic environment, providing an enhanced audio experience (e.g., performing dynamic signal equalization) and an improved battery life.

In this use case, the goal is achieved by using a series of AIMs. The first AIM (Environmental Sound Recognition) is fed with Microphone sound which captures the surrounding environment noise, together with according geometry information (which describes number, positioning and configuration of the microphone or the array of microphones).

The sounds are then categorized following prescriptions of a Sound Categorization KB, resulting in a sounds array and their categorization. Sound samples might eventually be compressed to allow a cloud-processing procedure.

The Environmental Sound Processing AIM, after fetching a list of relevant sounds from a KB, will trim sounds not relevant for the user in the specific moment and feed them to the next AIM, Dynamic Signal Equalization. This AIM fetches the User Hearing Profile from a KB and equalizes dynamically the sound taking into account the User’s specific hearing deviations.

Finally, the resulting sound is delivered to the output via the most appropriate the delivery method.

4        Functional Requirements

4.1       Emotion-Enhanced Speech

4.1.1      Reference architecture

This Use Case is implemented as in Figure 2. The Speech analysis AIM can be implemented either as AI/ML or legacy DP modules. If this AIM is implemented as a neural network, access to Emotion KB may not be needed.

Figure 2 – Emotion-enhanced speech

4.1.2      AI Modules

The AI Modules of Figure 2 perform the functions described in Table 1.

Table 1 – AI Modules of Emotion-Enhanced Speech

AIM Function
Feature extraction Produces Speech features suitable for subsequent analysis
Speech features analysis Produces Emotion descriptors by querying the Emotion KB. Alternatively, Emotion descriptors are produced by an embedded neural network.
Emotion KB Allows Speech analysis to access features extracted from speech recordings of different speakers reading/reciting the same corpus of texts, with the standard set of emotions and without emotion, for different languages and genders.
Emotion inserter Inserts a particular emotional vocal timbre, e.g., anger, disgust, fear, happiness, sadness, and surprise into a neutral (emotion-less) synthesised voice. It also changes the strength of an emotion (from neutral speech) in a gradual fashion.

4.1.3      I/O interfaces of AI Modules

The I/O data of the Emotion Enhanced Speech AIMs are given in Table 2.

Table 2 – I/O data of Emotion-Enhanced Speech AIMs

AIM Input Data Output Data
Feature extraction Emotion-less Digital Speech Speech features
Speech features analysis Speech features

Emotion

Emotion KB response

Emotion descriptors

 

Emotion KB query

Emotion KB Query Response
Emotion inserter Emotion-less Digital Speech

Emotion descriptors

Speech with Emotion

Emotion descriptors

4.1.4      Technologies and Functional Requirements

4.1.4.1     Digital Speech

Emotion Enhanced Speech (EES) requires that speech be sampled at a frequency between 22.05 kHz and 96 kHz and digitally represented between 16 bits/sample and 24 bits/sample.

To Respondents

Respondents are invited to comment on these choices.

4.1.4.2     Emotion

By Emotion we mean an attribute that indicate an emotion out of a finite set of Emotions.

In EES the input speech – natural or synthesised – does not contain emotion while the output speech is expected to contain the emotion expressed by the input Emotion.

The most basic emotions are described by the set: “anger, disgust, fear, happiness, sadness, and surprise” [7], or “joy versus sadness, anger versus fear, trust versus disgust, and surprise versus anticipation” [8]. One of these sets can be taken as “universal” in the sense that they are common across all cultures. An Emotion may have different Grades [9,10].

To Respondents

Respondents are invited to propose

  1. A minimal set of Emotions whose semantics are shared across cultures
  2. A set of Grades that can be associated to Emotions
  3. A digital representation of Emotions and their Grades (starting from [11]).

Currently, culture-specific Emotions are not being considered. However, the proposed digital representation of Emotions and their Grades should either accommodate or be extensible to accommodate culture-specific Emotions.

4.1.4.3     Speech features

To accom­plish their task, speech processing applications utilize certain features of speech signals. General speech features are described in [12,13]. The extraction of these properties or features and how to obtain them from a speech signal is known as speech analysis. It can be done in the time domain as well as in the frequency domain. Analysing speech in the time domain often requires simple calculation and interpretation.

Time-domain features are related to the waveform analysis in the time domain. They can be used to measure the arousal level of emotions.

Time-domain features carry information about sequences of short-time prosody acoustic features (features estimated on a frame basis). Example features modified by the emotional states are given by short-time zero crossing rate, short-term speech energy and duration [16].

Frequency-domain features can be computed using (short-time) Fourier transform, wavelet transform, and other mathematical tools [21]. The frequency domain provides the mechan­isms to obtain some of the most useful parameters in speech analysis because the human cochlea performs a quasi-frequency analysis.

Initially, the time-domain signal is transformed into the frequency-domain, from which the feature is extracted. Such features are highly associated with the human perception of speech. Hence, they have apparent acoustic characteristics. These features usually comprise formant frequency, linear prediction cepstral coefficient (LPCC), and Mel frequency cepstral coefficients (MFCC).

The frequency-domain features could carry information about:

  1. The Pitch signal (i.e., the glottal waveform) that depends on the tension of the vocal folds and the subglottal air pressure. Two parameters related to the pitch signal can be considered: pitch frequency and glottal air velocity. E.g., high velocity indicates a speech emotion like hap­piness. Low velocity is in harsher styles such as anger [22].
  2. The shape of the vocal tract that is modified by the emotional states. The formants (character­ized by a centre frequency and a bandwidth) could be a representation of the vocal tract reson­ances. Features related to the number of harmonics due to the non-linear airflow in the vocal tract. E.g., in the emotional state of anger, the fast air flow causes additional excitation signals other than the pitch. Teager Energy Operator-based (TEO) features measure the harmonics and cross-harmonics in the spectrum [23].

Example features modified by the emotional states are given by the Mel-frequency cepstrum (MFC) [24].

To Respondents

Respondents are expected to propose Speech features that are capable to model

  1. non-extreme emotional states [14]
  2. many emotional states with a natural-sounding voice [15].

4.1.4.4     Emotion descriptors

The Emotion descriptors are a derivation of Speech features. They are used by the Emotion inserter to add the required emotion to the Digital speech.

By using frequency-domain and time-domain features a specific emotion can be added to a particular input Digital speech. Speech analysis can use different strategies to render the emotion depending on

  1. The type of sentence (numbers of words, type of phonemes, etc.) to which an emotion is added
  2. The emotions added to the previous and next sentence.

Emotion descriptors can be the output of a neural network or obtained by querying an Emotion KB.

To Respondents

Respondents should propose Emotion descriptors suitable to introduce Emotion into the specific emotion-less speech resulting in a speech that appears as “natural” to the listener.

4.1.4.5     Emotion KB query format

As of today, there is a variety of speech datasets available (online). Often, they consist of conversational setups and contain overlaps in speech as well as noise, or they are poor in expressiveness. Some Datasets offer emotionally rich content with a high quality, but in a limited amount [e.g., 16,17,18,19]. To be effective an Emotion KB should contain a large and expressive speech dataset.

Emotion KB contains features extracted from the speech recordings of different female and male speakers reading/reciting the same corpus of texts with an agreed set of emotions and without emotion, for a set of languages and for different genders (voice performances by professional actors in comparison with the author’s spontaneous speech) [25, 26].

Emotion KB is queried by providing a set of speech features. Emotion KB responds by providing Emotion descriptors.

To Respondents

Respondents are requested to propose an Emotion KB query format satisfying the following requirements:

  1. Accept a list of the speech features identified in 4.1.4.4
  2. Provide as output a set of Emotion descriptors identified in 4.1.4.5

4.2       Audio Recording Preservation

4.2.1      Reference architecture

This Use Case is implemented as in Figure 3. The Audio-video Analysis AIM can be implemented either using AI or legacy technologies. If this AIM is implemented as a neural network, access to the Tape irregularity KB may not be required.

Figure 3 – Tape Audio preservation

4.2.2      AI Modules

The AIMs required by this Use Case are described in Table 3.

Table 3 – AI Modules of Audio Recording Preservation

AIM Function
Audio enhancement Produces Preservation audio using internal denoiser, finalized only to compensate (a) non-linear frequency response, caused by imperfect histor­ical recording equipment; (b) rumble, needle noise, or tape hiss caused by the imperfections introduced by aging. (see 4.2.4.4).
Audio-video analysis Produces images and audio excerpts querying the Tape irregularity KB. Alternatively, an embedded neural network produces images and audio excerpts.
Musicological classifier Produces relevant images from Digital Video and text describing images
Packager Produces file containing

1.     Digital audio

2.     Input video

3.     Audio sync’d images and text

Tape irregularity KB Knowledge Base of visual and audio irregularities

4.2.3      I/O interfaces of AI Modules

The AIMs of Audio Recording Preservation are given in Table 4

Table 4 – I/O data of Audio Recording Preservation AIMs

AIM Input Data Output Data
Audio enhancement Digital Audio Preservation Audio
Audio-video Analysis Preservation Audio

Digital Video

Tape irregularity KB response

Audio Excerpts

Images

Tape irregularity KB query

Musicological classifier Audio Excerpts

Images

Text

Images

Packager Preservation Audio

Digital Video

Text

Images

Preservation Master
Tape irregularity KB Query Response

4.2.4      Technologies and Functional Requirements

4.2.4.1     Digital Audio

Digital Audio sampled from an analogue source (e.g., magnetic tapes, 78rpm phonographic discs) at a frequency in the 48-96 kHz range with at least 16 and at most 24 bits/sample [27].

To Proponents

Proponents are invited to comment on this choice.

4.2.4.2     Digital Video

Digital video has the following features.

  1. Pixel shape: square
  2. Bit depth: 8-10 bits/pixel
  3. Aspect ratio: 4/3 and 16/9
  4. 640 < # of horizontal pixels < 1920
  5. 480 < # of vertical pixels < 1080
  6. Frame frequency 50-120 Hz
  7. Scanning: progressive
  8. Colorimetry: ITU-R BT709 and BT2020
  9. Colour format: RGB and YUV
  10. Compression: uncompressed, if compressed AVC, HEVC

To Proponents

Proponents are invited to comment on these choices.

4.2.4.3     Digital Image

A Digital Image is

  1. An uncompressed video frame with time information or
  2. A video frame compressed with JPEG [29] with time information.

To Proponents

Respondents are invited to comment on this choice.

4.2.4.4     Image Features

Image Features are used to describe [34]

  1. Splices of
    1. leader tape to magnetic tape
    2. magnetic tape to magnetic tape
  2. Other irregularities such as brands on tape, ends of tape, ripples, damaged tapes, markings, dirt, shadows etc.

To Proponents

Respondents are requested to propose

  1. a complete set of irregularities from audio tapes
  2. Image features that characterise them.

4.2.4.5     Tape irregularity KB query format

Tape irregularity KB contains features extracted from images of different tape irregularities [35].

The Irregularity KB is queried by giving the features of an Image. The Irregularity KB responds by providing the type of irregularity detected in the input Image.

To Respondents

Respondents are requested to propose an Tape irregularity KB query format satisfying the follow­ing requirements:

  1. Accept a list of the Image features identified in 4.2.4.4
  2. Responds with indication of presence of irregularities or otherwise. If there are irregularities, it provides the type of irregularity identified in 4.2.4.4 as output

This CfT is specifically for preservation of audio tapes. However, its scope may be extended if sufficient technologies covering other audio preservation instances are received. Any proposal for other audio preservation instances should be described with a level of detail comparable to this Use Case.

4.2.4.6     Text

Text should be encoded according to ISO/IEC 10646, Information technology – Universal Coded Character Set (UCS) to support most languages in use [36].

To Respondents

Respondents are invited to comment on this choice.

4.2.4.7     Packager

Packager takes Preservation Audio, Digital Video, Text and Images and produces the Preservation Master file.

To Respondents

Respondents should propose a file format capable to:

  1. Support queries for irregularities, showing all the images corresponding to that given irregularity (splices, carrier corruptions, etc.)
  2. Allow listening to the audio corresponding to a particular image.
  3. Allow to annotate (with text) the audio signal, to support the musicological analysis
  4. Support query on the annotation, returning the corresponding time (sec:ms:sample), the text, the audio signal excerpt and image (if any)
  5. Support random access to a specified portion of video and/or audio providing.

Preference will be given to formats that have already been standardised or are in wide use.

4.2.5      Information about Audio enhancement performance

A fifty-year-long debate around the restoration of audio documents has been ongoing inside the archivists’ and musicologists’ communities [30].

The Preservation audio produced by Audio enhancement must fulfil the requirements of accuracy, reliability, and philological authenticity.

In [31] Schuller makes an accurate investigation of signal alterations classified in two categories

  1. Intentional that includes recording, equalization, and noise reduction systems
  2. Unintentional further divided into two groups:
    1. those caused by the imperfection of the recording technique of the time, resulting in various distortions
    2. those caused by misalignment of the recording equipment, for example, wrong speed, deviation from the vertical cutting angle in cylinders, or misalignment of the recording in magnetic tape.

The choice whether or not to compensate for these alterations reveals different restoration strat­egies: historical faithfulness can refer to the recording as it has been produced, precisely equalized for intentional recording equalizations, compensated for eventual errors caused by misaligned recording equipment (for example, wrong speed, deviation from the vertical cutting angle in cylinders, or misalignment of the recording in magnetic tape) and digitized using a modern equipment to minimize replay distortions.

There is a certain margin of interpretation because historical acquaintance with the document is called into question alongside with technical-scientific knowledge, for instance, to identify the equalization curves of magnetic tapes or to determine the rotation speed of a record. Most of the information provided is retrievable from the history of audio technology, while other information is experimentally inferable with a certain degree of accuracy.

The restoration must be focused to compensate non-linear frequency response, caused by imperfect historical recording equipment; rumble, needle noise, or tape hiss caused by the imperfections introduced by aging.

The restoration step can thus be carried out with a good degree of objectivity and represents an optimum level achievable by the original (analogue) recording equipment.

A legacy denoiser algorithm should [32,33]:

  1. use little a priori information
  2. operate in real time
  3. be based on frequency-domain methods, such as various forms of non-casual Wiener filtering or spectral subtraction schemes
  4. include algorithms that incorporate knowledge of the human auditory system.

To Proponents

The CfT does not include technologies object of this AIM. However, respondents’ comments will be welcome.

4.3       Enhanced Audioconference Experience

4.3.1      Reference architecture

This Use Case is implemented as in Figure 4.

Figure 4 – Enhanced Audioconference Experience

4.3.2      AI Modules

The AIMs required by the Enhanced Audioconference Experience are given in

Table 5 – AIMs of Enhanced Audioconference Experience

AIM Function
Speech detection and separation Separates relevant Speech vs non-speech signals
Noise cancellation Removes noise in Speech signal
Output dynamic noise cancellation Reduces noise level based on Output Device Acoustic Model
Delivery Wraps De-noised Speech signal for Transport
Output Device Acoustic Model KB Contains calibration test results for all output devices of a given manufacturer identified by their ID

4.3.3      I/O interfaces of AI Modules

The I/O data of Enhanced Audioconference Experience AIMs are given in Table 6.

Table 6 – I/O data of Enhanced Audioconference Experience AIMs

AIM Input Data Output Data
Speech detection and separation Microphone Sound

Geometry Information

Digital Speech

Geometry Information

Noise cancellation Digital Speech

Geometry Information

De-noised Speech
Output dynamic noise cancellation De-noised Speech Equalised Speech
Delivery Equalised Speech

Transport info

Equalised Speech
Output Device Acoustic Model KB Query Response

4.3.4      Technologies and Functional Requirements

4.3.4.1     Digital Speech

Enhanced Audioconference Experience (EAE) requires that speech be sampled at a frequency between 22.05 kHz and 96 kHz and that the samples be represented with a number of bits at least 16 bits/sample and at most 24 bit/sample.

To Respondents

Respondents are invited to comment on these two choices.

4.3.4.2     Microphone geometry information

Microphone geometry information is a descriptive representation of relative positioning of one or multiple microphones which describes physical characteristics of microphones such as type, positioning, angle and their relative position and overall configuration such as Array Type. It allows to accurately reproduce a signal free of noise and distortion and to better separate noise from signal as required for proper working of EAE AIMs. Formats to represent microphone geometry information are: MPEG-H 3D Audio [37] and platform (Android, Windows, Linux) specific JSON Descriptors API [38].

To Respondents

Respondent are requested to

  1. express their preference between the two formats
  2. comment about MPAI’s choice of the two formats
  3. possibly suggest alternative solutions.

4.3.4.3     Output device acoustic model metadata KB query format

The Output device acoustic model KB contains a description of the output device acoustic model, such as frequency response and per-frequency attenuation.

The Output device acoustic model KB is queried by requesting the unique ID of device, if available, or by providing a means to identify the model or unique reference to output device being considered. The Output device acoustic model KB responds with information about output device characteristics.

To Respondents

Respondents are requested to propose a query/response API satisfying the following requirements: API shall provide

  1. Means to enquiry for a specific device, model or family of models, if available.
  2. Adequate schemas to represent the Output device acoustic model using, if necessary, current representation schemes.

4.3.4.4     Delivery

Equalised Speech needs to be transported using a transport protocol most appropriate for the environment.

To Respondents

Proponents are requested to identify the transport protocols suitable for the EAE Use Case and propose an extensible way to signal which transport mechanism is intended to be used.

4.4       Audio-on-the-go

4.4.1      Reference architecture

This Use Case is implemented as in Figure 5. Environment sound recognition and Environment sound processing AIMs can be implemented either using AI or legacy technology. If any of these AIMs are implemented as a neural network, access to the corresponding KB may not be needed.

Figure 5 – Audio-on-the-go

4.4.2      AI Modules

The AIMs of Audio-on-the-go are given by Table 7

Table 7 – AIMs of Audio-on-the-go

AIM Function
Environment Sounds Recognition Recognises, separates and categorises sounds captured from the surrounding environment
Environment Sound Processing Determines which sounds are relevant for the user vs sounds which are not
Dynamic Signal Equalization Dynamically equalises the sound using information from the User hearing profiles KB to produce the best possible quality output
Delivery Wraps equalised sound for Transport
Sound categorisation KB Contains audio features of the sounds in the KB
Relevant vs non-relevant sound KB Contains audio features of relevant sounds
User hearing profiles KB A dataset of hearing profiles of target users

4.4.3      I/O interfaces of AI Modules

The I/O data of Audio on the go AIMs are given by Table 8

Table 8 – I/O data of Audio-on-the-go AIMs

AIM Input Data Output Data
Environment Sounds Recognition Microphone Sound

Geometry info

Sound array

Sound categorisation

Environment Sound Processing Sound array

Sound categorisation

Sound relevant to user
Dynamic Signal Equalization Sound relevant to user Dynamically equalised sound
Delivery Equalised Speech

Transport info

Equalised Speech
Sound categorisation KB Query Response
Relevant vs non-relevant sound KB Query Response
User hearing profiles KB Query Response

4.4.4      Technologies and Functional Requirements

4.4.4.1     Digital Audio

Digital Audio sampled is a stream of samples obtained by sampling audio at a frequency in the 48-96 kHz range with at least 16 and at most 24 bits/sample.

To Respondents

Proponents are invited to comment on this choice.

4.4.4.2     Microphone geometry information

Microphone geometry information is a descriptive representation of relative positioning of one or multiple microphones which describes physical characteristics of microphones such as type, positioning, angle and their relative position and overall configuration such as Array Type. It allows to accurately reproduce a signal free of noise and distortion and to better separate noise from signal as required for proper working of EAE AIMs. Formats to represent microphone geometry information are: MPEG-H 3D Audio [1] and platform (Android, Windows, Linux) specific JSON Descriptors API [38].

To Respondents

Respondent are requested to

  1. express their preference between the two formats
  2. comment about MPAI’s choice of the two formats
  3. possibly suggest alternative solutions.

4.4.4.3     Sound array

Respondents should propose a format to package a set of environment sounds with the requirements on being able to include the sound samples, encoding information (e.g., sampling frequency, bits per sample, compression method) and relative metadata, and duration.

To Respondents

Respondents are requested to propose an extensible identification of audio compression methods.

4.4.4.4     Sounds categorisation

Sounds captured by the microphone should be categorised.

To Respondents

Respondents should propose an extensible classification of all types of sound of interest [39]. Support of a set of sounds classified according to a proprietary scheme should also be provided.

4.4.4.5     Sound categorisation KB query format

Sound categorisation KB contains audio features of the sounds in the KB.

Sound categorisation KB is queried by giving features extracted from the input sound as input. Sound categorisation KB responds by giving the category of the sound.

To Respondents

Respondents should propose an extensible set of features to be used to query the Sound categorisation KB and obtain the categories of the sounds with following requirements

  1. The confidence value for the most relevant N categories.
  2. From which classification KB it has been extracted

4.4.4.6     Relevant vs non-relevant sound KB query format

Relevant vs non-relevant sound KB contains audio features of the relevant sounds.

Relevant vs non-relevant sound KB is queried by giving a sound as input. Relevant vs non-relevant sound KB responds by giving the relevant sound.

To Respondents

Respondents should propose a query format capable to provide a Boolean value (relevant/non-relevant) or a probability level (e.g., 70% relevant).

4.4.4.7     User Hearing Profiles KB query format

User Hearing Profiles KB contains the user hearing profile for the properly identified (e.g. via a UUID or a third-party identity provider) specific user.

User Hearing Profiles KB is queried giving the User hearing profile ID as input. User hearing profile KB responds with the specific user hearing profile. The User hearing profile contains the hearing attenuation for a defined number of frequency spectrums or any representation able to determine the unique individual sound perception ability [40]. There are currently at least 2 SDKs on the matter: MIMI SDK, NURA SDK (both proprietary) [41].

To Respondents

Respondents should propose a format which can convey the unique individual sound perception ability, in one of the following ways

  1. The KB responds to a query with the values of the frequency perception of the user at a pre-defined set of frequency values
  2. The KB responds to a query with the value of the frequency perception of the user at a specified frequency values with the query of a specific frequency value.

4.4.4.8     Delivery

Equalised Speech needs to be transported using a transport protocol most appropriate for the environment.

To Respondents

Proponents are requested to identify the transport protocol suitable for the AOG Use Case and propose an extensible way to signal which transport mechanism is intended to be used.

5        Potential common technologies

Table 9 introduces the acronyms representing the MPAI-CAE and MPAI-MMC Use Cases.

Table 9 – Acronyms of MPAI-CAE and MPAI-MMC Use Cases

Acronym App. Area Use Case
EES MPAI-CAE Emotion-Enhanced Speech
ARP MPAI-CAE Audio Recording Preservation
EAE MPAI-CAE Enhanced Audioconference Experience
AOG MPAI-CAE Audio-on-the-go
CWE MPAI-MMC Conversation with emotion
MQA MPAI-MMC Multimodal Question Answering
PST MPAI-MMC Personalized Automatic Speech Translation

Table 10 gives all MPAI-CAE and MPAI-MMC technologies in alphabetical order.

Please note the following acronyms

KB Knowledge Base
QF Query Format

Table 10 – Alphabetically ordered MPAI-CAE and MPAI-MMC technologies

UC Technology Description
AOG Delivery Speech transport format
EAE Delivery Speech transport format
AOG Digital Audio PCM Audio 48-96 kHz/16-24 bit
ARP Digital Audio PCM Audio 48-96 kHz/16-24 bit
ARP Digital Image A (un)compressed digital video frame
MQA Digital Image (un)compressed image
CWE Digital Speech PCM speech 22.05-96kHz/16-24 bit
EAE Digital Speech PCM speech 22.05-96kHz/16-24 bit
EES Digital Speech PCM speech 22.05-96kHz/16-24 bit
MQA Digital Speech PCM speech 22.05-96kHz/16-24 bit
PST Digital Speech PCM speech 22.05-96kHz/16-24 bit
ARP Digital Video Digital Video
CWE Digital Video Digital Video
CWE Emotion Digital representation of emotion
EES Emotion Digital representation of emotion
EES Emotion descriptors Derivations of Speech features
CWE Emotion KB (speech) QF Provides emotion from speech features
CWE Emotion KB (text) QF Provides emotion from text features
CWE Emotion KB (video) QF Provides emotion from video features
EES Emotion KB QF Provides Emotion descriptors
ARP Image Features Features of tape irregularities Images
MQA Image features Features of object Images
MQA Image KB QF Provides object identifier
CWE Input to speech synthesis Plain text or concept
MQA Intention Information such as what, where, how
MQA Intention KB QF Provides Intention
PST Language identification Language identifier
CWE Meaning Information such as question, statement
MQA Meaning Information such as question, statement
AOG Microphone geometry information Description of microphone position
EAE Microphone geometry information Description of microphone position
MQA Object identifier Identifier of a physical object
MQA Online dictionary QF Provides paragraphs correlelated with questions
EAE Output device acoustic model metadata KB QF Provides output device metadata
ARP Packager Audio/Video/Images/Text Multiplexer
AOG Relevant vs non-relevant sound KB QF Provides relevant sound
AOG Sound array Vector of extracted sounds
AOG Sound categorisation KB QF Provides sound category
AOG Sounds categorisation Identifier of a type of sound
CWE Speech features Speech features containing emotion info
EES Speech features Features associated to speech analysis
PST Speech features Features of input speech
ARP Tape irregularity KB QF Provides image features
ARP Text Plain text
MQA Text Plain text
PST Text Plain text
CWE Text features Text features containing emotion info
AOG User Hearing Profiles KB QF Provides profile of identified user
CWE Video features Video features containing emotion info

The following technologies are potentially applicable to different Use Cases.

Table 11 – Technologies potentially shared by MPAI-CAE and MPAI-MMC

 

Function EES ARP EAE AOG CWE MQA PST
Delivery X X
Digital speech X X
Digital audio X X
Digital image X X
Digital video X X
Emotion X X
Image features X X
Meaning X X
Microphone geometry information X X
Speech features X X X
Text X X X X

The following technologies are shared or shareable across Use Cases:

  1. Delivery
  2. Digital speech
  3. Digital audio
  4. Digital image
  5. Digital video
  6. Emotion
  7. Meaning
  8. Microphone geometry information
  9. Text

Image features apply to different visual objects. Speech features are different for all Use Cases.

However, respondents should consider the possibility of proposing a unified set of Speech features as proposed in [42]

6        Terminology

Table 12 – MPAI-CAE terms

Term Definition
Access Static or slowly changing data that are required by an application such as domain knowledge data, data models, etc.
AI Framework (AIF) The environment where AIM-based workflows are executed
AI Module (AIM) The basic processing elements receiving processing specific inputs and producing processing specific outputs
Audio enhancement An AIM that produces Preservation audio using internal denoiser
Communication The infrastructure that connects the Components of an AIF
Delivery An AIM that wraps data for transport
Digital Speech Digitised speech as specified by MPAI
Dynamic Signal Equalization An AIM that dynamically equalises the sound using information from the User hearing profiles KB
Emotion An attribute that indicates an emotion out of a finite set of Emotions
Emotion Descriptor A set of time-domain and frequency-domain features capable to render a particular emotion, starting from an emotion-less digital speech
Emotion inserter A module to set time-domain and frequency-domain features of a neutral speech in order to insert a particular emotional intention.
Emotion KB A speech dataset rich in expressiveness
Emotion KB query format A dataset of time-domain and frequency-domain neutral speech features
Environment Sound Processing An AIM that determines which sounds are relevant for the user vs sounds which are not
Environment Sounds Recognition An AIM that recognises, separates and categorises sounds captured from the environment
Execution The environment in which AIM workflows are executed. It receives external inputs and produces the requested outputs both of which are application specific
Frequency-domain Features Properties (descriptors) of the signal with respect to frequency
Emotion Grade The intensity of an Emotion
Management and Control Manages and controls the AIMs in the AIF, so that they execute in the correct order and at the time when they are needed
Musicological classifier Algorithm that sorts unlabelled images from Digital Video into (relevant) labelled categories of information, linking them with text describing the images.
Noise cancellation An AIM that removes noise in Speech signal
Output Device Acoustic Model KB A dataset of calibration test results for all output devices of a given manufacturer identified by their ID
Output dynamic noise cancellation An AIM that reduces noise level based on Output Device Acoustic Model
Packager An AIM that packages audio, video, images and text in a file
Relevant vs non-relevant sound KB A dataset of audio features of relevant sounds
Sound categorisation KB Contains audio features of the sounds in the KB
Speech analysis The AIM that extracts Emotion descriptors
Speech analysis The AIM that understands the emotion embedded in speech
Speech analysis The AIM that extracts the characteristics of the speaker (e.g., physiology and intention)
Speech and Emotion File Format A file format that contains Digital speech and time-stamped Emotions related to speech
Speech detection and separation AIM that separates relevant Speech vs non-speech signals
Speech Features Speech features used to extract Emotion descriptors
Storage Storage used to e.g., store the inputs and outputs of the individual AIMs, data from the AIM’s state and intermediary results, shared data among AIMs
Tape irregularity KB Dataset that includes examples of the different irregularities that may be present in the carrier (analogue tape, phonographic discs) considered
Text Characters drawn from a finite alphabet
Time-domain features Properties (descriptors) of the signal with respect to frequency
User hearing profiles KB A dataset of hearing profiles of target users

7        References

  1. MPAI-AIF Call for Technologies; https://mpai.community/standards/mpai-aif/#Technologies
  2. MPAI-CAE Call for Technologies; N131
  3. MPAI-MMC Use Cases and Functional Requirements; N134
  4. Burkhardt and N. Campbell, “Emotional speech synthesis,” in The Oxford Handbook of Affective Computing. Oxford University Press New York, 2014, p. 286
  5. Noé Tits, A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech – a Deep Learning approach, 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), September 2019, DOI: 10.1109/ACIIW.2019.8925241
  6. W. Adorno, Philosophy of New Music, University of Minnesota Press, Minneapolis, Minn, USA, 2006
  7. Ekman, P. (1999). Basic Emotions. In T. Dalgleish and T. Power (Eds.) The Handbook of Cognition and Emotion Pp. 45–60. Sussex, U.K.: John Wiley & Sons, Ltd.
  8. Plutchik R., Emotion: a psychoevolutionary synthesis, New York Harper and Row, 1980
  9. Russell, James (1980). “A circumplex model of affect”. Journal of Personality and Social Psychology. 39 (6): 1161–1178. doi:10.1037/h0077714
  10. Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
  11. https://www.w3.org/TR/2014/REC-emotionml-20140522/
  12. Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
  13. Burkhardt, F., & Sendlmeier, W. F., Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis, ISCA Workshop on Speech & Emotion, Northern Ireland 2000, p. 151-156.
  14. Scherer, K. R., Ladd, D. R., & Silverman, K., Vocal cues to speaker affect: Testing two models, Journal of the Acoustic Society of America, 76(5), 1984, p. 1346-1356
  15. Kasuya, H., Maekawa, K., & Kiritani, S., Joint Estimation of Voice Source and Vocal Tract Parameters as Applied to the Study of Voice Source Dynamics, ICPhS 99, p. 2505-2512
  16. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PLOS ONE, vol. 13, no. 5, pp. 1–35, 05 2018
  17. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014
  18. Banziger, M. Mortillaro, and K. R. Scherer, “Introducing the geneva multimodal expression corpus for experimental research on emotion perception.” Emotion, vol. 12, no. 5, p. 1161, 2012
  19. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in Ninth European Conference on Speech Communication and Technology, 2005
  20. Mozziconacci, S. J. L., Speech Variability and Emotion: Production and Perception, PhD Thesis, Technical University Eindhoven, 1998
  21. Burkhardt, F., & Sendlmeier, W. F., Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis, ISCA Workshop on Speech & Emotion, Northern Ireland 2000, p. 151-156.
  22. Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
  23. Hamed Beyramienanlou, Nasser Lotfivand, “An Efficient Teager Energy Operator-Based Automated QRS Complex Detection”, Journal of Healthcare Engineering, vol. 2018, Article ID 8360475, 11 pages, 2018. https://doi.org/10.1155/2018/8360475]
  24. Davis S B. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28(4):65-74
  25. Giovanni Costantini, Iacopo Iaderola, Andrea Paoloni, Massimiliano Todisco. EMOVO Corpus: an Italian Emotional Speech Database.
  26. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 3501–3504, May 2014. 2- Moataz El Ayadi, Mohamed S. Kamel, Fakhri Karray. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition Journal, Elsevier, 44 (2011) 572–587
  27. IASA-TC 05: Handling and Storage of Audio and Video Carriers. IASA Technical Committee (2014)
  28. Hamed Beyramienanlou, Nasser Lotfivand, “An Efficient Teager Energy Operator-Based Automated QRS Complex Detection”, Journal of Healthcare Engineering, vol. 2018, Article ID 8360475, 11 pages, 2018. https://doi.org/10.1155/2018/8360475
  29. ISO/IEC 10918-1:1994 Information Technology — Digital Compression And Coding Of Continuous-Tone Still Images: Requirements And Guidelines
  30. Federica Bressan and Sergio Canazza, A Systemic Approach to the Preservation of Audio Documents: Methodology and Software Tools, Journal of Electrical and Computer Engineering, 2013. https://doi.org/10.1155/2013/489515
  31. Boston, Safeguarding the Documentary Heritage. A Guide to Standards, Recommended Practices and Reference Literature Related to the Preservation of Documents of All Kinds, UNESCO, Paris, France, 1988.
  32. Canazza. The digital curation of ethnic music audio archives: from preservation to restoration. International Journal of Digital Libraries, 12(2-3):121–135, 2012
  33. J. Godsill and P.J.W. Rayner. Digital Audio Restoration – a statistical model-based approach (Berlin: Springer-Verlag 1998)
  34. Pretto, Niccolò; Fantozzi, Carlo; Micheloni, Edoardo; Burini, Valentina; Canazza Targon, Sergio. Computing Methodologies Supporting the Preservation of Electroacoustic Music from Analog Magnetic Tape. In Computer Music Journal, 2018, vol. 42 (4), pp.59-74
  35. Fantozzi, Carlo; Bressan, Federica; Pretto, Niccolò; Canazza, Sergio. Tape music archives: from preservation to access. pp.233-249. In International Journal On Digital Libraries, pp. 1432-5012 vol. 18 (3), 2017. DOI:10.1007/s00799-017-0208-8
  36. ISO/IEC 10646:2003 Information Technology — Universal Multiple-Octet Coded Character Set (UCS)
  37. https://www.iis.fraunhofer.de/en/ff/amm/broadcast-streaming/mpegh.html
  38. https://docs.microsoft.com/bs-cyrl-ba/azure/cognitive-services/speech-service/how-to-devices-microphone-array-configuration
  39. https://www.frontiersin.org/articles/10.3389/fpsyg.2018.01277/full
  40. https://help.nuraphone.com/hc/en-us/articles/360000324676-Your-Profile
  41. https://integrate.mimi.io/documentation/android/4.0.1/documentation
  42. Problem Agnostic Speech Encoder; https://github.com/santi-pdp/pase

Application NoteDraft Use Cases and Functional RequirementsDraft Call for Technologies

MPAI Application Note #1 Rev. 1

Context-based Audio Enhancement (MPAI-CAE)

Proponents: Michelangelo Guarise, Andrea Basso (VOLUMIO)

 Description: The overall user experience quality is highly dependent on the context in which audio is used, e.g.

  1. Entertainment audio can be consumed in the home, in the car, on public transport, on-the-go (e.g. while doing sports, running, biking) etc.
  2. Voice communications: can take place office, car, home, on-the-go etc.
  3. Audio and video conferencing can be done in the office, in the car, at home, on-the-go etc.
  4. (Serious) gaming can be done in the office, at home, on-the-go etc.
  5. Audio (post-)production is typically done in the studio
  6. Audio restoration is typically done in the studio

By using context information to act on the content using AI, it is possible substantially to improve the user experience.

Figure 1 represents how MPAI-CAE can reorganise its processing modules within an MPAI-AIF Framework to support different applications.

Figure 1 – Instances of MPAI-CAE

Comments: Currently, there are solutions that adapt the conditions in which the user experiences content or service for some of the contexts mentioned above. However, they tend to be vertical in nature, making it dif­ficult to re-use possibly valuable AI-based components of the solutions for differ­ent applications.

MPAI-CAE aims to create a horizontal market of re-usable and possibly context-depending components that expose standard interfaces. The market would become more receptive to innov­ation hence more compet­itive. Industry and consumers alike will benefit from the MPAI-CAE stan­dard.

Examples

The following examples describe how MPAI-CAE can make the difference.

  1. Enhanced audio experience in a conference call

Often, the user experience of a video/audio conference can be marginal. Too much background noise or undesired sounds can lead to participants not understanding what participants are saying. By using AI-based adaptive noise-cancellation and sound enhancement, MPAI-CAE can virtually eliminate those kinds of noise without using complex microphone systems to capture environment characteristics.

  1. Pleasant and safe music listening while biking

While biking in the middle of city traffic, AI can process the signals from the environment captured by the microphones available in many earphones and earbuds (for active noise cancellation), adapt the sound rendition to the acoustic environment, provide an enhanced audio experience (e.g. performing dynamic signal equalization), improve battery life and selectively recognize and allow relevant environment sounds (i.e. the horn of a car). The user enjoys a satisfactory listening experience without losing contact with the acoustic surroundings.

  1. Emotion enhanced synthesized voice

Speech synthesis is constantly improving and finding several applications that are part of our daily life (e.g. intelligent assistants). In addition to improving the ‘natural sounding’ of the voice, MPAI-CAE can implement expressive models of primary emotions such as fear, happiness, sad­ness, and anger.

  1. Efficient 3D sound

MPAI-CAE can reduce the number of channels (i.e. MPEG-H 3D Audio can support up to 64 loudspeaker channels and 128 codec core channels) in an automatic (unsupervised) way, e.g. by mapping a 9.1 to a 5.1 or stereo (radio broadcasting or DVD), maintaining the musical touch of the composer.

  1. Speech/audio restoration

Audio restoration is often a time-consuming process that requires skilled audio engineers with specific experience in music and recording techniques to go over manually old audio tapes. MPAI-CAE can automatically remove anomalies from recordings through broadband denoising, declicking and decrackling, as well as removing buzzes and hums and performing spectrographic ‘retouching’ for removal of discrete unwanted sounds.

  1. Normalization of volume across channels/streams

Eighty-five years after TV has been first introduced as a public service, TV viewers are still strug­gling to adapt to their needs the different average audio levels from different broadcasters and, within a program, to the different audio levels of the different scenes.

MPAI-CAE can learn from user’s reactions via remote control, e.g. to a loud spot, and control the sound level accordingly.

  1. Automotive

Audio systems in cars have steadily improved in quality over the years and continue to be integrated into more critical applications. Toda, a buyer takes it for granted that a car has a good automotive sound system. In addition, in a car there is usually at least one and sometimes two microphones to handle the voice-response system and the hands-free cell-phone capability. If the vehicle uses any noise cancellation, several other microphones are involved. MPAI-CAE can be used to improve the user experience and enable the full quality of current audio systems by reduc­ing the effects of the noisy automotive environment on the signals.

  1. Audio mastering

Audio mastering is still considered as an ‘art’ and the prerogative of pro audio engineers. Normal users can upload an example track of their liking (possibly obtained from similar musical content) and MPAI-CAE analyzes it, extracts key features and generate a master track that ‘sounds like’  the example track starting from the non-mastered track.  It is also possible to specify the desired style without an example and the original track will be adjusted accordingly.

Requirements:

The following is an initial set of MPAI-CAE functional requirements to be further developed in the next few weeks. When the full set of requirements will be developed, the MPAI General Assembly will decide whether an MPAI-CAE standard should be developed.

  1. The standard shall specify the following natural input signals
    1. Microphone signals
    2. Inertial measurement signals (Acceleration, Gyroscope, Compass, …)
    3. Vibration signals
    4. Environmental signals (Proximity, temperature, pressure, light, …)
    5. Environment properties (geometry, reverberation, reflectivity, …)
  2. The standard shall specify
    1. User settings (equalization, signal compression/expansion, volume, …)
    2. User profile (auditory profile, hearing aids, …)
  3. The standard shall support the retrieval of pre-computed environment models (audio scene, home automation scene, …)
  4. The standard shall reference the user authentication standards/methods required by the specific MPAI-CAE context
  5. The standard shall specify means to authenticate the components and pipelines of an MPAI-CAE instance
  6. The standard shall reference the methods used to encrypt the streams processed by MPAI-CAE and service-related metadata
  7. The standard shall specify the adaptation layer of MPAI-CAE streams to delivery protocols of common use (e.g. Bluetooth, Chromecast, DLNA, …)

 Object of standard: Currently, three areas of standardization are identified:

  1. Context type interfaces: a first set of input and output signals, with corresponding syntax and semantics, for audio usage contexts considered of sufficient interest (e.g. audiocon­ferencing and audio consumption on-the-go). They have the following features
    1. Input and out signals are context specific, but with a significant degree of commonality across contexts
    2. The operation of the framework is implementation-dependent offering implementors the way to produce the set of output signals that best fit the usage context
  2. Processing component interfaces: with the following features
    1. Interfaces of a set of updatable and extensible processing modules (both traditional and AI-based)
    2. Possibility to create processing pipelines and the associated control (including the needed side information) required to manage them
    3. The processing pipeline may be a combination of local and in-cloud processing
  3. Delivery protocol interfaces
    1. Interfaces of the processed audio signal to a variety of delivery protocols

Benefits: MPAI-CAE will bring benefits positively affecting

  1. Technology providers need not develop full applications to put to good use their technol­ogies. They can concentrate on improving the AI technologies that enhance the user exper­ience. Further, their technologies can find a much broader use in application domains beyond those they are accustomed to deal with.
  2. Equipment manufacturers and application vendors can tap from the set of technologies made available according to the MPAI-CAE standard from different competing sources, integrate them and satisfy their specific needs
  3. Service providers can deliver complex optimizations and thus superior user experience with minimal time to market as the MPAI-CAE framework enables easy combination of 3rd party components from both a technical and licensing perspective. Their services can deliver a high quality, consistent user audio experience with minimal dependency on the source by selecting the optimal delivery method
  4. End users enjoy a competitive market that provides constantly improved user exper­iences and controlled cost of AI-based audio endpoints.

 Bottlenecks: the full potential of AI in MPAI-CAE would be unleashed by a market of AI-friendly processing units and introducing the vast amount of AI technologies into products and services.

 Social aspects: MPAI-CAE would free users from the dependency on the context in which they operate; make the content experience more personal; make the collective service experience less dependent on events affecting the individual participant and raise the level of past content to today’s expectations.

Success criteria: MPAI-CAE should create a competitive market of AI-based components expos­ing standard interfaces, processing units available to manufacturers, a variety of end user devices and trigger the implicit need felt by a user to have the best experience whatever the context.