Application Note – Draft Use Cases and Functional Requirements – Draft Call for Technologies
Context-based Audio Enhancement – MPAI-CAE
Draft Call for Technologies
1 Introduction
Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) is an international non-profit organisation with the mission to develop standards for Artificial Intelligence (AI) enabled digital data coding and for technologies that facilitate integration of data coding components into ICT systems. With the mechanism of Framework Licences, MPAI seeks to attach clear IPR licensing frameworks to its standards.
MPAI has found that the application area called “Context-based Audio Enhancement” is particularly relevant for MPAI standardisation because using context information to act on the input audio content can substantially improve the user experience of a variety of audio-related applications that include entertainment, communication, teleconferencing, gaming, post-production, restoration etc. for a variety of contexts such as in the home, in the car, on-the-go, in the studio etc.
Therefore, MPAI intends to develop a standard – to be called MPAI-CAE – that will provide standard technologies to implement four Use Cases identified so far
- Emotion-Enhanced Speech (EES)
- Audio Recording Preservation (ARP)
- Enhanced Audioconference Experience (EAC)
- Audio-on-the-go (AOG)
This document is a Call for Technologies (CfT) that
- satisfy the functional requirements of N131
- are released according to the Framework Licence of N1xy available online, if selected by MPAI for inclusion in the MPAI-CAE standard.
The standard will be developed with the following guidelines
- To satisfy the Functional Requirements of N131 [1], available online. In the future, MPAI may decide to extend MPAI-CAE to support other Use Cases.
- To use, where feasible and desirable, the same basic technologies required by the companion document MPAI-MMC Use Cases and Functional Requirements [2].
- To be suitable for implementation as AI Modules (AIM) conforming to the emerging MPAI AI Framework (MPAI-AIF) standard. The MPAI-AIF Functional Requirements N74 [4] and the Call for Technologies (N100) [5] are available online here and here.
Respondents should be aware that
- Use Cases that make up MPAI-CAE, the Use Cases themselves and the AIM internals will be non-normative
- The input and output interfaces of the AIMs, whose requirements have been derived to support the Use Cases, will be normative.
Therefore, the scope of this Call for Technologies is restricted to technologies required to implement the input and output interfaces of the AIMs identified in N131 [1].
However, MPAI invites comments on any technology or architectural component identified in N131, specifically
- Additions or removal of input/output signals to the identified AIMs with identification of data formats required by the new input/output signals
- Possible alternative partitioning of the AIMs implementing the example cases providing
- Arguments in support of the proposed partitioning
- Detailed specifications of the inputs and outputs of the proposed new AIMs
- New Use Cases fully described as in the final version of this document.
All parties who believe they have relevant technologies satisfying all or most of the requirements of one or more than one Use Case described in N131 are invited to submit proposals for consideration by MPAI. MPAI membership is not a prerequisite for responding to this CfT. However, proponents should be aware that, if their proposal or part thereof is accepted for inclusion in the MPAI-CAE standard, they shall immediately join MPAI, or their accepted technologies will be discarded.
MPAI will select the most suitable technologies based on their technical merits for inclusion in MPAI-CAE. However, MPAI in not obligated, by virtue of this CfT, to select a particular technology or to select any technology if those submitted are found inadequate.
Submissions are due on 2021/04/13T23:59 UTC and will be reviewed according to the schedule that the 7th MPAI General Assembly (MPAI-7) will define at its online meeting on 2021/04/15. For details on how submitters who are not MPAI members can attend the said review please contact the MPAI secretariat (secretariat@mpai.community).
2 How to submit a response
Those planning to respond to this CfT
- Are advised that online events will be held on 2021/02/24 and 2021/03/10 to present the MPAI-CAE CfT and respond to questions. Logistic information on these events will be posted on the MPAI web site
- Are requested to communicate their intention to respond to this CfT with an initial version of the form of Annex A to the MPAI secretariat (secretariat@mpai.community) by 2021/03/18. A potential submitter making a communication using the said form is not required to actually make a submission. Submission will be accepted even if the submitter did not communicate their intention to submit a response.
Responses to this MPAI-CAE CfT shall/may include:
Table 1 – Mandatory and optional elements of a response
Item | Status |
Detailed documentation describing the proposed technologies | mandatory |
The final version of Annex A | mandatory |
The text of Annex B duly filled out with the table indicating which requirements identified in MPAI N131 [1] are satisfied. If all the requirements of a Use Case are not satisfied, this should be explained. | mandatory |
Comments on the completeness and appropriateness of the MPAI-CAE requirements and any motivated suggestion to amend or extend those requirements. | optional |
A preliminary demonstration, with a detailed document describing it. | optional |
Any other additional relevant information that may help evaluate the submission, such as additional use cases. | optional |
The text of Annex E. | mandatory |
Respondents are invited to take advantage of the check list of Annex C before submitting their response and filling out Annex B.
Responses shall be submitted to secretariat@mpai.community (MPAI secretariat) by 2020/04/13 T23:59 UTC. The secretariat will acknowledge receipt of the submission via email.
Respondents are requested to present their submission (mandatory) at a properly announce MPAI meeting by teleconference. If no presenter will attend the meeting, the proposal will be discarded.
Respondents are advised that, upon acceptance by MPAI of their submission in whole or in part for further evaluation, MPAI will require that
- A working implementation, including source code, – for use in the development of the MPAI-CAE Reference Software – be made available before the technology is accepted for the MPAI-CAE standard. Software may be written in programming languages that can be compiled or interpreted and in hardware description languages.
- The working implementation be suitable for operation in the MPAI AIF Framework (MPAI-AIF)
- A non-MPAI member immediately join MPAI. If the non-MPAI member elects not to do so, their submission will be discarded. Direction on how to join MPAI can be found online.
Further information on MPAI can be obtained from the MPAI website.
3 Evaluation Criteria and Procedure
Proposals will be assessed using the following process
- Evaluation panel is created from
- All CAE-DC members attending
- Non-MPAI members who are respondents
- Non respondents/non MPAI member experts invited in a consulting capacity
- No one from 1.-2.-3. will be denied membership in the Evaluation panel
- Respondents present their proposals
- Evaluation Panel members ask questions
- If required subjective and/or objective tests are carried out
- Define required tests
- Carry out the tests
- Produce report
- At least 2 reviewers appointed to review & report about specific points in a proposal if required
- Evaluation panel members fill out Annex 2 for each proposal
- Respondents respond to evaluations
- Proposal evaluation report is produced.
Expected development timeline
Timeline of the CfT, deadlines and response evaluation:
Table 1 – Dates and deadlines
Step | Date |
Call for Technologies | 2021/02/17 |
CfT introduction conference call 1 | 2021/02/24T14:00 UTC |
CfT introduction conference call 2 | 2021/03/10T15:00 UTC |
Notification of intention to submit proposal | 2021/02/18 T23.59 UTC |
Submission deadline | 2021/04/13T23.59 UTC |
Evaluation of responses | 2021/04/15 (MPAI-7) |
Evaluation to be carried out during 2-hour sessions according to the calendar agreed at MPAI-7
4 References
- Draft MPAI-CAE Use Cases & Functional Requirements, MPAI N131
- Draft MPAI-MMC Use Cases & Functional Requirements, MPAI N133
- Draft MPAI-MMC Call for Technologies, MPAI N134
- MPAI-AIF Use Cases & Functional Requirements, MPAI N74; https://mpai.community/standards/mpai-aif/
- MPAI-AIF Call for Technologies, MPAI N100
Annex A: Information Form
This information form is to be filled in by a respondent to the MPAI-AIF CfT
- Title of the proposal
- Organisation: company name, position, e-mail of contact person
- What are the main functionalities of your proposal?
- Does your proposal provide or describe a formal specification and APIs?
- Will you provide a demonstration to show how your proposal meets the evaluation criteria?
Annex B: Evaluation Sheet
Proposal title:
Main Functionalities:
Response summary: (a few lines)
Comments on Relevance to the CfT (Requirements):
Comments on possible MPAI-CAE profiles[1]
Evaluation table:
Table 1 – Assessment of submission features
Submission features | Evaluation elements | Final Assessment |
Completeness of description | ||
Understandability | ||
Adaptability | ||
Extensibility | ||
Use of Standard Technology | ||
Efficiency | ||
Test cases | ||
Maturity of reference implementation | ||
Relative complexity | ||
Support of MPAI use cases | ||
Support of non-MPAI use cases |
Content of the criteria table cells:
Evaluation facts should mention:
- Not supported / partially supported / fully supported.
- What supported these facts: submission/presentation/demo.
- The summary of the facts themselves, e.g., very good in one way, but weak in another.
Final assessment should mention:
- Possibilities of improving or adding to the proposal, e.g., any missing or weak features.
- How sure the experts are, i.e., evidence shown, very likely, very hard to tell, etc.
- Global evaluation (Not Applicable/ –/ – / + / ++)
New Use Cases/Requirements Identified:
(please describe)
Evaluation summary:
- Main strong points, qualitatively:
- Main weak points, qualitatively:
- Overall evaluation: (0/1/2/3/4/5)
0: could not be evaluated
1: proposal is not relevant
2: proposal is relevant, but requires significant more work
3: proposal is relevant, but with a few changes
4: proposal has some very good points, so it is a good candidate for standard
5: proposal is superior in its category, very strongly recommended for inclusion in standard
Additional remarks: (points of importance not covered above.)
The submission features in Table 1 are explained in the following Table 2.
Table 2 – Explanation of submission features
Submission features | Criteria |
Completeness of description | Evaluators should
1. Compare the list of requirements (Annex C of the CfT) with the submission. 2. Check if respondents have described in sufficient detail to what part of the architecture their proposal refers to. NB1: Completeness of a proposal for a Use Case is a merit because reviewers can assess that the components are integrated. NB2: Submissions will be judged for the merit of what is proposed. |
Understandability | Evaluators should identify items that are demonstrably unclear (inconsistencies, sentences with dubious meaning etc.) |
Adaptability | Evaluators should check if respondent specifies an execution environment with its scope of applicability.
NB: Adaptability is synonymous of portability to different computational frameworks. |
Extensibility | Evaluators should check if respondent has proposed extensions to the use cases
NB: Extensibility is the capability of the proposed solution to support use cases that are not supported by current requirements. |
Use of standard Technology | Evaluators should check if new technologies are proposed where widely adopted technologies exists. If this is the case, the merit of the new technology shall be proved. |
Efficiency | Evaluators should assess power consumption, computational speed, computational complexity, required TOPS |
Test cases | Evaluators should report whether a proposal contains suggestions for testing the technologies proposed |
Maturity of reference implementation | Evaluators should assess the maturity of the proposal.
NB1: Maturity is measured by the completeness, i.e., having all the necessary and appropriate parts of the HW/SW disclosed implementation with respect to the submitted proposal. NB2: If there are parts of the implementation that are not disclosed but demonstrated, they will be considered if and only if such components are replicable. |
Relative complexity | Evaluators should identify issues that would make it difficult to implement the proposal compared to the state of the art |
Support of MPAI use cases | Evaluators should check how many use cases are supported in the submission |
Support of non-MPAI use cases | Evaluators should check whether the technologies proposed can demonstrably be used in other significantly different use cases. |
Annex C: Requirements check list
Table 8 This list has been derived from the Requirements of N131 [1].
Please note the following acronyms
KB | Knowledge Base |
QF | Query Format |
UC | Technology | Description |
AOG | Delivery | Speech transport format |
AOG | Digital Audio | PCM Audio 48-96 kHz/16-24 bit |
AOG | Microphone geometry information | Description of microphone position |
AOG | Relevant vs non-relevant sound KB QF | Provides relevant sound |
AOG | Sound array | Vector of extracted sounds |
AOG | Sound categorisation KB QF | Provides sound category |
AOG | Sounds categorisation | Identifier of a type of sound |
AOG | User Hearing Profiles KB QF | Provides profile of identified user |
ARP | Digital Audio | PCM Audio 48-96 kHz/16-24 bit |
ARP | Digital Image | A (un)compressed digital video frame |
ARP | Digital Video | Digital Video |
ARP | Image Features | Features characterising tape irregularities |
ARP | Packager | Audio/Video/Images/Text Multiplexer |
ARP | Tape irregularity KB QF | Provides image features |
ARP | Text | Plain text |
EAE | Delivery | Speech transport format |
EAE | Digital Speech | PCM speech 22.05-96kHz/16-24 bit |
EAE | Microphone geometry information | Description of microphone position |
EAE | Output device acoustic model metadata KB QF | Provides output device metadata |
EES | Digital Speech | PCM speech 22.05-96kHz/16-24 bit |
EES | Emotion | Digital representation of emotion |
EES | Emotion descriptors | Derivations of Speech features |
EES | Emotion KB QF | Provides Emotion descriptors |
EES | Speech and Emotion File Format | Multiplexed digital speech and emotion |
EES | Speech features | Features associated to speech analysis |
Respondent should consult the equivalent list in N133 [2]
Annex D – Technologies that may require specific testing
EES Emotion descriptors
EES Speech features
EES Emotion KB Query Format
ARP Image features
ARP Tape irregularities KB Query Format
Annex E: Mandatory text in responses
A response to this MPAI-AIF is CfT shall mandatorily include the following text
<Company/Member> submits this technical document in response to MPAI Call for Technologies for MPAI project MPAI-XYZ (MPAI document Nijk).
<Company/Member> explicitly agrees to the steps of the MPAI standards development process defined in Annex 1 to the MPAI Statutes, in particular <Company/Member> declares that <Company/Member> or its successors will make available the terms of the Licence related to its Essential Patents according to the Framework Licence of MPAI-XYZ (MPAI document Nmnp), alone or jointly with other IPR holders after the approval of the MPAI-XYZ Technical Specification by the General Assembly and in no event after commercial implementations of the MPAI-XYZ Technical Specification become available on the market.
In case the respondent is a non-MPAI member, the submission shall mandatorily include the following text
If (a part of) this submission is identified for inclusion in a specification, <Company> understands that <Company> will be requested to immediately join MPAI and that, if <Company> elects not to join MPAI, this submission will be discarded.
Subsequent technical contribution shall mandatorily include this text
<Member> submits this document to MPAI Development Committee XYZ as a contribution to the development of the MPAI-XYZ Technical Specification.
<Member> explicitly agrees to the steps of the MPAI standards development process defined in Annex 1 to the MPAI Statutes, in particular <Company> declares that <Company> or its successors will make available the terms of the Licence related to its Essential Patents according to the Framework Licence of MPAI-XYZ (MPAI document Nmnp), alone or jointly with other IPR holders after the approval of the MPAI-XYZ Technical Specification by the General Assembly and in no event after commercial implementations of the MPAI-XYZ Technical Specification become available on the market.
[1] Profile of a standard is a particular subset of the technologies that are used in a standard and, where applicable, the classes, subsets, options and parameters relevan for the subset
Application Note – Draft Use Cases and Functional Requirements – Draft Call for Technologies
MPAI-CAE – Context-based Audio Enhancement
Draft Use Cases and Functional Requirements
1 Introduction
Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) is an international association with the mission to develop AI-enabled data coding standards. Research has shown that data coding with AI-based technologies is more efficient than with existing technologies.
The MPAI approach to developing AI data coding standards is based on the definition of standard interfaces of AI Modules (AIM). AIMs operate on input data having a standard format to provide output data having a standard format. AIMs can be combined and executed in an MPAI-specified AI-Framework called MPAI-AIF. A Call for MPAI-AIF Technologies [1] is currently open.
While AIMs must expose standard interfaces to be able to operate in an MPAI AI Framework, their performance may differ depending on the technologies used to implement them. MPAI believes that competing developers striving to provide more performing proprietary and interoperable AIMs will promote horizontal markets of AI solutions that build on and further promote AI innovation.
This document is a collection of Use Cases and Functional Requirements for the MPAI Context-based Audio Enhancement (MPAI-CAE) application area. The Use Cases in the MPAI-CAE standard help improve the audio user experience for several applications including entertainment, communication, teleconferencing, gaming, post-production, restoration etc. in a variety of contexts such as in the home, in the car, on-the-go, in the studio etc. Currently MPAI has identified four Use Cases falling in the Context-based Audio Enhancement area:
- Emotion-Enhanced Speech (EES)
- Audio Recording Preservation (ARP)
- Enhanced Audioconference Experience (EAC)
- Audio-on-the-go (AOG)
This document is to be read in conjunction with the MPAI-CAE Call for Technologies (CfT) [2] as it provides the functional requirements of all the technologies that have been identified as required to implement the current MPAI-CAE Use Cases. Respondents to the MPAI-CAE CfT should make sure that their responses are aligned with the functional requirements expressed in this document.
In the future MPAI may issue other Calls for Technologies falling in the scope of MPAI-CAE to support identified Use Cases. Currently these are
- Efficient 3D sound
- (Serious) gaming
- Normalization of TV volume
- Automotive
- Audio mastering
- Speech communication
- Audio (post-)production
It should also be noted that some technologies identified in this document are the same, similar, or related to technologies required to implement some of the Use Cases of the companion document MPAI-MMC Use Cases and Functional Requirements [3]. Readers of this document are advised that being familiar of the content of the said companion document is a prerequisite for a proper understanding of this document.
This document is structured in 7 chapters, including this Introduction.
Chapter 2 | briefly introduces the AI Framework Reference Model and its six Components |
Chapter 3 | briefly introduces the 4 Use Cases. |
Chapter 4 | presents the 4 MPAI-CAE Use Cases with the following structure
1. Reference architecture 2. AI Modules 3. I/O data of AI Modules 4. Technologies and Functional Requirements |
Chapter 5 | identifies the technologies likely to be common across MPAI-CAE and MPAI-MMC, a companion standard project whose Call for Technologies is issued simultaneously with MPAI-CAE’s. |
Chapter 6 | gives suggested references. Respondents are advised to become familiar with the references |
Chapter 7 | gives a basic list of relevant terms and their definition |
2 The MPAI AI Framework (MPAI-AIF)
Most MPAI applications considered so far can be implemented as a set of AIMs – AI, ML and even traditional Data Processing (DP)-based units with standard interfaces assembled in suitable topologies to achieve the specific goal of an application and executed in an MPAI-defined AI Framework. MPAI is making all efforts to identify processing modules that are re-usable and upgradable without necessarily changing the inside logic. MPAI plans on completing the development of a 1st generation AI Framework called MPAI-AIF in July 2021.
The MPAI-AIF Architecture is given by Figure 1.
Figure 1 – The MPAI-AIF Architecture
Where
- Management and Control manages and controls the AIMs, so that they execute in the correct order and at the time when they are needed.
- Execution is the environment in which combinations of AIMs operate. It receives external inputs and produces the requested outputs both of which are application specific interfacing with Management and Control and with Communication, Storage and Access.
- AI Modules (AIM) are the basic processing elements receiving processing specific inputs and producing processing specific outputs.
- Communication is required in several cases and can be implemented, e.g., by means of a service bus and may be used to connect with remote parts of the framework
- Storage encompasses traditional storage and is used to e.g., store the inputs and outputs of the individual AIMs, data from the AIM’s state and intermediary results, shared data among AIMs.
- Access represents the access to static or slowly changing data that are required by the application such as domain knowledge data, data models, etc.
3 Use Cases
3.1 Emotion-Enhanced Speech
Speech carries information not only about the lexical content, but also about a variety of other aspects such as age, gender, signature, and emotional state of the speaker [2]. Speech synthesis is evolving towards supporting these aspects.
There are many cases where a speech without emotion needs to be converted to a speech carrying an emotion, possibly with grades of a particular emotion. This is the case, for instance, of a human-machine dialogue where the message conveyed by the machine is more effective if it carries an emotion properly related to the emotion detected in the human speaker.
The AI Modules identified in the Emotion-Enhanced Speech (EES) Use Case considered in this document will make it possible to create virtual agents communicating in a more natural way, and thus to improve the quality of human interaction with a machine, by making it closer to a human-human interaction [5].
The ultimate goal is to realise a user-friendly system control interface that lets users generate speech with various – continuous and real time – expressiveness control levels.
3.2 Audio Recording Preservation
Preservation of audio assets recorded on a variety of media (vinyl, tapes, cassettes etc.) is an important activity for a variety of application domains, in particular cultural heritage.
A totally neutral process in the analogue-to-digital (A/D) audio information transfer is not sufficient. It is necessary to recover and preserve context information, obviously, but not exclusively, audio. The recording of an acoustic event can never be a neutral operation because the timbre quality and the plastic value of the recorded sound, which are of great importance in, for example, contemporary music, are already influenced by the positioning of the microphones used during the recording. In addition, the processing carried out by the Tonmeister, i.e., the person who has a detailed theoretical and practical knowledge of all aspects of sound recording.
However, unlike a sound engineer, the Tonmeister must also be deeply trained in music: musicological and historic-critical competence are essential for the identification and correct cataloguing of the information contained in audio documents [6].
As sound carriers are made of unstable base materials, they are more subject to damage caused by inadequate handling. The commingling of a technical and scientific formation with historic-philological knowledge (an important element for the identification and correct cataloguing of the information contained in audio documents) becomes essential for preservative re-recording operations, going beyond mere A/D transfer. In the case of magnetic tapes, the carrier may hold important information: the tape can include multiples splices; it can be annotated (by the composer or by the technicians) and/or display several types of irregularities (e.g., corruptions of the carrier, tape of different colour or chemical composition).
In this Audio Recording Preservation Use Case, audio is digitised and fed into a preservation system. The audio information is supplemented by the information coming from a video camera that is pointed to the head that reads the magnetic tape. The output of the restoration process is the preservation digital audio and a preservation master file that contains, next to the preservation audio file, several other information types created by the preservation process.
The introduction of this use case in the field of active preservation of audio documents opens the way to effective answer to the methodological questions of reliability with respect to the recordings as documentary sources, also clarifying the concept of “historical faithfulness”.
The goal is to cover the whole “philologically informed” archival process of an audio document, from the active preservation of sound documents to the access to digitized files.
3.3 Enhanced Audioconference Experience
Often, the user experience of a video/audio conference is far from satisfactory. Too much background noise or undesired sounds can lead to participants not to understand or even misunderstand what participants are saying, in addition to creating distraction.
By using AI-based adaptive noise-cancellation and sound enhancement, those kinds of noise can be virtually eliminated without using complex microphone systems that capture environment characteristics.
In this use case, the goal is achieved by using a series of AIMs. The first AIM is fed with Microphone sound (which captures the conversation audio) and the according geometry information (which describes number, positioning and configuration of the microphone or the array of microphones). It is to be noted that also Microphone Physical information (frequency response and deviation of the microphone) might be added, but that will likely be an overkill for this scenario. The resulting output (Speech signal and Geometry information) is then fed to the Noise Cancellation AIM which performs de-noising of the conversation. The resulting output is then equalized based on the output device characteristics, fetched from the Output Device Acoustic Model KB, which describes the frequency response of the selected output device. This way the speech can be equalized removing any coloration from the output device, resulting in an optimally delivered sound experience.
3.4 Audio-on-the-go
While biking in the middle of city traffic, the user should enjoy a satisfactory listening experience without losing contact with the acoustic surroundings.
The microphones available in earphones and earbuds capture the signals from the environment, the relevant environment sounds (i.e., the horn of a car) are selectively recognised and the sound rendition is adapted to the acoustic environment, providing an enhanced audio experience (e.g., performing dynamic signal equalization) and an improved battery life.
In this use case, the goal is achieved by using a series of AIMs. The first AIM (Environmental Sound Recognition) is fed with Microphone sound which captures the surrounding environment noise, together with according geometry information (which describes number, positioning and configuration of the microphone or the array of microphones).
The sounds are then categorized following prescriptions of a Sound Categorization KB, resulting in a sounds array and their categorization. Sound samples might eventually be compressed to allow a cloud-processing procedure.
The Environmental Sound Processing AIM, after fetching a list of relevant sounds from a KB, will trim sounds not relevant for the user in the specific moment and feed them to the next AIM, Dynamic Signal Equalization. This AIM fetches the User Hearing Profile from a KB and equalizes dynamically the sound taking into account the User’s specific hearing deviations.
Finally, the resulting sound is delivered to the output via the most appropriate the delivery method.
4 Functional Requirements
4.1 Emotion-Enhanced Speech
4.1.1 Reference architecture
This Use Case is implemented as in Figure 2. The Speech analysis AIM can be implemented either as AI/ML or legacy DP modules. If this AIM is implemented as a neural network, access to Emotion KB may not be needed.
Figure 2 – Emotion-enhanced speech
4.1.2 AI Modules
The AI Modules of Figure 2 perform the functions described in Table 1.
Table 1 – AI Modules of Emotion-Enhanced Speech
AIM | Function |
Feature extraction | Produces Speech features suitable for subsequent analysis |
Speech features analysis | Produces Emotion descriptors by querying the Emotion KB. Alternatively, Emotion descriptors are produced by an embedded neural network. |
Emotion KB | Allows Speech analysis to access features extracted from speech recordings of different speakers reading/reciting the same corpus of texts, with the standard set of emotions and without emotion, for different languages and genders. |
Emotion inserter | Inserts a particular emotional vocal timbre, e.g., anger, disgust, fear, happiness, sadness, and surprise into a neutral (emotion-less) synthesised voice. It also changes the strength of an emotion (from neutral speech) in a gradual fashion. |
4.1.3 I/O interfaces of AI Modules
The I/O data of the Emotion Enhanced Speech AIMs are given in Table 2.
Table 2 – I/O data of Emotion-Enhanced Speech AIMs
AIM | Input Data | Output Data |
Feature extraction | Emotion-less Digital Speech | Speech features |
Speech features analysis | Speech features
Emotion Emotion KB response |
Emotion descriptors
Emotion KB query |
Emotion KB | Query | Response |
Emotion inserter | Emotion-less Digital Speech
Emotion descriptors |
Speech with Emotion
Emotion descriptors |
4.1.4 Technologies and Functional Requirements
4.1.4.1 Digital Speech
Emotion Enhanced Speech (EES) requires that speech be sampled at a frequency between 22.05 kHz and 96 kHz and digitally represented between 16 bits/sample and 24 bits/sample.
To Respondents
Respondents are invited to comment on these choices.
4.1.4.2 Emotion
By Emotion we mean an attribute that indicate an emotion out of a finite set of Emotions.
In EES the input speech – natural or synthesised – does not contain emotion while the output speech is expected to contain the emotion expressed by the input Emotion.
The most basic emotions are described by the set: “anger, disgust, fear, happiness, sadness, and surprise” [7], or “joy versus sadness, anger versus fear, trust versus disgust, and surprise versus anticipation” [8]. One of these sets can be taken as “universal” in the sense that they are common across all cultures. An Emotion may have different Grades [9,10].
To Respondents
Respondents are invited to propose
- A minimal set of Emotions whose semantics are shared across cultures
- A set of Grades that can be associated to Emotions
- A digital representation of Emotions and their Grades (starting from [11]).
Currently, culture-specific Emotions are not being considered. However, the proposed digital representation of Emotions and their Grades should either accommodate or be extensible to accommodate culture-specific Emotions.
4.1.4.3 Speech features
To accomplish their task, speech processing applications utilize certain features of speech signals. General speech features are described in [12,13]. The extraction of these properties or features and how to obtain them from a speech signal is known as speech analysis. It can be done in the time domain as well as in the frequency domain. Analysing speech in the time domain often requires simple calculation and interpretation.
Time-domain features are related to the waveform analysis in the time domain. They can be used to measure the arousal level of emotions.
Time-domain features carry information about sequences of short-time prosody acoustic features (features estimated on a frame basis). Example features modified by the emotional states are given by short-time zero crossing rate, short-term speech energy and duration [16].
Frequency-domain features can be computed using (short-time) Fourier transform, wavelet transform, and other mathematical tools [21]. The frequency domain provides the mechanisms to obtain some of the most useful parameters in speech analysis because the human cochlea performs a quasi-frequency analysis.
Initially, the time-domain signal is transformed into the frequency-domain, from which the feature is extracted. Such features are highly associated with the human perception of speech. Hence, they have apparent acoustic characteristics. These features usually comprise formant frequency, linear prediction cepstral coefficient (LPCC), and Mel frequency cepstral coefficients (MFCC).
The frequency-domain features could carry information about:
- The Pitch signal (i.e., the glottal waveform) that depends on the tension of the vocal folds and the subglottal air pressure. Two parameters related to the pitch signal can be considered: pitch frequency and glottal air velocity. E.g., high velocity indicates a speech emotion like happiness. Low velocity is in harsher styles such as anger [22].
- The shape of the vocal tract that is modified by the emotional states. The formants (characterized by a centre frequency and a bandwidth) could be a representation of the vocal tract resonances. Features related to the number of harmonics due to the non-linear airflow in the vocal tract. E.g., in the emotional state of anger, the fast air flow causes additional excitation signals other than the pitch. Teager Energy Operator-based (TEO) features measure the harmonics and cross-harmonics in the spectrum [23].
Example features modified by the emotional states are given by the Mel-frequency cepstrum (MFC) [24].
To Respondents
Respondents are expected to propose Speech features that are capable to model
- non-extreme emotional states [14]
- many emotional states with a natural-sounding voice [15].
4.1.4.4 Emotion descriptors
The Emotion descriptors are a derivation of Speech features. They are used by the Emotion inserter to add the required emotion to the Digital speech.
By using frequency-domain and time-domain features a specific emotion can be added to a particular input Digital speech. Speech analysis can use different strategies to render the emotion depending on
- The type of sentence (numbers of words, type of phonemes, etc.) to which an emotion is added
- The emotions added to the previous and next sentence.
Emotion descriptors can be the output of a neural network or obtained by querying an Emotion KB.
To Respondents
Respondents should propose Emotion descriptors suitable to introduce Emotion into the specific emotion-less speech resulting in a speech that appears as “natural” to the listener.
4.1.4.5 Emotion KB query format
As of today, there is a variety of speech datasets available (online). Often, they consist of conversational setups and contain overlaps in speech as well as noise, or they are poor in expressiveness. Some Datasets offer emotionally rich content with a high quality, but in a limited amount [e.g., 16,17,18,19]. To be effective an Emotion KB should contain a large and expressive speech dataset.
Emotion KB contains features extracted from the speech recordings of different female and male speakers reading/reciting the same corpus of texts with an agreed set of emotions and without emotion, for a set of languages and for different genders (voice performances by professional actors in comparison with the author’s spontaneous speech) [25, 26].
Emotion KB is queried by providing a set of speech features. Emotion KB responds by providing Emotion descriptors.
To Respondents
Respondents are requested to propose an Emotion KB query format satisfying the following requirements:
- Accept a list of the speech features identified in 4.1.4.4
- Provide as output a set of Emotion descriptors identified in 4.1.4.5
4.2 Audio Recording Preservation
4.2.1 Reference architecture
This Use Case is implemented as in Figure 3. The Audio-video Analysis AIM can be implemented either using AI or legacy technologies. If this AIM is implemented as a neural network, access to the Tape irregularity KB may not be required.
Figure 3 – Tape Audio preservation
4.2.2 AI Modules
The AIMs required by this Use Case are described in Table 3.
Table 3 – AI Modules of Audio Recording Preservation
AIM | Function |
Audio enhancement | Produces Preservation audio using internal denoiser, finalized only to compensate (a) non-linear frequency response, caused by imperfect historical recording equipment; (b) rumble, needle noise, or tape hiss caused by the imperfections introduced by aging. (see 4.2.4.4). |
Audio-video analysis | Produces images and audio excerpts querying the Tape irregularity KB. Alternatively, an embedded neural network produces images and audio excerpts. |
Musicological classifier | Produces relevant images from Digital Video and text describing images |
Packager | Produces file containing
1. Digital audio 2. Input video 3. Audio sync’d images and text |
Tape irregularity KB | Knowledge Base of visual and audio irregularities |
4.2.3 I/O interfaces of AI Modules
The AIMs of Audio Recording Preservation are given in Table 4
Table 4 – I/O data of Audio Recording Preservation AIMs
AIM | Input Data | Output Data |
Audio enhancement | Digital Audio | Preservation Audio |
Audio-video Analysis | Preservation Audio
Digital Video Tape irregularity KB response |
Audio Excerpts
Images Tape irregularity KB query |
Musicological classifier | Audio Excerpts
Images |
Text
Images |
Packager | Preservation Audio
Digital Video Text Images |
Preservation Master |
Tape irregularity KB | Query | Response |
4.2.4 Technologies and Functional Requirements
4.2.4.1 Digital Audio
Digital Audio sampled from an analogue source (e.g., magnetic tapes, 78rpm phonographic discs) at a frequency in the 48-96 kHz range with at least 16 and at most 24 bits/sample [27].
To Proponents
Proponents are invited to comment on this choice.
4.2.4.2 Digital Video
Digital video has the following features.
- Pixel shape: square
- Bit depth: 8-10 bits/pixel
- Aspect ratio: 4/3 and 16/9
- 640 < # of horizontal pixels < 1920
- 480 < # of vertical pixels < 1080
- Frame frequency 50-120 Hz
- Scanning: progressive
- Colorimetry: ITU-R BT709 and BT2020
- Colour format: RGB and YUV
- Compression: uncompressed, if compressed AVC, HEVC
To Proponents
Proponents are invited to comment on these choices.
4.2.4.3 Digital Image
A Digital Image is
- An uncompressed video frame with time information or
- A video frame compressed with JPEG [29] with time information.
To Proponents
Respondents are invited to comment on this choice.
4.2.4.4 Image Features
Image Features are used to describe [34]
- Splices of
- leader tape to magnetic tape
- magnetic tape to magnetic tape
- Other irregularities such as brands on tape, ends of tape, ripples, damaged tapes, markings, dirt, shadows etc.
To Proponents
Respondents are requested to propose
- a complete set of irregularities from audio tapes
- Image features that characterise them.
4.2.4.5 Tape irregularity KB query format
Tape irregularity KB contains features extracted from images of different tape irregularities [35].
The Irregularity KB is queried by giving the features of an Image. The Irregularity KB responds by providing the type of irregularity detected in the input Image.
To Respondents
Respondents are requested to propose an Tape irregularity KB query format satisfying the following requirements:
- Accept a list of the Image features identified in 4.2.4.4
- Responds with indication of presence of irregularities or otherwise. If there are irregularities, it provides the type of irregularity identified in 4.2.4.4 as output
This CfT is specifically for preservation of audio tapes. However, its scope may be extended if sufficient technologies covering other audio preservation instances are received. Any proposal for other audio preservation instances should be described with a level of detail comparable to this Use Case.
4.2.4.6 Text
Text should be encoded according to ISO/IEC 10646, Information technology – Universal Coded Character Set (UCS) to support most languages in use [36].
To Respondents
Respondents are invited to comment on this choice.
4.2.4.7 Packager
Packager takes Preservation Audio, Digital Video, Text and Images and produces the Preservation Master file.
To Respondents
Respondents should propose a file format capable to:
- Support queries for irregularities, showing all the images corresponding to that given irregularity (splices, carrier corruptions, etc.)
- Allow listening to the audio corresponding to a particular image.
- Allow to annotate (with text) the audio signal, to support the musicological analysis
- Support query on the annotation, returning the corresponding time (sec:ms:sample), the text, the audio signal excerpt and image (if any)
- Support random access to a specified portion of video and/or audio providing.
Preference will be given to formats that have already been standardised or are in wide use.
4.2.5 Information about Audio enhancement performance
A fifty-year-long debate around the restoration of audio documents has been ongoing inside the archivists’ and musicologists’ communities [30].
The Preservation audio produced by Audio enhancement must fulfil the requirements of accuracy, reliability, and philological authenticity.
In [31] Schuller makes an accurate investigation of signal alterations classified in two categories
- Intentional that includes recording, equalization, and noise reduction systems
- Unintentional further divided into two groups:
- those caused by the imperfection of the recording technique of the time, resulting in various distortions
- those caused by misalignment of the recording equipment, for example, wrong speed, deviation from the vertical cutting angle in cylinders, or misalignment of the recording in magnetic tape.
The choice whether or not to compensate for these alterations reveals different restoration strategies: historical faithfulness can refer to the recording as it has been produced, precisely equalized for intentional recording equalizations, compensated for eventual errors caused by misaligned recording equipment (for example, wrong speed, deviation from the vertical cutting angle in cylinders, or misalignment of the recording in magnetic tape) and digitized using a modern equipment to minimize replay distortions.
There is a certain margin of interpretation because historical acquaintance with the document is called into question alongside with technical-scientific knowledge, for instance, to identify the equalization curves of magnetic tapes or to determine the rotation speed of a record. Most of the information provided is retrievable from the history of audio technology, while other information is experimentally inferable with a certain degree of accuracy.
The restoration must be focused to compensate non-linear frequency response, caused by imperfect historical recording equipment; rumble, needle noise, or tape hiss caused by the imperfections introduced by aging.
The restoration step can thus be carried out with a good degree of objectivity and represents an optimum level achievable by the original (analogue) recording equipment.
A legacy denoiser algorithm should [32,33]:
- use little a priori information
- operate in real time
- be based on frequency-domain methods, such as various forms of non-casual Wiener filtering or spectral subtraction schemes
- include algorithms that incorporate knowledge of the human auditory system.
To Proponents
The CfT does not include technologies object of this AIM. However, respondents’ comments will be welcome.
4.3 Enhanced Audioconference Experience
4.3.1 Reference architecture
This Use Case is implemented as in Figure 4.
Figure 4 – Enhanced Audioconference Experience
4.3.2 AI Modules
The AIMs required by the Enhanced Audioconference Experience are given in
Table 5 – AIMs of Enhanced Audioconference Experience
AIM | Function |
Speech detection and separation | Separates relevant Speech vs non-speech signals |
Noise cancellation | Removes noise in Speech signal |
Output dynamic noise cancellation | Reduces noise level based on Output Device Acoustic Model |
Delivery | Wraps De-noised Speech signal for Transport |
Output Device Acoustic Model KB | Contains calibration test results for all output devices of a given manufacturer identified by their ID |
4.3.3 I/O interfaces of AI Modules
The I/O data of Enhanced Audioconference Experience AIMs are given in Table 6.
Table 6 – I/O data of Enhanced Audioconference Experience AIMs
AIM | Input Data | Output Data |
Speech detection and separation | Microphone Sound
Geometry Information |
Digital Speech
Geometry Information |
Noise cancellation | Digital Speech
Geometry Information |
De-noised Speech |
Output dynamic noise cancellation | De-noised Speech | Equalised Speech |
Delivery | Equalised Speech
Transport info |
Equalised Speech |
Output Device Acoustic Model KB | Query | Response |
4.3.4 Technologies and Functional Requirements
4.3.4.1 Digital Speech
Enhanced Audioconference Experience (EAE) requires that speech be sampled at a frequency between 22.05 kHz and 96 kHz and that the samples be represented with a number of bits at least 16 bits/sample and at most 24 bit/sample.
To Respondents
Respondents are invited to comment on these two choices.
4.3.4.2 Microphone geometry information
Microphone geometry information is a descriptive representation of relative positioning of one or multiple microphones which describes physical characteristics of microphones such as type, positioning, angle and their relative position and overall configuration such as Array Type. It allows to accurately reproduce a signal free of noise and distortion and to better separate noise from signal as required for proper working of EAE AIMs. Formats to represent microphone geometry information are: MPEG-H 3D Audio [37] and platform (Android, Windows, Linux) specific JSON Descriptors API [38].
To Respondents
Respondent are requested to
- express their preference between the two formats
- comment about MPAI’s choice of the two formats
- possibly suggest alternative solutions.
4.3.4.3 Output device acoustic model metadata KB query format
The Output device acoustic model KB contains a description of the output device acoustic model, such as frequency response and per-frequency attenuation.
The Output device acoustic model KB is queried by requesting the unique ID of device, if available, or by providing a means to identify the model or unique reference to output device being considered. The Output device acoustic model KB responds with information about output device characteristics.
To Respondents
Respondents are requested to propose a query/response API satisfying the following requirements: API shall provide
- Means to enquiry for a specific device, model or family of models, if available.
- Adequate schemas to represent the Output device acoustic model using, if necessary, current representation schemes.
4.3.4.4 Delivery
Equalised Speech needs to be transported using a transport protocol most appropriate for the environment.
To Respondents
Proponents are requested to identify the transport protocols suitable for the EAE Use Case and propose an extensible way to signal which transport mechanism is intended to be used.
4.4 Audio-on-the-go
4.4.1 Reference architecture
This Use Case is implemented as in Figure 5. Environment sound recognition and Environment sound processing AIMs can be implemented either using AI or legacy technology. If any of these AIMs are implemented as a neural network, access to the corresponding KB may not be needed.
Figure 5 – Audio-on-the-go
4.4.2 AI Modules
The AIMs of Audio-on-the-go are given by Table 7
Table 7 – AIMs of Audio-on-the-go
AIM | Function |
Environment Sounds Recognition | Recognises, separates and categorises sounds captured from the surrounding environment |
Environment Sound Processing | Determines which sounds are relevant for the user vs sounds which are not |
Dynamic Signal Equalization | Dynamically equalises the sound using information from the User hearing profiles KB to produce the best possible quality output |
Delivery | Wraps equalised sound for Transport |
Sound categorisation KB | Contains audio features of the sounds in the KB |
Relevant vs non-relevant sound KB | Contains audio features of relevant sounds |
User hearing profiles KB | A dataset of hearing profiles of target users |
4.4.3 I/O interfaces of AI Modules
The I/O data of Audio on the go AIMs are given by Table 8
Table 8 – I/O data of Audio-on-the-go AIMs
AIM | Input Data | Output Data |
Environment Sounds Recognition | Microphone Sound
Geometry info |
Sound array
Sound categorisation |
Environment Sound Processing | Sound array
Sound categorisation |
Sound relevant to user |
Dynamic Signal Equalization | Sound relevant to user | Dynamically equalised sound |
Delivery | Equalised Speech
Transport info |
Equalised Speech |
Sound categorisation KB | Query | Response |
Relevant vs non-relevant sound KB | Query | Response |
User hearing profiles KB | Query | Response |
4.4.4 Technologies and Functional Requirements
4.4.4.1 Digital Audio
Digital Audio sampled is a stream of samples obtained by sampling audio at a frequency in the 48-96 kHz range with at least 16 and at most 24 bits/sample.
To Respondents
Proponents are invited to comment on this choice.
4.4.4.2 Microphone geometry information
Microphone geometry information is a descriptive representation of relative positioning of one or multiple microphones which describes physical characteristics of microphones such as type, positioning, angle and their relative position and overall configuration such as Array Type. It allows to accurately reproduce a signal free of noise and distortion and to better separate noise from signal as required for proper working of EAE AIMs. Formats to represent microphone geometry information are: MPEG-H 3D Audio [1] and platform (Android, Windows, Linux) specific JSON Descriptors API [38].
To Respondents
Respondent are requested to
- express their preference between the two formats
- comment about MPAI’s choice of the two formats
- possibly suggest alternative solutions.
4.4.4.3 Sound array
Respondents should propose a format to package a set of environment sounds with the requirements on being able to include the sound samples, encoding information (e.g., sampling frequency, bits per sample, compression method) and relative metadata, and duration.
To Respondents
Respondents are requested to propose an extensible identification of audio compression methods.
4.4.4.4 Sounds categorisation
Sounds captured by the microphone should be categorised.
To Respondents
Respondents should propose an extensible classification of all types of sound of interest [39]. Support of a set of sounds classified according to a proprietary scheme should also be provided.
4.4.4.5 Sound categorisation KB query format
Sound categorisation KB contains audio features of the sounds in the KB.
Sound categorisation KB is queried by giving features extracted from the input sound as input. Sound categorisation KB responds by giving the category of the sound.
To Respondents
Respondents should propose an extensible set of features to be used to query the Sound categorisation KB and obtain the categories of the sounds with following requirements
- The confidence value for the most relevant N categories.
- From which classification KB it has been extracted
4.4.4.6 Relevant vs non-relevant sound KB query format
Relevant vs non-relevant sound KB contains audio features of the relevant sounds.
Relevant vs non-relevant sound KB is queried by giving a sound as input. Relevant vs non-relevant sound KB responds by giving the relevant sound.
To Respondents
Respondents should propose a query format capable to provide a Boolean value (relevant/non-relevant) or a probability level (e.g., 70% relevant).
4.4.4.7 User Hearing Profiles KB query format
User Hearing Profiles KB contains the user hearing profile for the properly identified (e.g. via a UUID or a third-party identity provider) specific user.
User Hearing Profiles KB is queried giving the User hearing profile ID as input. User hearing profile KB responds with the specific user hearing profile. The User hearing profile contains the hearing attenuation for a defined number of frequency spectrums or any representation able to determine the unique individual sound perception ability [40]. There are currently at least 2 SDKs on the matter: MIMI SDK, NURA SDK (both proprietary) [41].
To Respondents
Respondents should propose a format which can convey the unique individual sound perception ability, in one of the following ways
- The KB responds to a query with the values of the frequency perception of the user at a pre-defined set of frequency values
- The KB responds to a query with the value of the frequency perception of the user at a specified frequency values with the query of a specific frequency value.
4.4.4.8 Delivery
Equalised Speech needs to be transported using a transport protocol most appropriate for the environment.
To Respondents
Proponents are requested to identify the transport protocol suitable for the AOG Use Case and propose an extensible way to signal which transport mechanism is intended to be used.
5 Potential common technologies
Table 9 introduces the acronyms representing the MPAI-CAE and MPAI-MMC Use Cases.
Table 9 – Acronyms of MPAI-CAE and MPAI-MMC Use Cases
Acronym | App. Area | Use Case |
EES | MPAI-CAE | Emotion-Enhanced Speech |
ARP | MPAI-CAE | Audio Recording Preservation |
EAE | MPAI-CAE | Enhanced Audioconference Experience |
AOG | MPAI-CAE | Audio-on-the-go |
CWE | MPAI-MMC | Conversation with emotion |
MQA | MPAI-MMC | Multimodal Question Answering |
PST | MPAI-MMC | Personalized Automatic Speech Translation |
Table 10 gives all MPAI-CAE and MPAI-MMC technologies in alphabetical order.
Please note the following acronyms
KB | Knowledge Base |
QF | Query Format |
Table 10 – Alphabetically ordered MPAI-CAE and MPAI-MMC technologies
UC | Technology | Description |
AOG | Delivery | Speech transport format |
EAE | Delivery | Speech transport format |
AOG | Digital Audio | PCM Audio 48-96 kHz/16-24 bit |
ARP | Digital Audio | PCM Audio 48-96 kHz/16-24 bit |
ARP | Digital Image | A (un)compressed digital video frame |
MQA | Digital Image | (un)compressed image |
CWE | Digital Speech | PCM speech 22.05-96kHz/16-24 bit |
EAE | Digital Speech | PCM speech 22.05-96kHz/16-24 bit |
EES | Digital Speech | PCM speech 22.05-96kHz/16-24 bit |
MQA | Digital Speech | PCM speech 22.05-96kHz/16-24 bit |
PST | Digital Speech | PCM speech 22.05-96kHz/16-24 bit |
ARP | Digital Video | Digital Video |
CWE | Digital Video | Digital Video |
CWE | Emotion | Digital representation of emotion |
EES | Emotion | Digital representation of emotion |
EES | Emotion descriptors | Derivations of Speech features |
CWE | Emotion KB (speech) QF | Provides emotion from speech features |
CWE | Emotion KB (text) QF | Provides emotion from text features |
CWE | Emotion KB (video) QF | Provides emotion from video features |
EES | Emotion KB QF | Provides Emotion descriptors |
ARP | Image Features | Features of tape irregularities Images |
MQA | Image features | Features of object Images |
MQA | Image KB QF | Provides object identifier |
CWE | Input to speech synthesis | Plain text or concept |
MQA | Intention | Information such as what, where, how |
MQA | Intention KB QF | Provides Intention |
PST | Language identification | Language identifier |
CWE | Meaning | Information such as question, statement |
MQA | Meaning | Information such as question, statement |
AOG | Microphone geometry information | Description of microphone position |
EAE | Microphone geometry information | Description of microphone position |
MQA | Object identifier | Identifier of a physical object |
MQA | Online dictionary QF | Provides paragraphs correlelated with questions |
EAE | Output device acoustic model metadata KB QF | Provides output device metadata |
ARP | Packager | Audio/Video/Images/Text Multiplexer |
AOG | Relevant vs non-relevant sound KB QF | Provides relevant sound |
AOG | Sound array | Vector of extracted sounds |
AOG | Sound categorisation KB QF | Provides sound category |
AOG | Sounds categorisation | Identifier of a type of sound |
CWE | Speech features | Speech features containing emotion info |
EES | Speech features | Features associated to speech analysis |
PST | Speech features | Features of input speech |
ARP | Tape irregularity KB QF | Provides image features |
ARP | Text | Plain text |
MQA | Text | Plain text |
PST | Text | Plain text |
CWE | Text features | Text features containing emotion info |
AOG | User Hearing Profiles KB QF | Provides profile of identified user |
CWE | Video features | Video features containing emotion info |
The following technologies are potentially applicable to different Use Cases.
Table 11 – Technologies potentially shared by MPAI-CAE and MPAI-MMC
Function | EES | ARP | EAE | AOG | CWE | MQA | PST |
Delivery | X | X | |||||
Digital speech | X | X | |||||
Digital audio | X | X | |||||
Digital image | X | X | |||||
Digital video | X | X | |||||
Emotion | X | X | |||||
Image features | X | X | |||||
Meaning | X | X | |||||
Microphone geometry information | X | X | |||||
Speech features | X | X | X | ||||
Text | X | X | X | X |
The following technologies are shared or shareable across Use Cases:
- Delivery
- Digital speech
- Digital audio
- Digital image
- Digital video
- Emotion
- Meaning
- Microphone geometry information
- Text
Image features apply to different visual objects. Speech features are different for all Use Cases.
However, respondents should consider the possibility of proposing a unified set of Speech features as proposed in [42]
6 Terminology
Table 12 – MPAI-CAE terms
Term | Definition |
Access | Static or slowly changing data that are required by an application such as domain knowledge data, data models, etc. |
AI Framework (AIF) | The environment where AIM-based workflows are executed |
AI Module (AIM) | The basic processing elements receiving processing specific inputs and producing processing specific outputs |
Audio enhancement | An AIM that produces Preservation audio using internal denoiser |
Communication | The infrastructure that connects the Components of an AIF |
Delivery | An AIM that wraps data for transport |
Digital Speech | Digitised speech as specified by MPAI |
Dynamic Signal Equalization | An AIM that dynamically equalises the sound using information from the User hearing profiles KB |
Emotion | An attribute that indicates an emotion out of a finite set of Emotions |
Emotion Descriptor | A set of time-domain and frequency-domain features capable to render a particular emotion, starting from an emotion-less digital speech |
Emotion inserter | A module to set time-domain and frequency-domain features of a neutral speech in order to insert a particular emotional intention. |
Emotion KB | A speech dataset rich in expressiveness |
Emotion KB query format | A dataset of time-domain and frequency-domain neutral speech features |
Environment Sound Processing | An AIM that determines which sounds are relevant for the user vs sounds which are not |
Environment Sounds Recognition | An AIM that recognises, separates and categorises sounds captured from the environment |
Execution | The environment in which AIM workflows are executed. It receives external inputs and produces the requested outputs both of which are application specific |
Frequency-domain Features | Properties (descriptors) of the signal with respect to frequency |
Emotion Grade | The intensity of an Emotion |
Management and Control | Manages and controls the AIMs in the AIF, so that they execute in the correct order and at the time when they are needed |
Musicological classifier | Algorithm that sorts unlabelled images from Digital Video into (relevant) labelled categories of information, linking them with text describing the images. |
Noise cancellation | An AIM that removes noise in Speech signal |
Output Device Acoustic Model KB | A dataset of calibration test results for all output devices of a given manufacturer identified by their ID |
Output dynamic noise cancellation | An AIM that reduces noise level based on Output Device Acoustic Model |
Packager | An AIM that packages audio, video, images and text in a file |
Relevant vs non-relevant sound KB | A dataset of audio features of relevant sounds |
Sound categorisation KB | Contains audio features of the sounds in the KB |
Speech analysis | The AIM that extracts Emotion descriptors |
Speech analysis | The AIM that understands the emotion embedded in speech |
Speech analysis | The AIM that extracts the characteristics of the speaker (e.g., physiology and intention) |
Speech and Emotion File Format | A file format that contains Digital speech and time-stamped Emotions related to speech |
Speech detection and separation | AIM that separates relevant Speech vs non-speech signals |
Speech Features | Speech features used to extract Emotion descriptors |
Storage | Storage used to e.g., store the inputs and outputs of the individual AIMs, data from the AIM’s state and intermediary results, shared data among AIMs |
Tape irregularity KB | Dataset that includes examples of the different irregularities that may be present in the carrier (analogue tape, phonographic discs) considered |
Text | Characters drawn from a finite alphabet |
Time-domain features | Properties (descriptors) of the signal with respect to frequency |
User hearing profiles KB | A dataset of hearing profiles of target users |
7 References
- MPAI-AIF Call for Technologies; https://mpai.community/standards/mpai-aif/#Technologies
- MPAI-CAE Call for Technologies; N131
- MPAI-MMC Use Cases and Functional Requirements; N134
- Burkhardt and N. Campbell, “Emotional speech synthesis,” in The Oxford Handbook of Affective Computing. Oxford University Press New York, 2014, p. 286
- Noé Tits, A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech – a Deep Learning approach, 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), September 2019, DOI: 10.1109/ACIIW.2019.8925241
- W. Adorno, Philosophy of New Music, University of Minnesota Press, Minneapolis, Minn, USA, 2006
- Ekman, P. (1999). Basic Emotions. In T. Dalgleish and T. Power (Eds.) The Handbook of Cognition and Emotion Pp. 45–60. Sussex, U.K.: John Wiley & Sons, Ltd.
- Plutchik R., Emotion: a psychoevolutionary synthesis, New York Harper and Row, 1980
- Russell, James (1980). “A circumplex model of affect”. Journal of Personality and Social Psychology. 39 (6): 1161–1178. doi:10.1037/h0077714
- Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
- https://www.w3.org/TR/2014/REC-emotionml-20140522/
- Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
- Burkhardt, F., & Sendlmeier, W. F., Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis, ISCA Workshop on Speech & Emotion, Northern Ireland 2000, p. 151-156.
- Scherer, K. R., Ladd, D. R., & Silverman, K., Vocal cues to speaker affect: Testing two models, Journal of the Acoustic Society of America, 76(5), 1984, p. 1346-1356
- Kasuya, H., Maekawa, K., & Kiritani, S., Joint Estimation of Voice Source and Vocal Tract Parameters as Applied to the Study of Voice Source Dynamics, ICPhS 99, p. 2505-2512
- R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PLOS ONE, vol. 13, no. 5, pp. 1–35, 05 2018
- Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014
- Banziger, M. Mortillaro, and K. R. Scherer, “Introducing the geneva multimodal expression corpus for experimental research on emotion perception.” Emotion, vol. 12, no. 5, p. 1161, 2012
- Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in Ninth European Conference on Speech Communication and Technology, 2005
- Mozziconacci, S. J. L., Speech Variability and Emotion: Production and Perception, PhD Thesis, Technical University Eindhoven, 1998
- Burkhardt, F., & Sendlmeier, W. F., Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis, ISCA Workshop on Speech & Emotion, Northern Ireland 2000, p. 151-156.
- Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
- Hamed Beyramienanlou, Nasser Lotfivand, “An Efficient Teager Energy Operator-Based Automated QRS Complex Detection”, Journal of Healthcare Engineering, vol. 2018, Article ID 8360475, 11 pages, 2018. https://doi.org/10.1155/2018/8360475]
- Davis S B. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28(4):65-74
- Giovanni Costantini, Iacopo Iaderola, Andrea Paoloni, Massimiliano Todisco. EMOVO Corpus: an Italian Emotional Speech Database.
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 3501–3504, May 2014. 2- Moataz El Ayadi, Mohamed S. Kamel, Fakhri Karray. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition Journal, Elsevier, 44 (2011) 572–587
- IASA-TC 05: Handling and Storage of Audio and Video Carriers. IASA Technical Committee (2014)
- Hamed Beyramienanlou, Nasser Lotfivand, “An Efficient Teager Energy Operator-Based Automated QRS Complex Detection”, Journal of Healthcare Engineering, vol. 2018, Article ID 8360475, 11 pages, 2018. https://doi.org/10.1155/2018/8360475
- ISO/IEC 10918-1:1994 Information Technology — Digital Compression And Coding Of Continuous-Tone Still Images: Requirements And Guidelines
- Federica Bressan and Sergio Canazza, A Systemic Approach to the Preservation of Audio Documents: Methodology and Software Tools, Journal of Electrical and Computer Engineering, 2013. https://doi.org/10.1155/2013/489515
- Boston, Safeguarding the Documentary Heritage. A Guide to Standards, Recommended Practices and Reference Literature Related to the Preservation of Documents of All Kinds, UNESCO, Paris, France, 1988.
- Canazza. The digital curation of ethnic music audio archives: from preservation to restoration. International Journal of Digital Libraries, 12(2-3):121–135, 2012
- J. Godsill and P.J.W. Rayner. Digital Audio Restoration – a statistical model-based approach (Berlin: Springer-Verlag 1998)
- Pretto, Niccolò; Fantozzi, Carlo; Micheloni, Edoardo; Burini, Valentina; Canazza Targon, Sergio. Computing Methodologies Supporting the Preservation of Electroacoustic Music from Analog Magnetic Tape. In Computer Music Journal, 2018, vol. 42 (4), pp.59-74
- Fantozzi, Carlo; Bressan, Federica; Pretto, Niccolò; Canazza, Sergio. Tape music archives: from preservation to access. pp.233-249. In International Journal On Digital Libraries, pp. 1432-5012 vol. 18 (3), 2017. DOI:10.1007/s00799-017-0208-8
- ISO/IEC 10646:2003 Information Technology — Universal Multiple-Octet Coded Character Set (UCS)
- https://www.iis.fraunhofer.de/en/ff/amm/broadcast-streaming/mpegh.html
- https://docs.microsoft.com/bs-cyrl-ba/azure/cognitive-services/speech-service/how-to-devices-microphone-array-configuration
- https://www.frontiersin.org/articles/10.3389/fpsyg.2018.01277/full
- https://help.nuraphone.com/hc/en-us/articles/360000324676-Your-Profile
- https://integrate.mimi.io/documentation/android/4.0.1/documentation
- Problem Agnostic Speech Encoder; https://github.com/santi-pdp/pase
Application Note – Draft Use Cases and Functional Requirements – Draft Call for Technologies
MPAI Application Note #1 Rev. 1
Context-based Audio Enhancement (MPAI-CAE)
Proponents: Michelangelo Guarise, Andrea Basso (VOLUMIO)
Description: The overall user experience quality is highly dependent on the context in which audio is used, e.g.
- Entertainment audio can be consumed in the home, in the car, on public transport, on-the-go (e.g. while doing sports, running, biking) etc.
- Voice communications: can take place office, car, home, on-the-go etc.
- Audio and video conferencing can be done in the office, in the car, at home, on-the-go etc.
- (Serious) gaming can be done in the office, at home, on-the-go etc.
- Audio (post-)production is typically done in the studio
- Audio restoration is typically done in the studio
By using context information to act on the content using AI, it is possible substantially to improve the user experience.
Figure 1 represents how MPAI-CAE can reorganise its processing modules within an MPAI-AIF Framework to support different applications.
Figure 1 – Instances of MPAI-CAE
Comments: Currently, there are solutions that adapt the conditions in which the user experiences content or service for some of the contexts mentioned above. However, they tend to be vertical in nature, making it difficult to re-use possibly valuable AI-based components of the solutions for different applications.
MPAI-CAE aims to create a horizontal market of re-usable and possibly context-depending components that expose standard interfaces. The market would become more receptive to innovation hence more competitive. Industry and consumers alike will benefit from the MPAI-CAE standard.
Examples
The following examples describe how MPAI-CAE can make the difference.
- Enhanced audio experience in a conference call
Often, the user experience of a video/audio conference can be marginal. Too much background noise or undesired sounds can lead to participants not understanding what participants are saying. By using AI-based adaptive noise-cancellation and sound enhancement, MPAI-CAE can virtually eliminate those kinds of noise without using complex microphone systems to capture environment characteristics.
- Pleasant and safe music listening while biking
While biking in the middle of city traffic, AI can process the signals from the environment captured by the microphones available in many earphones and earbuds (for active noise cancellation), adapt the sound rendition to the acoustic environment, provide an enhanced audio experience (e.g. performing dynamic signal equalization), improve battery life and selectively recognize and allow relevant environment sounds (i.e. the horn of a car). The user enjoys a satisfactory listening experience without losing contact with the acoustic surroundings.
- Emotion enhanced synthesized voice
Speech synthesis is constantly improving and finding several applications that are part of our daily life (e.g. intelligent assistants). In addition to improving the ‘natural sounding’ of the voice, MPAI-CAE can implement expressive models of primary emotions such as fear, happiness, sadness, and anger.
- Efficient 3D sound
MPAI-CAE can reduce the number of channels (i.e. MPEG-H 3D Audio can support up to 64 loudspeaker channels and 128 codec core channels) in an automatic (unsupervised) way, e.g. by mapping a 9.1 to a 5.1 or stereo (radio broadcasting or DVD), maintaining the musical touch of the composer.
- Speech/audio restoration
Audio restoration is often a time-consuming process that requires skilled audio engineers with specific experience in music and recording techniques to go over manually old audio tapes. MPAI-CAE can automatically remove anomalies from recordings through broadband denoising, declicking and decrackling, as well as removing buzzes and hums and performing spectrographic ‘retouching’ for removal of discrete unwanted sounds.
- Normalization of volume across channels/streams
Eighty-five years after TV has been first introduced as a public service, TV viewers are still struggling to adapt to their needs the different average audio levels from different broadcasters and, within a program, to the different audio levels of the different scenes.
MPAI-CAE can learn from user’s reactions via remote control, e.g. to a loud spot, and control the sound level accordingly.
- Automotive
Audio systems in cars have steadily improved in quality over the years and continue to be integrated into more critical applications. Toda, a buyer takes it for granted that a car has a good automotive sound system. In addition, in a car there is usually at least one and sometimes two microphones to handle the voice-response system and the hands-free cell-phone capability. If the vehicle uses any noise cancellation, several other microphones are involved. MPAI-CAE can be used to improve the user experience and enable the full quality of current audio systems by reducing the effects of the noisy automotive environment on the signals.
- Audio mastering
Audio mastering is still considered as an ‘art’ and the prerogative of pro audio engineers. Normal users can upload an example track of their liking (possibly obtained from similar musical content) and MPAI-CAE analyzes it, extracts key features and generate a master track that ‘sounds like’ the example track starting from the non-mastered track. It is also possible to specify the desired style without an example and the original track will be adjusted accordingly.
Requirements:
The following is an initial set of MPAI-CAE functional requirements to be further developed in the next few weeks. When the full set of requirements will be developed, the MPAI General Assembly will decide whether an MPAI-CAE standard should be developed.
- The standard shall specify the following natural input signals
- Microphone signals
- Inertial measurement signals (Acceleration, Gyroscope, Compass, …)
- Vibration signals
- Environmental signals (Proximity, temperature, pressure, light, …)
- Environment properties (geometry, reverberation, reflectivity, …)
- The standard shall specify
- User settings (equalization, signal compression/expansion, volume, …)
- User profile (auditory profile, hearing aids, …)
- The standard shall support the retrieval of pre-computed environment models (audio scene, home automation scene, …)
- The standard shall reference the user authentication standards/methods required by the specific MPAI-CAE context
- The standard shall specify means to authenticate the components and pipelines of an MPAI-CAE instance
- The standard shall reference the methods used to encrypt the streams processed by MPAI-CAE and service-related metadata
- The standard shall specify the adaptation layer of MPAI-CAE streams to delivery protocols of common use (e.g. Bluetooth, Chromecast, DLNA, …)
Object of standard: Currently, three areas of standardization are identified:
- Context type interfaces: a first set of input and output signals, with corresponding syntax and semantics, for audio usage contexts considered of sufficient interest (e.g. audioconferencing and audio consumption on-the-go). They have the following features
- Input and out signals are context specific, but with a significant degree of commonality across contexts
- The operation of the framework is implementation-dependent offering implementors the way to produce the set of output signals that best fit the usage context
- Processing component interfaces: with the following features
- Interfaces of a set of updatable and extensible processing modules (both traditional and AI-based)
- Possibility to create processing pipelines and the associated control (including the needed side information) required to manage them
- The processing pipeline may be a combination of local and in-cloud processing
- Delivery protocol interfaces
- Interfaces of the processed audio signal to a variety of delivery protocols
Benefits: MPAI-CAE will bring benefits positively affecting
- Technology providers need not develop full applications to put to good use their technologies. They can concentrate on improving the AI technologies that enhance the user experience. Further, their technologies can find a much broader use in application domains beyond those they are accustomed to deal with.
- Equipment manufacturers and application vendors can tap from the set of technologies made available according to the MPAI-CAE standard from different competing sources, integrate them and satisfy their specific needs
- Service providers can deliver complex optimizations and thus superior user experience with minimal time to market as the MPAI-CAE framework enables easy combination of 3rd party components from both a technical and licensing perspective. Their services can deliver a high quality, consistent user audio experience with minimal dependency on the source by selecting the optimal delivery method
- End users enjoy a competitive market that provides constantly improved user experiences and controlled cost of AI-based audio endpoints.
Bottlenecks: the full potential of AI in MPAI-CAE would be unleashed by a market of AI-friendly processing units and introducing the vast amount of AI technologies into products and services.
Social aspects: MPAI-CAE would free users from the dependency on the context in which they operate; make the content experience more personal; make the collective service experience less dependent on events affecting the individual participant and raise the level of past content to today’s expectations.
Success criteria: MPAI-CAE should create a competitive market of AI-based components exposing standard interfaces, processing units available to manufacturers, a variety of end user devices and trigger the implicit need felt by a user to have the best experience whatever the context.