Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) is an international association with the mission to develop AI-enabled data coding standards. Research has shown that data coding with AI-based technologies is more efficient than with existing technologies.

The MPAI approach to developing AI data coding standards is based on the definition of standard interfaces of AI Modules (AIM). AIMs operate on input and output data both having a standard format. AIMs can be combined and executed in an MPAI-specified AI-Framework according to the emerging standard MPAI-AIF. A Call for MPAI-AIF Technologies [2] with associated Use Cases and Functional Requirements [1] was issued on 2020/12/16 and is now closed.

While AIMs must expose standard interfaces to be able to operate in an MPAI AI Framework, their performance may differ depending on the technologies used to implement them. MPAI believes that competing developers striving to provide more performing proprietary and interoperable AIMs will promote horizontal markets of AI solutions that build on and further promote AI innovation.

This document is a collection of Use Cases and corresponding Functional Requirements in the MPAI Context-based Audio Enhancement (MPAI-CAE) application area. The Use Cases in this area help improve the audio user experience for several application spaces that include entertainment, communication, teleconferencing, gaming, post-production, restoration etc. in a variety of contexts such as in the home, in the car, on-the-go, in the studio etc.

Currently MPAI has identified four Use Cases falling in the Context-based Audio Enhancement application area:

Emotion-Enhanced Speech (EES)
Audio Recording Preservation (ARP)
Enhanced Audioconference Experience (EAC)
Audio-on-the-go (AOG)

This document is to be read in conjunction with the MPAI-CAE Call for Technologies (CfT) [3] as it provides the functional requirements of all technologies identified as required to implement the current MPAI-CAE Use Cases. Respondents to the MPAI-CAE CfT are requested to make sure that their responses are aligned with the functional requirements expressed in this document.

In the future MPAI may issue other Calls for Technologies falling in the scope of MPAI-CAE to support identified Use Cases. Currently these are:

Efficient 3D sound
(Serious) gaming
Normalization of TV volume
Automotive
Audio mastering
Speech communication
Audio (post-)production

It should also be noted that some technologies identified in this document are the same, similar, or related to technologies required to implement some of the Use Cases of the companion document MPAI-MMC Use Cases and Functional Requirements [5]. Readers of this document are advised that being familiar with the content of the said companion document is a prerequisite for a proper understanding of this document.

This document is structured in 7 chapters, including this Introduction.

Chapter 2	briefly introduces the AI Framework Reference Model and its six Components
Chapter 3	briefly introduces the 4 Use Cases.
Chapter 4	presents the 4 MPAI-CAE Use Cases with the following structure 1. Reference architecture 2. AI Modules 3. I/O data of AI Modules 4. Technologies and Functional Requirements
Chapter 5	identifies the technologies likely to be common across MPAI-CAE and MPAI-MMC, a companion standard project whose Call for Technologies is issued simultaneously with MPAI-CAE’s.
Chapter 6	gives relevant references
Chapter 7	gives a basic list of relevant terms and their definition

For the reader’s convenience, Table 1 introduces the meaning of the acronyms used in this document.

Table 1 – MPAI-CAE acronyms

Acronym	Meaning
AI	Artificial Intelligence
AIF	AI Framework
AIM	AI Module
AOG	Audio-on-the-go
ARP	Audio Recording Preservation
CfT	Call for Technologies
DP	Data Processing
EAE	Enhanced Audioconference Experience
EES	Emotion-Enhanced Speech
KB	Knowledge Base
ML	Machine Learning

2 The MPAI AI Framework (MPAI-AIF)

Most MPAI applications considered so far can be implemented as a set of AIMs – AI, ML and even traditional DP-based units with standard interfaces assembled in suitable topologies to achieve the specific goal of an application and executed in an MPAI-defined AI Framework. MPAI is making all efforts to identify processing modules that are re-usable and upgradable without necessarily changing the inside logic. MPAI plans on completing the development of a 1^st generation MPAI-AIF AI Framework in July 2021.

The MPAI-AIF Architecture is given by Figure 1.

Figure 1 – The MPAI-AIF Architecture

MPAI-AIF is made up of 6 Components:

Management and Control manages and controls the AIMs, so that they execute in the correct order and at the time when they are needed.
Execution is the environment in which combinations of AIMs operate. It receives external inputs and produces the requested outputs, both of which are Use Case specific, activates the AIMs, exposes interfaces with Management and Control and interfaces with Communication, Storage and Access.
AI Modules (AIM) are the basic processing elements receiving processing specific inputs and producing processing specific outputs.
Communication is the basic infrastructure used to connect possibly remote Components and AIMs. It can be implemented, e.g., by means of a service bus.
Storage encompasses traditional storage and is used to e.g., store the inputs and outputs of the individual AIMs, intermediary results data from the AIM states and data shared by AIMs.
Access represents the access to static or slowly changing data that are required by the application such as domain knowledge data, data models, etc.

3 Use Cases

3.1 Emotion-Enhanced Speech (EES)

Speech carries information not only about the lexical content, but also about a variety of other aspects such as age, gender, signature, and emotional state of the speaker [3]. Speech synthesis is evolving towards supporting these aspects.

There are many cases where a speech without emotion needs to be converted to a speech carrying an emotion, possibly with grades of a particular emotion. This is the case, for instance, of a human-machine dialogue where the message conveyed by the machine is more effective if it carries an emotion properly related to the emotion detected in the human speaker.

The AI Modules identified in the Emotion-Enhanced Speech (EES) Use Case considered in this document will make it possible to create virtual agents communicating in a more natural way, and thus improve the quality of human-machine interaction, by making it closer to a human-human interaction [8].

EES’s ultimate goal is to help realise a user-friendly system control interface that lets users generate speech with various – continuous and real time – expressiveness control levels.

3.2 Audio Recording Preservation (ARP)

Preservation of audio assets recorded on a variety of media (vinyl, tapes, cassettes etc.) is an important activity for a variety of application domains, in particular cultural heritage, because preservation requires more that “neutral” transfer of audio information from the analogue to the digital domain. For instance,

It is necessary to recover and preserve context information, obviously, but not exclusively, audio.
The recording of an acoustic event cannot be a neutral operation because the timbre quality and the plastic value of the recorded sound, which are of great importance in, for example, contemporary music, are influenced by the positioning of the microphones used during the recording.
The processing carried out by the Tonmeister, i.e., the person who has a detailed theoretical and practical knowledge of all aspects of sound recording. However, unlike a sound engineer, the Tonmeister must also be deeply trained in music: musicological and historic-critical competence are essential for the identification and correct cataloguing of the information contained in audio documents [9].
As sound carriers are made of unstable base materials, they are more subject to damage caused by inadequate handling. The commingling of a technical and scientific formation with historic-philological knowledge (an important element for the identification and correct cataloguing of the information contained in audio documents) becomes essential for preservative re-recording operations, going beyond mere A/D conversion. In the case of magnetic tapes, the carrier may hold important information: the tape can include multiples splices; it can be annotated (by the composer or by the technicians) and/or display several types of irregularities (e.g., corruptions of the carrier, tape of different colour or chemical composition).

In this Audio Recording Preservation Use Case, currently restricted to magnetic tapes, audio is digitised and fed into a preservation system. The audio information is supplemented by the information coming from a video camera pointed to the head that reads the magnetic tape. The output of the restoration process is composed by:

Preservation digital audio
Preservation master file that contains, next to the preservation audio file, several other information types created by the preservation process.

The introduction of this use case in the field of active preservation of audio documents opens the way to respond in an effective way to the methodological questions of reliability with respect to the recordings as documentary sources, while clarifying the concept of “historical faithfulness”.

The goal is to cover the whole “philologically informed” archival process of an audio document, from the active preservation of sound documents to the access to digitized files.

3.3 Enhanced Audioconference Experience (EAE)

Often, the user experience of a video/audio conference is far from satisfactory. Too much background noise or undesired sounds can lead to participants not understanding or even misunderstanding what participants are saying, in addition to creating distraction.

By using AI-based adaptive noise-cancellation and sound enhancement, those kinds of noise can be virtually eliminated without using complex microphone systems that capture environment characteristics.

In this use case, the system receives microphone sound and microphone geometry information which describes number, positioning and configuration of the microphone or the array of microphones. Using this information, the system is able to detect and separate audioconference speech information from spurious sounds. It is to be noted that Microphone Physical information (i.e., frequency response and deviation of the microphone) might be added, but that will likely be an overkill for this scenario. The resulting speech then undergoes Noise Cancellation. The resulting output is equalized based on the output device characteristics, fetched from an Output Device Acoustic Model Knowledge Base, which describes the frequency response of the selected output device. This way the speech can be equalized removing any coloration from the output device, resulting in an optimally delivered sound experience.

3.4 Audio-on-the-go (AOG)

While biking in the middle of city traffic, the user should enjoy a satisfactory listening experience without losing contact with the acoustic surroundings.

The microphones available in earphones or earbuds capture the signals from the environment. The relevant environment sounds (i.e., the horn of a car) are selectively recognised and the sound rendition is adapted to the acoustic environment, providing an enhanced audio experience (e.g., performing dynamic signal equalization) and an improved battery life.

In this use case, Microphone sound captures the surrounding environment noise, together with geometry information (which describes number, positioning and configuration of the microphone or the array of microphones).

The sounds are then categorized. The result is an array of sounds with their categorization.

Sounds not relevant for the user in the specific moment are trimmed out and the rest of the sound information undergoes dynamic signal equalization using User Hearing Profile information.

Finally, the resulting sound is delivered to the output via the most appropriate the delivery method.

4 Functional Requirements

4.1 Introduction

The Functional Requirements developed in this document refer to the individual technologies identified as necessary to implement Use Cases belonging to given MPAI-CAE application area using AIMs operating in an MPAI-AIF AI Framework. The Functional Requirements developed adhere to the following guidelines:

AIMs are defined to allow implementations by multiple technologies (AI, ML, DP)
DP-based AIMs need interfaces such as to a Knowledge Base. AI-based AIM will typically require a learning process, however, support for this process is not included in the document. MPAI may develop further requirements covering that process in a future document.
AIMs can be aggregated in larger AIMs. Some data flows of aggregated AIMs may not necessarily be exposed any longer.
AIMs may be influenced by the companion MPAI-MMC Use Cases and Functional Requirements [5] as some technologies needed by some MPAI-MMC AIMs share a significant number of functional requirements.
Current AIMs do not feed information back to AIMs upstream. Respondents to the MPAI-CAE Call for Technologies [3] are welcome to motivate the need for such feed-back data flows and propose associated requirements.

The Functional Requirements described in the following sections are the result of a dedicated effort by MPAI experts over many meetings where different partitioning in AIMs have been proposed, discussed and revised. MPAI is aware that alternative partitioning or alternative I/O data to/from AIMs are possible. Those reading this document for the purpose of submitting a response to the MPAI-CAE Call for Technologies (N152) [2] are welcome to propose alternative partitionings or alternative I/O data in their submissions. In this case, however, they are required to justify their alternatives and determine the functional requirements of the relevant technologies. The evaluation team, to which proponents can, if they so wish, be members, will study the proposed alternative arrangement and may decide to accept all or part of the proposed new arrangement.

4.2 Emotion-Enhanced Speech

4.2.1 Reference architecture

This Use Case can be implemented as in Figure 2 and Figure 3. The two figures differ in the use of legacy DP technology vs AI technology:

In Figure 2 the Speech analysis AIM is implemented with legacy Data Processing technologies.
In Figure 3 the Speech analysis AIM is implemented as a neural network which incorporates the Emotion KB information.

Figure 2 – Emotion-enhanced speech (using external Knowledge Base)

Figure 3 – Emotion-enhanced speech (fully AI-based)

4.2.2 AI Modules

The AI Modules perform the functions described in Table 2.

Table 2 – AI Modules of Emotion-Enhanced Speech

AIM	Function
Speech feature analyser	Computes Speech features, queries the Emotion KB and obtains Emotion descriptors. Alternatively, Emotion descriptors are produced by an embedded neural network.
Emotion KB	Exposes an interface that allows Speech feature analyser to quey a KB of speech features extracted from recordings of different speakers reading/reciting the same corpus of texts, with the standard set of emotions and without emotion, for different languages and genders.
Emotion inserter	Inserts a particular emotional vocal timbre, e.g., anger, disgust, fear, happiness, sadness, and surprise into a neutral (emotion-less) synthesised voice. It also changes the strength of an emotion (from neutral speech) in a gradual fashion.

4.2.3 I/O interfaces of AI Modules

The I/O data of the Emotion Enhanced Speech AIMs are given in Table 3.

Table 3 – I/O data of Emotion-Enhanced Speech AIMs

AIM	Input Data	Output Data
Speech features analyser	Emotion-less speech Emotion Emotion descriptors	Emotion descriptors Speech features
Emotion KB	Speech features	Emotion descriptors
Emotion inserter	Emotion-less speech Emotion descriptors	Speech with Emotion Emotion descriptors

4.2.4 Technologies and Functional Requirements

4.2.4.1 Digital Speech

Speech should be sampled at a frequency between 8 kHz and 96 kHz and digitally represented between 16 bits/sample and 24 bits/sample (both linear). The frequency of 22.05 kHz should be used for the purpose of a response to the MPAI-CAE Call for Technologies. Demonstrations of a proposed technology for other sampling frequencies are welcome.

To Respondents

Respondents are invited to comment on these choices.

4.2.4.2 Emotion

By Emotion we mean a digital attribute that indicates an emotion out of a finite set of Emotions.

In EES the input speech – natural or synthesised – does not contain emotion while the output speech is expected to contain the emotion expressed by the input Emotion.

The most basic Emotions are described by the set: “anger, disgust, fear, happiness, sadness, and surprise” [10], or “joy versus sadness, anger versus fear, trust versus disgust, and surprise versus anticipation” [11]. One of these sets can be taken as “universal” in the sense that they are common across all cultures. An Emotion may have different Grades [12,13].

To Respondents

Respondents are requested to propose:

A minimal set of Emotions whose semantics are shared across cultures.
A set of Grades that can be associated to Emotions.
A digital representation of Emotions and their Grades (starting from [14]).

Currently, the MPAI-CAE Call for Technologies does not envisage to consider culture-specific Emotions. However, the proposed digital representation of Emotions and their Grades should either accommodate, or be extensible to accommodate, culture-specific Emotions.

4.2.4.3 Emotion KB query format

To accomplish their task, speech processing applications utilize certain features of speech signals. General speech features are described in [15,16]. The extraction of these features from a speech signal is known as speech analysis. Extraction can be done in the time domain as well as in the frequency domain.

Time-domain features are related to the waveform analysis in the time domain. Analysing speech in the time domain often requires simple calculation and interpretation. Time-domain features can be used to measure the arousal level of emotions.

Time-domain features carry information about sequences of short-time prosody acoustic features (features estimated on a frame basis). Example features modified by the emotional states are given by short-time zero crossing rate, short-term speech energy and duration [19].

Frequency-domain features can be computed using (short-time) Fourier transform, wavelet transform, and other mathematical tools [24]. Frequency domain operation provides mechanisms to obtain some of the most useful parameters in speech analysis because the human cochlea performs a quasi-frequency analysis.

Initially, the time-domain signal is transformed into the frequency-domain, from which the features are extracted. Such features are highly associated with the human perception of speech. Hence, they have apparent acoustic characteristics. These features usually comprise formant frequency, linear prediction cepstral coefficient (LPCC), and Mel frequency cepstral coefficients (MFCC).

The frequency-domain features can carry information about:

The Pitch signal (i.e., the glottal waveform) that depends on the tension of the vocal folds and the subglottal air pressure. Two parameters related to the pitch signal can be considered: pitch frequency and glottal air velocity. E.g., high velocity indicates a speech emotion like happiness. Low velocity is in harsher styles such as anger [25].
The shape of the vocal tract that is modified by the emotional states. The formants (characterized by a centre frequency and a bandwidth) could be a representation of the vocal tract resonances. Features related to the number of harmonics due to the non-linear airflow in the vocal tract. E.g., in the emotional state of anger, the fast air flow causes additional excitation signals other than the pitch. Teager Energy Operator-based (TEO) features measure the harmonics and cross-harmonics in the spectrum [26].

Example of features modified by the emotional states are given by the Mel-frequency cepstrum (MFC) [27].

Today, there is a variety of speech datasets available (online). Often, they consist of conversational setups and contain overlaps in speech as well as noise, or they are poor in expressiveness. Some datasets offer emotionally rich content with a high quality, but in a limited amount [e.g., 19,20,21,22]. To be effective, an Emotion KB should contain a large and expressive speech dataset.

Emotion KB contains speech features extracted from the speech recordings of speakers reading/ reciting the same corpus of texts with an agreed set of emotions and without emotion, for a set of languages and for different genders (voice performances by professional actors in comparison with the author’s spontaneous speech) [28, 29].

Emotion KB is queried by providing a vector of speech features. Emotion KB responds by providing Emotion descriptors.

To Respondents

Respondents are requested to propose an Emotion KB query format satisfying the following requirements:

Accept as input:
1. A vector of speech features capable of modelling:
  1. Non-extreme emotional states [17].
  2. Many emotional states with a natural-sounding voice [18].
2. An Emotion.
Provide as output a set of Emotion descriptors.

When assessing proposed Speech features, MPAI may resort to objective testing.

Note: An AI-based implementation may not need Emotion KB.

4.2.4.4 Emotion descriptors

Emotion descriptors are features used to alter the prosodic characteristics, the pitch, and the formant frequencies and bandwidth of Digital speech.

Speech analysis can use different strategies to render the emotion depending on:

The type of sentence (numbers of words, type of phonemes, etc.) to which an emotion is added
The emotions added to the previous and next sentence.

Emotion descriptors can be obtained by querying an Emotion KB (in the case of Figure 2) or from the output of a neural network (in the case of Figure 3).

To Respondents

Respondents should propose Emotion descriptors suitable to introduce Emotion into the specific emotion-less speech resulting in a speech that appears as “natural” to the listener.

When assessing proposed Speech features, MPAI may resort to subjective testing.

4.3 Audio Recording Preservation

4.3.1 Reference architecture

This Use Case is implemented as in Figure 4 and Figure 5. The two figures differ in the use of legacy DP technology vs AI technology:

In Figure 4 the Audio-video Analysis AIM is implemented with Data Processing Technologies.
In Figure 5 the Audio-video Analysis AIM is implemented as a neural network which incorporates the Emotion KB information.

Figure 4 – Tape Audio preservation (using external Knowledge Base)

Figure 5 – Tape Audio preservation (fully AI-based)

4.3.2 AI Modules

The AIMs required by this Use Case are described in Table 4.

Table 4 – AI Modules of Audio Recording Preservation

AIM	Function
Audio enhancer	Produces Preservation audio using internal denoiser, finalized only to compensate (a) non-linear frequency response, caused by imperfect historical recording equipment; (b) rumble, needle noise, or tape hiss caused by the imperfections introduced by aging. (see 4.3.5).
Audio analyser	Produces audio excerpts based on signals from Video analysis.
Video analyser	Extracts images from Video,queries the Tape irregularity KB and provides Images and Irregularities IDs. Alternatively, an embedded neural network produces images.
Musicological classifier	Produces relevant images from Digital video and Text describing images
Packager	Produces file containing: 1. Digital audio 2. Input video 3. Audio sync’d images and text
Tape irregularity KB	Knowledge Base of visual (tape) and audio irregularities

4.3.3 I/O interfaces of AI Modules

The AIMs of Audio Recording Preservation are given in Table 5

Table 5 – I/O data of Audio Recording Preservation AIMs

AIM	Input Data	Output Data
Audio enhancer	Digital Audio	Preservation Audio
Audio analysis	Preservation Audio Irregularity	Audio Excerpts
Video analysis	Digital Video Tape irregularity KB response	Images Tape irregularity KB query Irregularity IDs
Musicological classifier	Audio Excerpts Images Irregularity IDs	Text Images
Packager	Preservation Audio Digital Video Text Images	Preservation Master
Tape irregularity KB	Query	Response

4.3.4 Technologies and Functional Requirements

4.3.4.1 Digital Audio

Digital Audio sampled from an analogue source (e.g., magnetic tapes, 78rpm phonographic discs) at a frequency in the 44.1-96 kHz range with at least 16 and at most 24 bits/sample [30].

To Respondents

Respondents are invited to comment on this choice.

4.3.4.2 Digital Video

Digital video has the following features.

Pixel shape: square
Bit depth: 8-10 bits/pixel
Aspect ratio: 4/3 and 16/9
640 < # of horizontal pixels < 1920
480 < # of vertical pixels < 1080
Frame frequency 50-120 Hz
Scanning: progressive
Colorimetry: ITU-R BT709 and BT2020
Colour format: RGB and YUV
Compression: uncompressed; if compressed AVC, HEVC

To Respondents

Respondents are invited to comment on these choices.

4.3.4.3 Digital Image

A Digital Image is

An uncompressed video frame with time information or
A JPEG-compressed video frame [32] with time information.

To Respondents

Respondents are invited to comment on this choice.

4.3.4.4 Tape irregularity KB query format

Tape irregularity KB contains features extracted from images of different tape irregularities [38].

The Irregularity KB is queried by giving a vector of Image features that describe [37]:

Splices of
1. Leader tape to magnetic tape
2. Magnetic tape to magnetic tape
Other irregularities such as brands on tape, ends of tape, ripples, damaged tapes, markings, dirt, shadows etc.

The Irregularity KB responds by providing the type of irregularity detected in the input Image.

To Respondents

Respondents are requested to propose a Tape irregularity KB query format satisfying the following requirements:

A complete set of audio tape irregularities and Image features that characterise them.
A response to a query shall indicate:
1. Presence of irregularities or otherwise.
2. Type of irregularity as output (if there are irregularities).

When assessing proposed Image features MPAI may resort to objective testing.

This CfT is specifically for of audio tape preservation. However, its scope may be extended if sufficient technologies covering other audio preservation instances are received. Any proposal for other audio preservation instances should be described with a level of detail comparable to this Use Case.

4.3.4.5 Text

Text should be encoded according to ISO/IEC 10646, Information technology – Universal Coded Character Set (UCS) to support most languages in use [39].

To Respondents

Respondents are invited to comment on this choice.

4.3.4.6 Packager

Packager takes Preservation Audio, Digital Video, Text and Images and produces the Preservation Master file.

To Respondents

Respondents should propose a file format capable to:

Support queries for irregularities, showing all the images corresponding to that given irregularity (splices, carrier corruptions, etc.)
Allow listening to the audio corresponding to a particular image.
Allow to annotate (with text) the audio signal, to support the musicological analysis
Support query on the annotation, returning the corresponding time (sec:ms:sample), the text, the audio signal excerpt and image (if any)
Support random access to a specified portion of video and/or audio providing.

Preference will be given to formats that have already been standardised or are in wide use.

4.3.5 Information about Audio enhancement performance

A fifty-year-long debate around the restoration of audio documents has been ongoing inside the archivists’ and musicologists’ communities [33].

The Preservation audio produced by Audio enhancement must fulfil the requirements of accuracy, reliability, and philological authenticity.

In [34] Schuller makes an accurate investigation of signal alterations classified in two categories:

Intentional that includes recording, equalization, and noise reduction systems.
Unintentional further divided into those caused by:
1. The imperfection of the recording technique of the time, resulting in various distortions.
2. Misalignment of the recording equipment, e.g., wrong speed, deviation from the vertical cutting angle in cylinders, or misalignment of the recording in magnetic tape.

The choice whether or not to compensate for these alterations reveals different restoration strategies: historical faithfulness can refer to the recording as it has been produced, precisely equalized for intentional recording equalizations, compensated for eventual errors caused by misaligned recording equipment (for example, wrong speed, deviation from the vertical cutting angle in cylinders, or misalignment of the recording in magnetic tape) and digitized using a modern equipment to minimize replay distortions.

There is a certain margin of interpretation because historical acquaintance with the document is called into question alongside with technical-scientific knowledge, for instance, to identify the equalization curves of magnetic tapes or to determine the rotation speed of a record. Most of the information provided is retrievable from the history of audio technology, while other information is experimentally inferable with a certain degree of accuracy.

The restoration must be focused to compensate non-linear frequency response, caused by imperfect historical recording equipment; rumble, needle noise, or tape hiss caused by the imperfections introduced by aging.

The restoration step can thus be carried out with a good degree of objectivity and represents an optimum level achievable by the original (analogue) recording equipment.

A legacy denoiser algorithm should [35,36]:

Use little a priori information.
Operate in real time.
Be based on frequency-domain methods, such as various forms of non-casual Wiener filtering or spectral subtraction schemes.
Include algorithms that incorporate knowledge of the human auditory system.

To Proponents

The CfT does not include technologies object of this AIM. However, respondents’ comments on the text above will be welcome.

4.4 Enhanced Audioconference Experience

4.4.1 Reference architecture

This Use Case is implemented as in Figure 6.

Figure 6 – Enhanced Audioconference Experience

4.4.2 AI Modules

The AIMs required by the Enhanced Audioconference Experience are given in Table 6

Table 6 – AIMs of Enhanced Audioconference Experience

AIM	Function
Speech detection and separation	Separates relevant Speech vs non-speech signals
Noise cancellation	Removes noise in Speech signal
Output dynamic noise cancellation	Reduces noise level based on Output Device Acoustic Model
Delivery	Wraps De-noised Speech signal for Transport
Output Device Acoustic Model KB	Contains identifiers of all output devices of by manufacturer and their ID calibration test results

4.4.3 I/O interfaces of AI Modules

The I/O data of Enhanced Audioconference Experience AIMs are given in Table 7.

Table 7 – I/O data of Enhanced Audioconference Experience AIMs

AIM	Input Data	Output Data
Speech detection and separation	Microphone Sound Geometry Information	Digital Speech Geometry Information
Noise cancellation	Digital Speech Geometry Information	De-noised Speech
Output dynamic noise cancellation	De-noised Speech	Equalised Speech
Delivery	Equalised Speech Transport info	Equalised Speech
Output Device Acoustic Model KB	Query	Response

4.4.4 Technologies and Functional Requirements

4.4.4.1 Digital Speech

Speech should be sampled at a frequency between 8 kHz and 96 kHz and the samples should be represented with a number of bits at least 16 bits/sample and at most 24 bit/sample (both linear).

To Respondents

Respondents are invited to comment on these two choices.

4.4.4.2 Microphone geometry information

Microphone geometry information is a descriptive representation of relative positioning of one or multiple microphones which describes physical characteristics of microphones such as type, positioning, angle and their relative position and overall configuration such as Array Type. It allows to accurately reproduce a signal free of noise and distortion and to better separate noise from signal as required for proper working of EAE AIMs. Formats to represent microphone geometry information are: MPEG-H 3D Audio [40] and platform (Android, Windows, Linux) specific JSON Descriptors API [41].

To Respondents

Respondent are requested to:

Comment about MPAI’s choice of the two formats
Express their preference between the two formats.
Possibly suggest alternative solutions.

4.4.4.3 Output device acoustic model metadata KB query format

The Output device acoustic model KB contains a description of the output device acoustic model, such as frequency response and per-frequency attenuation.

The Output device acoustic model KB is queried by requesting the unique ID of a device, if available, or by providing a means to identify the model or unique reference to output device being considered. The Output device acoustic model KB responds with information about output device characteristics.

To Respondents

Respondents are requested to propose a query/response API satisfying the requirement that API shall provide:

Means to query the KB giving the device model as input to obtain the acoustic model.
Adequate schemas to represent the Output device acoustic model using, if necessary, current representation schemes.

4.4.4.4 Delivery

Equalised Speech needs to be transported using a transport protocol most appropriate for the environment.

To Respondents

Proponents are requested to identify the transport protocols suitable for the EAE Use Case and propose an extensible way to signal which transport mechanism is intended to be used.

4.5 Audio-on-the-go

4.5.1 Reference architecture

This Use Case is implemented as in Figure 7 and in Figure 8. The two figures differ in the use of legacy DP technology vs AI technology:

In Figure 7 Environment sound separation and Environment sound processing AIMs are implemented using legacy Data Processing technology.
In Figure 8 the Environment sound processing AIM is implemented as neural a network.

Figure 7 – Audio-on-the-go (using external Knowledge Base)

Figure 8 – Audio-on-the-go (full AI-based solution)

4.5.2 AI Modules

The AIMs of Audio-on-the-go are given by Table 8.

Table 8 – AIMs of Audio-on-the-go

AIM	Function
Environment sound separation	Separates the individual sounds captured from the surrounding environment
Environment sound processing	Determines which sounds are relevant to the user
Sound categorisation KB	Contains audio features of the sounds in the KB
Dynamic signal equalization	Dynamically equalises sound using information from User hearing profiles KB to produce the best possible quality output
Delivery	Wraps equalised sound for Transport
User hearing profiles KB	A dataset of hearing profiles of target users

4.5.3 I/O interfaces of AI Modules

The I/O data of Audio on the go AIMs are given by Table 9

Table 9 – I/O data of Audio-on-the-go AIMs

AIM	Input Data	Output Data
Environment sound separation	Microphone Sound Geometry info	Sound array
Environment sound processing	Sound array Sound categorisation	Relevant sounds Sound features
Dynamic signal equalization	Relevant sounds User’s hearing profiles	Dynamically equalised sound User ID
Delivery	Equalised Speech Transport info	Equalised Speech
Sound categorisation KB	Sound features vector	Sound categorisation
User hearing profiles KB	Query	Response

4.5.4 Technologies and Functional Requirements

4.5.4.1 Digital Audio

Digital Audio is a stream of samples obtained by sampling audio at a frequency in the 44.1-96 kHz range with at least 16 and at most 24 bits/sample.

To Respondents

Proponents are invited to comment on this choice.

4.5.4.2 Microphone geometry information

Microphone geometry information is a descriptive representation of relative positioning of one or multiple microphones which describes physical characteristics of microphones such as type, positioning, angle and their relative position and overall configuration such as Array Type. It allows to accurately reproduce a noise- and distortion-free signal and to better separate noise from signal as required for proper working of EAE AIMs. Formats to represent microphone geometry information are: MPEG-H 3D Audio [40] and platform (Android, Windows, Linux) specific JSON Descriptors API [41].

To Respondents

Respondents are requested to:

Express their preference between the two formats.
Comment about MPAI’s choice of the two formats.
Possibly suggest alternative solutions.

4.5.4.3 Sound array

The sounds identified in the Microphone sound are passed as an array of sounds represented as

Sound samples.
Encoding information (e.g., sampling frequency, bits/sample, compression method).
Relative metadata.

To Respondents

Respondents are requested to propose:

A format to package a set of environment sounds with appropriate metadata.
An extensible identification of audio compression methods.

4.5.4.4 Sound categorisation KB query format

Sound categorisation KB contains audio features of the sounds in the KB. Sound categorisation KB is queried by providing a vector of Sound features. Sound categorisation KB responds by giving the category of the sound.

Sound features are extracted from samples of individual sounds in the Sound array for the purpose of querying the Sound categorisation KB.

To Respondents

Respondents should propose a Sound categorisation query format satisfying the following requirements:

Use an extensible set of Sound features that satisfy the following requirements:
1. Be suitable for identifying a sound.
2. Be suitable as input to query the Sound categorisation.
Provide as output:
1. The probability for the most relevant N categories.
2. From which Sound categorisation KB this value has been derived.

When assessing proposed Sound features MPAI may resort to objective testing.

4.5.4.5 Sounds categorisation

Each vector in the sound array should be accompanied by an identifier of the category it belongs to.

To Respondents

Respondents should propose an extensible classification of all types of sound of interest [42]. Support of a set of sounds classified according to a proprietary scheme should also be provided.

4.5.4.6 User Hearing Profiles KB query format

User Hearing Profiles KB contains the user hearing profile for the properly identified (e.g. via a UUID or a third-party identity provider) specific user.

User Hearing Profiles KB is queried giving the User hearing profile ID as input. User hearing profile KB responds with the specific user hearing profile. The User hearing profile contains the hearing attenuation for a defined number of frequency spectrums or any representation able to determine the unique individual sound perception ability [43]. There are currently at least 2 SDKs on the matter: MIMI SDK, NURA SDK (both proprietary) [44].

To Respondents

Respondents should propose a query format which the following requirements:

Input: user identity, array of frequency values
Output: the values of the user’s sound perception ability at those frequency values

4.5.4.7 Delivery

Equalised Speech needs to be transported using a transport protocol most appropriate for the environment.

To Respondents

Proponents are requested to identify the transport protocol suitable for the AOG Use Case and propose an extensible way to signal which transport mechanism is intended to be used.

5 Potential common technologies

Table 10 introduces the acronyms representing the MPAI-CAE and MPAI-MMC Use Cases.

Table 10 – Acronyms of MPAI-CAE and MPAI-MMC Use Cases

Acronym	App. Area	Use Case
EES	MPAI-CAE	Emotion-Enhanced Speech
ARP	MPAI-CAE	Audio Recording Preservation
EAE	MPAI-CAE	Enhanced Audioconference Experience
AOG	MPAI-CAE	Audio-on-the-go
CWE	MPAI-MMC	Conversation with emotion
MQA	MPAI-MMC	Multimodal Question Answering
PST	MPAI-MMC	Personalized Automatic Speech Translation

Table 11 gives all MPAI-CAE and MPAI-MMC technologies in alphabetical order.

Please note the following acronyms:

KB	Knowledge Base
QF	Query Format

Table 11 – Alphabetically ordered MPAI-CAE and MPAI-MMC technologies

Notes	UC=Use case
	UCFR=Use Cases and Functional Requirements document number
	Section=Section of the above document
	Technology=name of technology

UC	UCFR	Section	Technology
EAE	N151	4.4.4.4	Delivery
AOG	N151	4.5.4.7	Delivery
CWE	N153	4.2.4.9	Dialog KB query format
ARP	N151	4.3.4.1	Digital Audio
AOG	N151	4.5.4.1	Digital Audio
ARP	N151	4.3.4.3	Digital Image
MQA	N153	4.3.4.3	Digital Image
EES	N151	4.2.4.1	Digital Speech
EAE	N151	4.4.4.1	Digital Speech
CWE	N153	4.2.4.2	Digital Speech
MQA	N153	4.3.4.2	Digital Speech
PST	N153	4.4.4.2	Digital Speech
ARP	N151	4.3.4.2	Digital Video
CWE	N153	4.2.4.3	Digital Video
EES	N151	4.2.4.2	Emotion
CWE	N153	4.2.4.4	Emotion
EES	N151	4.2.4.4	Emotion descriptors
CWE	N153	4.2.4.5	Emotion KB (speech) query format
CWE	N153	4.2.4.6	Emotion KB (text) query format
CWE	N153	4.2.4.7	Emotion KB (video) query format
EES	N151	4.2.4.3	Emotion KB query format
MQA	N153	4.3.4.4	Image KB query format
CWE	N153	4.2.4.11	Input to face animation
CWE	N153	4.2.4.10	Input to speech synthesis
MQA	N153	4.3.4.7	Intention KB query format
PST	N153	4.4.4.4	Language identification
CWE	N153	4.2.4.8	Meaning
MQA	N153	4.3.4.6	Meaning
EAE	N151	4.4.4.2	Microphone geometry information
AOG	N151	4.5.4.2	Microphone geometry information
MQA	N153	4.3.4.5	Object identifier
MQA	N153	4.3.4.8	Online dictionary query format
EAE	N151	4.4.4.3	Output device acoustic model metadata KB query format
ARP	N151	4.3.4.6	Packager
AOG	N151	4.5.4.3	Sound array
AOG	N151	4.5.4.4	Sound categorisation KB query format
AOG	N151	4.5.4.5	Sounds categorisation
PST	N153	4.4.4.3	Speech features
ARP	N151	4.3.4.4	Tape irregularity KB query format
ARP	N151	4.3.4.5	Text
CWE	N153	4.2.4.1	Text
MQA	N153	4.3.4.1	Text
PST	N153	4.4.4.1	Text
PST	N153	4.4.4.5	Translation results
AOG	N151	4.5.4.6	User Hearing Profiles KB query format

The following technologies are shared or shareable across Use Cases:

Delivery
Digital speech
Digital audio
Digital image
Digital video
Emotion
Meaning
Microphone geometry information
Text

Image features apply to different visual objects. The Speech features of all Use Cases are different.

However, respondents should consider the possibility of proposing a unified set of Speech features, e.g., as proposed in [45].

6 Terminology

Table 12 identifies and defines the terms used in the MPAI-CAE context.

Table 12 – MPAI-CAE terms

Term	Definition
Access	Static or slowly changing data that are required by an application such as domain knowledge data, data models, etc.
AI Framework (AIF)	The environment where AIM-based workflows are executed
AI Module (AIM)	The basic processing elements receiving processing specific inputs and producing processing specific outputs
Audio enhancement	An AIM that produces Preservation audio using internal denoiser
Communication	The infrastructure that connects the Components of an AIF
Data Processing (DP)	A legacy technology that may be used to implement AIMs
Delivery	An AIM that wraps data for transport
Digital Speech	Digitised speech as specified by MPAI
Dynamic Signal Equalization	An AIM that dynamically equalises the sound using information from the User hearing profiles KB
Emotion	A digital attribute that indicates an emotion out of a finite set of Emotions
Emotion Descriptor	A set of time-domain and frequency-domain features capable to render a particular emotion, starting from an emotion-less digital speech
Emotion inserter	A module to set time-domain and frequency-domain features of a neutral speech in order to insert a particular emotional intention.
Emotion KB	A speech dataset rich in expressiveness
Emotion KB query format	A dataset of time-domain and frequency-domain neutral speech features
Environment Sound Processing	An AIM that determines which sounds are relevant for the user vs sounds which are not
Environment Sounds Recognition	An AIM that recognises, separates and categorises sounds captured from the environment
Execution	The environment in which AIM workflows are executed. It receives external inputs and produces the requested outputs both of which are application specific
Frequency-domain Features	Properties (descriptors) of the signal with respect to frequency
Emotion Grade	The intensity of an Emotion
Knowledge Base	Structured and unstructured information made accessible to AIM (especially DP-based)
Management and Control	Manages and controls the AIMs in the AIF, so that they execute in the correct order and at the time when they are needed
Musicological classifier	Algorithm that sorts unlabelled images from Digital Video into (relevant) labelled categories of information, linking them with text describing the images.
Noise cancellation	An AIM that removes noise in Speech signal
Output Device Acoustic Model KB	A dataset of calibration test results for all output devices of a given manufacturer identified by their ID
Output dynamic noise cancellation	An AIM that reduces noise level based on Output Device Acoustic Model
Packager	An AIM that packages audio, video, images and text in a file
Relevant vs non-relevant sound KB	A dataset of audio features of relevant sounds
Sound categorisation KB	Contains audio features of the sounds in the KB
Speech analysis	The AIM that extracts Emotion descriptors
Speech analysis	The AIM that understands the emotion embedded in speech
Speech analysis	The AIM that extracts the characteristics of the speaker (e.g., physiology and intention)
Speech and Emotion File Format	A file format that contains Digital speech and time-stamped Emotions related to speech
Speech detection and separation	AIM that separates relevant Speech vs non-speech signals
Speech Features	Speech features used to extract Emotion descriptors
Storage	Storage used to e.g., store the inputs and outputs of the individual AIMs, data from the AIM’s state and intermediary results, shared data among AIMs
Tape irregularity KB	Dataset that includes examples of the different irregularities that may be present in the carrier (analogue tape, phonographic discs) considered
Text	Characters drawn from a finite alphabet
Time-domain features	Properties (descriptors) of the signal with respect to frequency
User hearing profiles KB	A dataset of hearing profiles of target users

7 References

MPAI-AIF Use Cases and Functional Requirements, N74; https://mpai.community/standards/mpai-aif/#Requirements
MPAI-AIF Call for Technologies, N100; https://mpai.community/standards/mpai-aif/#Technologies
MPAI-CAE Use Cases and Functional Requirements, N151; https://mpai.community/standards/mpai-cae/#UCFR
MPAI-CAE Call for Technologies, N152; https://mpai.community/standards/mpai-cae/#Technologies
MPAI-MMC Use Cases and Functional Requirements, N153; https://mpai.community/standards/mpai-mmc/#Requirements
MPAI-MMC Call for Technologies, N154; https://mpai.community/standards/mpai-mmc/#Technologies
Burkhardt and N. Campbell, “Emotional speech synthesis,” in The Oxford Handbook of Affective Computing. Oxford University Press New York, 2014, p. 286
Noé Tits, A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech – a Deep Learning approach, 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), September 2019, DOI: 10.1109/ACIIW.2019.8925241
W. Adorno, Philosophy of New Music, University of Minnesota Press, Minneapolis, Minn, USA, 2006
Ekman, P. (1999). Basic Emotions. In T. Dalgleish and T. Power (Eds.) The Handbook of Cognition and Emotion Pp. 45–60. Sussex, U.K.: John Wiley & Sons, Ltd.
Plutchik R., Emotion: a psychoevolutionary synthesis, New York Harper and Row, 1980
Russell, James (1980). “A circumplex model of affect”. Journal of Personality and Social Psychology. 39 (6): 1161–1178. doi:10.1037/h0077714
Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
https://www.w3.org/TR/2014/REC-emotionml-20140522/
Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
Burkhardt, F., & Sendlmeier, W. F., Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis, ISCA Workshop on Speech & Emotion, Northern Ireland 2000, p. 151-156.
Scherer, K. R., Ladd, D. R., & Silverman, K., Vocal cues to speaker affect: Testing two models, Journal of the Acoustic Society of America, 76(5), 1984, p. 1346-1356
Kasuya, H., Maekawa, K., & Kiritani, S., Joint Estimation of Voice Source and Vocal Tract Parameters as Applied to the Study of Voice Source Dynamics, ICPhS 99, p. 2505-2512
R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PLOS ONE, vol. 13, no. 5, pp. 1–35, 05 2018
Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014
Banziger, M. Mortillaro, and K. R. Scherer, “Introducing the geneva multimodal expression corpus for experimental research on emotion perception.” Emotion, vol. 12, no. 5, p. 1161, 2012
Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in Ninth European Conference on Speech Communication and Technology, 2005
Mozziconacci, S. J. L., Speech Variability and Emotion: Production and Perception, PhD Thesis, Technical University Eindhoven, 1998
Burkhardt, F., & Sendlmeier, W. F., Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis, ISCA Workshop on Speech & Emotion, Northern Ireland 2000, p. 151-156.
Cahn, J. E., The Generation of Affect in Synthesized Speech, Journal of the American Voice I/O Society, 8, July 1990, p. 1-19
Hamed Beyramienanlou, Nasser Lotfivand, “An Efficient Teager Energy Operator-Based Automated QRS Complex Detection”, Journal of Healthcare Engineering, vol. 2018, Article ID 8360475, 11 pages, 2018. https://doi.org/10.1155/2018/8360475]
Davis S B. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28(4):65-74
Giovanni Costantini, Iacopo Iaderola, Andrea Paoloni, Massimiliano Todisco. EMOVO Corpus: an Italian Emotional Speech Database.
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 3501–3504, May 2014. 2- Moataz El Ayadi, Mohamed S. Kamel, Fakhri Karray. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition Journal, Elsevier, 44 (2011) 572–587
IASA-TC 05: Handling and Storage of Audio and Video Carriers. IASA Technical Committee (2014)
Hamed Beyramienanlou, Nasser Lotfivand, “An Efficient Teager Energy Operator-Based Automated QRS Complex Detection”, Journal of Healthcare Engineering, vol. 2018, Article ID 8360475, 11 pages, 2018. https://doi.org/10.1155/2018/8360475
ISO/IEC 10918-1:1994 Information Technology — Digital Compression And Coding Of Continuous-Tone Still Images: Requirements And Guidelines
Federica Bressan and Sergio Canazza, A Systemic Approach to the Preservation of Audio Documents: Methodology and Software Tools, Journal of Electrical and Computer Engineering, 2013. https://doi.org/10.1155/2013/489515
Boston, Safeguarding the Documentary Heritage. A Guide to Standards, Recommended Practices and Reference Literature Related to the Preservation of Documents of All Kinds, UNESCO, Paris, France, 1988.
Canazza. The digital curation of ethnic music audio archives: from preservation to restoration. International Journal of Digital Libraries, 12(2-3):121–135, 2012
J. Godsill and P.J.W. Rayner. Digital Audio Restoration – a statistical model-based approach (Berlin: Springer-Verlag 1998)
Pretto, Niccolò; Fantozzi, Carlo; Micheloni, Edoardo; Burini, Valentina; Canazza Targon, Sergio. Computing Methodologies Supporting the Preservation of Electroacoustic Music from Analog Magnetic Tape. In Computer Music Journal, 2018, vol. 42 (4), pp.59-74
Fantozzi, Carlo; Bressan, Federica; Pretto, Niccolò; Canazza, Sergio. Tape music archives: from preservation to access. pp.233-249. In International Journal On Digital Libraries, pp. 1432-5012 vol. 18 (3), 2017. DOI:10.1007/s00799-017-0208-8
ISO/IEC 10646:2003 Information Technology — Universal Multiple-Octet Coded Character Set (UCS)
https://www.iis.fraunhofer.de/en/ff/amm/broadcast-streaming/mpegh.html
https://docs.microsoft.com/bs-cyrl-ba/azure/cognitive-services/speech-service/how-to-devices-microphone-array-configuration
https://www.frontiersin.org/articles/10.3389/fpsyg.2018.01277/full
https://help.nuraphone.com/hc/en-us/articles/360000324676-Your-Profile
https://integrate.mimi.io/documentation/android/4.0.1/documentation
Problem Agnostic Speech Encoder; https://github.com/santi-pdp/pase

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit

Use Cases and Functional Requirements

Context-based Audio Enhancement (MPAI-CAE)

1 Introduction

2 The MPAI AI Framework (MPAI-AIF)

3 Use Cases

3.1 Emotion-Enhanced Speech (EES)

3.2 Audio Recording Preservation (ARP)

3.3 Enhanced Audioconference Experience (EAE)

3.4 Audio-on-the-go (AOG)

4 Functional Requirements

4.1 Introduction

4.2 Emotion-Enhanced Speech

4.2.1 Reference architecture

4.2.2 AI Modules

4.2.3 I/O interfaces of AI Modules

4.2.4 Technologies and Functional Requirements

4.2.4.1 Digital Speech

4.2.4.2 Emotion

4.2.4.3 Emotion KB query format

4.2.4.4 Emotion descriptors

4.3 Audio Recording Preservation

4.3.1 Reference architecture

4.3.2 AI Modules

4.3.3 I/O interfaces of AI Modules

4.3.4 Technologies and Functional Requirements

4.3.4.1 Digital Audio

4.3.4.2 Digital Video

4.3.4.3 Digital Image

4.3.4.4 Tape irregularity KB query format

4.3.4.5 Text

4.3.4.6 Packager

4.3.5 Information about Audio enhancement performance

4.4 Enhanced Audioconference Experience

4.4.1 Reference architecture

4.4.2 AI Modules

4.4.3 I/O interfaces of AI Modules

4.4.4 Technologies and Functional Requirements

4.4.4.1 Digital Speech

4.4.4.2 Microphone geometry information

4.4.4.3 Output device acoustic model metadata KB query format

4.4.4.4 Delivery

4.5 Audio-on-the-go

4.5.1 Reference architecture

4.5.2 AI Modules

4.5.3 I/O interfaces of AI Modules

4.5.4 Technologies and Functional Requirements

4.5.4.1 Digital Audio

4.5.4.2 Microphone geometry information

4.5.4.3 Sound array

4.5.4.4 Sound categorisation KB query format

4.5.4.5 Sounds categorisation

4.5.4.6 User Hearing Profiles KB query format

4.5.4.7 Delivery

5 Potential common technologies

6 Terminology

7 References

Notice