<–Humans conversing with machines Video for humans and machines–>
8.1 Predictive maintenance
8.2 Music production and artistic industries
8.2.1 Post-production
8.2.2 Audio effects
8.2.3 Assisted composition
8.3 Immersive audio experience
8.3.1 3D and immersive audio
8.3.2 Object-based audio
8.3.3 Binaural audio
8.3.4 Virtual reality
8.3.5 Rendering immersive audio
8.4 Audio preservation and preparing for the (AI) future
8.5 Possible risks to plan for: Audio AI needs high quality data

We are surrounded by sound. The perception of different sounds is an important part of the human experience and can trigger a wide range of reactions. Think of the human response to a dangerous sound such as a lion’s roar in the wild, or to the noise of a fast truck coming our way in a trafficked road, or to the soothing tone of a gentle lullaby. Audio gives us an electrical representation of sounds, and sometimes we are tempted to think that modern sensors may handle sound better than humans. So far, however, they lack the ability to manage sound the way humans do. AI algorithms can come to our rescue as they are powerful tools in understanding audio patterns.

8.1Predictive maintenance

Besides human speech, we can characterize a wide range of audio signals, including: (1) music; (2) environmental audio such as a car passing by, a door closing, a gun shooting or a human screaming; (3) machine audio originated from electronic, electrical, or mechanical machines. Both environmental and machine audio are the result of physical events of interest that can be interpreted for practical use. Did a security door latch properly when it was closed? Does the fan on a cooling unit in a nuclear power plant sound like it is about to malfunction? Does a person scream in terror? The ability to detect panic in a person’s voice or a cry for help could make the difference in an emergency. In these two classes of audio signals, AI could be integrated as a value-added feature in building technology solutions, particularly in the context of physical safety and security, e.g., by detecting and localising “interesting” events. Predictive maintenance in industrial settings is an example where AI can be impactful by augmenting existing sensing capabilities. For instance, AI could analyse a motor’s sound and predict a malfunction before it occurs – learning from subtle deviations in noise signatures. So, it could be an additional layer of monitoring solution for early warning systems that offer incredible value for industry by reducing downtime and saving both human lives and extensive repair costs.

There are audio AI use cases in healthcare as well. The human body generates sounds with clinically relevant information. AI could contribute to data-driven real-time healthcare decisions, giving alarms when a person requires immediate assistance, such as in the case of the elderly or in hospitals.

8.2Music production and artistic industries

In general, AI is having a transformative effect on a broad array of industries, including the music production and artistic industries which are drawn to AI as an aid to the creative process. It is worth underscoring that AI machines don’t replace humans, rather AI provides tools to render complex processes more intuitive and reduce human time spent on tedious, uncreative tasks. For example, in the music recording/production/mastering industry, there are at least three main areas where AI demonstrates its impact: assisted mastering, assisted mixing and assisted composition. Today, one of the main goals of modern mastering is to make the listening experience consistent across all formats and platforms (from low rate mp3 files rendered by inexpensive earphones to high-resolution audio reproduced with sophisticated sound systems), with a wide range of loudness constraints. All these options make mastering extremely challenging and potentially costly. AI is proving to be a viable and egalitarian choice for many musicians. By analysing data and learning from previous tracks, AI-powered tools for assisted mastering enable musicians with a small budget to easily achieve professional-level results (albeit, ultimately, without the finesse of a human expert).


With so much content being created for Over-The-Top (OTT) services such as Netflix and Amazon Prime, the number of audio files to work with in post-production is dramatically increasing. Facilities are therefore looking for ways to work faster and in a more cost-efficient manner when it comes to mixing audio material. AI tools can help engineers and audio teams make basic decisions and complete the more routine tasks, thereby saving valuable pre-mixing time and enabling humans to focus on the more complex and creative elements. For example, some mastering plugins contain built-in intelligence that analyses source material (such as guitars or vocals) and considers its placement in the context of the rest of the mix to suggest mixing decisions. By taking on much of the initial heavy lifting, such tools can be hugely beneficial for less experienced users.

In the commercial world, ML applications in products already exist: LANDR1 [7], an automated audio mastering service which relies on AI to set parameters for digital audio processing and refinement; Neutron 3 released by iZotope2, an audio mixing tool that features a “track assistant” which utilizes AI to detect instruments and suggest fitting presets to the user. In more direct processing of audio by means AI, iZotope also features a utility for isolating dialogue in their audio restoration suite RX 93.

8.2.2Audio effects

Audio effects design for games and movies is another area where environmental and machine sounds are accurately recorded, catalogued, and applied. The concept of procedural audio design brings a partial solution to this function in that the process of sound recording is replaced by manually designed algorithms that can synthetically generate such audio. Procedural audio design is an intermediate step in automatizing audio effect design and more exciting developments can be expected via the combination of natural language processing and generative networks for AI-supported automatic sound design. A related development can be envisaged where silent movies (e.g., Man with a Movie Camera by Dziga Vertov) can be enhanced with AI-generated sound effects.

8.2.3Assisted composition

Assisted composition is another area of music production that is quickly realizing the value of AI. More and more tools are using deep learning algo-rhythms to identify patterns in huge amounts of source material and then utilising the insights generated to compose basic tunes and melodies.

8.3Immersive audio experience

If we want to mimic and reproduce auditory scenes we hear in real life, we utilise a set of techniques known as immersive audio. Immersive audio provides a “life-like” sound experience to end-users, different from the traditional stereo methods. This new audio experience envelops the listener, and it produces the perception on the audience of being surrounded by different audio universes by simulating credible auditory soundscapes. Disruptive innovations with AI in recording, encoding, and transcoding between immersive audio formats has gained importance for a broad industry thanks to the ever-increasing capacity of communication networks. The holy grail of immersive audio has always been to create a sonic reality that supplants the listener’s real acoustic environment by providing an emulated or synthetic auditory reality that is indistinguishable from the listener’s actual reality. 3D and immersive sound, for a long time not at the forefront of multimedia applications, are now an essential part of immersive games, extended reality applications, audio-visual arts, teleconferencing applications, and advanced broadcast applications.

8.3.13D and immersive audio

Several key technologies for 3D and immersive audio have been proposed. These techniques can be broadly classified into synthetic and recorded 3D audio in terms of how content is created, and headphone-based and loudspeaker-based in terms of how audio is reproduced. Three approaches are especially relevant and form the basis for existing (e.g., MPEG-H 3D Audio [8] and upcoming media coding standards (e.g., MPEG-I): Higher-order Ambisonics (HOA), Object-Based Audio (OBA), and binaural synthesis. Audio signals in one of these representations can sometimes, but not always, be transcoded into the others. HOA involves the representation of the sound field in the spherical Fourier domain as a series of spherical harmonic functions. Apart from allowing straightforward operations such as the 3D rotation of a sound field, this representation offers a theoretical framework that makes it possible to synthesise physically (as opposed to perceptually) accurate sound fields generated by simple sources such as a plane waves, point sources, and/or a combination thereof.

For a long time, HOA was constrained to synthetic 3D audio, where complex sound fields could be created through what is called Ambisonics panning. Although such an approach is beneficial in synthetic and virtual environments, real sound scenes, such as those from a real concert are better suited for real recordings. Special microphone arrays that can capture HOA are now commercially available and HOA recordings are becoming more commonplace. Such microphone arrays typically comprise pressure sensors on a rigid spherical baffle and require a pre-processing stage that converts the microphone signals (also known as the A-format representation) into spherical harmonic de-composition (also known as the B-format representation). The B-format signals can then be decoded for playback from a loudspeaker rig. As such, HOA provides a large listening area, and does not require tracking the listener position to reproduce an immersive audio field. HOA, by virtue of its capability to represent a sound field that is amenable to perfect reconstruction (limited by the maximum HOA order), acts as the basic format from which other approaches can be derived. For example, perceptual sound field reconstruction (PSR) signals can be derived from HOA signals. This important advantage resulted in HOA being selected as the scene-based format for MPEG-H 3D Audio.

8.3.2Object-based audio

OBA is more of a concept than a well-defined immersive audio approach. OBA involves the storage, transmission, and processing of audio sources as distinct audio objects, in a way like how audio stems are used in audio production. The audio stems or objects can be positioned and repositioned to compose a 3D auditory scene; enhanced, faded, or eliminated if necessary, and embellished with reverberation. Such flexibility is essential in providing the listener with a fully personalised listening experience, one where the listener can redesign the reproduced acoustic scene within the design space that they are given.

Despite its obvious advantages, OBA is not the first choice – at least today – for representing recorded sound fields. This because OBA requires the availability of audio sources as separate audio objects in addition to the definition of the reverberation characteristics of the intended acoustic scene using a representation that is either parametric or non-parametric. The extraction of the audio objects and the reverberation characteristics from real recordings is not a trivial task and requires among other things, source localization, source separation, dereverberation and optionally the extraction of the geometry of the acoustic scene.

8.3.3Binaural audio

Binaural audio involves the presentation of appropriate binaural cues to listeners over a pair of headphones so that they have the illusion of virtual sources in the 3D space surrounding them. The advent of mobile phones made binaural audio the de facto immersive audio approach for audio-on-the-go applications. The recent roll-out of spatial audio delivery services indicate the readiness of the market for such applications.

Binaural audio can be recorded by using anthropomorphic microphones also known as dummy head microphones that comprise a manikin shaped as a human head with realistic ears having microphones at the entrance of the ear. Dummy head microphones such as Neumann KU-100 physically capture binaural cues that are essential for the perception of sound sources in 3D. However, binaural recordings do not provide any means of interactivity and the listeners are presented with a high-quality immersive experience if their head is stationary. When the listener moves the head, the auditory scene also rotates drastically reducing the realism and the immersion. This renders binaural audio recordings useless in interactive 3D audio applications unless a head-tracking mechanism is applied in conjunction with personalized head-related transfer function (HRTF) filters.

Binaural audio can also be re-synthesised using appropriate digital HRTF filters. These filters mimic the acoustic path from a predefined sound source to the ears of the listener. Each distinct sound source is processed with a pair of HRTF filters (one each for the left and one for the right ear). Binaural audio synthesis should also respond to the movement of the listener’s head, which is typically achieved using hardware-based solutions called head trackers.

8.3.4Virtual reality

AI can also solve previously unsolvable problems in immersive audio and greatly improve the end-user experience in games, Virtual Reality (VR) and six degree of freedom (6DoF), navigable audio-visual content applicable in many domains including entertainment, broadcast, gaming, and cultural heritage. The generation of appropriate room reverberation to improve auditory immersion is a problem that potentially would benefit considerably from an AI-based approach and more specifically using the concept of differentiable digital signal processing (DDSP) which combines elements of deep learning with DSP.

8.3.5Rendering immersive audio

Often, the end-user rendering capabilities dictate whether they can play back the available immersive audio content. In the early days of multichannel audio coding, this problem was addressed by designing coding algorithms (see for example [9]) that were backwards compatible, meaning, for example, that 5.1 multi-channel content could be downmixed and transmitted for reproduction over two channels only (i.e., the original 5.1 multichannel audio information would contain the transmitted stereo signal). When more complex representations such as HOA and binaural audio are considered, simple downmixing will not be sufficient. Transcoding from HOA to binaural audio is possible and is widely used since such transcoding also provides distinct computational advantages [10]. Similarly, binaural content and synthetic HOA can be obtained from OBA representations. However, three key conversions are currently missing: binaural to OBA, binaural to HOA and finally, HOA to OBA. Recent developments in data processing resulted in algorithms for high-quality audio object extraction. There also exist direction estimation, reverberation time estimation and dereverberation methods that rely on AI-based approaches.

Such AI-based approaches could make it possible to create immersive audio content that can be repurposed, recomposed, and/or remixed and pave the way for expedited and flexible AI-based 3D audio content creation

8.4Audio preservation and preparing for the (AI) future

Preservation of audio assets recorded on a variety of media (vinyl, tapes, cassettes etc.) is an important activity for a variety of application domains, in particular cultural heritage. Audio archives are an important part of this heritage, but require relevant resources in term of people, time, and funding, since preservation requires more that “neutral” transfer of audio information from the analogue to the digital domain. In general, it is necessary to recover and preserve a lot of information in addition to the audio signals, e. g., annotations by the composer, by the technicians, etc. AI can drastically change the way we preserve, access, and add value to heritage, making its safeguarding sustainable.

The introduction of electronic and information technology into art present new challenges for archives and for the preservation of multimedia interactive installation, an important part of contemporary art. New multimedia artworks show a complex nature leading to a radical upheaval of the practice of preservation. The deep interconnection with technology is taking its toll in terms of fast obsolescence of hardware and software, which may soon become an irreversible loss. They exist only for a limited time inside an exposition (often less than a month). A computational AI-based model for preserving new multimedia art forms could be a very interesting medium-term aim.

Because of its immaterial nature, music was one of the earliest types of art to explore the creative use of new technologies: new musical forms have assumed increasing artistic importance since the second half of the last century. In the medium term, AI could be used to design and control complex installations (networks of computers and software), by means of audio-over-IP. .

8.5Possible risks to plan for: Audio AI needs high quality data

To design robust, audio-data-driven AI-based applications to a given audio scenario, high-quality data sets are needed to train the AI components. A good data set for supervised training must be large enough to cover the different circumstances that may occur. In addition, class imbalance should be minimal, that is the number of elements in each class must be similarly balanced. Data sets for audio recordings preservation, for example, should be built from hundreds of thousands of documents obtained from several different archives.

Some use cases require data sets that comprise audio, visual and textual content. For example, the sonic “inpainting” of a silent movie could benefit from visual analytics and codified knowledge that characterises typical sounding objects identified from the movie.

Other critical factors include well-defined performance metrics and testing procedures. MPAI addresses conformance and performance attributes and rules both at the level of performance specifications as well as at the level of organization of its ecosystem governance. While only few, high-level examples are described in this section, MPAI delivers AI-based data coding standards looking at the full spectrum of applications, as will become more apparent in later chapters.

2 Izotope; https://www.izotope.com/en/products/neutron.html

3 Izotope, https://www.izotope.com/en/products/rx.html

<–Humans conversing with machines Video for humans and machines–>