The Audio Scene Description Composite AIM is specified in the following six sections.

1        Functions of Audio Scene Description

2        Reference Model of Audio Scene Description

3        I/O Data of Audio Scene Description

4        Functions of AI Modules of Audio Scene Description

5        I/O Data of AI Modules of Audio Scene Description

6        AIM and JSON Metadata Specification of Audio Scene Description

1       Functions of Audio Scene Description

Audio Scene Description (CAE-ASD):

  1. Receives the Audio Scene composed of:
    • Microphone Array Geometry.
    • Multichannel Audio, i.e., the output of the Microphone Array.
  2. Separates Audio Objects in the scene.
  3. Produces Audio Scene Descriptors.

2       Reference Model of Audio Scene Description

Figure 1 depicts the Reference Architecture of CAE-ASD.

Figure 1 – Reference Model of Audio Scene Description Composite AIM

3       I/O Data of Audio Scene Description

Table 1 gives the Input/Output data of Audio Scene Description.

Table 1 – I/O data of Audio Scene Description

Input data Comment
Microphone Array Geometry The description of the spatial microphone arrangement.
Multichannel Audio The Audio output of the Microphone Array.
Output data Comments
Audio Scene Descriptors The Descriptors of the Audio Scene.

1.4       Functions of AI Modules of Audio Scene Description

Table 2 gives the list of the AIMs with their functions.

Table 2 – AI Modules of Audio Scene Description

AIM Function
Audio Analysis Transform
  1. Receives Multichannel Audio from Microphone Array.
  2. Transforms Multichannel Audio into frequency bands via a Fast Fourier Transform (FFT). The following operations are carried out in discrete frequency bands. When such a configuration is used, a 50% overlap between subsequent audio blocks needs to be employed. The output is a data structure comprising complex valued audio samples in the frequency domain.
  3. Produces Transform Multichannel Audio
Audio Source Localisation
  1. Receives Transform Multichannel Audio and Microphone Array Geometry
  2. Detects the Audio Objects in the Audio Scene with their Spatial Attitudes.
  3. Produces the Spatial Attitudes of the Audio Objects.
Audio Separation and Enhancement
  1. Receives Microphone Array Geometry, Transform Multichannel Audio and Spatial Attitudes.
  2. Separates the Audio Objects by using their Spatial Attitudes.
  3. Produces  Transform Enhanced Audio and Audio Scene Geometry.
Audio Synthesis Transform
  1. Receives It receives Transform Enhanced Audio.
  2. Transforms the Transform Enhanced Source into time domain via an Inverse Fast Fourier Transform (IFFT).
  3. Produces Enhanced Audio.
Audio Descriptor Multiplexing
  1. Receives Enhanced Audio, Microphone Array Geometry, and Audio Scene Geometry.
  2. Multiplexes the Enhanced Audio and the Audio Scene Geometry.
  3. Produces Audio Scene Descriptors.

1.5       I/O Data of AI Modules of Audio Scene Description

Table 3 – Audio Scene Description and their data

AIM Input Data Output Data
Audio Analysis Transform Multichannel Audio Transform Multichannel Audio
Audio Source Localisation Transform Multichannel Audio
Microphone Array Geometry
Audio Spatial Attitudes
Audio Separation and Enhancement Audio Spatial Attitudes
Transform Multichannel Audio
Microphone Array Geometry
Transform Enhanced Audio
Audio Scene Geometry
Audio Synthesis Transform Transform Enhanced Audio Enhanced Audio
Audio Descriptor Multiplexing Enhanced Audio
Audio Scene Geometry
Microphone Array Geometry
Audio Scene Descriptors

6       Specification of Audio Scene Description AIMs and JSON Metadata

Table 4 – AIM and JSON Metadata

AIW AIMs Names JSON
CAE-ASD Audio Scene Description X
CAE-AAT Audio Analysis Transform X
CAE-ASL Audio Source Localisation X
CAE-ASE Audio Separation and Enhancement X
CAE-AST Audio Synthesis Transform X
CAE-AMX Audio Descriptor Multiplexing X