Audio Scene Description (CAE-ASD)

The Audio Scene Description Composite AIM is specified in the following six sections.

Audio Scene Description (CAE-ASD):

Receives the Audio Scene composed of:
- Microphone Array Geometry.
- Multichannel Audio, i.e., the output of the Microphone Array.
Separates Audio Objects in the scene.
Produces Audio Scene Descriptors.

Figure 1 depicts the Reference Architecture of CAE-ASD.

Figure 1 – Reference Model of Audio Scene Description Composite AIM

Table 1 gives the Input/Output data of Audio Scene Description.

Table 1 – I/O data of Audio Scene Description

Input data	Comment
Microphone Array Geometry	The description of the spatial microphone arrangement.
Multichannel Audio	The Audio output of the Microphone Array.
Output data	Comments
Audio Scene Descriptors	The Descriptors of the Audio Scene.

Table 2 gives the list of the AIMs with their functions.

Table 2 – AI Modules of Audio Scene Description

AIM	Function
Audio Analysis Transform	Receives Multichannel Audio from Microphone Array. Transforms Multichannel Audio into frequency bands via a Fast Fourier Transform (FFT). The following operations are carried out in discrete frequency bands. When such a configuration is used, a 50% overlap between subsequent audio blocks needs to be employed. The output is a data structure comprising complex valued audio samples in the frequency domain. Produces Transform Multichannel Audio
Audio Source Localisation	Receives Transform Multichannel Audio and Microphone Array Geometry Detects the Audio Objects in the Audio Scene with their Spatial Attitudes. Produces the Spatial Attitudes of the Audio Objects.
Audio Separation and Enhancement	Receives Microphone Array Geometry, Transform Multichannel Audio and Spatial Attitudes. Separates the Audio Objects by using their Spatial Attitudes. Produces Transform Enhanced Audio and Audio Scene Geometry.
Audio Synthesis Transform	Receives Transform Enhanced Audio. Transforms the Transform Enhanced Source into time domain via an Inverse Fast Fourier Transform (IFFT). Produces Enhanced Audio.
Audio Descriptor Multiplexing	Receives Enhanced Audio, Microphone Array Geometry, and Audio Scene Geometry. Multiplexes the Enhanced Audio and the Audio Scene Geometry. Produces Audio Scene Descriptors.

Table 3 – Audio Scene Description and their data

AIM	Input Data	Output Data
Audio Analysis Transform	Multichannel Audio	Transform Multichannel Audio
Audio Source Localisation	Transform Multichannel Audio Microphone Array Geometry	Audio Spatial Attitudes
Audio Separation and Enhancement	Audio Spatial Attitudes Transform Multichannel Audio Microphone Array Geometry	Transform Enhanced Audio Audio Scene Geometry
Audio Synthesis Transform	Transform Enhanced Audio	Enhanced Audio
Audio Descriptor Multiplexing	Enhanced Audio Audio Scene Geometry Microphone Array Geometry	Audio Scene Descriptors

Table 4 – AIM and JSON Metadata

AIW	AIMs		Names	JSON
CAE-ASD			Audio Scene Description	X
	–	CAE-AAT	Audio Analysis Transform	X
	–	CAE-ASL	Audio Source Localisation	X
	–	CAE-ASE	Audio Separation and Enhancement	X
	–	CAE-AST	Audio Synthesis Transform	X
	–	CAE-AMX	Audio Descriptor Multiplexing	X

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit