<–Structure of MPAI standards MPAI mission and organisation–>
15.1 Emotion
15.2 Intention
15.3 Meaning
15.4 Speech features
15.5 Microphone array geometry
15.6 Audio scene geometry

15.1 Emotion

Emotion descriptors and examples are widely used in current MPAI standards. For example, the Emotion-Enhanced Speech Use Case of the MPAI-CAE standard handles emotion via two different modalities: users can supply model utterances demonstrating the desired emotion for a synthetic speech segment; or they can specify the desired emotion for that synthetic segment using a label supplied by MPAI as a data type with its own digital representation, e.g., “angry”.

Notably, MPAI is the first standards organisation to have standardised a numbered list of Emotions. (Note, however, that the list can be modified or replaced by implementers, as explained below.)

In MPAI, Emotions are defined by the following data set:

  1. EmotionType, a high-level category of Emotions within the mentioned list of Emotions, e.g., “FEAR.”

  2. EmotionDegree, one of the values “high,” “medium,” and “low.”

  3. EmotionSet, a data structure that specifies a set of EmotionTypes and EmotionNames proposed to augment or replace the standard MPAI set.

  4. EmotionName, the label of an emotion, whether general, e.g., “fearful/scared” or more specific, e.g., “terrified.”

  5. EmotionSetName, the name of an EmotionSet data structure.

The Basic Emotion Set is a table that currently identifies 16 EmotionTypes (high-level emotion categories), e.g., “FEAR”, “HURT” and “APPROVAL, DISAPPROVAL”. Each current EmotionType contains one or more general-level Emotions. For example, “FEAR” happens to contain one Emotion, “fearful/scared”; “HURT” contains “hurt” and “jealous”; and the “APPROVAL, DISAPPROVAL” category contains “admiring/approving,” “disapproving,” and “indifferent.” EmotionTypes (catgories) can also include more specific or subcategorized emotions. For instance, the “FEAR” EmotionType includes “terrified” and “anxious/uneasy,” while the “APPROVAL, DISAPPROVAL” category includes “awed” and “contemptuous.” In total, the Basic Emotion Set lists 60 general or more specific Emotions.

The Emotion data type is extensible in the sense that an implementer may submit a proposal that extends or replaces the Basic Emotion Set. The proposal will be assessed by the Development Committee in charge and, if approved for consistency, posted on the MPAI web site for use.

15.2 Intention

MPAI defines Intention as the result of analysis of the goal of a question. The “intention” consists of the following elements: qtopic, qfocus, qLAT and qSAT. These are exemplified by the question: Who is the author of King Lear?” The result of question analysis concludes the domain of the question is “Literature,” the topic of the question is “King Lear”, and the focus of the question is “Who.” More precise definitions are:

  1. qtopic is the topic of the question, the object or event the question is about.

  2. qfocus is the focus of the question, which is the part of the question that, if replaced by the answer, makes the question a stand-alone statement. Ex. What, where, who, what policy, which river, etc.

  3. qLAT is the Lexical Answer Type of the question. For example, “author” is qLAT in “Who is the author of King Lear?”

  4. qSAT is the Semantic Answer Type of the question. qSAT corresponds to Named Entity type of the language analysis results. For example, “person” is qSAT in “Who is the author of King Lear?”

  5. qdomain is the domain of the question such as “science”, “weather”, “history”.

The information in the Intention is used to find the answer to the user’s question which matches best with topic, focus, answer types and domain by measuring reliabilities of the candidate answers extracted from sentences in the Knowledge Base.

15.3 Meaning

MPAI defines Meaning as information – semantic, but also syntactic and structural – extracted from input data, i.e., Text, Speech, and Video. The “meaning” consists of: POS_tagging, NE_tagging, Dependency_tagging and SRL_tagging defined as:

  1. POS_tagging indicates the results of tagging Part Of Speech (POS) such as noun, verb, etc. including information on the POS tagging set and tagged results of the question.

  2. NE_tagging indicates NE results of tagging Named Entities (NE) such as Person, Organisation, Fruit, etc., including information on the NE tagging set and tagged results of the question.

  3. dependency_tagging indicates results of tagging dependency, i.e., the structure of the sentence such as subject, object, head of the relation, etc., including information on the dependency tagging set and tagged results of the question.

  4. SRL_tagging indicates tagging results of Semantic Role Labelling (SRL), i.e., the semantic structure of the sentence such as agent, location, patient role, etc., including information on the SRL tagging set and tagged results of the question.

The semantic and structural information contained in the Meaning is used as features by other AIMs to decide the user’s intention (Question Analysis), the reply to the question (Question Answering) or how the dialog should continue (Dialog Processing).

15.4 Speech features

MPAI defines speech features as descriptive aspects of a speech segment. These include base speed, pitch, and volume; variations in pitch, intensity, and sub-segment duration (rhythm); vocal tension, degree of whisper or creakiness, and others. The features can be represented symbolically, e.g., indicating a certain intensity (volume) in decibels; or they can be represented via neural-network-based vectors (NNspeechFeatures). Either representation may be automatically recognised and extracted for use in, e.g., speech analysis or speech synthesis.

To describe some speech features more exactly:

  1. Pitch: the fundamental frequency of speech expressed in Hz.

  2. Intensity: the energy of speech expressed as dB.

  3. Speed: the speech rate expressed as a number indicating specified linguistic units (e.g., phonemes, syllables, or words) per second.

Speech features can be used for voice analysis, e.g., to assist recognition of vocally expressed emotion; or they can be used for voice synthesis, e.g., to lend a certain emotional charge to a synthetic voice. When speech features are passed between AIMs, the receiving module will require precise specification of their format and, if the format exploits network-based vectors, pre-trained models, or sufficient training data.

15.5 Microphone array geometry

A microphone array consists of several microphones placed over a platform. It is used to record the environment from different locations. These arrays are used in a variety of applications aiming noise cancellation, source separation or source localisation. Multichannel outputs of the array may be used in data-driven or ML based AI applications. EAE is a real-time use case that analyses the multichannel signals from a microphone array. Since one of the EAE inputs is multichannel audio, microphone array geometry is another input format to define the properties of the input signals. From the definition obtained from microphone array geometry, the analysis with different types of microphone arrays over AIMs is possible.

Microphone arrays are described with the following features:

  1. Microphone Array Type defines the shape of the platform where the microphones are placed. It would be in spherical, circular, planar, linear or another format.

  1. Microphone Array is formed with a Number of microphones.

  1. Microphone object consists of the properties of specific microphone placed over the platform. It contains the microphone position in x, y, z coordinates with respect to the central reference position. The directivity pattern of the individual microphone would also be set as omnidirectional, figure of eight, cardioid, supercardioid, hypercardioid or another. The microphone looking directions would also be set as a vector in x, y, z coordinates.

  1. Microphone array look direction is a reference vector represented in x, y, z coordinates.

  1. The type of the platform to form a microphone array may differ to place selected number of microphones. Microphones can be placed over a rigid or open surface for a compact form. Microphone Array Scattering Type defines the scattering surface of the platform which would have an effect over the frequency components of the sound field. During the analysis, the microphone array surface type, i.e., rigid, open or another must be regarded by AIMs.

  1. Microphone manufacturing process varies depending upon the types of microphones and manufacturers. It is not possible to obtain the same frequency characteristics between the microphones even if the microphone type and the manufacturers are the same. Therefore, Microphone Array Filter URI is required as the equalisation filter address that defines the filter coefficients to equalise each microphone output each other.

  1. Additionally, the format requires, Sampling Rate, Sampling Type and Block Size as the number of samples in an audio block.

By using the microphone array geometry information with the microphone array audio, the AIMs defined for the EAE make the speech detection and separation and noise cancellation applicable via adapting their runtime according to the input definitions.

15.6 Audio scene geometry

The EAE output contains the separated speech signals and the audio scene geometry, i.e., the information that needs to be sent to a receiver in order to correctly recreate the audio field intended by the transmitter.

EAE packages the separated multichannel speech with their spatial information given in the audio scene geometry for transmission through an audio conference application that may improve the speech source quality or enable the immersive audio.

Therefore, the EAE output should include:

  1. Speech Count represents the number of speech objects detected during the current audio block.

  1. Speech Object contains the spatial information of the detected speech. Each object is represented with SpeechID. The single channel identifier over multichannel audio is the ChannelID. Spatial information of the speech object detected in the current block is represented by azimuth, elevation and distance.

  2. Block Information represents the current index, starting and ending time of the block.

<–Structure of MPAI standards MPAI mission and organisation–>