MPAI-TFA V1.5 Data Types Speech Qualifier

1 Definition

The Speech Qualifier is a set of Data providing additional information on Speech Data for potential use by a machine. It describes:

The structure of the signal (Formats)
The origin of the speech (Source)
The linguistic content (Metadata)
The perceptual characteristics (SpeechCharacteristics)
The systems interacting with the signal (Device)

The combination of Speech Data and Speech Qualifier is called Speech Object and is specified by MPAI-OSD V1.5.

2 Functional Requirements

The Speech Qualifier allows the expression of the following Elements:

#subtypesSub-Types
#formatsFormats
- ContentFormats
- TransportFormats
#attributesAttributes
- Source
- Metadata
- SpeechCharacteristics
- Device

Users needing additional entries in the Speech Qualifier or support of new Qualifiers should make a documented request to the MPAI Secretariat.
Requests will be considered by the appropriate MPAI committee.

3 Syntax

https://schemas.mpai.community/TFA/V1.5/data/SpeechQualifier.json

4 Semantics

4.1 Sub-Types

Reserved for future extensions.

4.2 Formats

4.2.1 ContentFormats

Defines the data arrangement used to represent speech signals.

- Raw Speech

Definition: the type of data arrangement used to digitally represent speech samples.

- - Sampling Frequency: number expressing kHz
  - Sample Precision: number expressing bits per sample

Typically represented using: PCM

- Speech Compression Formats

Definition: the type of data arrangement used to reduce the number of bits for speech.

- - G711A
  - G711μ
  - MP3 (ISO/IEC 11172-3:1993)
  - AAC (ISO/IEC 14496-3:2019)

Additional formats are defined in: SpeechContentFormats.json

4.2.2 TransportFormats

Defines how Speech data is transported.

FileFormat: SpeechFileFormats
StreamFormat: SpeechStreamFormats

4.3 Attributes

4.3.1 Source

Defines the origin of the speech signal.

Real: speech produced by a human speaker
Synthetic: speech generated by a system (e.g. TTS)

4.3.2 Metadata

Provides descriptive information about the speech content.

Language:
- LanguageFormat
- LanguageCode
SpeakerIdentity
ContentDescription:
- TextObject
- EntityInternalStatus

4.3.3 SpeechCharacteristics

Defines measurable and perceptual characteristics of speech signals.
These attributes provide additional information useful for speech processing,
analysis, and synthesis, without constraining implementation methods.

SpeakingRateDefinition: rate of speech delivery.
Typically expressed as words per second or syllables per second,
depending on the application.
PitchRangeDefinition: range of variation of the fundamental frequency (F0).
Typically expressed in Hertz or semitones.
EnergyDefinition: measure of signal intensity or loudness.
May be represented as RMS energy, peak level, or perceptual loudness (e.g. LUFS).
ProsodyDefinition: expressive pattern of speech, including intonation,
rhythm, and stress.
- Neutral
- Expressive
- Emphatic
- Monotonic
- Other
DisfluenciesDefinition: indicates presence of hesitations, repetitions,
fillers, or interruptions in speech.

4.3.4 Device

Defines the device used for capturing or rendering speech signals.

DeviceRole:
- Capture (microphones, sensor arrays)
- Render (speakers, headphones)
- Bidirectional
DeviceType:
- Microphone
- MicrophoneArray
- Speaker
- Headphones
- WearableMic
CaptureConfiguration:
- ChannelCount
- SamplingMode (Mono, Stereo, MultiChannel, Ambisonics)
RenderConfiguration:
- ChannelCount
- RenderingMode (Mono, Stereo, Multichannel, Binaural)

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit