| 1 Definition | 2 Functional Requirements | 3 Syntax | 4 Semantics |
1 Definition
The Speech Qualifier is a set of Data providing additional information on Speech Data for potential use by a machine. It describes:
- The structure of the signal (Formats)
- The origin of the speech (Source)
- The linguistic content (Metadata)
- The perceptual characteristics (SpeechCharacteristics)
- The systems interacting with the signal (Device)
The combination of Speech Data and Speech Qualifier is called Speech Object and is specified by MPAI-OSD V1.5.
2 Functional Requirements
The Speech Qualifier allows the expression of the following Elements:
- #subtypesSub-Types
- #formatsFormats
- ContentFormats
- TransportFormats
- #attributesAttributes
- Source
- Metadata
- SpeechCharacteristics
- Device
Users needing additional entries in the Speech Qualifier or support of new Qualifiers should make a documented request to the MPAI Secretariat.
Requests will be considered by the appropriate MPAI committee.
3 Syntax
https://schemas.mpai.community/TFA/V1.5/data/SpeechQualifier.json
4 Semantics
4.1 Sub-Types
Reserved for future extensions.
4.2 Formats
4.2.1 ContentFormats
Defines the data arrangement used to represent speech signals.
-
- Raw Speech
Definition: the type of data arrangement used to digitally represent speech samples.
-
-
- Sampling Frequency: number expressing kHz
- Sample Precision: number expressing bits per sample
-
Typically represented using: PCM
-
- Speech Compression Formats
Definition: the type of data arrangement used to reduce the number of bits for speech.
-
-
- G711A
- G711μ
- MP3 (ISO/IEC 11172-3:1993)
- AAC (ISO/IEC 14496-3:2019)
-
Additional formats are defined in: SpeechContentFormats.json
4.2.2 TransportFormats
Defines how Speech data is transported.
- FileFormat: SpeechFileFormats
- StreamFormat: SpeechStreamFormats
4.3 Attributes
4.3.1 Source
Defines the origin of the speech signal.
- Real: speech produced by a human speaker
- Synthetic: speech generated by a system (e.g. TTS)
4.3.2 Metadata
Provides descriptive information about the speech content.
- Language:
- LanguageFormat
- LanguageCode
- SpeakerIdentity
- ContentDescription:
- TextObject
- EntityInternalStatus
4.3.3 SpeechCharacteristics
Defines measurable and perceptual characteristics of speech signals.
These attributes provide additional information useful for speech processing,
analysis, and synthesis, without constraining implementation methods.
- SpeakingRateDefinition: rate of speech delivery.
Typically expressed as words per second or syllables per second,
depending on the application. - PitchRangeDefinition: range of variation of the fundamental frequency (F0).
Typically expressed in Hertz or semitones. - EnergyDefinition: measure of signal intensity or loudness.
May be represented as RMS energy, peak level, or perceptual loudness (e.g. LUFS). - ProsodyDefinition: expressive pattern of speech, including intonation,
rhythm, and stress.- Neutral
- Expressive
- Emphatic
- Monotonic
- Other
- DisfluenciesDefinition: indicates presence of hesitations, repetitions,
fillers, or interruptions in speech.
4.3.4 Device
Defines the device used for capturing or rendering speech signals.
- DeviceRole:
- Capture (microphones, sensor arrays)
- Render (speakers, headphones)
- Bidirectional
- DeviceType:
- Microphone
- MicrophoneArray
- Speaker
- Headphones
- WearableMic
- CaptureConfiguration:
- ChannelCount
- SamplingMode (Mono, Stereo, MultiChannel, Ambisonics)
- RenderConfiguration:
- ChannelCount
- RenderingMode (Mono, Stereo, Multichannel, Binaural)