1 Definition 2 Functional Requirements 3 Syntax 4 Semantics

1     Definition

The Speech Qualifier is a set of Data providing additional information on Speech Data for potential use by a machine. It describes:

  • The structure of the signal (Formats)
  • The origin of the speech (Source)
  • The linguistic content (Metadata)
  • The perceptual characteristics (SpeechCharacteristics)
  • The systems interacting with the signal (Device)

The combination of Speech Data and Speech Qualifier is called Speech Object and is specified by MPAI-OSD V1.5.

2     Functional Requirements

The Speech Qualifier allows the expression of the following Elements:

  • #subtypesSub-Types
  • #formatsFormats
    • ContentFormats
    • TransportFormats
  • #attributesAttributes
    • Source
    • Metadata
    • SpeechCharacteristics
    • Device

Users needing additional entries in the Speech Qualifier or support of new Qualifiers should make a documented request to the MPAI Secretariat.
Requests will be considered by the appropriate MPAI committee.

3     Syntax


https://schemas.mpai.community/TFA/V1.5/data/SpeechQualifier.json

4     Semantics

4.1  Sub-Types

Reserved for future extensions.

4.2  Formats

4.2.1  ContentFormats

Defines the data arrangement used to represent speech signals.

    • Raw Speech

Definition: the type of data arrangement used to digitally represent speech samples.

      • Sampling Frequency: number expressing kHz
      • Sample Precision: number expressing bits per sample

Typically represented using: PCM

    • Speech Compression Formats

Definition: the type of data arrangement used to reduce the number of bits for speech.

      • G711A
      • G711μ
      • MP3 (ISO/IEC 11172-3:1993)
      • AAC (ISO/IEC 14496-3:2019)

Additional formats are defined in: SpeechContentFormats.json

4.2.2  TransportFormats

Defines how Speech data is transported.

4.3  Attributes

4.3.1  Source

Defines the origin of the speech signal.

  • Real: speech produced by a human speaker
  • Synthetic: speech generated by a system (e.g. TTS)

4.3.2  Metadata

Provides descriptive information about the speech content.

  • Language:
    • LanguageFormat
    • LanguageCode
  • SpeakerIdentity
  • ContentDescription:
    • TextObject
    • EntityInternalStatus

4.3.3  SpeechCharacteristics

Defines measurable and perceptual characteristics of speech signals.
These attributes provide additional information useful for speech processing,
analysis, and synthesis, without constraining implementation methods.

  • SpeakingRateDefinition: rate of speech delivery.
    Typically expressed as words per second or syllables per second,
    depending on the application.
  • PitchRangeDefinition: range of variation of the fundamental frequency (F0).
    Typically expressed in Hertz or semitones.
  • EnergyDefinition: measure of signal intensity or loudness.
    May be represented as RMS energy, peak level, or perceptual loudness (e.g. LUFS).
  • ProsodyDefinition: expressive pattern of speech, including intonation,
    rhythm, and stress.

    • Neutral
    • Expressive
    • Emphatic
    • Monotonic
    • Other
  • DisfluenciesDefinition: indicates presence of hesitations, repetitions,
    fillers, or interruptions in speech.

4.3.4  Device

Defines the device used for capturing or rendering speech signals.

  • DeviceRole:
    • Capture (microphones, sensor arrays)
    • Render (speakers, headphones)
    • Bidirectional
  • DeviceType:
    • Microphone
    • MicrophoneArray
    • Speaker
    • Headphones
    • WearableMic
  • CaptureConfiguration:
    • ChannelCount
    • SamplingMode (Mono, Stereo, MultiChannel, Ambisonics)
  • RenderConfiguration:
    • ChannelCount
    • RenderingMode (Mono, Stereo, Multichannel, Binaural)