1     Definition 2     Functional Requirements 3     Syntax
4     Semantics 5    Conformance Testing 6     Performance Assessment

1      Definition

A Data Type representing various features of a Speech Segment, including speaker identity, prosody, and additional vocal elements including tension, whispery quality, or creaky voice.

2      Functional Requirements

Speech Descriptors may include Neural Network Descriptors.

3      Syntax

https://schemas.mpai.community/MMC/V2.3/data/SpeechDescriptors.json

4      Semantics

Label Size Description
Header N1 Bytes Speech Descriptors Header
– Standard – Speech Descriptors 9 Bytes The characters “MMC-SPD-V”
– Version N2 Bytes Major version – 1 or 2 characters
– Dot-separator 1 Byte The character “.”
– Subversion N3 Byte Minor version – 1 or 2 characters
MInstanceID N4 Bytes ID of the Metaverse Instance.
SpeechDescriptorsID N5 Bytes ID of Speech Descriptors.
SpeechDescriptorsData N7 Bytes Data associated with Input Text.
SpeechFeatures N8 Byte Indicates characteristic elements extracted from the input speech, specifically pitch, tone, intonation, intensity, speed, emotion, and NNspeechFeatures.
NNSpeechFeatures N9 Bytes Indicates specifically neural-network-based characteristic elements extracted from the input speech by Neural Network
pitch N10 Bytes Indicates the fundamental frequency of Speech expressed as a real number indicating frequency as Hz (Hertz).
tone N11 Bytes Tone is a variation in the pitch of the voice while speaking expressed as human readable words as in Table 48.
ToneType N12Byte Indicates the Tone that the input speech carries.
intonation N13 Bytes A variation of the pitch, intensity and speed within a time period measured in seconds.
intensity N14 Bytes Energy of Speech expressed as a real number indicating dBs (decibel).
speed N7 Bytes Indicates the Speech Rate as a real number indicating specified linguistic units (e.g., Phonemes, Syllables, or Words) per second.
emotion N15 Byte Indicates the Emotion that the input speech carries.
EmotionType N16 Bytes Indicates the Emotion that the input speech carries.
toneName N17 Bytes Specifies the name of a Tone.
toneSetName N18 Bytes Name of the Tone set which contains the Tone. Tone set is used as a baseline, but other sets are possible.

5     Conformance Testing

A Data instance Conforms with MPAI-MMC V2.3 Speech Descriptors (MMC-SPD) if:

  1. The Data validates against the Speech Descriptors’ JSON Schema.
  2. All Data in the  Speech Descriptors’ JSON Schema
    1. Have the specified type
    2. Validate against their JSON Schemas
    3. Conform with their Data Qualifiers if present.

6     Performance Assessment