1 Definition | 2 Functional Requirements | 3 Syntax |
4 Semantics | 5 Conformance Testing | 6 Performance Assessment |
1 Definition
A Data Type representing various features of a Speech Segment, including speaker identity, prosody, and additional vocal elements including tension, whispery quality, or creaky voice.
2 Functional Requirements
Speech Descriptors may include Neural Network Descriptors.
3 Syntax
https://schemas.mpai.community/MMC/V2.3/data/SpeechDescriptors.json
4 Semantics
Label | Size | Description |
Header | N1 Bytes | Speech Descriptors Header |
– Standard – Speech Descriptors | 9 Bytes | The characters “MMC-SPD-V” |
– Version | N2 Bytes | Major version – 1 or 2 characters |
– Dot-separator | 1 Byte | The character “.” |
– Subversion | N3 Byte | Minor version – 1 or 2 characters |
MInstanceID | N4 Bytes | ID of the Metaverse Instance. |
SpeechDescriptorsID | N5 Bytes | ID of Speech Descriptors. |
SpeechDescriptorsData | N7 Bytes | Data associated with Input Text. |
SpeechFeatures | N8 Byte | Indicates characteristic elements extracted from the input speech, specifically pitch, tone, intonation, intensity, speed, emotion, and NNspeechFeatures. |
NNSpeechFeatures | N9 Bytes | Indicates specifically neural-network-based characteristic elements extracted from the input speech by Neural Network |
pitch | N10 Bytes | Indicates the fundamental frequency of Speech expressed as a real number indicating frequency as Hz (Hertz). |
tone | N11 Bytes | Tone is a variation in the pitch of the voice while speaking expressed as human readable words as in Table 48. |
ToneType | N12Byte | Indicates the Tone that the input speech carries. |
intonation | N13 Bytes | A variation of the pitch, intensity and speed within a time period measured in seconds. |
intensity | N14 Bytes | Energy of Speech expressed as a real number indicating dBs (decibel). |
speed | N7 Bytes | Indicates the Speech Rate as a real number indicating specified linguistic units (e.g., Phonemes, Syllables, or Words) per second. |
emotion | N15 Byte | Indicates the Emotion that the input speech carries. |
EmotionType | N16 Bytes | Indicates the Emotion that the input speech carries. |
toneName | N17 Bytes | Specifies the name of a Tone. |
toneSetName | N18 Bytes | Name of the Tone set which contains the Tone. Tone set is used as a baseline, but other sets are possible. |
5 Conformance Testing
A Data instance Conforms with MPAI-MMC V2.3 Speech Descriptors (MMC-SPD) if:
- The Data validates against the Speech Descriptors’ JSON Schema.
- All Data in the Speech Descriptors’ JSON Schema
- Have the specified type
- Validate against their JSON Schemas
- Conform with their Data Qualifiers if present.
6 Performance Assessment