MPAI-MMC V2.5 AIMs Automatic Speech Recognition

Function Ref. Model I/O Data SubAIMs JSON MData Profiles Ref. Software Conformance Performance

1 Functions

The Automatic Speech Recognition (MMC-ASR) AIM extracts the text conveyed by an utterance and derives the paralinguistic features of the speech signal. The input speech may be accompanied by an auxiliary text, the identifier of the speaker, the Speech Overlap data type, and the time indicating the portion of the input speech that should be recognised.

In V2.5, MMC-ASR extends its output with Speech Descriptors, providing downstream AIMs with paralinguistic information — including prosody, vocal affect, and speaker characteristics — derived from the same speech signal used for recognition. This information has general value across MPAI applications and requires no additional input beyond what is already provided to the AIM.

Receives	Language Selector	Signalling the language of the speech.
	Auxiliary Text	Text that may be used to provide context information.
	Speech Object	Speech to be recognised.
	Speaker ID	ID of speaker uttering speech.
	Speech Overlap	Data type providing information on speech overlap.
	Speaker Time	Time during which the speech is to be recognised.
Produces	Recognised Text	Text transcript of the recognised speech.
	Speech Descriptors	Paralinguistic features of the recognised speech segment.

2 Reference Model

Figure 1 depicts the Reference Model of the Automatic Speech Recognition (MMC-ASR) AIM.

Automatic Speech Recognition MMC-ASR V2.5

Figure 1 – The Automatic Speech Recognition (MMC-ASR) AIM

3 Input/Output Data

Table 1 specifies the Input and Output Data of the Automatic Speech Recognition (MMC-ASR) AIM.

Table 1 – I/O Data of the Automatic Speech Recognition (MMC-ASR) AIM

Input	Description
Language Selector	Selects input language.
Auxiliary Text Object	Text Object with content related to Speech Object.
Speech Object	Speech Object emitted by Entity.
Speaker Identifier	Identity of Speaker.
Speech Overlap	Times and IDs of overlapping speech segments.
Speaker Time	Time during which Speech is recognised.
Output	Description
Recognised Text Object	Output of the Automatic Speech Recognition AIM — a Text Segment or just a string.
Speech Descriptors	Paralinguistic features of the speech segment including prosody, vocal affect, speaker characteristics, and optional neural network descriptors.

4 SubAIMs

No SubAIMs.

5 JSON Metadata

https://schemas.mpai.community/MMC/V2.5/AIMs/AutomaticSpeechRecognition.json

6 Profiles

No Profiles.

7 Reference Software

8 Conformance Testing

Table 2 provides the Conformance Testing Method for the MMC-ASR AIM.

If a schema contains references to other schemas, conformance of data for the primary schema implies that any data referencing a secondary schema shall also validate against the relevant schema, if present, and conform with the Qualifier, if present.

Table 2 – MMC-ASR AIM Conformance Testing

Input	Language Selector	Shall validate against the Language Selector part of the schema.
	Auxiliary Text	Shall validate against the Text Object schema. Text Data shall conform with the Text Qualifier.
	Speech Object	Shall validate against the Speech Object schema. Speech Data shall conform with the Speech Qualifier.
	Speaker ID	Shall validate against the Instance ID schema.
	Speech Overlap	Shall validate against the Speech Overlap schema.
	Speaker Time	Shall validate against the Time schema.
Output	Text Object	Shall validate against the Text Object schema. Text Data shall conform with the Text Qualifier, e.g. output language shall be that indicated by the Language Selector.
	Speech Descriptors	Shall validate against the Speech Descriptors schema.

9 Performance Assessment

Performance Assessment of an ASR Implementation (ASRI) can be performed for a language for which there is a dataset of speech segments of various durations with corresponding Transcription Text. An MMC-ASR AIM Performance Assessment Report shall be based on the following steps and specify the input dataset used.

For each Recognised Text produced by the ASRI being Assessed for Performance in response to a speech segment provided as input:

Compare the Recognised Text with the Transcription Text.
Compute the Word Error Rate (WER) defined as the sum of deletion, insertion, and substitution errors in the Recognised Text compared to the Transcription Text, divided by the total number of words in the Transcription Text.

Performance Assessment of an ASRI for a language in a Performance Assessment Report is defined as “The WER computed on all speech segments included in the reported dataset”.

Go To MPAI-MMC V2.5 AI Modules

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit