1 Scope of Conversation with Emotion
2 Reference Architecture of Conversation with Emotion
3 I/O Data of Conversation with Emotion.
4 Functions of AI Modules of Conversation with Emotion
5 I/O Data of AI Modules of Conversation with Emotion
6 JSON Metadata of Conversation with Emotion
1 Scope of Conversation with Emotion
In the Conversation with Emotion (MMC-CWE) Use Case, a machine responds to a human’s textual and/or vocal utterance in a manner consistent with the human’s utterance and emotional state, as detected from the human’s text, speech, or face. The machine responds using text, synthetic speech, and a face whose lip movements are synchronised with the synthetic speech and the synthetic machine emotion.
2 Reference Architecture of Conversation With Emotion
Figure 1 gives the Reference Model of Conversation With Emotion including the input/output data, the AIMs, the AIM topology, and the data exchanged between and among the AIMs.
Figure 1 – Reference Model of Conversation With Emotion
The operation of Conversation with Emotion develops as follows:
- Automatic Speech Recognition produces Recognised Text
- Input Speech Description and PS-Face Interpretation produce Emotion (Speech).
- Input Face Description and PS-Face Interpretation produce Emotion (Face).
- Natural Language Understanding refines Recognised Text and produces Meaning.
- Input Text Description and PS-Text Interpretation produce Emotion (Text).
- Multimodal Emotion Fusion AIM fuses all Emotions into the Fused Emotion.
- The Entity Dialogue Processing AIM produces a reply based on the Fused Emotion and Meaning.
- The Text-To-Speech (Emotion) AIM produces Output Speech from Text with Emotion.
- The Lips Animation AIM animates the lips of a Face drawn from the Video of Faces KB consistently with the Output Speech and the Output Emotion.
3 I/O Data of Conversation with Emotion
The input and output data of the Conversation with Emotion Use Case are:
Table 1 – I/O Data of Conversation with Emotion
Input | Descriptions |
Input Selector | Data determining the use of Speech vs Text. |
Text Object | Text typed by the human as additional information stream or as a replacement of the speech depending on the value of Input Selector. |
Speech Object | Speech of the human having a conversation with the machine. |
Face Object | Visual information of the Face of the human having a conversation with the machine. |
Output | Descriptions |
Text Object | Text of the Speech produced by the Machine. |
Speech Object | Synthetic Speech produced by the Machine. |
Face Object | Video of a Face whose lip movements are synchronised with the Output Speech and the synthetic machine emotion. |
4 Functions of AI Modules of Conversation with Emotion
Table 2 provides the functions of the Conversation with Emotion AIMs.
Table 2 – Functions of AI Modules of Conversation with Emotion
AIM | Function |
Automatic Speech Recognition | 1. Receives Speech Object. 2. Produces Recognised Text. |
Input Speech Description | 1. Receives Speech Object. 2. Produces Speech Descriptors |
Input Face Description | 1. Receives Face Object. 2. Extracts Face Descriptors. |
Natural Language Understanding | 1. Receives Input Selector, Text Object, Recognised Text. 2. Produces Meaning (i.e., Text Descriptors), Refined Text. |
PS-Speech Interpretation | 1. Receives Speech Descriptors. 2. Provides the Emotion of the Face. |
PS-Face Interpretation | 1. Receives Face Descriptors. 2. Provides the Emotion of the Face. |
PS-Text Interpretation | 1. Receives Text Descriptors. 2. Provides the Emotion of the Text. |
Multimodal Emotion Fusion | 1. Receives Emotion (Text), Emotion (Speech), Emotion (Face). 2. Provides human’s Input Emotion by fusing Emotion (Text), Emotion (Speech), and Emotion (Video). |
Entity Dialogue Processing | 1. Receives Refined Text, Meaning, Input Emotion. 2. Analyses Meaning and Input Text or Refined Text, depending on the value of Input Selector. 3. Produces Machine Emotion and Machine Text. |
Text-to-Speech | 1. Receives Machine Text and Machine Emotion. 2. Produces Output Speech. |
Video Lip Animation | 1. Receives Machine Speech and Machine Emotion. 2. Animates the lips of a video obtained by querying the Video Faces KB, using the Output Emotion. 3. Produces Face Object with synchronised Speech Object (Machine Object). |
5 I/O Data of AI Modules of Conversation with Emotion
The AI Modules of Conversation with Emotion perform the Functions specified in Table 21.
Table 3 – AI Modules of Conversation with Emotion
AIM | Receives | Produces |
Automatic Speech Recognition | Speech Object | Recognised Text |
Input Speech Description | Speech Object | Speech Descriptors |
Input Face Description | Face Object | Face Descriptors |
Natural Language Understanding | Recognised Text | Refined Text Text Descriptors |
PS-Speech Interpretation | Speech Descriptors | Emotion (Speech) |
PS-Face Interpretation | Face Face Descriptors | Emotion (Face) |
PS-Text Interpretation | Text Descriptors | Emotion (Text) |
Multimodal Emotion Fusion | Emotion (Text) Emotion (Speech) Emotion(Face) |
Input Emotion |
Entity Dialogue Processing | 1. Text Descriptors 2. Based on Input Selector 2.1. Refined Text 2.2. Input Text 3. Input Emotion |
1. Machine Text 2. Machine Emotion |
Text-to-Speech | 1. Machine Text 2. Machine Emotion |
Output Speech. |
Video Lip Animation | 1. Machine Emotion 2. Machine Speech |
Output Visual |
6 Specification of Conversation with Emotion AIMs and JSON Metadata
Table 4 – AIMs and JSON Metadata
AIMs | Name | JSON | |
MMC-CWE | Conversation With Emotion | X | |
– | MMC-ASR | Automatic Speech Recognition | X |
– | MMC-ISD | Input Speech Description | X |
– | PAF-IFD | Input Face Description | X |
– | MMC-NLU | Natural Language Understanding | X |
– | MMC-PSI | PS-Speech Interpretation | X |
– | PAF-PFI | PS-Face Interpretation | X |
– | MMC-PTI | PS-Text Interpretation | X |
– | MMC-MEF | Multimodal Emotion Fusion | X |
– | MMC-EDP | Entity Dialogue Processing | X |
– | MMC-TTS | Text-to-Speech | X |
– | MMC-VLA | Video Lip Animation | X |