1     Scope of Conversation with Emotion

2     Reference Architecture of Conversation with Emotion

3     I/O Data of Conversation with Emotion.

4     Functions of AI Modules of Conversation with Emotion

5     I/O Data of AI Modules of Conversation with Emotion

6     JSON Metadata of Conversation with Emotion

1      Scope of Conversation with Emotion

In the Conversation with Emotion (MMC-CWE) Use Case, a machine responds to a human’s textual and/or vocal utterance in a manner consistent with the human’s utterance and emotional state, as detected from the human’s text, speech, or face. The machine responds using text, synthetic speech, and a face whose lip movements are synchronised with the synthetic speech and the synthetic machine emotion.

2      Reference Architecture of Conversation With Emotion

Figure 1 gives the Reference Model of Conversation With Emotion including the input/output data, the AIMs, the AIM topology, and the data exchanged between and among the AIMs.

Figure 1 – Reference Model of Conversation With Emotion

The operation of Conversation with Emotion develops as follows:

  1. Automatic Speech Recognition produces Recognised Text
  2. Input Speech Description and PS-Face Interpretation produce Emotion (Speech).
  3. Input Face Description and PS-Face Interpretation produce Emotion (Face).
  4. Natural Language Understanding refines Recognised Text and produces Meaning.
  5. Input Text Description and PS-Text Interpretation produce Emotion (Text).
  6. Multimodal Emotion Fusion AIM fuses all Emotions into the Fused Emotion.
  7. The Entity Dialogue Processing AIM produces a reply based on the Fused Emotion and Meaning.
  8. The Text-To-Speech (Emotion) AIM produces Output Speech from Text with Emotion.
  9. The Lips Animation AIM animates the lips of a Face drawn from the Video of Faces KB consistently with the Output Speech and the Output Emotion.

3      I/O Data of Conversation with Emotion

The input and output data of the Conversation with Emotion Use Case are:

Table 1 – I/O Data of Conversation with Emotion 

Input Descriptions
Input Selector Data determining the use of Speech vs Text.
Text Object Text typed by the human as additional information stream or as a replacement of the speech depending on the value of Input Selector.
Speech Object Speech of the human having a conversation with the machine.
Face Object Visual information of the Face of the human having a conversation with the machine.
Output Descriptions
Text Object Text of the Speech produced by the Machine.
Speech Object Synthetic Speech produced by the Machine.
Face Object Video of a Face whose lip movements are synchronised with the Output Speech and the synthetic machine emotion.

4      Functions of AI Modules of Conversation with Emotion

Table 2 provides the functions of the Conversation with Emotion AIMs.

Table 2 – Functions of AI Modules of Conversation with Emotion

AIM Function
Automatic Speech Recognition 1.     Receives Speech Object.
2.     Produces Recognised Text.
Input Speech Description 1.     Receives Speech Object.
2.     Produces Speech Descriptors
Input Face Description 1.     Receives Face Object.
2.     Extracts Face Descriptors.
Natural Language Understanding 1.     Receives Input Selector, Text Object, Recognised Text.
2.     Produces Meaning (i.e., Text Descriptors), Refined Text.
PS-Speech Interpretation 1.     Receives Speech Descriptors.
2.     Provides the Emotion of the Face.
PS-Face Interpretation 1.     Receives Face Descriptors.
2.     Provides the Emotion of the Face.
PS-Text Interpretation 1.     Receives Text Descriptors.
2.     Provides the Emotion of the Text.
Multimodal Emotion Fusion 1.     Receives Emotion (Text), Emotion (Speech), Emotion (Face).
2.     Provides human’s Input Emotion by fusing Emotion (Text), Emotion (Speech), and Emotion (Video).
Entity Dialogue Processing 1.     Receives Refined Text, Meaning, Input Emotion.
2.     Analyses Meaning and Input Text or Refined Text, depending on the value of Input Selector.
3.     Produces Machine Emotion and Machine Text.
Text-to-Speech 1.     Receives Machine Text and Machine Emotion.
2.     Produces Output Speech.
Video Lip Animation 1.     Receives Machine Speech and Machine Emotion.
2.     Animates the lips of a video obtained by querying the Video Faces KB, using the Output Emotion.
3.     Produces Face Object with synchronised Speech Object (Machine Object).

1.1.5      I/O Data of AI Modules of Conversation with Emotion

The AI Modules of Conversation with Emotion perform the Functions specified in Table 21.

Table 3 – AI Modules of Conversation with Emotion

AIM Receives Produces
Automatic Speech Recognition Speech Object Recognised Text
Input Speech Description Speech Object Speech Descriptors
Input Face Description Face Object Face Descriptors
Natural Language Understanding Recognised Text Refined Text
Text Descriptors
PS-Speech Interpretation Speech Descriptors Emotion (Speech)
PS-Face Interpretation Face Face Descriptors Emotion (Face)
PS-Text Interpretation Text Descriptors Emotion (Text)
Multimodal Emotion Fusion Emotion (Text)
Emotion (Speech)
Emotion(Face)
Input Emotion
Entity Dialogue Processing 1.      Text Descriptors
2.       Based on Input Selector
2.1.      Refined Text
2.2.      Input Text
3.       Input Emotion
1.     Machine Text
2.     Machine Emotion
Text-to-Speech 1.     Machine Text
2.     Machine Emotion
Output Speech.
Video Lip Animation 1.     Machine Emotion
2.     Machine Speech
Output Visual

6     Specification of Conversation with Emotion AIMs and JSON Metadata

Table 4 – AIMs and JSON Metadata

AIMs Name JSON
MMC-CWE Conversation With Emotion X
MMC-ASR Automatic Speech Recognition X
MMC-ISD Input Speech Description X
PAF-IFD Input Face Description X
MMC-NLU Natural Language Understanding X
MMC-PSI PS-Speech Interpretation X
PAF-PFI PS-Face Interpretation X
MMC-PTI PS-Text Interpretation X
MMC-MEF Multimodal Emotion Fusion X
MMC-EDP Entity Dialogue Processing X
MMC-TTS Text-to-Speech X
MMC-VLA Video Lip Animation X