Multi-Modal Conversation (MPAI-MMC)

MPAI Application Note #6

Proponent: Miran Choi (ETRI)

Description: Owing to recent advances of AI technologies, natural language processing started to be widely used in various applications. One of the useful applications is the conversational partner which provides the user with information, entertains, chats and answers questions through the speech interface. However, an application should include more than just a speech interface to provide a better service to the user. For example, emotion recognizer and gesture interpreter are needed for better multi-modal interfaces.

Multi-modal conversation (MPAI-MMC) aims to enable human-machine conversation that emulates human-human conversation in completeness and intensity by using AI.

The interaction of AI processing modules implied by a multi-modal conversation system would look approximately as presented in Figure 1, where one can see a language understanding module, a speech recognition module, image analysis module, a dialog processing module, and a speech synthesis module.

Figure 1 – Multi-Modal Conversation (emotion-focused)

Comments: The processing modules of the MPAI-MMC instance of Figure 1 would be operated in the MPAI-AIF framework.

Examples

The example of MMC is the conversation between a human user and a computer/robot as in the following list. The input from the user can be voice, text or image or combination of different inputs. Considering emotion of the human user, MMC will output responses in a text, speech, music depending on the user’s needs.

Chats: “I am bored. What should I do now?” – “You look tired. Why don’t you take a walk?”
Question Answering: “Who is the famous artist in Barcelona?” – “Do you mean Gaudi?”
Information Request: “What’s the weather today?” – “It is a little cloudy and cold.”
Action Request: “Play some classical music, please” – “OK. Do you like Brahms?”

Processing modules involved in MMC:

A preliminary list of processing modules is given below:

Fusion of multi-modal input information
Natural language understanding
Natural language generation
Speech recognition
Speech synthesis
Emotion recognition
Intention understanding
Image analysis
Knowledge fusion from different sources such as speech, facial expression, gestures, etc
Dialog processing
Question Answering
Machine Reading Comprehension (MRC)
Speech Synthesis

Requirements:

This is the initial functional requirements that will be developed in the full set in the Functional Requirements (FR) phase..

The standard shall specify the following natural input signals

Sound signals from microphone
Text from keyboard or keypad
Images from the camera

The standard shall specify a user profile format (e.g. gender, age, specific needs, etc.)
The standard shall support emotion-based dialog processing that uses emotion result from the emotion recognition as input and decides the replies based on the user’s intention as output.
The standard should provide means to carry emotion and user preferences in the speech synthesis processing module.
Processing modules should be agnostic to AI, ML or DP technology: it should be general enough to avoid limitations in terms of algorithmic structure, storage and communication and allow full interoperability with other processing modules.
The standard should provide support for the storage of, and access to:

Unprocessed data in speech, text or image form
Processed data in the form of annotations (semantic labelling). Such annotations can be produced as the result of primary analysis of the unprocessed data or come from external sources such as knowledge base.
meta-data (such as collection date and place; classification data)
Support for the structured data produced from the raw data.

The standard should also provide support for:

The combination into a general analysis workflow of a number of computational blocks that access processed, and possibly unprocessed, data such as input channels, and produce output as a sequence of vectors in a space of arbitrary dimension.
The possibility of defining and implementing a novel processing block from scratch in terms of either some source code or a proprietary binary codec
A number of pre-defined blocks that implement well-known analysis methods (such as NN-based methods).
The parallel and sequential combination of processing modules that comprise different services.
The real time processing for the conversation between the user and the robot/computer.

Object of standard: Interfaces of processing components utilized in multimodal communication.

Input interfaces: how to deal with inputs in different formats
Processing component interfaces: interfaces between a set of updatable and extensible processing modules
Delivery protocol interfaces: Interfaces of the processed data signal to a variety of delivery protocols
Framework: the glue keeping the pieces together => mapping to MPAI-AIF

Benefits:

Decisively improve communication between humans and machines and the user experience
Reuse of processing components for different applications
Create a horizontal market of multimodal conversational components
Make market more competitive

Bottlenecks:

Some processing units should be improved because end-to-end processing has lower performances compared to modular approaches. Therefore, the standard should be able to cover the traditional method as well as hybrid approaches.

Social aspects:

Enhanced user interfaces will provide accessibility for people with disabilities. MMC can also be used in care giving services for elderly and patients.

Success criteria:

How MMC can be extended to different services by combining several processing modules easily and easily.
The performance of multi-modality compared to uni-modality in the user interface.
Interconnection and integration among different processing modules

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit

Application Note

Multi-Modal Conversation (MPAI-MMC)

MPAI Application Note #6

Notice