<-Go to AI Workflows       Go to ToC       Text and Speech Translation->

1     Functions 2     Reference Architecture 3     I/O Data
4     Functions of AI Modules 5     I/O Data of AI Modules 6     AIW, AIMs, and JSON Metadata
7     Reference Software 8     Conformance Texting 9     Performance Assessment

1      Functions

In a Question Answering (QA) System, a machine provides answers to a user’s question presented in natural language. Multimodal Question Answering improves current QA systems that are only able to deal with text or speech inputs by offering the requesting human the ability to present both speech or text and images. For example, users might ask “Where can I buy this tool?” while showing the picture of the tool, even without showing their faces. In the Multimodal Question Answering (MMC-MQA) Use Case, a machine responds to a question expressed by a user in text or speech while showing an object. The machine’s response may use text and synthetic speech.

2      Reference Model

Figure 1 gives the Multimodal Question Answering Reference Model including the input/output data, the AIMs, and the data exchanged between and among the AIMs.

 

Figure 1 – Reference Model of Multimodal Question Answering

The operation of Multimodal Question Answering develops in the following way:

  1. Input Selector is used to inform the machine whether the human employs Text or Speech to query the machine.
  2. Depending on the value of Input Selector, Natural Language Understanding:
    • Extracts the Meaning of the question from and refines Recognised Text.
    • Extracts the Meaning of the question from Input Text.
  3. Visual Object Identification identifies the Visual Object.
  4. Question Analysis Module determines the Intention of the question.
  5. Module for Question Answering uses Intention and Meaning to produce the answer as Machine Text.
  6. Text-To-Speech produces Machine Speech from Machine Text.

3      I/O Data

The input and output data of the Multimodal Question Answering Use Case are:

Table 1 – I/O Data of Multimodal Question Answering

Input Descriptions
TextObject Text typed by the human as a replacement for Input Speech.
Input Selector Data determining the use of Speech or Text.
Visual Object Video of the human showing an object held in hand.
Speech Object Speech of the human asking a question the Machine.
Output Descriptions
Machine Text The Text generated by Machine in response to human input.
Machine Speech The Speech generated by Machine in response to human input.

4      Functions of AI Modules of Multimodal Question Answering

Table 2 provides the functions of the Multimodal Question Answering Use Case.

Table 2 – Functions of AI Modules of Multimodal Question Answering

AIM Function
Visual Object Identification Identifies the Visual Object.
Automatic Speech Recognition Recognises Speech.
Natural Language Understanding Extracts Meaning and refines Text from Recognised Text.
Question Analysis Module Extracts Intention from Text.
Answer to Question Module Produces response of Machine to the query.
Text-to-Speech Synthesises Speech from Text.

5      I/O Data of AI Modules

The AI Modules of Multimodal Question Answering are given in Table 3.

Table 3 – AI Modules of Multimodal Question Answering

AIM Receives Produces
Visual Object Identification Visual Object Instance Identifier
Automatic Speech Recognition Speech Object Recognised Text
Natural Language Understanding Text Object or Recognised Text Refined Text
Meaning
Question Analysis Module Meaning Intention
Answer to Question Module 1.     Input or Recognised Text
2.    Intention
3.  Meaning
Machine Text
Text-to-Speech Machine Text Machine Speech

6      AIW, AIMs, and JSON Metadata

Table 4 provides the links to the AIW and AIM specifications and to the JSON syntaxes. AIMs/1 indicates that the column contains Composite AIMs and AIMs/2 indicates that the column contains their Basic AIMs.

Table 4 – AIW, AIMs, and JSON Metadata

AIW AIMs/1 AIMs/2 Name JSON
MMC-MQA Multimodal Question Answering X
OSD-VOI Visual Object Identification X
OSD-VDI Visual Direction Identification X
OSD-VOE Visual Object Extraction X
OSD-VII Visual Instance Identification X
MMC-ASR Automatic Speech Recognition X
MMC-NLU Natural Language Understanding X
MMC-QAM Question Analysis Module X
MMC-AQM Answer to Question Module X
MMC-TTS Text-to-Speech X

7     Reference Software

8     Conformance Testing

Input Data Data Type Input Conformance Testing Data
Input Selector Binary data For Input Selector=0 and Input Selector=1
Text Object Unicode All input Text files to be drawn from Text Files.
Speech Object .wav All input Speech files to be drawn from Audio Files.
Input Image JPEG All input Image files to be drawn from Images.
Output Data Data Type Input Conformance Testing Data
Machine Text Unicode All Text files produced shall conform with Text files.
Machine Speech .wav All Speech files produced shall conform with Speech files.

9     Performance Assessment

 

<-Go to AI Workflows       Go to ToC       Text and Speech Translation->