2022/04/22
MPAI alerts industry of upcoming MPAI-MMC V2 Call for Technologies
Multimodal Conversation is one of the first standards approved by MPAI (September 2021). It comprises five Use Cases, all sharing the use of AI to enable forms of human-machine conversation that emulate human-human conversation in completeness and intensity:
- Conversation with Emotion (CWE), enables a human hold a conversation with a machine. The machine understand the human’s emotion as expressed by their speech and face and responds with a synthetic voice and an animated face both expressing emotion.
- Multimodal Question Answering (MQA), enables a human to ask questions about a displayed object.
- Three conversational translation Uses Cases enable a human to obtain a translation of their speech that preserves the “colour” of their speech in the interpreted speech.
The Multimodal Conversation Development Committee has investigated the Multimodal Conversation side of several use cases developed by other MPAI groups and has selected three as candidate use cases for Version 2 of the MPAI-MMC standard:
- Conversation About a Scene (CAS): a human holds a conversation with a machine about objects in a scene of which the human is part. While conversing, the human points a finger to indicate their interest in a particular object.
- Human-Connected Autonomous Vehicle Interaction (HCI): a group of humans has a conversation on a domain-specific subject (travel by car) with a Connected Autonomous Vehicle. The machine understands the utterances, the emotion in the speech and in the faces, and the expression in their gestures. The machine manifests itself as the Head and shoulders of an avatar whose face and head convey emotions congruent with the speech it utters.
- Avatar-Based Videoconference (ABV). In this instance of Mixed-reality Collaborative Space (MCS), avatars represent humans participating in a videoconference. Avatars reproduce the movements of the torsos of human participants with a high degree of accuracy.
Several data formats for potential standardisation have been derived from the following identified functions. The use cases needing a particular function are indicated.
ABV | CAS | HCI | |
1. Human selects: | |||
a. The Ambient in which the avatars operate (ABV). | X | ||
b. The avatar model used by the machine to manifest itself (CAS, ABV, HCI). | X | X | X |
c. The Colour (i.e., the speech features) the machine uses to utter speech, | X | X | X |
2. Machine locates the visual and speech components of human(s) in the visual and sound space. | X | X | X |
3. Machine separates: | |||
a. The visual components of the individual humans from the rest of the visual space (i.e., other visual objects and other visual humans). | X | X | X |
b. The speech components of the individual speaking humans from the rest of sound space (i.e., other sound objects). | X | X | |
4. The machine extracts descriptors of: | |||
a. Human face. | X | X | X |
b. Physical gesture (i.e., head, arms, hands and fingers). | X | X | X |
c. Human speech. | X | X | X |
5. The machine uses | |||
a. Face descriptors to: | |||
i. Identify the human belonging to a group of a limited number of humans. | X | X | |
ii. Extract the emotion of the face. | X | X | X |
iii. Animate the face of an avatar. | X | X | X |
b. Physical gesture descriptors to: | |||
i. Extract the Expression of the physical gestures. | X | X | X |
ii. Interpret the sign language conveyed by the physical gesture descriptors. | X | X | |
iii. Animate the torso of an avatar using physical gesture descriptors | X | X | |
c. Speech Descriptors to | |||
i. Identify a human belonging to a group composed of a limited number of humans. | X | X | |
ii. Recognise speech (i.e., extract text). | X | X | X |
iii. Extract the emotion in the speech. | X | X | X |
6. Machine holds a conversation with a human or an avatar | |||
a. In the context of a specific domain | X | ||
b. About objects in the visual space | X | ||
by | |||
a. Analysing and interpreting their Expression and Emotion | |||
b. Uttering speech with Emotion, possibly spatially located on the lips of an avatar | |||
c. Animating: | |||
i. Eyes, lips and facial muscles of an avatar to display an Emotion. | X | X | |
ii. Lips in sync with an uttered speech. | X | X | X |
d. Expressing/displaying a sequence of Emotions/Expressions that are congruent with | |||
i. Text, Expressions and Emotion of the other party | X | X | X |
ii. The Machine’s response and its associated Expressions and Emotions | X | X | |
e. Gazing at the other party it is conversing with | X | X |
MPSI-MMC V2 Functional Requirements are being finalised. This is the current plan for developing the new standard:
Functional Requirements | 2022/02/23 |
Commercial Requirements | 2022/06/15 |
Call for Technologies | 2022/07/13 |
Response to Call due | 2022/10/10 |
Standard Development | 2022/10/12 |
Technical Specification | 2023/02/08 |
Watermarking and AI
The term watermarking comprises a family of methodological and application tools used to imperceptibly and persistently insert data into a content item. Watermarking is used for different purposes such as to enable an entity to claim ownership of or a device to use the content item.
As a neural network is one type of content – and one that may be quite expensive to develop – is the watermarking notion applicable to neural networks? MPAI thinks it is and is working to develop requirements for a Neural Network Watermarking (NNW) standard called MPAI-NNW that will enable a watermarking technology provider to qualify their products. The standard will provide the means to measure, for a given size of the watermarking payload, the ability of:
- The watermark inserter to inject a payload without deteriorating the performance of the Neural Network. This item requires for a given application domain:
- A testing dataset to be used for the watermarked and unwatermarked neural network.
- An evaluation methodology to assess any change of the performance induced by the watermark.
- The watermark detector to recognize the presence of the inserted watermark when applied to a watermarked network that has been modified (e.g., by transfer learning or pruning) or to any of the inferences of the modified model. This item requires for a given application domain:
- A list of potential modification types expected to be applied on the watermarked neural network as well as of their ranges (e.g., random pruning at 25%).
- Performance criteria for the watermark detector (e.g., relative numbers of missed detection and false alarm).
- The watermark decoder to successfully retrieve the payload when applied to a watermarked network that has been modified (e.g., by transfer learning or pruning) or to any of the inferences of the modified model. This item requires for a given application domain:
- A list of potential modification types expected to be applied on the watermarked neural network as well as of their ranges (e.g., random pruning at 25%).
- Performance criteria for the watermark decoder (e.g., 100% or (100-α)% recovery).
- The watermark inserter to inject a payload at a low computational cost, e.g., execution time on a given processing environment.
- The watermark detector/decoder to detect/decode a payload from a watermarked model or from any of its inferences, at a low computational cost, e.g., execution time on a given processing environment.
The work of developing requirements for the MPAI-NNW is ongoing. Participation in the work is open to non-members. Contact the MPAI Secretariat if you wish to join the MPAI-NNW online meetings.
Activities in the next meeting cycle
Group name | Apr 25-29 |
May 02 – 06 |
May 09 – 13 |
May 16 – 20 |
Time
(UTC) |
||
AI Framework | 25 | 2 | 9 | 16 | 15 | ||
Governance of MPAI Ecosystem | 25 | 2 | 9 | 16 | 16 | ||
Mixed-reality Collaborative Spaces | 25 | 2 | 9 | 13 | 17 | ||
Multimodal Conversation | 26 | 3 | 10 | 14 | 14 | ||
Neural Network Watermarking | 26 | 3 | 10 | 14 | 15 | ||
Context-based Audio enhancement | 26 | 3 | 10 | 14 | 16 | ||
Connected Autonomous Vehicles | 4 | 11 | 18 | 12 | |||
AI-Enhanced Video Coding | 27 | 11 | 14 | ||||
AI-based End-to-End Video Coding | 17 | 13 | |||||
4 | 14 | ||||||
Avatar Representation and Animation | 28 | 5 | 12 | 13:30 | |||
Server-based Predictive Multiplayer Gaming | 28 | 5 | 12 | 14:30 | |||
AIM Health | 6 | ||||||
Communication | 28 | 12 | 15 | ||||
Industry and Standards | 29 | 13 | 16 | ||||
General Assembly (MPAI-19) | 18 | 15 |