Moving Picture, Audio and Data Coding
by Artificial Intelligence

All posts

MPAI appoints MPAI Store, incorporated as Company Limited by Guarantee, as the MPAI Store in the MPAI Ecosystem

Geneva, Switzerland – 30 September 2022. Today the international, non-profit, unaffiliated Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) standards developing organisation has concluded its 24th General Assembly (MPAI-24). Among the outcomes is the appointment of MPAI Store, a company limited by guarantee incorporated in Scotland, as the “MPAI Store” referenced to by the Governance of the MPAI Ecosystem standard (MPAI-GME).

The tasks of the MPAI Store are critical for the operation of the MPAI Ecosystem. Some of these are:

  1. Operation on a cost-recovery basis.
  2. Registration Authority function, i.e., assignment of an ID to implementers.
  3. Testing of implementations of MPAI technical specifications submitted by implementers.
  4. Labelling of implementations based on the verified interoperability level.
  5. Distribution of implementations via high-availability ICT infrastructure.

MPAI-24 has reiterated the deadline extension for submitting responses to the Calls for Technologies on AI Framework, Multimodal Conversation, and Neural Network watermarking until the 24th of October. The link to all documents relevant to the Calls can be found on the MPAI website.

MPAI develops data coding standards for applications that have AI as the core enabling technology. Any legal entity supporting the MPAI mission may join MPAI, if able to contribute to the development of standards for the efficient use of data.

 

So far, MPAI has developed 5 standards (not italic in the list below), is currently engaged in extending 2 approved standards (underlined) and is developing another 10 standards (italic).

 

 

Name of standard Acronym Brief description
AI Framework MPAI-AIF Specifies an infrastructure enabling the execution of implementations and access to the MPAI Store.
Context-based Audio Enhancement MPAI-CAE Improves the user experience of audio-related applications in a variety of contexts.
Compression and Understanding of Industrial Data MPAI-CUI Predicts the company’s performance from governance, financial, and risk data.
Governance of the MPAI Ecosystem MPAI-GME Establishes the rules governing the submission of and access to interoperable implementations.
Multimodal Conversation MPAI-MMC Enables human-machine conversation emulating human-human conversation.
Avatar Representation and Animation MPAI-ARA Specifies descriptors of avatars impersonating real humans.
Connected Autonomous Vehicles MPAI-CAV Specifies components for Environment Sensing, Autonomous Motion, and Motion Actuation.
End-to-End Video Coding MPAI-EEV Explores the promising area of AI-based “end-to-end” video coding for longer-term applications.
AI-Enhanced Video Coding MPAI-EVC Improves existing video coding with AI tools for short-to-medium term applications.
Integrative Genomic/Sensor Analysis MPAI-GSA Compresses high-throughput experiments’ data combining genomic/proteomic and other data.
Mixed-reality Collaborative Spaces MPAI-MCS Supports collaboration of humans represented by avatars in virtual-reality spaces.
MPAI Metaverse Model MPAI-MMM Development of a reference model to guide the creation of Interoperable Metaverse Instances.
Neural Network Watermarking MPAI-NNW Measures the impact of adding ownership and licensing information to models and inferences.
Visual Object and Scene Description MPAI-OSD Describes objects and their attributes in a scene.
Server-based Predictive Multiplayer Gaming MPAI-SPG Trains a network to compensate data losses and detects false data in online multiplayer gaming.
XR Venues MPAI-XRV XR-enabled and AI-enhanced use cases where venues may be both real and virtual.

 

Please visit the MPAI website, contact the MPAI secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.

Most importantly: please join MPAI, share the fun, build the future.


Imperceptibility, Robustness, and Computational Cost in Neural Network Watermarking

Introduction

Research efforts, specific skills, training and processing can cumulatively bring the development costs of a neural network anywhere from a few thousand to a few hundreds of thousand dollars. Therefore, the AI industry needs a technology to ensure traceability and integrity not only of a neural network but also of the content generated by it (so-called inference).

Faced with a similar problem, the digital content production and distribution industry has considered watermarking as a tool to insert a payload carrying data such as timestamping or owner ID information. If the inserted payload is imperceptible and persistent, it can be used to signal the ownership of a content item or the semantic modification of its content.

A role for MPAI?

MPAI has assessed that watermarking can also be used by the AI industry and intends to develop a standard to assess the performance of neural network watermarking technologies. Users with different applications in mind can be interested in neural network watermarking. For instance, the owner, i.e., the developer of a neural network, is interested in having their neural network protected by the “best” watermarking solution. The watermarking provider, i.e., the developer of the watermarking technology, is interested in evaluating the performance of their watermarking technology. In its turn, the customer, i.e., the provider of an end product needs the owner’s and watermarking provider’s solution to offer a product or a service. Finally, the end-user buys or rents the product and uses it.

All these users are mainly interested in three neural network watermarking properties: imperceptibility, persistency, and computational complexity.

Neural network watermarking imperceptibility

One of the features that a user of a watermarking technology may be interested in is assessing the impact that the embedding of a watermark in a neural network has on the quality of the inference that the neural network provides.

MPAI has identified the following process to test imperceptibility:

  1. Select a pair of training and testing datasets and a set of M unwatermarked neural networks.
  2. Insert a watermark in each neural network with D different data payloads, yielding M x (D + 1) neural networks: M x D watermarked neural networks and M unwatermarked neural networks.
  3. Feed the M x (D + 1) neural networks with the testing dataset and measure the quality of the produced inference.

Neural network watermarking persistence

One of the features that a user of a watermarking technology may be interested in is assessing the capability of the detector to ascertain the presence of the watermark and the capability of the decoder to retrieve from a modified version of the neural network.

MPAI has identified the following process to test the capability of the detector to find the watermark in the neural network:

  1. Repeat step 1 above.
  2. Repeat step 2 above.
  3. Repeat step 3 above.
  4. Apply one of the modifications (to be specified by the standard), with the goal to alter the watermark. Each modification must be characterised by a set of parameters that will challenge the robustness of the watermark.
  5. Feed the M x (D + 1) neural networks to the detector and record the decision –“watermark present” or “watermark absent”.
  6. Mark the results as true positive, true negative, false positive (false alarm) and false negative (missed detection).

The process to test the capability of the decoder to retrieve the payload in the neural network requires similar steps as above where “presence and absence” is replaced by “distance between the retrieved payload and the original payload”.

The computational cost

One of the features that a user of a watermarking technology may be interested in is evaluating the processing cost of a watermarking solution (in terms of computing resources and/or time).

The MPAI Call for Technologies

The MPAI process is to develop Use Cases and Functional Requirements, issue Calls for Technologies, receive and assess responses to the Call, and develop a standard for assessing the performance of a neural network watermarking technology. The published document can be found here. The MPAI secretariat should receive responses by 2022/10/24.

 

 


Avatars and the MPAI-MMC V2 Call for Technologies

The goal of the MPAI Multimodal Conversation (MPAI-MMC) standard is to enable forms of human-machine conversation that emulate the human-human one in completeness and intensity. While this is clearly a long-term goal, MPAI is focusing on standards providing frameworks which break down – where possible – complex AI functions to facilitate the formation of a component market where solution aggregators can find AI Modules (called AIM) to build AI Workflows (called AIW) corresponding to standard use cases. The AI Framework standard (MPAI-AIF) is a key enabler of this plan.

In September 2021, MPAI approved Multimodal Conversation V1 with 5 use cases. The first one – Conversation with Emotion – assumes that a human converses with a machine that understands what the human says, extracts the human’s emotion from their speech and face, articulates a textual response with an attached emotion, and converts it into synthetic speech containing emotion and a video containing a face expressing the machine’s emotion whose lips are properly animated.

The second MPAI-MMC V1 use case was Multimodal Question Answering. Here a human asks a question to a machine about an object. The machine understands the question and the nature of the object and generates a text answer which is converted to synthetic speech.

The other use cases are about automatic speech translation and they are not relevant for this article.

In July 2022, MPAI issued a Call for Technologies with the goal to acquire the technologies needed to implement three more Multimodal Conversation use cases. One concerns the extension of the notion of “emotion” to “Personal Status”, an element of the internal state of a person which also contains cognitive status (what a human or a machine has understood about the context) and attitude (what is the stance the human or the machine intends to adopt in the context). Personal status is conveyed by text, speech, face, and gesture. See here for more details. Gesture is the second ambition of MPAI-MMC V2.

A use case of MPAI-MMC V2 is “Conversation about a Scene” and can be described as follows:

A human converses with a machine indicating the object of their interest. The machine sees the scene and hears the human; extracts and understands the text from the human’s speech and the personal status in their speech, face, and gesture; understands the object intended by the human; produces a response (text) with its own personal status; and manifests itself as a speaking avatar.

Figure 1 depicts a subset of the technologies that MPAI needs in order to implement this use case.

Figure 1 – The audio-visual front end

These are the functions of the modules and the data provided:

  1. The Visual Scene Description module analyses the video signal, describes, and makes available the Gesture and the Physical Objects in the scene.
  2. The Object Description module provides the Physical Object Descriptors.
  3. The Gesture Description modules provides the Gesture Descriptors.
  4. The Object Identification module uses both Physical Object Descriptors and Visual Scene-related Descriptors, to understand which object in the scene the human points their finger to, select the appropriate set of Physical Object Descriptors, and give the Object ID.
  5. The Gesture Descriptor Interpretation module uses the Gesture Descriptors to extract the Personal Status of Gesture.
  6. The Face DescriptionFace Descriptor Interpretation chain produces the Personal Status of Face.
  7. The Audio Scene Description module analyses the audio signal, describes, and makes available the Speech Object.
  8. The Speech DescriptionSpeech Descriptor Interpretation chain produces the Personal Status of Speech.

After the “front end” part we have a “conversation and manifestation” part involving another set of technologies as described in Figure 2.

Figure 2 – Conversation and Manifestation

  1. The Text and Meaning Extraction module produces Text and Meaning.
  2. The Personal Status Fusion module integrates the three sources of Personal Status into the Personal Status.
  3. The Question and Dialogue Processing module processes Input Text, Meaning, Personal Status and Object ID and provides the Machine Output Text and Personal Status.
  4. The Personal Status Display module processes Machine Output Text and Personal Status and produces a speaking avatar uttering Machine Speech and showing an animated Machine Face and Machine Gesture.

The MPAI-MMC V2 Call considers another use case – Avatar-Based Videoconference – that uses avatars in a different way.

Avatars representing geographically separated humans participate in a virtual conference. Each participant receives each other participants’ avatars, locates them around a table, and participates in the videoconference embodied in their own avatar.

The system is composed of:

  1. Transmitter client: Extracts speech and face descriptors for authentication, creates avatar descriptors using Face & Gesture Descriptors, and Meaning, and sends the participant’s Avatar Model & Descriptors and Speech to the Server.
  2. Server: Authenticates participants; distributes Avatar Models & Descriptors and Speech of each participant.
  3. Virtual Secretary: Makes and displays a summary of the avatars’ utterances using their speech and Personal Status.
  4. Receiver client: Creates virtual videoconference scene, attaches speech to each avatar and lets participant view and/or navigate the virtual videoconference room.

Figure 3 gives a simplified one-figure description of the use case.

Figure 3 – The avatar-based videoconference use case

This is the sequence of operations:

  1. The Speaker Identification and Face Identification modules produce Speech and Face Descriptors that the Authentication module in the server uses to identify the participant.
  2. The Personal Status Extraction module produces the Personal Status.
  3. The Speech Recognition and Meaning produces the Meaning.
  4. The Face Description and Gesture Description modules produce the Face and Gesture Descriptors (for feature and motion).
  5. The Participant Description module uses Personal Status, Meaning, and Face and Gesture Descriptors to produce the Avatar Descriptors.
  6. The Avatar Animation module animates the individual participant’s Avatar Model using the Avatar Descriptors.
  7. The AV Scene Composition module places the participants’ avatars in their assigned places, attaches to each avatar its own speech and produces the Audio-Visual Scene that the participant can view and navigate.

The MPAI-MMC V2 use cases require the following technologies

  1. Audio Scene Description.
  2. Visual Scene Description.
  3. Speech Descriptors for:
    1. Speaker identification.
    2. Personal status extraction.
  4. Human Object Descriptors.
  5. Face Descriptors for:
    1. Face identification.
    2. Personal status extraction.
    3. Feature extraction (e.g., for avatar model)
    4. Motion extraction (e.g., to animate an avatar).
  6. Gesture Descriptors for:
    1. Personal Status extraction.
    2. Features (e.g., for avatar model)
    3. Motion (e.g., to animate an avatar).
  7. Personal Status.
  8. Avatar Model.
  9. Environment Model.
  10. Human’s virtual twin animation.
  11. Animated avatar manifesting a machine producing text and personal status.

The MPAI-MMC V2 standard is an opportunity for the industry to agree on a set of data formats so a market of modules can be created that is able to handle those formats. The standard should be extensible, in the sense that as new more performing technologies mature, they can be incorporated into the standard.

Please see:

  1. The 2 min video (YouTube and non-YouTube) illustrating MPAI-MMC V2.
  2. The slides presented at the online meeting on 2022/07/12.
  3. The Video recording of the online presentation (Youtube,non-YouTube) made at that 12 July presentation.
  4. The Call for TechnologiesUse Cases and Functional Requirements, Framework Licence. and Template for responses.

The MPAI 2022 Calls for Technologies – Part 3 (Neural Network Watermarking)

Research, personnel, training and processing can bring the development costs of a neural network anywhere from a few thousand to a few hundreds of thousand dollars. Therefore, the AI industry needs a technology to ensure traceability and integrity not only of a neural network, but also of the content generated by it (so-called inference). The content industry facing a  similar problem, has used watermarking to imperceptibly and persistently insert a payload carrying, e.g., owner ID, timestamp, etc. to signal the ownership of a content item. Watermarking can also be used by the AI industry.

The general requirements for using watermarking in neural networks are:

  • The techniques shall not affect the performance of the neural network.
  • The payload shall be recoverable even if the content was modified.

MPAI has classified the cases of watermarking use as follows:

  • Identification of actors (i.e., neural network owner, customer, and end-user).
  • Identification of the neural network model.
  • Detecting the modification of a neural network.

This classification is depicted in Figure 1 and concerns the use of watermarking technologies in neural networks and is independent of the intended use.

Figure 1 – Classification of neural network watermarking uses

MPAI has identified the need for a standard – code name MPAI-NNW – enabling users to measure the performance of the following component of a watermarking technology:

  • The ability of a watermark inserter to inject a payload without deteriorating the performance of the Neural Network.
  • The ability of a watermark detector to ascertain the presence and of a watermark decoder to retrieve the payload of the inserted watermark when applied to:
    • A modified watermarked network (e.g., by transfer learning or pruning).
    • An inference of the modified model.
  • The computational cost (e.g., execution time) of a watermark inserter to inject a payload, a watermark detector/decoder to detect/decode a payload from a watermarked model or from any of its inferences.

Figure 2 depicts the three watermarking components covered by MPAI-NNW.

Figure 2 – The three areas to be covered by MPAI-NNW

MPAI has issued a Call to acquire the technologies for use in the standard. The list below is a subset of the requests contained in the call:

  • Use cases
    • Comments on use cases.
  • Impact of the watermark on the performance
    • List of Tasks to be performed by the Neural Network (g. classification task, speech generation, video encoding, …).
    • Methods to measure the quality of the inference produced (g. precision, recall, subjective quality evaluation, PSNR, …).
  • Detection/Decoding capability
    • List of potential modifications that a watermark shall be robust against (g. pruning, fine-tuning, …).
    • Parameters and ranges of proposed modifications.
    • Methods to evaluate the differences between the original and retrieved watermarks (g., Symbol Error Rate).
  • Processing cost
    • Specification of the testing environments.
    • Specification of the values characterizing the processing of Neural Networks.

Below are a few useful links for those wishing to know more about the MPAI-NNW Call for Technologies and how to respond to it:

The MPAI secretariat shall receive the responses to the MPAI-NNW Call for Technologies by 2022 October


Answering a few basic questions about MPAI

Q: What is the main objective of MPAI?

A: There are many languages in the world, but if we want to reach all interested people we have to use a universally recognised language. The same thing happens with data: the more people understand the format the more value they have.

The definition of a universal language for music – the MP3 standard – led to a revolution that many – because they did not know the world before – cannot understand the extent. The same goes for video of which there were, up to the mid-90s, perhaps as many formats or sub-formats as there were countries in the world. Of course, today’s world of media would not exist if the crazy idea that each country is entitled to define its own format still prevailed.

MPAI aims to bring the same revolution to the world of data by specifying standard data formats that rely on AI. Data is increasingly processed with artificial intelligence techniques; therefore standards should take the peculiarity of artificial intelligence into account. As with the media, billions of citizens of the world will enjoy the benefits.

Q: Who can participate and contribute to the MPAI project?

A: MPAI is open to all legal entities who can contribute to the development of standards for data coding achieved, mainly but not exclusively, using artificial intelligence. An internal committee at MPAI evaluates membership applications based on a few simple criteria. MPAI, however, is also open to the participation of physical persons in the phases concerning the presentation of proposals and the definition of the functional requirements of a standard.

Q: How does the standardisation process in MPAI work, starting from the market needs to arrive at standards?

A: As I said, MPAI gives anyone the opportunity to submit proposals for standards and to contribute to the definition of their functional requirements. However, a different argument must be made for the definition of another important parameter of a standard – the commercial requirements – that is, when and at what conditions will I be able to use the standard? The answer to this question is not obvious because there are MPEG standards whose usage licences is not well defined or have become available many years after the standard was approved. Defining business requirements is the responsibility of MPAI’s core members, but anyone can propose technologies in response to MPAI’s Call for Technologies. If a non-member proposal is accepted, they must become a member. The development of a standard is therefore open to all MPAI members.

Q: With its recently published Multimodal Conversation (MPAI-MMC) V2 Call for Technologies seems particularly interesting in the field of machine learning applications for conversation analytics and especially sentiment analysis because it introduces Emotion, Cognitive State and Attitude (Personal Status) to enhance the conversation between humans and machines represented by avatars displaying Personal Status. Could you briefly describe some possible use cases and objectives of this call?

A: Back in September 2021, one year after the founding of MPAI, three standards were published, one of which concerned the human-machine conversation (Multimodal Conversation, MPAI-MMC). One use case concerns a machine that “understands” not only the speech but also the emotion contained in the voice and the face of the human the machine is talking to and responds with a sentence and an avatar face both displaying a pertinent emotion

In this year’s call, ambition is significantly greater because the concept of “emotion” has been extended to “personal state” which includes, in addition to emotion, cognitive state, and attitude of an entity that can be a human or a machine. Personal status can be extracted through a module called “personal status extraction” and provides an answer to “how much my text, my voice, my face and my gestures transmit my degree of knowledge of the subject I am talking about and my attitude towards the interlocutor?” The personal status of a machine can be manifested using a module that we call “personal status display” able to synthesise a speaking avatar that utters a sentence expressing a given personal status.

With these two modules and a series of other technologies, MPAI intends to support three use cases: a human talking to a machine about the objects contained in a room, a group of humans talking with an autonomous vehicle and a virtual video conference in which the participants are represented by avatars seated around a table and express the characteristics and movements of the participants. In the last use case, a virtual secretary understands what the avatars say along with their personal status and converts everything into a summary.

Q: Where are we regarding the definition of data coding standards for AI? Do you think we will reach a sufficient level to allow the construction of an ecosystem of AI applications that can really exploit standardised data coding protocols?

So far, we have approved 5 standards. Four are “technical”: context-based audio enhancement, multimodal conversation, probability of a company’s failure based on its data, and a standard environment for running AI applications. This last standard MPAI-AIF is at the basis of the other three because MPAI standards do not define monolithic, but component-based applications. An AI application is a workflow consisting of modules of which an MPAI application standard defines the functions and interfaces, i.e., the data that passes through them. A user of the standard can build their own application workflow by putting together modules from different sources that can work together because they conform with the “standard”. It is the Lego approach adapted to component-based applications. For this to be practically possible, governance of the ecosystem generated by MPAI standards is required. This is the goal of the fifth MPAI standard MPAI-GME that sets the rules to convert the idea of the user who builds a workflow and makes it work from a good dream to reality. A fundamental governance element is the MPAI Store, recently established as a non-profit organisation, from which a user can download not only complete applications – as they can do today from app stores – but also the components of the application they need coming from diverse sources. If the scenario of composing an application from components ceases to be a dream, then those who develop MPAI modules do not necessarily have to be economic giants to bear the development costs of monolithic AI applications, but they may very well be small and medium-sized companies specialising in potentially much narrower areas, a possibly better suited to innovation. In addition, the MPAI Store deals with module distribution.

This is the new world that is emerging, made possible by MPAI standards.

Join MPAI – Join the fun – Build the future!


The MPAI 2022 Calls for Technologies – Part 2 (Multimodal Conversation)

Processing and generation of natural language is an area where artificial Intelligence is expected to make a difference compared to traditional technologies. Version 1 of the MPAI Multimodal Conversation standard (MPAI-MMC V1), specifically the Conversation with Emotion use case, has addressed this and related challenges: processing and generation not only of speech but also of the corresponding human face when both convey emotion.

The audio and video produced by a human conversing with the machine represented by the blue box in Figure 1 is perceived by the machine which then generates a human-like speech and video in response.

Figure 1 – Multimodal conversation in MPAI-MMC V1

The system depicted in Figure 1 operates as follows (bold indicates module, underline indicates output, italic indicates input):

  1. Speech Recognition (Emotion) produces Recognised Text from Input Speech and the Emotion embedded in Recognised Text and in Input Speech.
  2. Video Analysis extracts the Emotion expressed in Input Video (human’s face).
  3. Emotion Fusion fuses the three Emotions into one (Fused Emotion).
  4. Language Understanding produces Text (Language Understanding) from Recognised Text and Fused Emotion.
  5. Dialogue Processing generates pertinent Output Text and Output Emotion using Text, Meaning, and Fused Emotion.
  6. Speech Synthesis (Emotion) synthesises Output Speech from Output Text.
  7. Lips Animation generates Output Video displaying the Output Emotion with lips animated by Output Speech using Output Speech, Output Emotion and a Video drawn from the Video of Faces Knowledge Base.

The MPAI-MMC V2 Call for Technologies issued in July 2022, seeks four major classes of technologies enabling a significant extension of the scope of its use cases:

  1. The internal status of a human from Emotion, defined as the typically non-rational internal status of a human resulting from their interaction with the Environment, such as “Angry”, “Sad”, “Determined” to two more internal statuses: Cognitive State, defined as the typically rational internal status of a human reflecting the way they understand the Environment, such as “Confused”, “Dubious”, “Convinced”, and Attitude, defined as the internal status of a human or avatar related to the way they intend to position themselves vis-à-vis the Environment, e.g., “Respectful”, “Confrontational”, “Soothing”. Personal Status is the combination of Emotion, Cognitive State and Attitude. These can be extracted not only from speech and face but from text and gesture (intended as the combination of the head, arms, hands and fingers) as well.
  2. The direction suggested by the Conversation with Emotion use case where the machine generates an Emotion pertinent to what it has heard (Input Speech) and seen (human face of Input Video) but also to what the machine is going to say (Output Text). Therefore, Personal Status is not just extracted from a human but can also be generated by a machine.
  3. Enabling solutions no longer targeted to a controlled environment but facing the challenges of the real world: to enable a machine to create the digital representation of an audio-visual scene composed of speaking humans in a real environment.
  4. Enabling one party to animate an avatar model using standard descriptors and model generated by another party.

More information about Personal Status and its applications in Personal Status Extraction and Personal Status Display can be found in Personal Status in human-machine conversation.

With these technologies MPAI-MMC V2 will support three new use cases:

  1. Conversation about a scene. A human has a conversation with a machine about the objects in a room. The human uses gestures to indicate the objects of their interest. The machine uses a personal status extraction module to better understand the internal status of the human and produces responses that include text and personal status. The machine manifests itself via a personal status display module (see more here).
  2. Human-Connected Autonomous Vehicle (CAV) Interaction. A group of humans interact with a CAV to get on board, request to be taken to a particular venue and have a conversation with the CAV while travelling. The CAV uses a personal status extraction module to better understand the personal status of the humans and produces responses that include Text and Personal Status. The CAV manifests itself via a personal status display module (again, see more here).
  3. Avatar-Based Videoconference. (Groups of) humans from different geographical locations participate in a virtual conference represented by avatars animated by descriptors produced by their clients using face and gesture descriptors supported by speech analysis and personal status extraction. The server performs speech translation and distributes avatar models and descriptors. Each participant places the individual avatars animated by their descriptors around a virtual table with their speech. A virtual secretary creates an editable summary recognising the speech and extracting the personal status of each avatar.

Figure 2 represents the reference diagram of Conversation about a Scene.

Figure 2 – Conversation about a Scene in MPAI-MMC V2

The system depicted in Figure 2 operates as follows:

  1. Visual Scene Description creates a digital representation of the visual scene.
  2. Speech Recognition recognises the text uttered by the human.
  3. Object Description, Gesture Description and Object Identification provide the ObjectID.
  4. Personal Status Extraction provides the human’s current Personal Status.
  5. Language Understanding  provides Text (Language Understanding) and Meaning.
  6. Question and Dialogue Processing generates Output Text and the Personal State of each of Speech, Face and Gesture.
  7. Personal Status Display produces a speaking avatar animated by Output Text and the Personal State of each of Speech, Face and Gesture.

The internal architecture of the Personal Status Display module is depicted in Figure 3.

Figure 3 – Architecture of the Personal Status Display Module

Those wishing to know more about the MPAI-MMC V2 Call for Technologies should review:

  1. The 2 min video (YouTube, non-YouTube) illustrating MPAI-MMC V2.
  2.  The slides presented at the online meeting on 2022/07/12.
  3. The video recording of the online presentation (YouTube, non-YouTube) made at that 12 July presentation.
  4. The Call for Technologies, Use Cases and Functional Requirements, Clarifications about MPAI-MMC V2 Call for Technologies data formats, Framework Licence, and Template for responses.

The MPAI Secretariat should receive the responses to the Call by 2022/10/10T23:59 UTC. Partial responses are welcome.


Making a standard is just the first step

A definition of standard is “the documented agreement by a group of people that certain things should be done in a certain way”. In its still short but intense life, MPAI has documented several such agreements about “certain things”. One of these things is emotion. MPAI has specified 59 words with attached semantics that identify a type of emotion and a mechanism to extend the words representing emotions. It has also defined the format of the financial and organisational data that feed a machine predicting the probability of default and the organisational adequacy of the company the financial data refers to.

However, the path from a documented agreement to real products, services and applications is not easy and not short. In the domain of software standards, a reference implementation is often required to assure those who were not part of the agreement that the agreement is sound. In all cases, you need an agreed procedure that allows a party to test an implementation and verify that it does indeed conform with the standard. In other words, if the standard can be compared to the law, conformance testing can be compared to the code of procedure.

MPAI application standards, such as Conversation with Emotion and Company Performance Prediction, are not monoliths. They are implemented as sets of modules (AI Modules – AIMs) connected in workflows (AI Workflows – AIWs) executed in an environment (AI Framework – AIF). MPAI specifies the functions and data formats of the AIMs and the function, data format and connections of their AIW. MPAI also specifies the APIs that an AIM and other AIF components call to execute an AIW. The goal is to enable a user to buy an AIF from one party, an AIW from another, and the AIMs also from different parties.

Another facet is that MPAI standards deal with a particular technology – Artificial Intelligence – that sets it aside from other technologies. In general, the value of a neural network resides in the “completeness” and “reliability” of the data used for the task that the neural network claims to be able to carry out. Users buy a network together with the assurance that the data used are OK. But is it? Can a user be safe if it relies on a neural network?

MPAI uses the word performance (of an implemented standard) to indicate all these issues and defines it as a set of attributes – Reliability (e.g., quality), Robustness (e.g., ability to operate outside its original domain), Replicability (e.g., tests done by one party can be replicated by another) and Fairness (e.g., dataset or the network are open to being tested for bias).

Since its early days, MPAI has decided that the complex ecosystem composed of MPAI, its specifications, implementers of specifications, conformance testers, and performance assessors could hardly get their act together in a smoothly running ecosystem. MPAI would be unable to honour its promises of an open competitive market of components, implicit in the notion of interoperable AIMs nicely fitting in an AIW, and running in an AIF.

MPAI has specified its solution in the Governance of the MPAI Ecosystem (MPAI-GME) standard that specifies the roles of the MPAI Ecosystem players:

  1. MPAI:
    1. Develops and publishes standards, i.e., the set of Technical Specification, Reference Software Specification, Conformance Testing Specification, ad Performance Assessment Specification.
    2. Establishes the MPAI Store (see point 2.).
    3. Appoints Performance Assessors (see point 3.).
  2. The MPAI Store:
    1. Assign identifiers to implementers of MPAI standards (this is required to identify an implementation).
    2. Receive implementations of MPAI standards submitted by implementers.
    3. Verify that implementations are secure.
    4. Test the conformance of implementations to the appropriate MPAI standard or to one of its use cases.
    5. Receive reports from Performance Assessors about the grade of performance of implementations.
    6. Label with their interoperability level and make available for download implementations, and publish reviews of implementation user experiences.
    7. Publish reviews of implementations communicated by Users.
  3. Implementers of MPAI Technical Specifications:
    1. Obtain an implementer ID from the MPAI Store.
    2. Submit implementations to the MPAI Store for security verification and conformance testing prior to distribution.
    3. May submit implementations to Performance Assessors.
  4. Performance Assessors:
    1. Are appointed by MPAI for a particular domain of expertise.
    2. Assess implementations for performance, typically for a fee.
    3. Make the MPAI Store and implementer aware of their findings about an implementation.
  5. Users:
    1. Download implementations.
    2. May communicate reviews of downloaded implementations to the MPAI Store.

Figure 1 describes the operation of the MPAI Ecosystem

Figure 1 – The MPAI Ecosystem

Another piece of the Governance of the MPAI Ecosystem has been put in place on the 9th of August by a group of individuals active in MPAI who have established the MPAI Store in Scotland as a Company Limited by Guarantee.


MPAI adds documents and clarification to its currently open three Calls for Technologies

Geneva, Switzerland – 23 August 2022. Today the international, non-profit, unaffiliated Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) standards developing organisation has concluded its 23rd General Assembly (MPAI-23). Among the outcomes are three documents produced to facilitate the task of drafting a response to the currently open Calls for Technologies and one document that will facilitate the identification and positioning of the technologies defined in the Multimodal Conversation Use Cases and Functional Requirements V2. MPAI-23 has also decided to extend the deadline for submitting responses to the Calls until the 24th of October. The link to all documents relevant to the Calls can be found on the MPAI website.

MPAI-23 has also been informed that a group of individuals active in MPAI has decided to establish the MPAI Store, the entity envisaged by the Governance of the MPAI Ecosystem to have the task to assign identifiers to implementers of MPAI standards, receive implementations of MPAI standards, verify their security, test their conformance to an MPAI standard or to one of its use cases, make available for download implementations labelled with their interoperability level and publish reviews of implementation user experiences. The MPAI blog provides a description of the MPAI Store mission  in the context of the Governance of the MPAI Ecosystem.

MPAI develops data coding standards for applications that have AI as the core enabling technology. Any legal entity supporting the MPAI mission may join MPAI, if able to contribute to the development of standards for the efficient use of data.

So far, MPAI has developed 5 standards (not italic in the list below), is currently engaged in extending 2 approved standards (underlined) and is developing another 10 standards (italic).

Name of standard Acronym Brief description
AI Framework MPAI-AIF Specifies an infrastructure enabling the execution of implementations and access to the MPAI Store.
Context-based Audio Enhancement MPAI-CAE Improves the user experience of audio-related applications in a variety of contexts.
Compression and Understanding of Industrial Data MPAI-CUI Predicts the company’s performance from governance, financial, and risk data.
Governance of the MPAI Ecosystem MPAI-GME Establishes the rules governing the submission of and access to interoperable implementations.
Multimodal Conversation MPAI-MMC Enables human-machine conversation emulating human-human conversation.
Avatar Representation and Animation MPAI-ARA Specifies descriptors of avatars impersonating real humans.
Connected Autonomous Vehicles MPAI-CAV Specifies components for Environment Sensing, Autonomous Motion, and Motion Actuation.
End-to-End Video Coding MPAI-EEV Explores the promising area of AI-based “end-to-end” video coding for longer-term applications.
AI-Enhanced Video Coding MPAI-EVC Improves existing video coding with AI tools for short-to-medium term applications.
Integrative Genomic/Sensor Analysis MPAI-GSA Compresses high-throughput experiments’ data combining genomic/proteomic and other data.
Mixed-reality Collaborative Spaces MPAI-MCS Supports collaboration of humans represented by avatars in virtual-reality spaces.
Neural Network Watermarking MPAI-NNW Measures the impact of adding ownership and licensing information to models and inferences.
Visual Object and Scene Description MPAI-OSD Describes objects and their attributes in a scene.
Server-based Predictive Multiplayer Gaming MPAI-SPG Trains a network to compensate data losses and detects false data in online multiplayer gaming.
XR Venues MPAI-XRV XR-enabled and AI-enhanced use cases where venues may be both real and virtual.

Please visit the MPAI website, contact the MPAI secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.

Most importantly: please join MPAI, share the fun, build the future.


The MPAI 2022 Calls for Technologies – Part 1 (AI Framework)

A foundational element of the MPAI architecture is the fact that monolithic AI applications have some characteristics that make them undesirable. For instance, they are single-use, i.e., it is hard to reuse technologies used by the application in another application and they are obscure, i.e., it is hard to understand why a machine has produced a certain output when subjected to a certain input. The first characteristic means that it is hard to make complex applications because an implementer must possess know-how of all features of the applications and the second is that they often are “unexplainable”.

MPAI launched AI Framework (AIF), its first official standardisation activity in December 2020, less than 3 months after its establishment. AIF is a standard environment where it is possible to execute AI Workflows (AIW) composed of AI Modules (AIM). Both AIWs and AIMs are defined by their function and their interfaces. AIF is unconcerned by the technology used by an AIM but needs to know the topology of an AIW.

Ten months later (October 2021)the MPAI-AIF standard was approved. Its structure is represented by Figure 1.

Figure 1 – The MPAI-AIF Reference Model

MPAI’s AI Framework (MPAI-AIF) specifies the architecture, interfaces, protocols, and Application Programming Interfaces (API) of the AI Framework (AIF), an environment specially designed for execution of AI-based implementations, but also suitable for mixed AI and traditional data processing workflows.

The AIF, the AIW and the AIMs are represented by JSON Metadata. The User Agent and the AIMs call the Controller through a set of standard APIs. Likewise, the Controller calls standard APIs to interact with Communication (a service for inter-AIM communication), Global Storage (a service for AIMs to store data for access by other AIMs) and the MPAI Store (a service for downloading AIMs required by an AIW). Access represents access to application-specific data.

Through the JSON Metadata, an AIF with appropriate resources (specified in the AIF JSON Metadata) can execute an AIW requiring AIMs (specified in the AIF JSON Metadata) that can be downloaded from the MPAI Store.

The MPAI-AIF standard has the following main features:

  1. Independence of the Operating System.
  2. Modular component-based architecture with specified interfaces.
  3. Encapsulation of component interfaces to abstract them from the development environment.
  4. Interface with the MPAI Store enabling access to validated components.
  5. Component can be implemented as software, hardware or mixed hardware-software.
  6. Components: execute in local and distributed Zero-Trust architectures, can interact with other implementations operating in proximity and support Machine Learning functionalities.

The MPAI-AIF standard achieves much of the original MPAI vision because AI applications:

  1. Need not be monolithic but can be composed of independently developed modules with standard interfaces
  2. Are more explainable
  3. Can be found in an open market.

Feature #6 above is a requirement, but the standard does not provide practical means for an application developer to ensure that the execution of the AIW takes place in a secure environment. Version 2 of MPAI-AIF intends to provide exactly that. As MPAI-AIF V1 does not specify any trusted service that an implementer can rely on, MPAI-AIF V2 identifies specific trusted services supporting the implementation of a Trusted Zone meeting a set of functional requirements that enable AIF Components to access trusted services via APIs, such as:

  1. AIM Security Engine.
  2. Trusted AIM Model Services
  3. Attestation Service.
  4. Trusted Communication Service.
  5. Trusted AIM Storage Service
  6. Encryption Service.

Figure 1 represents the Reference Models of MPAI-AIF V2.

Figure 2 – Reference Models of MPAI-AIF V2

The AIF Components shall be able to call Trusted Services APIs after establishing the developer-specified security regime based on the following requirements:

  1. The AIF Components shall access high-level implementation-independent Trusted Services API to handle:
    1. Encryption Service.
    2. Attestation Service.
    3. Trusted Communication Service.
    4. Trusted AIM Storage Service including the following functionalities:
      1. AIM Storage Initialisation (secure and non-secure flash and RAM)
      2. AIM Storage Read/Write.
      3. AIM Storage release.
    5. Trusted AIM Model Services including the following functionalities:
      1. Secure and non-secure Machine Learning Model Storage.
      2. Machine Learning Model Update (i.e., full, or partial update of the weights of the Model).
      3. Machine Learning Model Validation (i.e., verification that the model is the one that is expected to be used and that the appropriate rights have been acquired).
    6. AIM Security Engine including the following functionalities:
      1. Machine Learning Model Encryption.
      2. Machine Learning Model Signature.
      3. Machine Learning Model Watermarking.
  2. The AIF Components shall be easily integrated with the above Services.
  3. The AIF Trusted Services shall be able to use hardware and OS security features already existing in the hardware and software of the environment in which the AIF is implemented.
  4. Application developers shall be able to select the application’s security either or both by:
    1. Level of security that includes a defined set of security features for each level, i.e., APIs are available to either select individual security services or to select one of the standard security levels available in the implementation.
    2. Developer-defined security, i.e., a combination of a developer-defined set of security features.
  5. The specification of the AIF V2 Metadata shall be an extension of the AIF V1 Metadata supporting security with either or both standardised levels and a developer-defined combination of security features.
  6. MPAI welcomes the submission of use cases and their respective threat models.

MPAI has rigorously followed its standard development process in producing the Use Cases and Functional Requirements summarised in this post. MPAI has additionally produced The Commercial Requirements (Framework Licence) and the text of the Call for Technologies.

Below are a few useful links for those wishing to know more about the MPAI-AIF V2 Call for Technologies and how to respond to it:

  1. The “About MPAI-AIF” web page provides some general information about MPAI-AIF.
  2. The MPAI-AIF V1 standard can be downloaded from here.
  3. The 1 min 20 sec video (YouTube and (non-YouTube) concisely illustrates the MPAI-AIFV2 Call for Technologies.
  4. The slides and the video recording of the online presentation (YouTubenon-YouTube) made at the 11 July online presentation give a complete overview of MPAI-AIF V2.

The MPAI secretariat shall receive the responses to the MPAI-AIF V2 Call for Technologies by 10 October 2022 at 23:39 UTC. For any need, please contact the MPAI secretariat.

 


Personal Status in human-machine conversation

MPAI has a Development Committee in the area of human-machine conversation (MMC-DC). In September 2021, MMC-DC has produced its first standard titled Multimodal Conversation (MPAI-MMC). That standard provides a standard way to represent Emotion with the following syntax:

{

“$schema”:”http://json-schema.org/draft-07/schema”,

“definitions”:{

“emotionType”:{

“type”:”object”,

“properties”:{

“emotionDegree”:{

“enum”: [“High”, “Medium”, “Low”]

},

“emotionName”:{

“type”:”number”

},

“emotionSetName”:{

“type”:”string”

}

}

},

“type”:”object”,

“properties”:{

“primary”:{

“$ref”:”#/definitions/emotionType”

},

“secondary”:{

“$ref”:”#/definitions/emotionType”

}

}

}

The semantics is given by:

Name Definition
emotionType Specifies the Emotion that the input carries.
emotionDegree Specifies the Degree of Emotion as one of “Low,” “Medium,” and “High.”
emotionName Specifies the ID of an Emotion listed in Table 2.
emotionSetName Specifies the name of the Emotion set which contains the Emotion. Emotion set of  Table 2 is used as a baseline, but other sets are possible.

Table 1 gives some examples of the MPAI standardised three-level Basic Emotion Set partly based on:

Table 1 – Basic Emotion Set

EMOTION CATEGORIES GENERAL ADJECTIVAL SPECIFIC ADJECTIVAL
ANGER angry furious
irritated
frustrated
APPROVAL, DISAPPROVAL admiring/approving
disapproving
indifferent
awed
contemptuous
AROUSAL aroused/excited/energetic cheerful
playful
lethargic
sleepy
ATTENTION attentive expectant/anticipating
thoughtful
distracted/absent-minded
vigilant
hopeful/optimistic
BELIEF credulous sceptical
CALMNESS calm peaceful/serene
resigned

The semantics of somr elements in Table 1 is provided by Table 2.

Table 2 – Semantics of the Basic Emotion Set

ID Emotion Meaning
1 admiring/approving emotion due to perception that others’ actions or results are valuable
2 amused positive emotion combined with interest (cognitive)
3 anger emotion due to perception of physical or emotional damage or threat
4 anxious/uneasy low or medium degree of fear, often continuing rather than instant
5 aroused/excited/energetic cognitive state of alertness and energy
6 arrogant emotion communicating social dominance
7 astounded high degree of surprised
8 attentive cognitive state of paying attention
9 awed approval combined with incomprehension or fear
10 bewildered/puzzled high degree of incomprehension
11 bored not interested
12 calm relative lack of emotion

In July MPAI has issued a call for technologies to extend the MPAI-MMC standard. One of the technologies requested is Personal Status defined as “The ensemble of information internal to a person, including Emotion, Cognitive State, and Attitude”. The 3 components are defined

Attitude An element of the internal status related to the way a human or avatar intends to position vis-à-vis the Environment or subsets of it, e.g., “Respectful”, “Confrontational”, “Soothing”.
Cognitive State An element of the internal status reflecting the way a human or avatar understands the Environment, such as “Confused”, “Dubious”, “Convinced”.
Emotion An element of the internal status resulting from the interaction of a human or avatar with the Environment or subsets of it, such as “Angry”, “Sad”, “Determined”.

The Personal Status is conveyed by one or more Modalities, currently, Text, Speech, Face and Gesture.

Respondents to the call are requested to propose the following:

  1. A Personal Status format capable of describing the evolution of Personal Status over time.
  2. A Fused Personal Status format supporting the requirements to:
    1. Include the Emotion, Cognitive Status, and Attitude making up a Personal Status.
    2. Retain information on the measured values of the different factors in a Personal Status conveyed by the different Modalities.
    3. Describe the evolution of Personal Status over time.

A Personal Status standard can be used as a standard component in human-machine conversation. One such component is Personal Status Extraction, depicted in Figure 2.

Figure 2 –Personal Status Extraction

Another component is Personal Status Display depicted in Figure 3.

Figure 3 – Personal Status Display