Moving Picture, Audio and Data Coding
by Artificial Intelligence

What is new in MPAI Multimodal Conversation

The MPAI project called Multimodal Conversation (MPAI-MMC), one of the earliest MPAI projects, has the ambitious goal of using AI to enable forms of conversation between humans and machines that emul­ate the conversation between humans in completeness and intensity. An important element to achieving this goal is the leveraging of all modalities used by a human when talking to another human: speech, but also text, face, and gesture.

In the Conversation with Emotion use case standardised in Version 1 (V1) of MPAI-MMC, the machine activates different modules, that MPAI calls AI Modules (AIM) that produce data in response to the data generated by a human:

AI Module Produces What data From what data
Speech Recognition (Emotion) Extracts Text
Human speech emotion
Speech
Language Understanding Produces Refined text Recognised text
Extracts Meaning
Text emotion
Recognised text
Video Analysis Extracts Face emotion Face Object
Emotion Fusion Produces Fused emotion Text Emotion
Speech Emotion
Face Emotion
Dialogue Processing Produces Machine text
Machine emotion
Meaning
Refined Text
Fused Emotion
Speech Synthesis (Emotion) Produces Machine speech with Emotion Text
Emotion
Lips Animation Produces Machine Face with Emotion Speech
Emotion

This is graphically depicted in Figure 1 where the green blocks correspond to the AIMs.

Figure 1 – Conversation with Emotion (V1)

Multimodal Conversation Version 2 (V2), for which a Call for Technologies is planned to be issued on 19 July 2022, intends to improve MPAI-MMC V1 by extending the notion of Emotion with the notion of Personal Status. This is the ensemble of personal information that includes Emotion, Cognitive State, and Attitude. The former two – Emotion and Cognitive State – result from the interaction with the environment, while the last – Attitude – is the stance that will be taken for new interactions based on the achievedEmotion and Cognitive State.

Figure 2 shows the composite AI Module introduced in MPAI-MMC V2: Personal Status Extraction (PSE). This contains specific AIMs that describe the individual text, speech, face and gesture modalities and interpret descriptors. PSE plays a fundamental role in the human-machine conversation as we will see soon.

Figure 2 – Personal Status Extraction

A second fundamental component – Personal Status Display (PSD) – is depicted in Figure 3. Its role is to enable the machine to manifest itself to the party it is conversing with. The manifestation is driven by the words generated by the machine and by the Personal Status it intends to attach to its speech, face, and gesture.

Figure 3 – Personal Status Display

Is there a reason why the word “party” has been used in lieu of “human”. Yes, there is. The Personal Status Display can be used to manifest a machine to a human, but potentially to another avatar. The same can be said of Personal Status Extraction which can extract the Personal Status of a human, but could do that on an avatar as well. MPAI-MMC V2 has examples of both.

Figure 4 shows how can we can leverage the Personal Status Extraction and Personal Status Display AIMs to enhance the performance of Conversation with Emotion – pardon – Conversation with Personal Status.

Figure 4 – Conversation with Personal Status V2.0

In Figure 4 speech recognition extracts the text from speech. Language Understanding Question and Dialogue Processing can do a better job because they have access to Personal Status. Finally, the Personal Status Display is a re-usable component that generates a speaking avatar from text and the Personal Status conveyed by the three speech, face, and gesture modalities.

Figure 4 assumes that the outside world provides clean speech, face and gesture. Most often, unfortunately, this is not the case. There is no single speech and, even if there is just one, it is embedded in all sorts of sounds surrounding us. The same can be said of face and gesture. There may be more than one person, and extracting the face or the head, arms, hands, and finger making up the gesture of a human is anything but simple. Figure 5 introduces two critical components Audio Scene Description (ASD) and Visual Scene Description (VSD).

Figure 5 – Conversation with Personal Status and Audio-Visual Scene Description

The task of Audio-Visual Scene Description (AVSD) can be described as “digitally describe a portion of the world with a level of clarity and precision achievable by a human”. The goal expressed in this form can be both unattainable with today’s technology because description of “any” scene is too general. On the other hand, it can also be not sufficient for some purposes because very often the world can be described by using sensors a human does not have.

The scope of Multimodal Conversation V2, however, is currently limited to 3 use cases:

  1. A human has a conversation with a machine about the objects in a room.
  2. A group of humans has a conversation with a Connected Autonomous Vehicle (CAV) outside and inside it (in the cabin).
  3. Groups of humans have a videoconference where humans are individually represented by avatars having a high similarity with the humans they represent.

VSD should provide a description of the visual scene as composed of visual objects classified as human and generic objects. The human object should be decomposable in face, head, arm, hand, and finger objects and should have position and velocity information. The ASD should provide a description of the speech sources as audio objects with their position and velocity.

The first use case is well represented by Figure 6.

Figure 6 – Conversation About a Scene

The machine sees the human as a human object. The Object Identification ID uses the Gesture Descriptors to understand where the finger of the human points at. If at that position there is an object, the Object Identification AIM uses the Physical Object Descriptors to assign an ID to the object. The machine also feeds Face Object and Human Object into the Personal Status Extraction AIM to understand what the human’s Emotion, Cognitive State and Attitude in order is to enable the Question and Dialogue Processing AIM to fine tune its answer.

Is this all we have to say about Multimodal Conversation V2.0? Well, no, this is the beginning. So, stay tuned for more news or, better, attend the MPAI-MMC V2 online presentation on Tuesday 12 July 2022 at 14 UTC. Please register here to attend.


An introduction to MPAI Multimodal Conversation V2

The MPAI project called Multimodal Conversation (MPAI-MMC) has the ambitious goal to use AI to enable forms of human-machine conversation that emul­ate human-human conversation in completeness and intensity. This means that MMC will leverage all modalities that a human uses when talking to another human: of course, speech, but also text, face and gesture.

In the Conversation with Emotion use case of MMC V1 the machine activates different modules (in italic) to produce data (underlined) in response to a human:

  1. Speech Recognition (Emotion) extracts text and speech emotion.
  2. Language Understanding produces refined text, and extracts meaning and text emotion.
  3. Video Analysis extracts face emotion.
  4. Emotion Fusion fuses the 3 emotions into fused emotion.
  5. Dialogue Processing produces machine text and machine emotion.
  6. Speech Synthesis (Emotion) produces speech with machine emotion.
  7. Lips Animation produces machine face (an avatar) with facial emotion and lips in sync with speech.

This is depicted in Figure 1.

Multimodal Conversation Version 2 (V2) intends to substantially improve MPAI-MMC V2 by adding Personal Cognitive State and Attitude to Emotion. The combination of the three is called Personal Status, the ensemble of information internal to a person. Emotion and Cognitive State are the result of an interaction with the environment, while Attitude is the stance for new interactions.

Figure 1 shows one component – Personal Status Extraction (PSE) – identified for MPAI-MMC V2. PSE, a Composite AIM containong other specific AIMs that describe modalities and interpret derscriptors, plays a fundamental role in human-machine conversation

Figure 1 – Personal Status Extraction

A second fundamental component – Personal Status Display – is depicted in Figure 2.

Figure 2 – Personal Status Description

 


Functional requirements for 3 new standards published 

 Geneva, Switzerland – 22 June 2022. Today the international, non-profit, unaffiliated Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) standards developing organisation has concluded its 21st General Assembly. Among the outcomes is the approval of three Use Cases and Functional Requirements documents for AI Framework V2, Multimodal Conversation V2 and Neural Network Watermarking V1.

This milestone is important because MPAI Principal Members intending to participate in the development of the standards can develop the Framework Licences of the three planned standards. The Framework Licence has been devised by MPAI to facilitate the practical availability of approved standards (see here for an example). It is a licence without critical data such as cost, dates, rates etc. MPAI is now drafting the Calls for Technologies for the 3 standards and plans to adopt and publish them on 2022/07/19, the 2nd anniversary of the launch of the MPAI project.

AI Framework (MPAI-AIF) V1 specifies an infrastructure enabling the execution of implementations and access to the MPAI Store. V2 will add security support to the framework and is the next step following today’s release of the MPAI-AIF V1 Reference Software.

Multimodal Conversation (MPAI-MMC) V1 Enables human-machine conversation emulating human-human conversation. V2 will specify technologies supporting 5 new use cases:

  1. Personal Status Extraction: provides an estimate of the Personal Status (PS) – of a human or an avatar – conveyed by Text, Speech, Face, and Gesture. PS is the ensemble of information internal to a person, including Emotion, Cognitive State, and Attitude.
  2. Personal Status Display: generates an avatar from Text and PS that utters speech with the intended PS while the face and gesture show the intended PS.
  3. Conversation About a Scene: a human holds a conversation with a machine about objects in a scene. While conversing, the human points their fingers to indicate their interest in a particular object. The machine is helped by the understanding of the human’s PS.
  4. Human-Connected Autonomous Vehicle (CAV) Interaction: a group of humans converse with a CAV which understands the utterances and the PSs of the humans it converses with and manifests itself as the output of a Personal Status Display.
  5. Avatar-Based Videoconference: avatars representing humans with a high degree of accuracy participate in a videoconference. A virtual secretary (VS) represented as an avatar displaying PS creates an online summary of the meeting with a quality enhanced by the virtual secretary’s ability to understand the PS of the avatar it converses with.

Neural Network Watermarking (MPAI-NNW): will provide the means to measure, for a given size of the watermarking payload, the ability of 1) the watermark inserter to inject a payload without deteriorating the NN performance, 2) the watermark detector to recognise the presence and the watermark decoder to successfully retrieve the payload of the inserted watermark, 3) the watermark inserter to inject a payload and the watermark detector/decoder to detect/decode a payload from a watermarked model or from any of its inferences at a measured computational cost.

MPAI will hold four online presentations of the documents on the following dates:

Title Acronym Day of July   Time Note
AI Framework V2 MPAI-AIF 11 15:00 UTC Register
Multimodal Conversation V2 MPAI-MMC 07 14:00 UTC Register
Multimodal Conversation V2 MPAI-MMC 12 14:00 UTC Register
Neural Network Watermarking MPAI-NNW 12 15:00 UTC Register

MPAI-MMC will be presented in two sessions because of the number and scope of the use cases and of the supporting technologies.

Those intending to attend a presentation event are invited to register at the link above.

MPAI develops data coding standards for applications that have AI as the core enabling technology. Any legal entity supporting the MPAI mission may join MPAI, if able to contribute to the development of standards for the efficient use of data.

So far, MPAI has developed 5 standards (normal font in the list below), is currently engaged in extending two approved standards (underlined) and is developing other 9 standards (italic).

Name of standard Acronym Brief description
AI Framework MPAI-AIF Specifies an infrastructure enabling the execution of implementations and access to the MPAI Store.
Context-based Audio Enhancement MPAI-CAE Improves the user experience of audio-related applications in a variety of contexts.
Compression and Understanding of Industrial Data MPAI-CUI Predicts the company performance from governance, financial, and risk data.
Governance of the MPAI Ecosystem MPAI-GME Establishes the rules governing the submission of and access to interoperable implementations.
Multimodal Conversation MPAI-MMC Enables human-machine conversation emulating human-human conversation.
Server-based Predictive Multiplayer Gaming MPAI-SPG Trains a network to com­pensate data losses and detects false data in online multiplayer gaming.
AI-Enhanced Video Coding MPAI-EVC Improves existing video coding with AI tools for short-to-medium term applications.
End-to-End Video Coding MPAI-EEV Explores the promising area of AI-based “end-to-end” video coding for longer-term applications.
Connected Autonomous Vehicles MPAI-CAV Specifies components for Environment Sensing, Autonomous Motion, and Motion Actuation.
Avatar Representation and Animation MPAI-ARA Specifies descriptors of avatars impersonating real humans.
Neural Network Watermarking MPAI-NNW Measures the impact of adding ownership and licensing information to models and inferences.
Integrative Genomic/Sensor Analysis MPAI-GSA Compresses high-throughput experiments data combining genomic/proteomic and other.
Mixed-reality Collaborative Spaces MPAI-MCS Supports collaboration of humans represented by avatars in virtual-reality spaces.
Visual Object and Scene Description MPAI-OSD Describes objects and their attributes in a scene.

Visit the MPAI website, contact the MPAI secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.

Most importantly: join MPAI, share the fun, build the future.

 

 


MPAI wants to do it again

On the 30th of September 2021, on the first anniversary of its incorporation, MPAI approved Version 1 of its Multimodal Conversation standard (MPAI-MMC). The standard included 5 use cases: Conversation with Emotion, Multimodal Question Answering and e Automatic Speech Translation Use Cases. Three months later, MPAI approved Version 1 of Context-based Audio Enhancement (MPAI-CAE). The standard included 4 use cases: Emotion-Enhanced Speech, Audio Recording Preservation, Speech Restoration System and Enhanced Audioconference Experience.

A lot more has happened in MPAI beyond these two standards, even before the approval of the two standards, and now MPAI is ready to launch a new project that includes 5 use cases:

  1. Personal Status Extraction (PSE).
  2. Personal Status-driven Avatar (PSA).
  3. Conversation About a Scene (CAS).
  4. Human-CAV (Connected Autonomous Vehicle) Interaction (HCI).
  5. Avatar-Based Videoconference (ABV).

This article will give a brief introduction to the 5 use cases.

  1. Personal Status Extraction (PSE). Personal Status is a set of internal characteristics of a person, currently, Emotion, Cognitive State, and Attitude. Emotion and Cognitive State result from the interaction of a human with the Environment. Cognitive State is more rational (e.g., “Confused”, “Dubious”, “Convinced”). Emotion is less rational (e.g., “Angry”, “Sad”, “Determined”). Attitude is the stance that a human takes when s/he has reached an Emotion and Cognitive State (e.g., “Confrontational”, “Respectful”, “Soothing”). The PSE use case is about how Personal Status can be extracted from its Manifestations: Text, Speech, Face and Gesture.
  2. Personal Status-driven Avatar (PSA). In Conversation with Emotion (MPAI-MMC V1) a machine was represented by an avatar whose speech and face displayed an emotion congruent with the emotion displayed by a human the machine is conversing with. The PSA use case is about the interaction of a machine with humans in different use cases. The machine is represented by an avatar whose text, speech, face, and gesture display a Personal Status congruent with the Personal Status manifested by the human the machine is conversing with.
  3. Conversation About a Scene (CAS): A human and a machine converse about the objects in a room with little or no noise. The human uses a finger to indicate their interest in a particular object. The machine understands the Personal Status shown by the human in their speech, face, and gesture, e.g., the human’s satisfaction because the machine understands their question. The machine manifests itself as the head-and-shoulders of an avatar whose face and gesture (head) convey the machine’s Personal Status resulting from the conversation in a way that is congruent with the speech it utters.
  4. Human-CAV (Connected Autonomous Vehicle) Interaction (HCI): a group of humans converse with a Connected Autonomous Vehicle (CAV) on a domain-specific subject (travel by car). The conversation can be held both outside of the CAV when the CAV recognises the humans to let them into the CAV or inside when the humans are sitting in the cabin. The two Environments are assumed to be noisy. The machine understands the Speech, and the human’s Personal Status shown on their Text, Speech, Face, and Gesture. The machine appears as the head and shoulders of an avatar whose Text, Speech, Face, and Gesture (Head) convey a Personal Status congruent with the Speech it utters.
  5. Avatar-Based Videoconference (ABV). Avatars representing geographically distributed humans participate in a videoconference reproducing the movements of the upper part of the human participants (from the waist up) with a high degree of accuracy. Some locations may have more than one participant. A special participant in the Virtual Environment where the Videoconference is held can be the Virtual Secretary. This is an entity displayed as an avatar not representing a human participant whose role is to: 1) make and visually share a summary of what other avatars say; 2) receive comments on the summary; 3) process the vocal and textual comments taking into account the avatars’ Personal Status showing in their text, speech, face, and gesture; 4) edit the summary accordingly; and 5) display the summary. A human participant or the meeting manager composes the avatars’ meeting room and assigns each avatar’s position and speech as they see fit.

These use cases imply a wide range of technologies (more than 40). While the requirements for these technologies and the full description of the use cases are planned to be approved at the next General Assembly (22 June), MPAI is preparing the Framework Licence and the Call for Technologies. The latter two are planned to be approved at the next-to-next General Assembly on 19 July. MPAI gives respondents about 3 months to complete their submissions.

More information about the MPAI process and the Framework Licence is available on the MPAI website.


MPAI for affordable Artificial Intelligence

After a series of ups and downs that lasted about sixty years, the set of technologies that go by the name of Artificial Intelligence (AI) has powerfully entered the design, production and strategy realities of many companies. Although it would not be easy – it would perhaps be an ineffective use of time – to argue against those who claim that AI is neither Artificial nor Intelligent, the term AI is sufficiently useful and indicative that it has found wide use in both saying and doing.

To characterise AI, it is useful to compare it with the antecedent technology called Data Processing (DP). When handling a data source, an expert could understand the characteristics of the data, e.g., the values, or rather the transformations of the data capable of extracting the most representative quantities. A good example was Digital Signal Processing (DSP) well represented by those agglomerates of sophisticated algorithms that go by the name of standards for audio and video compression.

In all these cases we find wonderful examples of how human ingenuity has been able to dig into enormous masses of data for years and discover the peculiarities of audio and video signals one by one to give them a more efficient, i.e., one that required fewer bits to represent the same or nearly the same data.

AI presents itself as a radical alternative to what DP fans have done so far. Instead of employing humans to dig into the data to find hidden relationships, machines are trained to search for and find these hidden relationships. In other words, instead of training humans to find relationships, train humans to train the machine to find those relationships.

The machines intended for this purpose consist of a network of variously connected nodes. Drawing on the obvious parallel of the brain, the nodes are called neurons and the network is therefore called the neural network. In the training phase, the machine is presented with many – maybe millions – of examples and, thanks to an internal logic, the connections can be corrected backwards so that the next time – hopefully – the result is better tuned.

Intuitively, it could be said that the more complex the universe of data that the machine must “learn”, the more complex the network must be. This is not necessarily true because the machine has been built to understand the internal relationships of the data and what appears to us at first sight complex could have a rule or a set of relatively simple rules that underlie the data and that the machine can “understand”.

Training a neural network can be expensive. The first cost element is the large amounts of data for training the network. This can be supervised (the man tells the machine how well it fared) or unsupervised (the machine understands this by itself). The second cost element is the large amounts of calculations to change the weights, i.e., the importance of the connections between neurons for each iteration. The third cost element is the access to IT infrastructures to carry out the training. Finally, if the trained neural network is used to offer a service, the cost of accessing potentially important computing resources every time the machine produces an inference, that is, it processes data to provide an answer.

On 19 July 2020, the idea of ​​establishing a non-profit organisation with the mission of developing standards for data encoding using mainly AI techniques was launched. One hundred days later the organisation was formed in Geneva under the name of MPAI – Moving Picture, Audio and Data Coding by Artificial Intelligence.

Why should we need an organisation for data coding standards using AI? The answer is simple and can be formulated as follows: MPEG standards – based on DP – have enormously accelerated, and actually promoted the evolution and dissemination of audio-visual products, services and applications. It is, therefore, reasonable to expect that MPAI standards – based on AI – will accelerate the evolution and diffusion of products, services, and applications for the data economy. Yes, because even audio-visual sources in the end produce – and for MPEG always they did so – data.

One of the first objectives that MPAI set to itself was the pure and simple lowering of the development and operating costs of AI applications. How can a standard achieve this?

The answer starts a bit far away, that is, from the human brain. We know that the human brain is made up of connected neurons. However, the connections of the approximately 100 billion neurons are not homogeneously distributed because the brain is made up of many neuronal “aggregations” whose function the research in the field is gradually coming to understand. So, rather than neurons connecting with parts of the brain, we are talking about neurons that have many interconnections with other neurons within an aggregation, while it is the aggregation itself passing the results of its processing to other aggregations. For example, the visual cortex, the part of the brain processing the visual information located in the occipital lobe and part of the visual pathway has a layered structure with 6 interconnected layers. The 4th layer is further subdivided in 4 sublayers.

Whatever its motivations, one of the first standards approved by the MPAI General Assembly (November 2021, 14 months after MPAI was established, was AI ​​Framework (MPAI-AIF), a standard that specifies the architecture and constituent components of an environment able to implement AI systems consisting of AI Modules (AI Module or AIM) organised in AI Workflows (AI Workflow or AIW), as shown in Figure 1.

Figure 1 – Reference model of MPAI-AIF

The main requirements that have guided the development of the MPAI-AIF standard specifying this environment are:

  1. Independence from the operating system.
  2. Modularity of components.
  3. Interfaces that encapsulate components abstracted from the development environment.
  4. Wide range of implementation technologies: software (Microcontrollers to High-Performance Computing systems), hardware, and hardware-software.
  5. AIW execution in local and distributed Zero-Trust environments.
  6. AIF interaction with other AIFs operating in the vicinity (e.g., swarms of drones).
  7. Direct support for Machine Learning functions.
  8. Interface with MPAI Store to access validated components.

Controller performs the following functions:

  1. Offers basic functionality, e.g., scheduling, and communication between AIM and other AIF components.
  2. Manages resources according to the instructions given by the user.
  3. Is linked to all AIM/AIW in a given AIF.
  4. Activates/suspends/resumes/deactivates AIWs based on user or other inputs.
  5. Exposes three APIs:
    1. AIM APIs allow AIM/AIW to communicate with it (register, communicate and access the rest of the AIF environment).
    2. User APIs allow user or other controllers to perform high-level tasks (e.g., turn the controller on/off, provide input to the AIW via the controller).
    3. Controller-to-controller APIs allow a controller to interact with another controller.
  6. Accesses the MPAI Store APIs to communicate to the Store.
  7. One or more AIWs can run locally or on multiple platforms.
  8. Communicates with other controllers running on separate agents, requiring one or more controllers in proximity to open remote ports.

Communication connects an output port of one AIM with an input port of another AIM using events or channels. It has the following characteristics:

  1. Activated jointly with the controller.
  2. Persistence is not required.
  3. Channels are Unicast – physical or logical.
  4. Messages have high or normal priority and are communicated via channels or events.

AI Module (AIM) receives data, performs a well-defined function and produces data. It has the following features:

  1. Communicates with other components via ports or events.
  2. Can incorporate other AIMs within it.
  3. Can register and log out dynamically.
  4. Can run locally or on different platforms, e.g., in the cloud or on swarms of drones, and communicate with a remote controller.

AI Workflow (AIW) is a structured aggregation of AIMs receiving and processing data according to a function determined by a use case and producing the required data.

Shared Storage stores data making it available to other AIMs.

AIM Storage stores the data of individual AIMs.

User Agent interfaces the user with an AIF via the controller.

Access offers access to static or slow-varying data that are required by the AIM, such as domain knowledge data, data models, etc.

MPAI Store stores and makes implementations available to users.

MPAI-AIF is an MPAI standard that can be freely downloaded from the MPAI website. An open-source implementation of MPAI-AIF will be available shortly.

MPAI-AIF is important because it lays the foundation on which other MPAI application standards can be implemented. So, it can be said that the description given above does not mark the conclusion of MPAI-AIF, but only the beginning. In fact, work is underway to provide MPAI-AIF with security support. The reference model is an extension of the model in Figure 1.

Figure 2 – Reference model of MPAI-AIF with security support

MPAI will shortly publish a Call for Technologies. In particular, the Call will request API proposals to access Trusted Services and Crypto Services.

We started by extolling the advantages of AI and complaining about the high costs of using the technology. How can MPAI-AIF lower costs and increase the benefits of AI? The answer lies in these expected developments:

  1. AIM implementers will be able to offer them to an open and competitive market.
  2. Application developers will be able to find the AIMs they need in the open and competitive market.
  3. Consumers will enjoy a wide selection of the best AI applications produced by competing application developers based on competing technologies.
  4. the demand for technologies enabling new and more performing AIMs will fuel innovation.
  5. Society will be able to lift the veil of opacity behind which hide many of today’s AI-based monolithic applications.

MPAI develops data coding standards for applications that have AI as the core enabling technology. Any legal entity supporting the MPAI mission may join MPAI, if able to contribute to the development of standards for the efficient use of data.

Visit the MPAI website, contact the MPAI secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedInTwitterFacebookInstagram, and YouTube.

Most importantly: join MPAI – share the fun – build the future.


Making sure that AI is “good” AI

AI has generated easy enthusiasms but also fears. The narrative that has developed sees in the development of AI more a potentially dystopian machine-ruled future than a tool potentially capable to improve the weel being of humanity.

Indeed, some AI technologies hold the potential to transform our society in a disruptive way. That possibility must be kept in check if we want to avoid potentially serious problems. Just think of video deep fakes, but also of the possibility of advanced linguistic models such as GPT-3 to generate ethically questionable outcomes.

One problem in this is that AI is a new technology and its limitations and problems are sometimes difficult to understand and evaluate. This is illustrated by the frequent identification of training bias or vulnerabilities which could have disastrous impacts in systems that are mission-critical or make sensitive decisions.

The MPAI Technical Specification called “Governance of the MPAI Ecosystem (MPAI-GME)” (see here) deals with these issues. To address a problem, however, you first need to identify it. MPAI does that by defining the Performance of an Implementation of an MPAI standard as a collection of 4 attributes:

  1. Reliability: Implementation performs as specified by the standard, profile and version the Implementation refers to, e.g., within the application scope, stated limitations, and for the period of time specified by the Implementer.
  2. Robustness: the ability of the Implementation to cope with data outside of the stated application scope with an estimated degree of confidence.
  3. Replicability: the assessment made by an entity can be replicated, within an agreed level, by another entity.
  4. Fairness: the training set and/or network is open to testing for bias and unanticipated results so that the extent of system applicability can be assessed.

MPAI defines the figure of “Performance Assessors”  who are mandated to assess how much an Implementation possesses the Performance attributes.

Who can be a Performance Assessor? A testing laboratory, a qualified company and even an Implementer. In the last case, an Implementer may not Assess the Performance of its Implementations. A Performance Assessor is appointed for a particular domain and for an indefinite duration and may charge Implementers for its services. However, MPAI can revoke the appointment.

In making its assessments, an MPAI Assessor is guided by the Performance Assessment Specification (PAS), the fourth component of an MPAI Standard. A PAS specifies the characteristics of the procedure, the tools and the datasets used by an Assessor when assessing the Performance of an Implementation.

MPAI has developed the PAS of the Compression and Understanding of Industrial Data Standard (MPAI-CUI). MPAI-CUI can predict the default and business discontinuity probability, and the adequacy of the organisational model of a company in a given prediction horizon using governance and financial statement data, and the assessment of cyber and seismic risk. Of course, the outlook of a company depends on more risks than cyber and seismic, but the standard in its current form takes only these risk into account.

The figure below gives the referencemodelof the standard.

Let’s see what the MPAI-CUI PAS actually says.

A Performance Assessor shall assess the Performance of an Implementation using a dataset satisfying the following requirement:

  1. The turnover of the companies used to create the dataset shall be between 1 M$ and 50 M$.
  2. The Financial Statements used shall cover 5 consecutive years.
  3. The last year of the Financial Statements and Governance data shall be the year the Performance is assessed.
  4. No Financial Statements, Governance data and no risk data shall be missing.

and the assessment process shall be carried out in 3 steps, as follows:

  1. Compute the Default Probability for each company in the dataset that
    1. Includes geographic location and industry types.
    2. Does not include geographic location and industry types.
  2. Compute the Organisational Model Index for each company in the dataset that
    1. Includes geographic location and industry types.
    2. Does not include geographic location and industry types.
  3. Verify that the average
    1. Default Probabilities for 1.a. and 1.b. do not differ by more than 2%.
    2. Organisational Model Index for 2.a. and 2.b. does not differ by more than 2%.

The MPAI Store will use the result of the Performance Assessment to label an Implementation.

Although very specific to an application, the example provided in this article gives a sufficient indication that the governance of the MPAI ecosystem has been designed to provide a practical solution to a difficult problem that risks depriving humankind of a potentially good technology.

If only we can separate the wheat from the chaff.


Patent pool being formed for four MPAI standards

Geneva, Switzerland – 18 May 2022. Today the international, non-profit, unaffiliated Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) standards developing organisation has concluded its 20th General Assembly. Among the outcomes is the communication that a patent pool will soon be established for four of its standards: AI Framework (MPAI-AIF), Context-Based Audio Enhancement (MPAI-CAE), Compression and Understanding of Industrial Data (MPAI-CUI) and Multimodal Conversation (MPAI-MMC).

The four standards have been developed based on the MPAI process;

  1. Between December 2020 and March 2021. MPAI issues 4 Calls for Technologies each referring to two documents – Functional Requirements and Framework Licence.
  2. In the September to December 2021 time frame, the 4 standards have been approved.
  3. In December 2021 and January 2022, the MPAI Secretariat requested its members to declare whether they believed holding patents essential to the four standards.
  4. In March 2022, the Secretariat issued a Call for Patent Pool Administrators on behalf of the identified patent holders.
  5. In May 2022, the result of the Call was communicated to patent holders.

According to the MPAI process, the patent holders will select the patent pool administrator with a majority of 2/3 of the patent holders’ votes. The licences shall have a total cost comparable with the total cost of similar technologies and be released not after products are on the market.

MPAI is developing the Calls for Technologies with associated Functional Requirements and Framework Licences for Version 2 of the MPAI-AIF and MPAI-MMC standards, planning to publish the Calls on 19 July 2022. The definition of the terms of the Framework Licences, a licence without critical data such as cost, dates, rates etc., is a prerogative of the MPAI Principal Members.

Version 2 will substantially extend the capabilities of Version 1 of the 3 standards by supporting three new use cases:

  1. Conversation About a Scene: a human holds a conversation with a machine about objects in a scene of which the human is part. While conversing, the human points their fingers to indicate their interest in a particular object.
  2. Human-Connected Autonomous Vehicle Interaction: a group of humans converse with a Connected Autonomous Vehicle (CAV) on a domain-specific subject (travel by car). The machine understands the utterances, the emotion in the speech and the expression in the faces and in the gestures of the humans it is conversing with, and manifests itself as head and shoulder of an avatar whose face and head convey emotions congruent with the uttered speech.
  3. Avatar Videoconference. Avatars participate in a videoconference reproducing the upper part of the human bodies they represent with a high degree of accuracy. A virtual secretary with a humanly appearance creates an online summary of the meeting. The quality of the summary is enhanced by the virtual secretary’s ability to detect the avatars’ emotions and expressions and to interact with avatars requesting changes to the summary. The quality of the interaction os enhanced by the virtual secretary’s ability to show emotions and expressions.

MPAI develops data coding standards for applications that have AI as the core enabling technology. Any legal entity supporting the MPAI mission may join MPAI, if able to contribute to the development of standards for the efficient use of data.

So far, MPAI has developed 5 standards (normal font in the list below), is currently engaged in extending two approved standards (underlined) and is developing other 9 standards (italic).

Name of standard Acronym Brief description
AI Framework MPAI-AIF Specifies an infrastructure enabling the execution of implementations and access to the MPAI Store. MPAI-AIF V2 is being prepared.
Context-based Audio Enhancement MPAI-CAE Improves the user experience of audio-related applications in a variety of contexts. MPAI-CAE V2 is being prepared.
Compression and Understanding of Industrial Data MPAI-CUI Predicts the company performance from governance, financial, and risk data.
Governance of the MPAI Ecosystem MPAI-GME Establishes the rules governing the submission of and access to interoperable implementations.
Multimodal Conversation MPAI-MMC Enables human-machine conversation emulating human-human conversation. MPAI-MMC V2 is being prepared.
Server-based Predictive Multiplayer Gaming MPAI-SPG Trains a network to com­pensate data losses and detects false data in online multiplayer gaming.
AI-Enhanced Video Coding MPAI-EVC Improves existing video coding with AI tools for short-to-medium term applications.
End-to-End Video Coding MPAI-EEV Explores the promising area of AI-based “end-to-end” video coding for longer-term applications.
Connected Autonomous Vehicles MPAI-CAV Specifies components for Environment Sensing, Autonomous Motion, and Motion Actuation.
Avatar Representation and Animation MPAI-ARA Specifies descriptors of avatars impersonating real humans.
Neural Network Watermarking MPAI-NNW Measures the impact of adding ownership and licensing information in models and inferences.
Integrative Genomic/Sensor Analysis MPAI-GSA Compresses high-throughput experiments data combining genomic/proteomic and other.
Mixed-reality Collaborative Spaces MPAI-MCS Supports collaboration of humans represented by avatars in virtual-reality spaces called Ambients
Visual Object and Scene Description MPAI-OSD Describes objects and their attributes in a scene and the semantic description of the objects.

Visit the MPAI website, contact the MPAI secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.

Most importantly: join MPAI, share the fun, build the future.


The MPAI Framework Licence approach to Standard Essential Patent (SEP) licensing

In the business world, goods are delivered based on technical and commercial specifications. In the standards world, there are good reasons why the goods (the standards) of a Standards Developing Organisation (SDO) are not delivered according to commercial requirements normally accepted in the business world. However, this is not a good reason for an SDO to stay with commercial requirements called “patent declarations” that simply bind the originators to license their SEPs at so-called Fair, Reasonable and Non-Discriminatory (FRAND) terms. This simply would not make sense in business and this is the reason why FRAND has been and continues to be causing problems.

The Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) SDO was established to develop data coding standards mostly using the Artificial Intelligence (AI) technology while offering, for each of its standards, a clear licensing framework to implementers.

This is how MPAI implements its process:

  1. A new standard may be proposed by anybody.
  2. Anybody may participate in the development of the Use Cases and Functional Requirements of a standard.
  3. MPAI Principal Members intending to participate in the development of a standard develop and approve, with 2/3 majority, the Framework Licence (FWL) for that standard. The FWL is a licence without values (dollars, percentages, rates, dates, etc.) containing a declaration that:
    1. The total cost of the licence will be in line with the total cost of the licenses for similar data coding technologies and will consider the market value of the specific standardised technology.
    2. The licence will be issued not after commercial implementations of the standard are made available on the market.
  4. During the development of the standard, MPAI members making technical contributions to the committee developing the standard declare they will make available their licences according to the FWL. Non members may participate in the development of the standard by becoming members.
  5. After the standard has been approved by the MPAI General Assembly:
    1. MPAI members who believe to be SEP holders express their preference on the patent pool administrator of the standard with a 2/3 majority.
    2. All Members declare they will get a licence for other members’ SEPs, if used, within one year after publication of the licensing terms by SEP holders.

The MPAI process ensures that

  1. The use cases and functional requirements of a standard are developed with participation of the eventual users, not just by MPAI members (i.e., the technology developers).
  2. Information about the eventual licence of a standard includes time (not after products are on the market) and cost (total cost of the licence in line with the total cost of the licenses for similar technologies).

Sure, this is not the same as a hard delivery date of and a price tag in dollars – standard are a special type of goods that is closely watched by antitrust authorities. But it is a long way from a promise that cost will be “fair”, “reasonable” and “non-discriminatory”, and time at heaven’s will.

You can find more information on the MPAI process and the FWL.


Virtual Secretary for Videoconference

As reported in a previous post, MPAI is busy finalising the “Use Cases and Functional Requirements” document of MPAI-MMC V2. One use case is Avatar-Based Videoconference (ABV), part of the Mixed-reality Collaborative Space (MCS) project supporting scenarios where geographically separated humans represented by avatars collaborate in virtual-reality spaces.

ABV refers to a virtual videoconference room equipped with a table and an appropriate number of chairs to be occupied by:

  1. Speaking virtual twins representing human participants displayed as the upper part of avatars resembling their real twins.
  2. Speaking human-like avatars not representing humans, e.g., a secretary taking notes of the meeting, answering questions, etc.

In line with the MPAI approach to standardisation, this article will report the currently defined functions, input/output data, AIM topology of the AI Workflow (AIW) of the Virtual Secretary, and the AI Modules (AIM) and their input/output data. The information in this article is expected to change when it will be published as an annex to the upcoming Call for Technologies.

The functions of the Virtual Secretary are:

  1. To collect and summarise the statements made by participating avatars.
  2. To display the summary for participants to see, read and comment on.
  3. To receive sentences/questions about its summary via Speech and Text.
  4. To monitor the avatars’ emotions in their speech and face, and expression in their gesture.
  5. To change the summary based on avatars’ text from speech, emotion from speech and face, and expression from gesture.
  6. To respond via speech and text, and display emotion in text, speech, and face.

The Virtual Secretary workflow in the AI Framework is depicted in Figure 1.

Figure 1 – Reference Model of Virtual Secretary

The operation of the workflow can be described as follows:

  1. The Virtual Secretary recognises the speech of the avatars.
  2. The Speech Recognition and Face Analysis extract the emotions from the avatars’ speech and face.
  3. Emotion Fusion provides a single emotion based on the two emotions.
  4. Gesture Analysis extracts the gesture expression.
  5. Language Understanding uses the recognised text and the emotion in speech to provide the final version of the input text (LangUnd-Text) and the meaning of the sentence uttered by an avatar.
  6. Question analysis uses the meaning to extract the intention of the sentence uttered by an avatar.
  7. Question and Dialogue Processing (QDP) receives LangUnd-Text and the text provided by a participant via chat and generates:
    1. The text to be used in the summary or to interact with other avatars.
    2. The emotion contained in the speech to be synthesised.
    3. The emotion to be displayed by the Virtual Secretary avatar’s face.
    4. The expression to be displayed by the Virtual Secretary’s avatar
  8. Speech Synthesis (Emotion) uses QDP’s text and emotion and generates the Virtual Secretary’s synthetic speech with the appropriate embedded emotion.
  9. Face Synthesis (Emotion) uses the avatar’s synthetic speech and QDP’s face emotion to animate the face of the Virtual Secretary’s avatar.

The data types processed by the Virtual Secretary are:

Avatar Descriptors allow the animation of an Avatar Model based on the description of the movement of:

  1. Muscles of the face (e.g., eyes, lips).
  2. Head, arms, hands, and fingers.

Avatar Model allows the use of avatar descriptors related to the model without the lower part (from the waist down) to:

  1. Express one of the MPAI standardised emotions on the face of the avatar.
  2. Animate the lips of an avatar in a way that is congruent with the speech it utters, its associated emotion and the emotion it expresses on the face.
  3. Animate head, arms, hands, and fingers to express one of the Gestures to be standardised by MPAI, e.g., to indicate a particular person or object or the movements required by a sign language.
  4. Rotate the upper part of the avatar’s body, e.g., as need if the avatar turns to watch the avatar next to itself.

Emotion of a Face is represented by the MPAI standardised basic set of 59 static emotions and their semantics. To support the Virtual Secretary use case, MPAI needs new technology to represent a sequence of emotions each having a duration and a transition time. The dynamic emotion representation should allow for two different emotions to happen at the same time, possibly with different durations.

Face Descriptors allow the animation of a face expressing emotion, including at least eyes (to gaze at a particular avatar) and lips (animated in sync with the speech).

Intention is the result of analysis of the goal of an input question standardised in MPAI-MMC V1.

Meaning is information extracted from an input text and physical gesture expression such as question, statement, exclamation, expression of doubt, request, invitation.

Physical Gesture Descriptors represent the movement of head, arms, hands, and fingers suit-able for:

  1. Recognition of sign language.
  2. Recognition of coded hand signs, e.g., to indicate a particular object in a scene.
  3. Representation of arbitrary head, arm, hand, and finger motion.
  4. Culture-dependent signs (e.g, mudra sign).

Spatial coordinates allow the representation of the position of an avatar, so that another avatar can gaze at its face when it has a conversation with it.

Speech Features allow a user to select a Virtual Secretary with a particular speech model.

Visual Scene Descriptors allow the representation of a visual scene in a virtual environment.

In July MPAI plans on publishing a Call for Technologies for MPAI-MMC V2. The Call will have two attachments. The first is the already referenced Use Cases and Functional Requirements document, the second is the Framework Licence that those responding to the Call shall accept in order to have their response considered.


Watermarking and AI

The term watermarking comprises a family of methodological and application tools used to insert data into a content item in a way that is as imperceptible and persistent as possible. Watermarking is used for different purposes such as to enable an entity to claim ownership of a content item or a device to use it.
As a neural network is a type of content – and one that may be quite expensive to develop – does it make sense to apply the watermarking approach to content to neural networks?
MPAI thinks it does and is working to develop requirements for a Neural Network Watermarking (NNW) standard called MPAI-NNW that will enable a watermarking technology provider to validate their products’ claims. The standard will provide the means to measure, for a given size of the watermarking payload, the ability of:

  • The watermark inserter to inject a payload without affecting the performance of the neural network. This item requires, for a given application domain:
    • A testing dataset to be used for the watermarked and unwatermarked neural network.
    • An evaluation methodology to assess any change of the performance induced by the watermark.
  • The watermark detector to recognise the presence of the inserted watermark when applied to a watermarked network that has been modified (e.g., by transfer learning or pruning) or to any of the inferences of the modified model. This item requires, for a given application domain:
    • A list of potential modification types expected to be applied to the watermarked neural network as well as of their ranges (e.g., random pruning at 25%).
    • Performance criteria for the watermark detector (e.g., relative numbers of missed detections and false alarms).
  • The watermark decoder to successfully retrieve the payload when applied to a watermarked network that has been modified (e.g., by transfer learning or pruning) or to any of the inferences of the modified model. This item requires, for a given application domain:
    • A list of potential modification types expected to be applied to the watermarked neural network as well as of their ranges (e.g., random pruning at 25%).
    • ​​Performance criteria for the watermark decoder (e.g., 100% or (100-α)% recovery).
  • The watermark inserter to inject a payload at a low computational cost, e.g., execution time on a given processing environment.
  • The watermark detector/decoder to detect/decode a payload from a watermarked model or from any of its inferences, at a low computational cost, e.g., execution time on a given processing environment.

You can read the MPAI-NNW Use cases & functional requirements WD 0.2.

The work of developing requirements for the MPAI-NNW standard is ongoing. In this phase of the work, participation is open to non members. Contact the MPAI Secretariat if you wish to join the MPAI-NNW online meetings.