Moving Picture, Audio and Data Coding
by Artificial Intelligence

Archives: 2022-01-26

MPAI approves a new Technical Specification and a Technical Report

Geneva, Switzerland – 25 January 2023. Today the international, non-profit, unaffiliated Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) standards developing organisation has concluded its 28th General Assembly (MPAI-28) approving the Neural Network Watermarking (MPAI-NNW) Technical Specification, the MPAI Metaverse Model (MPAI-MMM) Technical Report, and the 2023 program of work on the Metaverse.

MPAI-28 has approved for publication the following two documents:

  1. Neural Network Watermarking (MPAI-NNW). Draft Technical Specification providing methodologies to evaluate the performance of neural network-based watermarking solutions in terms of imperceptibility, robustness, and computational cost. Further information from
YouTube video Non-YouTube video  MPAI-NNW
  1. MPAI Metaverse Model (MPAI-MMM). Draft Technical Report, a document outlining a set of desirable guidelines to accelerate the development of interoperable Metaverses. The online presentation of the draft version of this document is available at
YouTube video Non-YouTube video The MPAI Metaverse Model

MPAI has also approved the 2023 program of work related to the MPAI Metaverse Model:

  1. Functionality Profiles referencing MMM functionalities, not technologies.
  2. Metaverse Instance Architecture with the functions and data types of the building blocks.
  3. Functional requirements of the identified data types.
  4. Table of Contents of the Common Metaverse Specifications.
  5. Initial Common Metaverse Specifications that includes MPAI Technologies.

MPAI is continuing its work plan comprising the following Technical Specifications:

  1. AI Framework (MPAI-AIF). Standard for a secure AIF environment executing AI Workflows (AIW) composed of AI Modules (AIM).
  2. Avatar Representation and Animation (MPAI-ARA). Standard for generation and animation of interoperable avatar models reproducing humans and expressing a Personal Status.
  3. Context-based Audio Enhancement (MPAI-CAE). Standard to describe an audio scene to support human interaction with autonomous vehicles and metaverse applications.
  4. Multimodal Conversation (MPAI-MMC). Standard for Personal Status generalising the notion of Emotion including Cognitive State and Social Attitude.

The MPAI work plan also includes exploratory activities, some of which are close to becoming standard or technical report projects:

  1. AI Health (MPAI-AIH). Targets an architecture where smartphones store users’ health data processed using AI and AI Models are updated using Federated Learning.
  2. Connected Autonomous Vehicles (MPAI-CAV). Targets the Human-CAV Interaction Environment Sensing, Autonomous Motion, and Motion Actuation subsystems implemented as AI Workflows.
  3. End-to-End Video Coding (MPAI-EEV). Extends the video coding frontiers using AI-based End-to-End Video coding.
  4. AI-Enhanced Video Coding (MPAI-EVC). Improves existing video coding with AI tools for short-to-medium term applications.
  5. Server-based Predictive Multiplayer Gaming (MPAI-SPG). Uses AI to train neural networks that help an online gaming server to compensate data losses and detects false data.
  6. XR Venues (MPAI-XRV). Identifies common AI Modules used across various XR-enabled and AI-enhanced use cases where venues may be both real and virtual.

As we enter the year 2023, this is a good time for legal entities supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data to join MPAI.

Please visit the MPAI website, contact the MPAI secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.


MPAI is offering its high-quality drone sequences to the video coding community

Fifteen months ago, MPAI started an investigation on AI-based End-to-End Video Coding, a new approach is not based on traditional video coding architectures. Recently published results from the investigation show that Version 0.3 of the MPAI-EEV Reference Model has generally higher performance than the MPEG-HEVC video coding standard when applied to the MPAI set of high-quality drone video sequences.

MPAI is now offering its Unmanned Aerial Vehicle (UAV) sequence dataset for use by the video community in testing compression algorithms. The dataset contains various drone videos captured under different conditions, including environments, flight altitudes, and camera views. These video clips are selected from several categories of real-life objects in different scene object densities and lighting conditions, representing diverse scenarios in our daily life.

Compared to natural videos, UAV-captured videos are generally recorded by drone-mounted cameras in motion and at different viewpoints and altitudes. These features bring several new challenges, such as motion blur, scale changes and complex background. Heavy occlusion, non-rigid deformation and tiny scales of objects might be of great challenge to drone video compression.

Please get an invitation from the MPAI Secretariat and come to one of the biweekly meetings of the MPAI-EEV group (starting from 1st of February 2023). The MPAI-EEV group is going to showcase its superior performance fully neural network-based video codec model for drone videos. The group is inclusive and planning for the future of video coding using end-to-end learning. Please feel free to participate, leaving your comments or suggestions to the MPAI-EEV. We will discuss your contribution and our state of the art with the goal of progressing this exciting area of coding of video sequences from drones.

Table 1 – Drone video test sequences

Source Sequence
Name
Spatial
Resolution
Frame
Count
Frame
Rate
Bit
Depth
Scene
Feature
 

Class A VisDrone-SOT TPAM12021

BasketballGround 960×528 100 24 8 Outdoor
GrassLand 1344×752 100 24 8 Outdoor
Intersection 1360×752 100 24 8 Outdoor
NightMall 1920×1072 100 30 8 Outdoor
SoccerGround 1904×1056 100 30 8 Outdoor
Class B
VisDrone-MOT
TPAM12021
Circle 1360×752 100 24 8 Outdoor
CrossBridge 2720×1520 100 30 8 Outdoor
Highway 1344×752 100 24 8 Outdoor
Class C
Corridor
IROS2018
Classroom 640×352 100 24 8 Indoor
Elevator 640×352 100 24 8 Indoor
Hall 640×352 100 24 8 Indoor
Class D
UAVDT S
ECCV2018
Campus 1024×528 100 24 8 Outdoor
RoadByTheSea 1024×528 100 24 8 Outdoor
Theater 1024×528 100 24 8 Outdoor

See https://mpai.community/standards/mpai-eev/about-mpai-eev/

Join MPAI – Share the fun – Build the future!


A look inside MPAI XR Venues

XR Venues is an MPAI project (MPAI-XRV) addressing use cases enabled by Extended Reality (XR) technologies – the combination of Augmented Reality (AR), Virtual Reality (VR) and Mixed Reality (MR) – and enhanced by Artificial Intelligence (AI) technologies. The word “venue” is used as a synonym for “real and virtual environments”.

The XRV group has identified some 10 use cases and made a detailed analysis of three of them: eSports Tournament, Live theatrical stage performance, and Experiential retail/shopping.

How did XRV become an MPAI project? MPAI responds to industry needs with a rigorous process that includes 8 phases starting from Interest Collection up to Technical Specification. The initial phase of the process:

  1. Starts with the submission of a proposal triggering the Interest Collection stage where the interest of people other than the proposers is sought.
  2. Continues with the Use Cases stage where applications of the proposal are studied.
  3. Concludes with the Functional Requirements stage where the AI Workflows implementing the developed use cases and their composing AI Modules are identified with their functions and data formats.

Let’s see how things are developing in the XR Venues project (MPAI-XRV) now at the Functional Requirements stage. We will describe the use case of  the eSports Tournament game. This consists of two teams of 3 to 6 players arranged on either side of a real world (RW) stage, each using a computer to compete within a real-time Massively Multiplayer Online game space.

Figure 1 – An eSports Tournament

The game space occurs in a virtual world (VW) populated by:

  1. Players represented by avatars each driven by role (e.g., magicians, warriors, soldier, etc.), properties (e.g., costumes, physical form, physical features), and actions (e.g., casting spells, shooting, flying, jumping).
  2. Avatars representing other players, autonomous characters (e.g., dragon, monsters, various creatures), and environmental structures (e.g., terrain, mountains, bodies of water).

The game action is captured by multiple VW cameras and projected onto a RW immersive screen surrounding spectators and live streamed to remote spectators as a 2D video with all related sounds of the VW game space.

A shoutcaster calls the action as the game proceeds. The RW venue (XR Theatre) includes one or more immersive screens where the image of RW players, player stats or other information or imagery may also be displayed. The same may also be live streamed. The RW venue is augmented with lighting and special effects, music, and costumed performers.

Live stream viewers interact with one another and with commentators through live chats, Q&A sessions, etc. while RW spectators interact through shouting, waving and interactive devices (e.g., LED wands, smartphones). RW spectators’ features are extracted from data captured by camera and microphone or wireless data interface and interpreted.

Actions are generated from RW or remote audience behaviour and VW action data (e.g., spell casting, characters dying, bombs exploding).

At the end of the tournament, an award ceremony featuring the winning players on the RW stage is held with great fanfare.

eSports Tournament is a representative example of the XRV project where human participants are exposed to real and virtual environments that interact with one another. Figure 1 depicts the general model representing how data from a real or virtual environment are captured, processed, and interpreted to generate actions transformed into experiences that are delivered to another real or virtual environment.

Figure 2 – Environment A to Environment B Interactions

Irrespective of whether Environment A is real or virtual, Environment Capture captures signals and/or data from the environment, Feature Extraction extracts descriptors from data, and Feature Interpretation yields interpretations by analysing those descriptors. Action Generation generates actions by analysing interpretations, Experience Generation      translates action into an experience, and Environment Rendering delivers the signals and/or data corresponding to the experience into Environment B whether real or virtual. Of course, the same sequence of steps can occur in the right-to-left direction starting from Environment B.

A thorough analysis of the eSports Tournament use case has led the XRV group to develop the reference model depicted in Figure 3.

Figure 3 – Reference Model of eSports Tournament

The AI Modules on the left-hand side and in the middle of the reference model perform the Description Extraction and Descriptor Interpretation functions identified in Figure 2. The data generated by them are:

  1. Player Status is the ensemble of information internal to the player, expressed by Emotion, Cognitive State, and Attitude estimated from Audio-Video-Controller-App of the individual players.
  2. Participants Status is the ensemble of information, expressed by Emotion, Cognitive State and Attitude of participants, estimated from the collective behaviour of Real World and on-line spectators in response to actions of a team, a player, or the game through audio, video, interactive controllers, and smartphone apps. Both data types are similar to the Personal Status developed in the context of Multimodal Conversation Version 2.
  3. Game State is estimated from Player location and Player Action (both in the VW), Game Score and Clock.
  4. Game Action Status is estimated from Game State, Player History, Team History, and Tournament Level.

The four data streams are variously combined by the three AI Modules on the right-hand side to generate effects in the RW and VW, and to orientate the cameras in the VW. These correspond to the Action Generation, Experience Generation and Experience Rendering of Figure 2.

The definition of interfaces between the AI Modules of 3 will enable the independent development of those AI Modules with standard interfaces. An XR Theatre will be able to host a pre-existing game and produce an eSports Tournament supporting RW and VW audience interactivity. To the extent that the game possesses the required interfaces, the XR Theatre also can drive actions within the VW.

eSports has grown substantially in the last decade. Arena-sized eSport Tournaments with increasing complexity are now routine. An XRV Venue dedicated to eSports enabled by AI can greatly enhanced the participants’ experience with powerful multi-sensory, interactive and highly immersive media, lowering the complexity of the system and the required human resources. Standardised AI Modules for an eSports XRV Venue enhance interoperability across different games and simplify experience design


The MPAI Metaverse Model has been launched

MPAI has posted the MPAI Metaverse Model (MMM) on the 3rd of January 2023 calling for comments and contributions until the 23rd of January and organised two online presentations. You can see the recording of one presentation and the powerpoint file:

YouTube Non-YouTube The MPAI Metaverse Model WD0.5

The MMM is a proposal for a method to develop Metaverse standards. It is based on an experience honed during decades of digital media standardisation that seeks to accommodate the extreme heterogeneity of industries all needing a common technology complemented by industry specificities.

The MMM is not just a proposal of a method. It also includes a roadmap and implements the first steps of it. The steps of the roadmap are not intended to be implemented in a strict sequential way.

The table below indicates the steps. Steps 1 to 4 are ongoing and included in the MMM. Step 5 has started.

# Step Content
1 Terms and Definitions A set of interconnected and consistent set of terms and definitions.
2 Assumptions A set of assumptions guiding the development of metaverse standards, starting from:

  1. Collect functionalities.
  2. Develop the Common Metaverse Specifications (CMS) .
  3. Establish industry-specific profiles based on CMS technologies.
3 Use Cases A set of 18 use cases with workflows used to develop metaverse functionalities.
4 External Services Potentially used by a metaverse instance to develop metaverse functionalities.
5 Functional Profiles Develop profiles that reference functionalities included in the MMM, not technologies.
6 Metaverse Architecture Develop a metaverse architecture with functional blocks and data exchanged between blocks.
7 Functional Requirements of Data Format Develop functional requirements of the data formats exchanged between blocks.
8 CMS Table of Contents Identify and organise all technologies required to support the MMM functionalities.
9 MPAI standards Enter MPAI standards relevant to the metaverse into the CMS Table of Contents.

 


A bird’s eye view of the MPAI Metaverse Model

MPAI is pleased to announce that, after a full year of efforts, it has been able to publish the MPAI Metaverse Model, the master plan of a project designed to facilitate the establishment of standards promoting Metaverse Interoperability. Watch

YouTube video Non-YouTube video

The industry is showing a growing interest in the Metaverse that is expected to create new jobs, opportunities, and experiences with transformational impacts on virtually all sectors of human interaction.

Standards and Artificial Intelligence are widely recognised as two of the main drivers for the development of the Metaverse. MPAI – Moving Picture, Audio, and Data Coding by Artificial Intelligence – plays a role in both thanks to its status as an international, unaffiliated, non-profit organisation developing standards for AI-based data coding with clear Intellectual Property Rights licensing frameworks.
The MMM is a full-bodied document divided in 9 chapters.

  1. Introduction gives a high-level overview of the MMM and explains that the MMM is published for community comments where MPAI posts the MMM, anybody can send comments and contributions to the MPAI Secretariat, MPAI considers them, and publishes the MMM in final form on 25 January.
  2. Definitions gives a comprehensive set of Metaverse-related terms and definitions.
  3. Assumptions details 16 assumptions that the proposed Metaverse standardisation process will adopt. Some of them are:
    1. the steps of the standardisation process.
    2. the availability of Common Metaverse Specifications (CMS).
    3. the eventual development of Metaverse Profiles.
    4. a definition of Metaverse Instance and Interoperability.
    5. the layered structure of a Metaverse Instance.
    6. the fact that Metaverse Instances already exist.
    7. the definition of Metaverse User.
  4. Use Cases collects a large number of application domains that will benefit from the use of the Metaverse. They are analysed to derive Metaverse Functionalities, such as:
    1. Automotive,
    2. Education,
    3. Finance,
    4. Healthcare
    5. Retail.
  5. External Services collects some of the services that a Metaverse Instance may require either as a platform native or as an externally provided service and are analysed to derive Metaverse Functionalities.. Examples are:
    1. content creation
    2. marketplace
    3. crypto wallets.
  6. Functionalities is a major element of the MMM in its current form. It collects a large number of Functionalities that a Metaverse Instance may support depending on the Profile it adopts. It is organised in 9 areas, i.e.,
    1. Instance,
    2. Environment,
    3. Content Representation,
    4. Perception of the Universe by the Metaverse,
    5. Perception of the Metaverse by the Universe,
    6. User,
    7. Interaction,
    8. Information search
    9. Economy support.
    10. Each area is organised in subareas: e.g., Instance is subdivided into
      1. Management
      2. Organisation
      3. Features
      4. Storage
      5. Process Management
      6. Security.
    11. Each subarea provides the Functionalities relevant to that subarea, e.g., Process Management includes the following Functionalities:
      1. Smart Contract
      2. Smart Contract Monitoring
      3. Smart Contract Interoperability.
  7. Technologies has the challenging task of verifying how well technologies match the requirements of the Functionalities. Currently, the following Technologies are analysed:
    1. Sensory information – namely, Audio, Visual, Touch, Olfaction, Gustation, and Brain signals.
    2. Data processing – how can we cope with the end of Moore’s Law and with the challenging requirements for distributed processing.
    3. User Devices – how Devices can cope with challenging motion-to-photon requirements.
    4. Network – the prospects of networks providing services satisfying high-level requirements, e.g., latency and bit error rate.
    5. Energy – the prospects of energy storage for portable devices and of energy consumption caused by thousands of Metaverse Instances and potentially billions of Devices.
  8. Governance identifies and analyses two areas:
    1. technical governance of the Metaverse System if the industry decides that this level of governance is in the common interest.
    2. governance by public authorities operating at a national or regional level.
  9. Profiles provides an initial roadmap from the publication of the MMM to the development of Profiles through the development of
    1. Metaverse Architecture
    2. Functional Requirements of Data types
    3. Common Metaverse Specification Table of Contents
    4. mapping of MPAI standard Technologies into the CMS
    5. inclusion of all required Technologies
    6. drafting of the mission of the Governance of the Metaverse System.

The MMM is a large integrated document. Comment on the MMM and join MPAI to make it happen!


Two MPAI documents published for community comments

Geneva, Switzerland – 21 December 2022. Today the international, non-profit, unaffiliated Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) standards developing organisation has concluded its 27th General Assembly (MPAI-27) celebrating the adoption without modifications of three MPAI Technical Specifications as IEEE standards, and approving the publication of the MPAI Metaverse Model (MPAI-MMM) draft Technical Report and the Neural Network Watermarking (MPAI-NNW) draft Technical Specification for community comments.

The Institute of Electrical and Electronic Engineers Standard Association has adopted three MPAI Technical Specifications – AI Framework (MPAI-AIF), Context-based Audio Enhancement (MPAI-CAE), and Multimodal Conversation (MPAI-MMC) – as IEEE standards number 3301-2022, 3302-2022, and 3300-2022, respectively. The MPAI and IEEE versions are technically equivalent, and implementers of MPAI/IEEE standards can obtain an ImplementerID from the MPAI Store.

MPAI implements a rigorous process of standards development requiring publication of a draft Technical Specification or Technical Report with a request for community comments before final approval and publication.  MPAI-27 approved the following two documents for the said preliminary publication:

  1. MPAI Metaverse Model (MPAI-MMM). Draft Technical Report, a document outlining a set of desirable guidelines to accelerate the development of interoperable Metaverses:
    1. A set of assumptions laid at the foundation of the Technical Report.
    2. Use cases based on and services to Metaverse Instances.
    3. Application of the profile approach successfully adopted for digital media standards to Metaverse standards.
    4. An initial set of functionalities used by Metaverse Instances to facilitate the definition of profiles.
    5. Identification of the main technologies enabling the Metaverse.
    6. A roadmap to definition of Metaverse Profiles.
    7. An initial list of governance and regulation issues likely to affect the design, deployment, operation, and interoperability of Metaverse Instances.

An online presentation of MPAI-MMM will be made on 2023/01/10

08:00 UTC: https://us06web.zoom.us/meeting/register/tZEtcuuurTsuHdcbXCAy-we7soWkIqK5a2MK

18:00 UTC: https://us06web.zoom.us/meeting/register/tZcocuqtrjkuGdz0_nQWhLIJMvSHbfAkqP39

The MPAI Metaverse Model is accessible online.

  1. Neural Network Watermarking (MPAI-NNW). Draft Technical Specification providing methodologies to evaluate the performance of neural network-based watermarking solutions in terms of:
    1. The watermarking solution imperceptibility defined as a measure of the potential impact of the watermark injection on the result of the inference created by the model.
    2. The watermarking solution robustness defined as the detector and decoder ability to retrieve the watermark when the watermarked model is subjected to modifications.
    3. The computational cost of the main operations performed in the end-to-end watermarking process.

The documents are accessible from the links above. Comments should be sent to the MPAI secretariat. Both documents are expected to be released in final form on 2023/01/25.

MPAI is continuing its work plan comprising the following Technical Specifications:

  1. AI Framework (MPAI-AIF). Standard for a secure AIF environment executing AI Workflows (AIW) composed of AI Modules (AIM).
  2. Avatar Representation and Animation (MPAI-ARA). Standard for generation and animation of interoperable avatar models reproducing humans and expressing a Personal Status.
  3. Context-based Audio Enhancement (MPAI-CAE). Standard to describe an audio scene to support human interaction with autonomous vehicles and metaverse applications.
  4. Multimodal Conversation (MPAI-MMC). Standard for Personal Status generalising the notion of Emotion including Cognitive State and Social Attitude.

The MPAI work plan also includes exploratory activities, some of which are close to becoming standard or technical report projects:

  1. AI Health (MPAI-AIH). Targets an architecture where smartphones store users’ health data processed using AI and AI Models are updated using Federated Learning.
  2. Connected Autonomous Vehicles (MPAI-CAV). Targets the Human-CAV Interaction Environment Sensing, Autonomous Motion, and Motion Actuation subsystems implemented as AI Workflows.
  3. End-to-End Video Coding (MPAI-EEV). Extends the video coding frontiers using AI-based End-to-End Video coding.
  4. AI-Enhanced Video Coding (MPAI-EVC). Improves existing video coding with AI tools for short-to-medium term applications.
  5. Server-based Predictive Multiplayer Gaming (MPAI-SPG). Uses AI to train neural networks that help an online gaming server to compensate data losses and detects false data.
  6. XR Venues (MPAI-XRV). Identifies common AI Modules used across various XR-enabled and AI-enhanced use cases where venues may be both real and virtual.

As we enter the year 2023, it is a good opportunity for legal entities supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data to join MPAI.

Please visit the MPAI website, contact the MPAI secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.

 

 


Three MPAI Technical Specifications are now IEEE Standards

MPAI is pleased to announce that Multimodal Conversation (MPAI-MMC), AI Framework (MPAI-AIF), and Context-based Audio Enhancement (MPAI-CAE) have been adopted without modification by the Institute of Electrical and Electronic Engineering (IEEE) Standards Association as IEEE Standards 3300-2022, 3301-2022, and 3302-2022, respectively.
The sequence of steps that has led to this result has involved the granting to IEEE of the right to publish and distribute the text of IEEE Standards 3300-2022, 3301-2022, and 3302-2022 as derivative works of MPAI-MMC, MPAI-AIF, and MPAI-CAE. It has also paved the way to the establishment of the MPAI Store, a non-profit organisation with the mandate to assign ImplementerIDs, unique identifiers of MPAI technical specification, to implementers, and the verification and distribution of MPAI implementations.
Implementers owning an ImplementerID can uniquely identify their embodiments of AI Frameworks (AIF), AI Workflows (AIW), and AI Modules (AIM). Thanks to the ImplementerID syntax, the MPAI Store can verify that an implementation submitted for distribution has indeed been originated by the implementer identified by the ImplementerID.
On the same occasion, the IEEE has approved the establishment of Project Authorization Report (PAR) P3303 Working Group (WG). The WG is in charge of managing the process eventually leading to the adoption of Compression and Understanding of Industrial Data (MPAI-CUI) without modification as IEEE P3303.
MPAI continues to look forward to a fruitful collaboration with IEEE on these and future MPAI standards.


MPAI calls for new members to support its standard development plans

Geneva, Switzerland – 23 November 2022. Today the international, non-profit, unaffiliated Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) standards developing organisation has concluded its 26th General Assembly (MPAI-26). MPAI is calling for new members to support the development of its work program.

Planned for approval in the first months of 2023 are 5 standards and 1 technical report:

  1. AI Framework (MPAI-AIF). Standard for a secure AIF environment executing AI Workflows (AIW) composed of AI Modules (AIM).
  2. Avatar Representation and Animation (MPAI-ARA). Standard for generation and animation of interoperable avatar models reproducing humans and expressing a Personal Status.
  3. Context-based Audio Enhancement (MPAI-CAE). Standard to describe an audio scene to support human interaction with autonomous vehicles and metaverse applications.
  4. Multimodal Conversation (MPAI-MMC). Standard for Personal Status generalising the notion of Emotion including Cognitive State and Social Attitude.
  5. MPAI Metaverse Model (MPAI-MMM). Technical Report covering the design, deployment, operation, and interoperability of Metaverse Instances.
  6. Neural Network Watermarking (MPAI-NNW). Standard specifying methodologies to evaluate neural network-based watermarking solutions

The MPAI work plan also includes exploratory activities, some of which are close to becoming standard or technical report projects:

  1. AI Health (MPAI-AIH). Targets an architecture where smartphones store users’ health data processed using AI and AI Models are updated using Federated Learning.
  2. Connected Autonomous Vehicles (MPAI-CAV). Targets the Human-CAV Interaction Environment Sensing, Autonomous Motion, and Motion Actuation subsystems implemented as AI Workflows.
  3. End-to-End Video Coding (MPAI-EEV). Extends the video coding fronties using AI-based End-to-End Video coding.
  4. AI-Enhanced Video Coding (MPAI-EVC). Improves existing video coding with AI tools for short-to-medium term applications.
  5. Server-based Predictive Multiplayer Gaming (MPAI-SPG). Uses AI to train neural networks that help an online gaming server to compensate data losses and detects false data.
  6. XR Venues (MPAI-XRV). Identifies common AI Modules used across various XR-enabled and AI-enhanced use cases where venues may be both real and virtual.

It is a good opportunity for legal entities supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data to join MPAI now, also considering that membership is immediately active and will last until 2023/12/31.

Please visit the MPAI website, contact the MPAI secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.

Most importantly: please join MPAI, share the fun, build the future.


End-to-End Video Coding in MPAI

Introduction

During the past decade, the Unmanned-Aerial-Vehicles (UAVs) have attracted increasing attention due to their flexible, extensive, and dynamic space-sensing capabilities. The volume of video captured by UAVs is exponentially growing along with the increased bitrate generated by the advancement of the sensors mounted on UAVs, bringing new challenges for on-device UAV storage and air-ground data transmission. Most existing video compression schemes were designed for natural scenes without consideration of specific texture and view characteristics of UAV videos. In MPAI EEV project, we have contributed a detailed analysis of the current state of the field of UAV video coding. Then EEV establishes a novel task for learned UAV video coding and construct a comprehensive and systematic benchmark for such a task, present a thorough review of high quality UAV video datasets and benchmarks, and contribute extensive rate-distortion efficiency comparison of learned and conventional codecs after. Finally, we discuss the challenges of encoding UAV videos. It is expected that the benchmark will accelerate the research and development in video coding on drone platforms

UAV Video Sequences

We collect a set of video sequences to build the UAV video coding benchmark from those diverse contents, considering the recording device type (various models of drone-mounted cameras), diverse in many aspects including location (in-door and out-door places), environment (traffic workload, urban and rural regions), objects (e.g., pedestrian and vehicles), and scene object density (sparse and crowded scenes).Table 1 provides a comprehensive summary of the prepared learned drone video coding benchmark for a better understanding of those videos.

Table 1: Video sequence characteristics of the proposed learned UAV video coding benchmark

Source Sequence

Name

Spatial

Resolution

Frame

Count

Frame

Rate

Bit

Depth

Scene

Feature

 

Class A VisDrone-SOT

BasketballGround 960×528 100 24 8 Outdoor
GrassLand 1344×752 100 24 8 Outdoor
Intersection 1360×752 100 24 8 Outdoor
NightMall 1920×1072 100 30 8 Outdoor
SoccerGround 1904×1056 100 30 8 Outdoor
Class B

VisDrone-MOT

Circle 1360×752 100 24 8 Outdoor
CrossBridge 2720×1520 100 30 8 Outdoor
Highway 1344×752 100 24 8 Outdoor
Class C

Corridor

Classroom 640×352 100 24 8 Indoor
Elevator 640×352 100 24 8 Indoor
Hall 640×352 100 24 8 Indoor
Class D

UAVDT S

Campus 1024×528 100 24 8 Outdoor
RoadByTheSea 1024×528 100 24 8 Outdoor
Theater 1024×528 100 24 8 Outdoor

The corresponding thumbnail of each video clip is depicted in Fig. 1 as supplementary information. There are 14 video clips from multiple different UAV video dataset sources [1, 2, 3]. Their resolutions and frame rates range from 2720 × 1520 down to 640 × 352 and 24 to 30 respectively.

To comprehensively reveal the R-D efficiency of UAV video using both conventional and learned codecs, we encode the above-collected drone video sequences using the HEVC reference software with screen content coding (SCC) extension (HM-16.20-SCM-8.8) and the emerging learned video coding framework OpenDVC [4]. Moreover, the reference model of MPAI End-to-end Video (EEV) is also employed to compress the UAV videos. As such, the baseline coding results are based on three different codecs. Their schematic diagrams are shown in Fig. 1. The left panel represents the classical hybrid codec. The remaining two are learned codecs, OpenDVC and EEV respectively. It is easy to observe that the EEV software is an enhanced version of OpenDVC codecthat incorporates more advanced modules such as motion compensation prediction improvement, two-stage residual modelling, and in-loop restoration network.

Figure 1 Block diagram of different codecs. (a) Conventional hybrid codec HEVC. (b)

OpenDVC. (3) MPAI EEV. Zoom-in for better visualization

Another important factor for learned codecs is train-and-test data consistency. It is widely accepted in the machine learning community that train and test data should be independent and identically distributed. However, both OpenDVC and EEV are trained using the natural video dataset vimeo-90k with mean-square-error (MSE) as distortion metrics. We employ those pre-trained weights of learned codecs without fine-tuning them on drone video data to guarantee that the benchmark is general.

 Evaluation.

Since all drone videos in our proposed benchmark use the RGB color space, the quality assessment methods are also applied to the reconstruction in the RGB domain. For each frame, the peak-signal-noise-ratio (PSNR) is  are calculated for each component channel. Then the RGB averaged value is obtained to indicate its picture quality. Regarding the bitrate, we calculate bit-per-pixel (BPP) using the binary files produced by codecs. We report the coding efficiency of different codecs using the Bjøntegaard delta bit rate (BD-rate) measurement.

Table 1: The BD-rate performance of different codecs (OpenDVC, EEV, and HM-16.20- SCM-8.8) on drone video compression. The distortion metric is RGB-PSNR.

Category Sequence

Name

BD-Rate Reduction

EEV vs OpenDVC

BD-Rate Reduction

EEV vs HEVC

 

Class A VisDrone-SOT

BasketballGround -23.84% 9.57%
GrassLand -16.42% -38.64%
Intersection -18.62% -28.52%
NightMall -21.94% -6.51%
SoccerGround -21.61% -10.76%
Class B VisDrone-MOT Circle -20.17% -25.67%
CrossBridge -23.96% 26.66%
Highway -20.30% -12.57%
Class C Corridor Classroom -8.39% 178.49%
Elevator -19.47% 109.54%
Hall -15.37% 58.66%
Class D UAVDT S Campus -26.94% -25.68%
RoadByTheSea -20.98% -24.40%
Theater -19.79% 2.98%
Class A -20.49% -14.97%
Class B -21.48% 3.86%
Class C -14.41% 115.56%
Class D -22.57% -15.70%
Average -19.84% 15.23%

The corresponding PSNR based R-D performances of the three different codecs are shown in Table 1. Regarding the simulation results, it is observed that around 20% bit-rate reduction could be achieved when comparing EEV and OpenDVC codec. This shows promising performances for the learned codecs and its improvement made by EEV software.

When we directly compare the coding performance of EEV and HEVC, obvious performance gap between the in-door and out-door sequences could be observed. Generally speaking, the HEVC SCC codec outperforms the learned codec by 15.23% over all videos. Regarding Class C, EEV is significantly inferior to HEVC by clear margin, especially for the Classroom and elevator sequences. Such R-D statistics reveal that learned codecs are more sensitive to the video content variations than conventional hybrid codecs if we directly apply natural-video-trained codec to UAV video coding. For future research, this point could be resolved and modeled as an out-of-distribution problem and extensive knowledge could be borrowed from the machine learning community.

To further dive into the R-D efficiency interpretation of different codecs, we plot the R-D curves of different methods in Fig. 2. Specifically, we select Camplus and Highway for illustration. The blue-violet, peach-puff, and steel-blue curves denote EEV, HEVC, and OpenDVC codec respectively. The content characteristic of UAV videos and its distance to the natural videos shall be modeled and investigated in future research.

MPAI-EEV Working Mechanism

This work was accomplished in the MPAI-EEV coding project, which is an MPAI standard project seeking to compress video by exploiting AI-based data coding technologies. Within this workgroup, experts around the globe gather and review the progress, and plan new efforts every two weeks. In its current phase, attendance at MPAI-EEV meetings is open to interested experts. Since its formal establishment in Nov. 2021, the MPAI EEV has released three major versions of it reference models. MPAI-EEV plans on being an asset for AI-based end-to-end video coding by continuing to contribute new development in the end-to-end video coding field.

This work, contributed by MPAI-EEV, has constructed a solid baseline for compressing UAV videos and facilitates the future research works for related topics.

Reference

[1]       Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling, “Detection and Tracking Meet Drones Challenge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2021.

[2]       A. Kouris and C.S. Bouganis, “Learning to Fly by MySelf: A Self-Supervised CNN- based Approach for Autonomous Navigation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018, pp. 5216–5223.

[3]       Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian, “The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking,” in European Conference on Computer Vision, 2018, 370–386.

[4]       Ren Yang, Luc Van Gool, and Radu Timofte, “OpenDVC: An Open Source Imple- mentation of the DVC Video Compression Method,” arXiv preprint arXiv:2006.15862, 2020.

 

 

 


And now, whither MPAI?

After establishing 25 months ago, adding 5 standards in its game bag, and submitting 4 of them to IEEE for adoption, what is the next MPAI challenge? The answer is that MPAI has more than one challenge in its viewfinder and that this post will report on the first of them, namely the next standards MPAI is working on.

AI Framework (MPAI-AIF). Version 1 (V1) of this standard specifies an environment where non-monolithic component-based AI applications are executed. The new (V2) standard is adding a set of APIs that enable an application developer to select a security level or implement a particular security solution.

Context-based Audio Enhancement (MPAI-CAE). One MPAI-CAE V1 use case is Enhanced Audioconference Experience where the remote end can correctly recreate the sound sources at its end by using the Scene Description of the Audio  at the transmitting side. The new (V2) standard is targeting more challenging environments than a room, such as a human outdoor talking to a vehicle whose speech must be as clean as possible. Therefore,  a more powerful audio scene description needs to be developed.

Multimodal Conversation (MPAI-MMC). One MPAI-MMC V1 use case is a machine talking to a human and extracting the human’s emotional state from their text, speech, and face to improve the quality of the conversation. The new (V2) standard is augmenting the scope of the understanding of the human internal state by introducing Personal Status combining emotion, cognitive state and social attitude. MPAI-MMC applies it to three new use cases: Conversation about a Scene, Virtual Secretary, and Human-Connected Autonomous Vehicles Interaction (which uses the MPAI-CAE V2 technology).

Avatar Animation and Representation (MPAI-ARA). The new (V1) MPAI-ARA standard addresses several areas where the appearance of a human is mapped to an avatar model. One example is the Avatar-Based Videoconference Use Case where the appearance of an avatar is expected to faithful reproduce a human participant or where a machine conversing with humans displays itself as an avatar showing a Personal Status consistent with the conversation.

Neural Network Watermarking. The new (V1) MPAI-NNW standard specifies methodologies to evaluate neural network watermarking technologies in the following areas:

  • The impact on the performance of a watermarked neural network (and its inference).
  • The ability of the detector/decoder to detect/decode a payload when the watermarked neural network has been modified.
  • The computational cost of injecting, detecting, or decoding a payload in the watermark.

MPAI Metaverse Model. The new (V1) MPAI-MMM Technical Report identifies, organises, defines, and exemplifies functionalities generally considered useful to the metaverse without assuming that a specific metaverse implementation support any of them.

All these are ambitious targets but the work is supported by the submissions received in response to the relevant Calls for Technologies and MPAI’s internal expertise.

This is the first of the current MPAI objectives. It is good enough to convince you to  join MPAI now. Read all the 7 good reasons in the MPAI blog.