2021 - Page 3 of 4 - MPAI community

Leonardo Chiariglione
2025-11-19

Audio Spatial Reasoning: The Sound-Aware Interpreter

Autonomous User (A-User) is an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. It is a “conversation partner in a metaverse interaction” with the User, itself an A-User or and H-User directly controlled by a human. The figure shows a diagram of the A-User while the User generates audio-visual streams of information and possibly text as well.

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (walk, converse, do things, etc.) with another User in a metaverse. The latter User may be an A-User or be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the third of a sequence of aiming at illustrating more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first two dealt with the Control performed by the A-User Control AI Module on the other components of the A-User and how the A-User captures the external metaverse environment using the Context Capture AI Module.

Audio Spatial Reasoning is the A-User’s acoustic intelligence module – the one that listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning. Therefore, Its role is not just about “hearing”, it is also about “understanding” where sound is coming from, how relevant it is, and what it implies in the context of the User’s intent in the environment.

When the A-User system receives a Context snapshot from Context Capture – including audio streams with a position and orientation and a description of the User’s emotional state (called User State) – Audio Spatial Reasoning start an analysis of directionality, proximity, and semantic importance of incoming sounds. The conclusion is something like “That voice is coming from the left, with a tone of urgence, and its orientation is directed at the A-User.”

All this is represented with an extension of the Audio Scene Descriptors describing:

Which audio sources are relevant
Where they are located in 3D space
How close or far they are
Whether they’re foreground (e.g., a question) or background (e.g., ambient chatter)

This guide is sent to Prompt Creation and Domain Access. Let’s see what happens with the former. The extended Audio Scene Descriptors are fused with the User’s spoken or written input and the current User State. The result is a PC-Prompt – a rich query enriched with text expressing the multimodal information collected so far – that is passed to Basic Knowledge for reasoning.

The Audio Scene Descriptors are further processed and integrated with domain-specific information. The response is called Audio Spatial Directive that includes domain-specific logic, scene priors, and task constraints. For example, if the scene is a medical simulation, Domain Access might tell Audio Spatial Reasoning that “only sounds from authorised personnel should be considered”. This feedback helps Audio Spatial Reasoning refine its interpretation – filtering out irrelevant sounds, boosting priority for critical ones, and aligning its spatial model with the current domain expectations.

Therefore, we can call Audio Spatial Reasoning as the A-User’s auditory guide. It knows where sounds are coming from, what they mean, and how they should influence the A-User’s behaviour. The A-User responds to a sound with spatial awareness, contextual sensitivity, and domain consistency.

There are still about two mounts to the deadline of 2025/01/19 when responses Call must reach the MPAI Secretariat (secretariat@mpai.community) without exception.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv) – Today or tomorrow depending on where you are.

No Comments InAll posts

Leonardo Chiariglione
2025-11-13

Context Capture: The A-User’s First Glimpse of the World

The sequence of posts – of which this is the second – that illustrates more in depth the architecture of an A-User provides as an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first post dealt with the A-User Control, the AI-Module (AIM) that controls the other AIM of the A-User and is possibly controlled by a human.

Context Capture is the A-User’s sensory front-end – the AIM that opens up perception by scanning the environment and assembling a structured snapshot of what’s out there in the moment. It is the first AI Module (AIM) in the loop providing the data and setting the stage for everything that follows. When A-User Control decides it’s time to engage, it prompts Context Capture to focus on a specific M-Location – the zone where the User is active, rendering its Avatar.

What Context Capture produces is called Context – a time-stamped, multimodal snapshot that represents the A-User’s initial understanding of the scene. But this isn’t just raw data. Context is composed of two key ingredients: Audio-Visual Scene Descriptors and User State.

The Audio-Visual Scene Descriptors are like a spatial sketch of the environment. They describe what’s visible and audible: objects, surfaces, lighting, motion, sound sources, and spatial layout. They provide the A-User with a sense of “what’s here” and “where things are.” But they’re not perfect. These descriptors are often shallow – they capture geometry and presence but not meaning. A chair might be detected as a rectangular mesh with four legs, but Context Capture doesn’t know if it’s meant to be sat on, moved, or ignored.

That’s where Spatial Reasoning comes in. Spatial Reasoning is the AIM that takes this raw spatial sketch and starts asking the deeper questions:

“Which object is the User referring to?”
“Is that sound coming from a relevant source?”
“Does this object afford interaction, or is it just background?”

It analyses the Context and produces two critical outputs:

Spatial Output: a refined map of spatial relationships, referent resolutions, and interaction constraints.
Spatial Guide: a set of cues that enrich the user’s input — highlighting which objects or sounds are relevant, how close they are, and how they might be used.

These outputs are sent downstream to Domain Access and Prompt Creation. The former refines the spatial understanding of the scene. The latter enriches the A-User’s query when it formulates the prompt to the Basic Knowledge (LLM).

Then there’s User State – a snapshot of the User’s cognitive, emotional, and attentional posture. Is the User focused, distracted, curious, frustrated? Context Capture reads facial expressions, gaze direction, posture, and vocal tone to infer a baseline state. But again, it’s just a starting point. User behaviour may be nuanced, and initial readings can be incomplete, noisy or ambiguous. That’s why User State Refinement exists – to track changes over time, infer deeper intent, and guide the alignment of the A-User’s expressive behaviour done by Personality Alignment.

In short, Context Capture is the A-User’s first glimpse of the world – a fast, structured perception layer that’s good enough to get started, but not good enough to finish the job. It’s the launchpad for deeper reasoning, richer modulation, and more expressive interaction. Without it, the A-User would be blind. With it, the system becomes situationally aware, emotionally attuned, and ready to reason – but only if the rest of the AIMs do their part.

Responses to the Call must reach the MPAI Secretariat (secretariat@mpai.community) by 2025/01/21.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv).

No Comments InAll posts

Leonardo Chiariglione
2025-11-08

A-User Control: The Autonomous Agent’s Brain

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. The latter User may also be an A-User or may be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the first of a planned sequence of posts having the goal to illustrate more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture.

A-User Control is the general commander of the A-User system making sure the Avatar behaves like a coherent digital entity aware of the rights it can exercise in an instance of the MPAI Metaverse Model – Architecture (MMM-TEC) standard. The command is actuated by various signals exchanged with the AI-Modules (AIM) composing the Autonomous User.

At its core, A-User Control decides what the A-User should do, which AIM should do it, and how it should do it – all while respecting the Rights granted to the A-User and the Rules defined by the M-Instance. A-User Control either executes an Action directly or delegates it to another Process in the metaverse to carry it out.

A-User Control is not just about triggering actions. A-User Control also manages the operation of its AIMs, for instance A-User Rendering, which can turn text produced by the Basic Knowledge (LLM) and the Personal Status selected by Personality Alignment into a speaking and gesturing Avatar. A-User Control sends shaping commands to A-User Rendering, ensuring the Avatar’s behaviour aligns with metaverse-generated cues and contextual constraints.

A-User Control is not independent of human influence. The human, i.e., the A-User “owner”, can override, adjust, or steer its behaviour. This makes A-User Control a hybrid system: autonomous by design, but open to human modulation when needed.

The control begins when A-User Control triggers Context Capture to perceive the current M-Location — the spatial zone of the metaverse where the User is active. That snapshot, called Context, includes spatial descriptors and a readout of the human’s cognitive and emotional posture called User State. From there, the two Spatial Reasoning components – Audio and Visual – use Context to analyse the scene and sending outputs to Domain Access and Prompt Creation, which enrich the User’s input and guide the A-User’s understanding.

As reasoning flows through Basic Knowledge, Domain Access, and User State Refinement, A-User Control ensures that every action, rendering, and modulation is aligned with the A-User’s operational logic.

In summary, the A-User Control is the executive function of the A-User: part orchestrator, part gatekeeper, part interpreter. It’s the reason the Avatar doesn’t just speak – it does so while being aware of the Context – both the spatial and User components – with purpose, permission, and precision.

Stay tuned for more introductions into the world of the Autonomous User Architecture.

Responses to the Call must reach the MPAI Secretariat (secretariat@mpai.community) by 2025/01/21.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv).

No Comments InAll posts

Leonardo Chiariglione
2025-11-02

A new MPAI standard project for Autonomous Users in metaverse

The concept of virtual reality is now well established, with the metaverse concept as an important variant. Accordingly, MPAI has established a related standard, the MPAI Metaverse Model – Technologies (MMM-TEC) standard. However, standards for the contents of an MPAI metaverse instance (M-Instance) are still in progress. This document introduces the current status of these efforts and invites participation.

The contents include Processes representing entities with agency, called Users, and other entities lacking agency – essentially, various things populating an M-Instance – called Items.

Some Users represent humans. These may be directly operated by humans (and are called H-Users), or may have a high degree of operational autonomy (and are called A-Users, or informally, agents). Both types may be rendered as avatars called Personae.

The MMM-TEC standard specifies technologies enabling Users to perform various Actions on Items (things) in an M-Instance. For example, Users may sense data from the real world or may move Items in the M-Instance, possibly in combination with other Processes. However, MMM-TEC does not yet specify how an A-User decides to perform an Action.

Thus MPAI is developing a new standard covering such decisions: what does an A-User do when deciding to do something to achieve a Goal in an M-Instance? MPAI has assembled numerous relevant technologies, but more are needed. Therefore, the 61^st MPAI General Assembly (MPAI-61) has published the Call for Technologies Pursuing Goals in metaverse (MPAI-PGM) – Autonomous User Architecture (AUA). The Call requests interested parties – irrespective of their membership in MPAI – to submit responses that may enable MPAI to develop a robust A-User Architecture standard attractive to implementers and users.

The planned standard’s scope is as follows: PGM-AUA will specify functions and interfaces by which an A-User interacts with another User, either an A-User or an H-User. (Again, the term “User” means “conversational partner in the metaverse”, whether autonomous or driven by a human.) A-Users can capture text and audio-visual information originated by, or surrounding, the User; extract the User State, i.e., snapshots of the User’s cognitive, emotional, and interactional states; produce an appropriate multimodal response, rendered as a speaking Avatar; and move appropriately in the relevant virtual space.

One possible way to model an A-User’s interactions with other Users might be to train a very powerful unitary Large Language Model, able to use spatial and media information. However, because such a model would be unwieldy and difficult to manage, MPAI instead assumes the use of a relatively simple Large Language Model with basic language and reasoning capabilities. Spatial, audio-visual, and User description information will be passed to and from this Basic Model in natural language.

To handle this integration, MPAI proposes the MPAI AI Framework (MPAI-AIF) standard. This standard provides the necessary infrastructure to define a foundation for an A-User to which the necessary technologies can be added. MPAI-AIF enables specification of an AI Workflow (AIW) composed of AI Modules (AIMs). In this case, these can jointly represent an A-User in a manner that is modular, i.e., able to swap or update modules independently from other modules; transparent, i.e., able to perform clear roles and expose well-defined interfaces; and extensible, i.e., able to add or replace specific competences as needed.

The following figure represents a tentative diagram of the A-User architecture.

Figure 1 – The reference model of the Autonomous User Architecture

The model represents a largely autonomous A-User’s (“agent’s”) interactions with another User (A-User or H-User) at a given instant. It would thus be invoked repeatedly for extended interactions.

At a high level, we see an executive element (A-User Control), which can receive as input a human command or the response to some Action, and which delivers as output its status in response to the relevant command; any related action; and any request that it may itself deliver.

NOTE: While an A-User is defined as a relatively autonomous Process, a human may take over or modify its operation via the A-User Control.

More formally, the executive

The A-User Control AIM drives A-User operation by controlling how it interacts with the environment and performs Actions and Process Actions based on the Rights it holds and the M-Instance Rules. It does so by:

Performing or requesting another Process to perform an Action.
Controlling the operation of AIMs, in particular A-User Rendering.

The responsible human may take over or modify the operation of the A-User Control by exercising Human Commands. Figure 2 summarises the input and output data of the A-User Control AIM

Figure 2 – Simplified view of the Reference Model of A-User Control

A Human Command received from a human will generate a Human Command Status in response. A Process Action Request to a Process – that may include another User – will generate a Process Action Response. Various types of Commands (called Directives) to the Autonomous User AI Modules (AIM) will generate responses (called Statuses). The Figure singles out the A-User Rendering Directives issued to the A-User Rendering AIM. This will generate a response typically including a Speaking Avatar that the A-User Control AIM will MM-Add or MM-Move in the metaverse. The complete Reference Model of A-User Control can be found here.

The Context Capture AIM, prompted by the A-User Control, perceives a particular location of the M-Instance – called M-Location – where the User, i.e., the A-User’s conversation partner, has MM-Added its Avatar. In the metaverse, the A-User perceives by issuing an MM-Capture Process Action Request. The multimodal data captured is processed and the result is called Context – a time-stamped snapshot of the M-Location – composed of:

Audio and Visual Scene Descriptors describing the spatial content.
Entity State, describing the User’s cognitive, emotional, and attentional posture.

Thus, Context represents the initial A-User’s understanding of the User and the M-Location where it is embedded.

The Spatial Reasoning AIM – composed of two AIMs, Audio Spatial Reasoning and Visual Spatial Reasoning – analyses Context and sends an enhanced version of the Audio and Visual Scene Descriptors, containing audio source relevance, directionality, and proximity (Audio) and object relevance, proximity, referent resolutions, and affordance (Visual) to

The Domain Access AIM seeking additional domain-specific information. Domain Access responds with further enhanced Audio and Visual Scene Descriptors, and
The Prompt Creation AIM sending to the Basic Knowledge, a basic LLM, the PC-Prompt integrating:
1. User Text and Entity State (from Context Capture).
2. Enhanced Audio and Visual Scene Descriptors (from Spatial Reasoning).

This is depicted in Figure 3.

Figure 3 – Basic Knowledge receives PC-Prompt from Prompt Creation

The Initial Response to PC-Prompt is sent by Basic Knowledge to Domain Access that

Processes the Audio and Visual Scene Descriptors and the Initial Response by accessing domain-specific models, ontologies, or M-Instance services to retrieve:
1. Scene-specific object roles (e.g., “this is a surgical tool”)
2. Task-specific constraints (e.g., “only authorised Users may interact”)
3. Semantic affordances (e.g., “this object can be grasped”)
Produces and sends four flows:
1. Enhanced Audio and Visual Scene Descriptors to Spatial Reasoning to enhance its scene understanding.
2. User Context Guide to User State Refinements to enable it to update User’s Entity State.
3. Personality Context Guide to Personality Alignment.
4. DA-Prompt, a new prompt to Basic Knowledge including initial reasoning and spatial semantics.

Figure 4 – Domain Access serves Spatial Reasoning, Basic Knowledge, User State Refinement, and Personality Alignment

Basic Knowledge produces and sends an Enhanced Response to the User State Refinement AIM.

User State Refinement refines its understanding of User State using the User Context Guide, produces and sends:

UR-Prompt to Basic Knowledge.
Expressive State Guide to Personality Alignment providing A-User with the means to adopt a Personality that is congruent with the User’s Entity State.

Basic Knowledge produces and sends a Refined Response to Personality Alignment.

This is depicted in Figure 5.

Figure 5 – User State Refinements feeds Personality Alignment

Personality Alignment

Selects a Personality based Refined Response and Expressive State Guide and conveying a variety of elements such as : Expressivity (e.g., Tone, Tempo, Face, Gesture) and Behavioural Traits (e.g.: verbosity, humour, emotion), Type of role (e.g., assistant, mentor, negotiator, entertainer), etc.
Formulates and sends
1. An A-User Entity State reflecting the Personality to A-User Rendering.
2. A PA-Prompt to Basic Knowledge reflecting the intended speech modulation, face and gesture), synchronisation cues across modalities

Basic Knowledge sends a Final Response that conveys semantic content, contextual integration, expressive framing, and personality coherence.

This is depicted in Figure 6.

Figure 6 – Personality Alignment feeds A-User Rendering

A-User Rendering uses Final Response, A-User Entity Status and A-User Control Command from A-User Control to synthesise and shape a speaking Avatar contained in the A-User Control. This is depicted in Figure 7.

Figure 7 – The result of the Autonomous User processing is fed to A-User Control

Extended Call for Technologies

The complexity of the MMM-TEC model has prompted MPAI to extend its usual practice for Calls for Technologies. In addition to the usual Call for Technologies, Use Cases and Functional Requirements, Framework Licence, and Template for Responses, the Call also refers to a Tentative Technical Specification, a document drafted as if it were an actual Technical Specification. Respondents to the Call are free to comment on, change, or extend the Tentative Technical Specification or to make any other proposals judged relevant to the Call.

Anyone, irrespective of MPAI membership status, may respond to the Call. Responses shall reach the MPAI Secretariat by 2026/01/21T23:59.

Appropriate MPAI working groups will thoroughly review the Responses and retain those deemed appropriate for the future PGM-AUA standard. MPAI may select suitable technologies from those submitted in Responses, but is not obligated to select any proposal. Respondents will be encouraged to join MPAI. If Respondents whose Responses are accepted in full or in part do not join MPAI, MPAI will discontinue consideration of their proposed technologies.

No Comments InAll posts

Leonardo Chiariglione
2025-10-29

MPAI calls for technologies supporting metaverse-based Agentic AI

Geneva, Switzerland – 29^th October 2025. MPAI – Moving Picture, Audio and Data Coding by Artificial Intelligence – the international, non-profit, unaffiliated organisation developing AI-based data coding standards – has concluded its 61^st General Assembly (MPAI-61) approving the publication of a Call for Autonomous User Architecture Technologies.

With this Call for Technologies, formally “Pursuing Goals in metaverse (MPAI-PGM) – Autonomous User Architecture (PGM-AUA)”, MPAI is aiming at a standard enabling Autonomous Users to perform activities such as moving around and conversing with other Users. These are processes representing humans in a metaverse conforming with the MPAI Metaverse Model Technologies standard (MMM-TEC). They can either operate with a high degree of autonomy (A-Users) or be directly controlled by humans (H-Users).

PGM-AUA will rely on the friendly MMM-TEC environment and many relevant technologies already available in the 16 approved MPAI standards. However, the ambitious PGM-AUA goal requires many new technologies that the Call is designed to secure.

The text of the call and associated document is available. Responses are due to the MPAI Secretariat by 2025/01/21T23:59.

MPAI-61 has also approved the new versions of standards previously posted for Community Comments:

MPAI is continuing the development of its work plan that involves the following activities:

AI Framework (MPAI-AIF): developing a new MPAI-AIF specification that facilitates the creation of new workflows using available AIMs.
AI for Health (MPAI-AIH): developing the specification of a system receiving and processing licenses AI Health Data and enabling clients to improve health processing models via federated learning.
Context-based Audio Enhancement (CAE-DC): developing the Audio Six Degrees of Freedom (CAE-6DF) and Audio Object Scene Rendering (CAE-AOR) specifications.
Connected Autonomous Vehicle (MPAI-CAV): investigating extensions of the current CAV-TEC specification.
Compression and Understanding of Industrial Data (MPAI-CUI): developing the Company Performance Prediction V2.0 specification.
End-to-End Video Coding (MPAI-EEV): exploring the potential of AI-based End-to-End Video coding.
AI-Enhanced Video Coding (MPAI-EVC): finalising the Up-sampling Filter for Video applications (EVC-UFV) standard.
Governance of the MPAI Ecosystem (MPAI-GME): operating the MPAI Ecosystem per the MPAI-GME Specification.
Human and Machine Communication (MPAI-HMC): developing reference software and performance assessment.
Multimodal Conversation (MPAI-MMC): discussing the conversational part of the PGM-AUA Call for Technologies.
MPAI Metaverse Model (MPAI-MMM): developing support for security in the MMM-TEC specs.
Neural Network Watermarking (MPAI-NNW): Reviewing the responses to the Call on Neural Network Traceability Technologies.
Object and Scene Description (MPAI-OSD): discussing the spatial part of the PGM-AUA Call for Technologies.
Portable Avatar Format (MPAI-PAF): discussing the rendering part of the PGM-AUA Call for Technologies.
AI Module Profiles (MPAI-PRF): extending the scope of the current version of AI Module Profiles.
Server-based Predictive Multiplayer Gaming (MPAI-SPG): exploring new standard opportunities in the domain.
Data Types, Formats, and Attributes (MPAI-TFA) extending the standard to data types used by MPAI standards (e.g., automotive, health, and metaverse).
XR Venues (MPAI-XRV): developing the standard for improved development and execution of Live Theatrical Performances.

Legal entities and representatives of academic departments supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data can become MPAI members.

Please visit the MPAI website, contact the MPAI Secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.

No Comments InAll posts

Leonardo Chiariglione
2025-10-04

Celebrating the first five years of MPAI

Where there are organisations counting years of existence in decades or centuries, there should not be much to celebrate for an organisation that only reaches as few as five years of existence. But there are years and years – even days and days – like in one day as a lion or a hundred years as a sheep.

The last five were not the years of a sheep but as one day as a lion.

We started with the idea of an organisation dedicated to standards for AI-based data coding because we thought that standards would bring benefits to a domain mostly alien to it. Not like some standards that look more like legal tools designed to oppress users but standards offering fair opportunities to all parties in the chain extending from innovators to end users.

An ambitious organisation like MPAI could not operate like four friends in a bar. The MPAI operation rules were developed and are now enshrined in the MPAI Patent Policy. The ambitions of MPAI were further enhanced by the definition of the MPAI Ecosystem extending from MPAI to implementers, integrators, and end users with the introduction of a new actor called MPAI Store, now incorporated in Scotland as a company limited by guarantee. There is a standard – Governance of the MPAI Ecosystem (MPAI-GME) setting the rules of operation of the Ecosystem.

The idea of a mission was there but what about implementing it? We acted as lions and posited that opaque monolithic AI should become component-based AI. Now a large share of our standards are based on the AI Framework (MPAI-AIF) standard, specifying an environment where AI Workflows composed of AI Modules can be initialised, dynamically configured, and controlled. MPAI-AIF also provided a stimulus to adoption of JSON Schema as a “language” to represent data types, AI Modules, and AI Workflows in MPAI standards. Today there is virtually no MPAI standard that does not use that language.

Having laid down the technical foundations, we started the buildings. One was designed to host the quite representative area of human and machine conversation extending beyond the “word” to cover other sometimes ethereal but information-carrying sensations and feelings. The standard called Multimodal Conversation (MPAI-MMC) is the first attempt at digitally representing this ethereal information with the Personal Status data type and Human-Machine Communication (MPAI-HMC) is an excellent example of its application.

Another investigation stream since the early MPAI days is audio sitting at the MPAI table as “Context-based Audio Enhancement” leading to the Context-based Audio Enhancement – Use Cases (MPAI-CAE) standard. Finally, with Compression and Understanding of Industrial Data (MPAI-CUI), MPAI demonstrated that data from so far unexplored domains like finance could benefit from standards.

Just one year after its establishment, MPAI could claim success by publishing its first three standards: MPAI-CUI, MPAI-GME, and MPAI-MMC and, by the end of 2021, another two: MPAI-AIF and MPAI-CAE.

Since its early days, MPAI was convinced that standards should have as much visibility as possible. For this reason, it established a successful cooperation with the Institute of Electric and Electronic Engineers (IEEE) – Standard Association (SA). Today, starting from three standards in 2022, nine MPAI standards have been adopted by IEEE without modifications and three more are in the pipeline.

The creation of MPAI Development Committees and Working Groups and their activity continued unrelenting. The use of watermarking and then fingerprinting to trace the use of neural networks let to the development of Neural Network Watermarking – Traceability (NNW-NNT). Connected Autonomous Vehicles was started in late 2020 and is now a standard with the name Connected Autonomous Vehicle – Technologies (CAV-TEC). MPAI was probably the first to engage in activities leading to a metaverse standard and now it can claim to have a solid candidate to lead the move to interoperable metaverses with MPAI Metaverse Model – Technologies (MMM-TEC). Since its early days, MPAI worked on online gaming, producing the Server-based Predictive multiplayer Gaming – Mitigation of Data Loss Effects (SPG-MDL) standard where a set of AI Modules predicts the game state of an online multiplayer game.

MPAI abhors the attitude of other standards bodies who develop unnecessarily “siloed” standards where technologies are treated exclusively from the point of view of the domain of that standard without considering similar technologies in other domains. Object and Scene Description (MPAI-OSD) and Portable Avatar Format (MPAI-PAF) do specify AI Workflows specific to their domains but their AI Modules and Data Types were specified for wide reuse in many other MPAI standards. This attitude is not confined to these two standards as the same can be said of MPAI-CAE and MPAI-MMC.

Atypical – but no less important – standards are AI Module Profiles (MPAI-PRF) establishing a machine-readable description to identify AI Module Profiles and Data Types, Formats, and Attributes (MPAI-TFA) providing a standard way to add information about data for processing by a machine.

Last comes a standard that embodies probably the very first activity – AI for video. AI-Enhanced Video Coding – Up-sampling Filter for Video applications (EVC-UFV) offers an AI super-resolution filter vastly superior to currently used filters.

Five years ago, MPAI was very bold in targeting standards for AI, then just a nice technology to talk about. In five years, however, AI is all over the place and much talked about. What will the future offer for MPAI?

Some answers are clear:

With its impressive portfolio of 15 standards, there will be much maintenance and enhancement work to do.
Two new standards are being developed and should be completed in a short time: AI for Health – Health Secure Platform and XR Venues – Live Theatrical Performance.
One project – End-to-End Video coding has still to go through the Call for Technologies phase
A Call for Technologies is open, and responses are expected: Neural Network Watermarking – Technologies.
A new Call for Technologies on Pursuing Goals in the metaverse is being prepared. This will require the development of a significant number of “behaviours” on top of a “baseline” Small Language Model.
Development of reference implementations to enhance the value and attractiveness of existing standards.

AI continues its lightning speed of development and MPAI will continue watching and identifying standardisation opportunities in different domains.

Long live MPAI!

No Comments InAll posts

Leonardo Chiariglione
2025-10-02

MPAI celebrates five years of pioneering AI standards

Geneva, Switzerland – 30^th September 2025. MPAI – Moving Picture, Audio and Data Coding by Artificial Intelligence – the international, non-profit, unaffiliated organisation developing AI-based data coding standards – has celebrated its fifth anniversary at its 60^th General Assembly (MPAI-60).

Established 5 years ago on 30 September 2020, MPAI has created the organisation, given itself rigorous procedures of work, developed 15 standards and two technical reports, obtained adoption of eight of its standards without modification by IEEE Standards Association, and is setting sights on next challenges targeting both extensions and new standards.

In line with its mission of AI-based data coding, MPAI standards cover execution of AI applications, audio enhancement, connected autonomous vehicles, finance, human and machine conversation, metaverse, objects and scenes, avatars, and many others.

MPAI-60 has approved final publication of new versions of existing standards:

and is publishing the following standards for Community Comments

MPAI is continuing the development of its work plan that involves the following activities:

AI Framework (MPAI-AIF): developing a new MPAI-AIF specification that facilitates the creation of new workflows using available AIMs.
AI for Health (MPAI-AIH): developing the specification of a system receiving and processing licenses AI Health Data and enabling clients to improve health processing models via federated learning.
Context-based Audio Enhancement (CAE-DC): developing the Audio Six Degrees of Freedom (CAE-6DF) and Audio Object Scene Rendering (CAE-AOR) specifications.
Connected Autonomous Vehicle (MPAI-CAV): investigating extensions of the current CAV-TEC specification.
Compression and Understanding of Industrial Data (MPAI-CUI): developing the Company Performance Prediction V2.0 specification.
End-to-End Video Coding (MPAI-EEV): exploring the potential of AI-based End-to-End Video coding.
AI-Enhanced Video Coding (MPAI-EVC): refining the Up-sampling Filter for Video applications (EVC-UFV) standard.
Governance of the MPAI Ecosystem (MPAI-GME): working on version 2.0 of the Specification.
Human and Machine Communication (MPAI-HMC): developing reference software and performance assessment.
Multimodal Conversation (MPAI-MMC): Developing the notion of Perceptive and Agentive AI (PAAI) capable of handling more complex questions.
MPAI Metaverse Model (MPAI-MMM): extending the capabilities of the MMM-TEC specs to support more applications.
Neural Network Watermarking (MPAI-NNW): Issuing a Call on Neural Network Traceability Technologies.
Object and Scene Description (MPAI-OSD): extending the capabilities of the MPAI-OSD V1.3 to support more applications.
Portable Avatar Format (MPAI-PAF): extending the capabilities of the MPAI-PAF V1.4 to support more applications.
AI Module Profiles (MPAI-PRF): extending the scope of the current version of AI Module Profiles.
Server-based Predictive Multiplayer Gaming (MPAI-SPG): exploring new standard opportunities in the domain.
Data Types, Formats, and Attributes (MPAI-TFA) extending the standard to data types used by MPAI standards (e.g., automotive, health, and metaverse).
XR Venues (MPAI-XRV): developing the standard for improved development and execution of Live Theatrical Performances.

Legal entities and representatives of academic departments supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data can become MPAI members.

No Comments InAll posts

Leonardo Chiariglione
2025-08-21

Exploring the innovations of the MMM-TEC V2.1 standard

The MPAI Metaverse Model (MPAI-MMM) – Technologies (MMM-TEC) specification is based on an innovative approach. As in the real world (Universe) we have animate and inanimate things, in an MPAI Metaverse (M-Instance) we have Processes and Items. Processes can animate Items (things) in the metaverse but can also act as a bridge between metaverse and universe. For convenience, MMM-TEC defines four classes of Processes: Apps, Devices, Services, and Users.

Probably, the most interesting one is the User, defined as the “representative” of a human where representation means that the human is responsible for what their Users do in the metaverse. The representation function can be very strict because the human drives everything one of their User does or very loose because the User is a fully autonomous agent (still under the human’s responsibility). As the User is a Process, it cannot be “perceived” except from what it does, but it can render itself in a perceptible form, called Persona that may visually appear as a humanoid. A human can have more than one User, and a User can be rendered with more than one Persona.

Humans can do interesting things in the world, but what interesting things can they do in the metaverse? MMM-TEC answers this question by offering a range of 28 basic Actions, called Process Actions. An important one is Register. By Registering, a human gets the Rights to import (via the UM-Send Action) and deploy (via the Execute Action) Users and render (by e.g., MM-Adding) Personae. UM-Send means sending things from the universe to the metaverse and MM-Add means placing an Avatar and then possibly animating it (MM-Animate) with a stream or rendering it (MU-Actuate) in the universe.

Universe and metaverse are connected, but they should be mutually “protected”. One example of what this means is data from the universe cannot be simply imported into the metaverse, but is first captured (UM-Capture), then identified (Identify) – i.e., converted into an Item – and finallu acted upon, e.g., used to animate an avatar. Also, a User is not entitled to do just anything anywhere in the metaverse because its operation is governed by three basic notions: Rights, expressing the fact that a User (in general, a Process) may perform a certain Process Action; Rules, expressing the fact a Process may, may not, or must perform a Process Action; and P-Capabilities expressing that the Process can perform certain Process Actions.

What if a Process wants to perform a Process Action, has the Rights to perform it, and its performance complies with the Rules, but it cannot, i.e., it does not know how to perform it? MMM-TEC makes use of a philosophy of language notion called Speech Act that is expressed by an individual and contains both information and action. For instance, User MU-Actuates Persona At M-Location At U-Location With Spatial Attitude will mean that the User renders at U-Location in the universe with a certain Position and Orientation the Persona that is placed at an M-Location in the Metaverse. If the User can – i.e., it has the P-Capabilities to – MU-Actuate the Persona, for instance because it is connected to the universe via an appropriate device, and may, i.e., it has the Rights to MU-Actuate, and the planned Process Action complies with the Rules, then the Process Action is performed. However, if the User does not have the necessary P-Capabilities or does not have the Rights to MU-Actuate the Persona, it can ask an Import-Export Service to do this on its behalf. Possibly, the Service will request that a Transaction be made in order to perform the requested Process Action.

As a last point, we should describe how MMM-TEC represents Rights and Rules. MMM-TEC states that Rights are, in general, a collection of Process Actions that the Process can perform. Each of them is preceded by Internal, Acquired, or Granted to indicate if the Rights were obtained at the time of Registration, were acquired (e.g., by a Transaction), or are Granted (and then possibly withdrawn) by another Process. Similarly, Rules are expressed by Process Actions each of which is preceded by May, May not, or Must.

We could add many more details to give a complete description of the MMM-TEC potential. You can directly access the standard here, but now we want to address some of the innovations introduced by MMM-TEC V2.1.

The first is the set of new capabilities provided by the Property Change Process Action. We said that we can MM-Add a Persona and then MM-Animate it. But what if we are preparing a theatre performance and we do not want “to be seen” while rehearsing? Property Change can set the Perceptibility Status of an Item but can also change:

The properties of a visual Item in terms of its size, mass, material (i.e., to signal that the object is material or immaterial), gravity (is subject to gravity or not), and texture map.
The audio characteristics of an object: Reflectivity, Reverberation, Diffusion, and Absorption.
The properties of a light source: Type (Point, Directional, Spotlight, Area), colour, and intensity of the light source.
The properties of an audio source: Diffuseness, Directional Patterns, Shape, and Size.
The Personal Status (i.e., emotion) of an avatar.

Another important set of functionalities is provided by significant extensions of how a Process in the metaverse can affect the universe. MMM-TEC V2.1 allows a User to MU-Actuate at a U-Location an Item MM-Added at an M-Location. How can this Process Action be performed? We assume that the M-Instance is connected to a special Device that can perform the following in the universe:

Pick an existing object.
Drive a 3d printer that produces the analogue version of the Item.
Render a 2D or a 3D media object.

MMM-TEC V2.1 calls R-Item any physical object in the universe, including the object produced by a 3D printer and the 2D or 3D media object produced. It also defines the following additional Process Actions:

MU-Add an R-Item: to place an R-Item (a physical object) somewhere in the universe with a Spatial Attitude.
MU-Animate an R-Item: to animate, e.g., a robot, with a stream.
MU-Move an R-Item from a U-Location to another U-Location along a Trajectory.

MMM-TEC is rigorous in defining how Process Actions can be performed in an M-Instance, but what about the universe? Do we want Processes to perform actions in the universe in an uncontrolled way?

The answer is clear: the M-Instance does not control the Universe through some supernatural force but through Devices whose operation is conditional on the Rights and P-Capabilities held by the Device to perform the desired Process Actions in the universe. The Process Actions beginning with “MU-” include the Rights of a Device to act on the universe.

V2.1 adds several new use cases to the long list of V2.0. One of these is called “Emergency in Industrial Metaverse”:

An M-Location includes the Digital Twin of a real factory (R-Factory) where the regular operation is separated from emergency operation described by the use case.
An “emergency” User in the Digital Twin (V-Factory):
1. Has the Rights to actuate and animate an “emergency” robot in the R-Factory.
2. Can be rendered as a Persona having the appearance of the corresponding robot.
In case of an emergency, the User:
1. Activates an alarm in the R-Factory.
2. Actuates its “emergency” robot (Analogue Twin) in the R-Factory.
3. Animates the robot to solve the problem.
4. Renders its Persona so that humans can see what is happening in the R-Factory.
When the emergency is resolved, the robot is moved to its repository.

You are invited to register to attend the online presentation on 12 September at 15 UTC and provide your comments to the MPAI Secretariat by 2025/09/28 T23:59 UTC

No Comments InAll posts

Leonardo Chiariglione
2025-08-20

MPAI publishes MPAI Metaverse Model – Technologies V2.1 standard with extended functionalities

Geneva, Switzerland – 20^th August 2025. MPAI – Moving Picture, Audio and Data Coding by Artificial Intelligence – the international, non-profit, unaffiliated organisation developing AI-based data coding standards – has concluded its 59^th General Assembly (MPAI-59) approving the publication of the MPAI Metaverse Model – Technologies V2.1 with a request for Community Comments.

The earlier 2.0 Version of Technical Specification: MPAI Metaverse Model (MMM) – Technologies (MMM-TEC) already supported digital twinning of real-world environments and their blending with MMM-TEC-specified virtual environments. The new MMM-TEC V2.1 supports “analogue twinning” of virtual- with real-world environments opening attractive industrial metaverse applications. This is achieved by introducing new “Process Actions” (speech acts of an MMM-TEC process sent to another process) and the notion of R-Item (real object) that can be MU-Added (placed at a U-Location, a location in the real world), MU-Moved (moved from a U-Location to another U-Location along a Trajectory), and MU-Animated (animated) in sync with a Persona (the rendering of a Process as an avatar) in the metaverse.

Among the several other innovations included in MMM-TEC V2.1, we mention Change Property, a Process Action whereby a Process changes – if it holds the Rights – the place where and object is located; its properties such as perceptibility, size, mass, gravity, and texture; audio properties such as reflectivity, reverberation, diffusion and absorption; an audio or light source; and the emotional state of an avatar.

MPAI standards are best described as a web of interconnected specifications. The new technologies needed by MMM-TEC are partly specified by Object and Scene Descriptors (MPAI-OSD), Portable Avatar Format (MPAI-PAF), and Data Types, Formats and Attributes (MPAI-TFA). They are now at versions V1.4, V1.5, and V1.4, respectively.

The MMM-TEC1 V2.1 standard on 12 September at 15 UTC (link).
The MPAI-OSD V1.4 and MPAI-PAF V1.5 standards on 12 September at 10 UTC (link).
The MPAI-TFA V1.4 standard on Wednesday 17 September at 15 UTC (link)
The MPAI-GME V2.0 standard on Friday 26 September at 14 UTC (link).

MPAI is continuing the development of its work plan that involves the following activities:

AI Framework (MPAI-AIF): developing a new MPAI-AIF specification that facilitates the creation of new workflows using available AIMs.
AI for Health (MPAI-AIH): developing the specification of a system receiving and processing licenses AI Health Data and enabling clients to improve health processing models via federated learning.
Context-based Audio Enhancement (CAE-DC): developing the Audio Six Degrees of Freedom (CAE-6DF) and Audio Object Scene Rendering (CAE-AOR) specifications.
Connected Autonomous Vehicle (MPAI-CAV): investigating extensions of the current CAV-TEC specification.
Compression and Understanding of Industrial Data (MPAI-CUI): developing the Company Performance Prediction V2.0 specification.
End-to-End Video Coding (MPAI-EEV): exploring the potential of AI-based End-to-End Video coding.
AI-Enhanced Video Coding (MPAI-EVC): refining the Up-sampling Filter for Video applications (EVC-UFV) standard.
Governance of the MPAI Ecosystem (MPAI-GME): working on version 2.0 of the Specification.
Human and Machine Communication (MPAI-HMC): developing reference software and performance assessment.
Multimodal Conversation (MPAI-MMC): Developing the notion of Perceptive and Agentive AI (PAAI) capable of handling more complex questions.
MPAI Metaverse Model (MPAI-MMM): extending the capabilities of the MMM-TEC specs to support more applications.
Neural Network Watermarking (MPAI-NNW): Issuing a Call on Neural Network Traceability Technologies.
Object and Scene Description (MPAI-OSD): extending the capabilities of the MPAI-OSD V1.3 to support more applications.
Portable Avatar Format (MPAI-PAF): extending the capabilities of the MPAI-PAF V1.4 to support more applications.
AI Module Profiles (MPAI-PRF): extending the scope of the current version of AI Module Profiles.
Server-based Predictive Multiplayer Gaming (MPAI-SPG): exploring new standard opportunities in the domain.
Data Types, Formats, and Attributes (MPAI-TFA) extending the standard to data types used by MPAI standards (e.g., automotive, health, and metaverse).
XR Venues (MPAI-XRV): developing the standard for improved development and execution of Live Theatrical Performances.

Legal entities and representatives of academic departments supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data can become MPAI members.

No Comments InAll posts

Leonardo Chiariglione
2025-07-14

Exploring the Up-sampling Filter for Video applications (EVC-TEC) standard

MPAI has approved Technical Specification: AI-Enhanced Video Coding (MPAI-EVC) – Up-sampling Filter for Video applications (EVC-UFV).

The standard includes a general procedure to design video up-sampling filters based on super resolution techniques and a method to reduce the complexity of the designed filters without significant performance loss. The standard also provides the parameters of specific filters for standard definition to high definition and high definition to ultra-high definition, for the complexity-reduced and original cases.

The standard will be presented online on 23 July at 13 UTC. Register here to attend the presentation.

The standard is not in final form. It is published with a request for Community Comments according to MPAI procedures. Comments should be sent the MPAI Secretariat by 2025/08/18 T23:59 UTC.

A method typically used in video coding is to down-sample to half the input video frame before encoding. This reduces the computational cost but requires an up-sampling filter to recover the original video resolution in the decoded video to reduce as much as possible the loss in visual quality. Currently used filters are bicubic and Lanczos,

Figure 1 – Up-sampling Filters for Video application (EVC-UFV)

In the last few years, Artificial Intelligence (AI), Machine Learning (ML), and especially Deep Learning (DL) techniques, have demonstrated their capability to enhance the performance of various image and video processing tasks. MPAI has performed an investigation to assess how video coding performance could be improved by replacing traditional coding blocks with deep-learning ones. The outcome of this study has shown that deep-learning based up-sampling filters significantly improve the performance of existing video codecs.

MPAI issued a Call for Technologies for up-sampling filters for video applications in October 2024. This was followed by an intense phase of development that enabled MPAI to approve Technical Specification: AI-Enhanced Video Coding (MPAI-EVC) – Up-sampling Filter for Video application (EVC-UFV) V1.0 with a request for Community Comments at its 58^th General Assembly (MPAI-58).

EVC-UFV standard enables efficient and low complexity up-sampling filters applied to video with different bit-depth of 8 and 10 bit per pixels per component, in standard YCbCr colour space with 4:2:0 sub-sampling, encoded with a variety of encoding technologies using different encoding features such as random access and low delay.

As depicted in Figure 2, the filter is a Densely Residual Laplacian Super-Resolution network (DRLN), offering a novel deep-learning approach.

Figure 2 – Densely Residual Laplacian Super-Resolution network (DRLN).

The complexity of the filter is reduced in two steps. First, a drastic simplification of the deep-learning structure that reduces the numbers of blocks provides a much lighter network while keeping similar performances of the baseline DRLN. This is achieved by identifying the DRLN’s principal components and understanding the impact of each component on the output video frame quality, memory size, and computational costs.

As shown in Figure 2, the main component of the DRLN architecture is a Residual Block which is composed of the Densely Residual Laplacian Modules (DRLM) and a convolutional layer. Each DRLM contains three Residual Units, as well as one compression unit and one Laplacian attention unit (a set of Convolutional Layers with a square filter size and Dilation that is greater than or equal the filter size). Each Residual Unit consists of two convolutional layers and two ReLU Layers. All DRLM modules in each Residual Block and all Residual Units in each DRLM are densely connected. The Laplacian attention unit consists of three convolutional layers with filter size 3×3 and dilation (a technique for expanding a convolutional kernel by inserting holes or gaps between its elements) equal to 3, 5, 7. All convolutional layers in the network, except the Laplacian one, have filter size 3×3 with dilation equal to 1. Throughout the network, the number of feature maps (the outputs of convolutional layers) is 64.

Based on this structural analysis, reducing the number of the main Residual Blocks, adding more DRLMs, and reducing the complexity of the Residual Unit and the number of hidden convolutional layers and features map drastically accelerates execution speed and reduces memory management but does not substantially affect the network’s visual quality performance.

Figure 3 depicts the resulting EVC-UFV Up-sampling Filter,

Figure 3 – Structure of the EVC-UFV Up-sampling Filter

The parameters of the original and complexity-reduced network are given in Table 1.

Table 1 – Parameters of the original and the complexity-reduced network

	Original	Final
Residual Blocks	6	2
DRLMs per Residual Block	3	6
Residual Block per DRLM	3	3
Hidden Convolutional Layers per Residual Unit	2	1
Input Feature Maps	64	32

Further, by pruning the parameters and weights of the network, the network complexity is reduced by 40%. The loss in performance is less than 1% in BD-rate. This is achieved, by first using the well-known DeepGraph technique, modified to work with deep-learning based up-sampling filter, understanding the dependency of the different components’ layers of the simplified deep-learning network. This facilitates grouping components sharing a common pruning approach that can be applied without introducing dimensional inconsistencies among the inputs and outputs of the layers.

Verification Tests of the technology has been performed on:

Standard sequences	CatRobot, FoodMarket4, ParkRunning3.
Bits/sample	8 and 10 bit-depth per component.
Colour space	YCbCr with 4:2:0 sub sampling.
Encoding technologies	AVC, HEVC, and VVC.
Encoding settings	Random Access and Low Delay at QPs 22, 27, 32, 37, 42, 47.
Up-sampling	SD to HD and HD to UHD.
Metrics	BD-Rate, BD-PSNR and BD-VMAF
Deep-learning structure	Same for all QPs

Results show an impressive improvement for all coding technologies, and encoding options for all three objective metrics when compared with the currently used traditional bicubic interpolation. The results of Table 2 have been obtained foe the low-delay coding mode.

Table 2 – Performance of the EVC-UFV Up-sampling Filter

	AVC	HEVC	VVC
SD to HD (using own trained filter)	14.4%	12.2%	13.8%
HD to UHD (using own trained filter)	5.6%	6%	6.5%
SD to HD (using HD to UHD filter)	14%	11.6%	11.4%

All results are obtained with the 40% pruned network.

No Comments InAll posts

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit

Archives: 2021-06-09

Notice