February 2021 - MPAI community

Leonardo Chiariglione
2025-11-23

Visual Spatial Reasoning: The Vision Aware Interpreter

Autonomous User (A-User) is an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. It is a “conversation partner in a metaverse interaction” with the User, itself an A-User or an H-User directly controlled by a human. The figure shows a diagram of the A-User while the User generates audio-visual streams of information and possibly text as well.

This is the fourth of a sequence of posts aiming to illustrate the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first three dealt with 1) the Control performed by the A-User Control AI Module on the other components of the A-User, 2) how the A-User captures the external metaverse environment using the Context Capture AI Module, and 3) listens, localises, and interprets sound not just as data, but as data having a meaning and a spatial anchor.

When the A-User acts in a metaverse space, sound doesn’t tell the whole story. The visual scene – objects, zones, gestures, occlusions – is the canvas where situational meaning unfolds. That’s where Visual Spatial Reasoning comes in: it’s the interpreter that makes sense of what the Autonomous User sees, not just what it hears.

Visual Spatial Reasoning can be considered as the visual analyst embedded in the “brain” of the Autonomous User. It doesn’t just list objects; it understands their geometry, relationships, and salience. A chair isn’t just “a chair” – it’s occupied, near a table, partially occluded, or the focus of attention. By enriching raw descriptors into structured semantics, Visual Spatial Reasoning transforms objects made of pixels into actionable targets.

This is what it does

Scene Structuring: Takes and organises raw visual descriptors into coherent spatial maps.
Semantic Enrichment: Adds meaning – classifying objects, inferring affordances, and ranking salience.
Directed Alignment: Filters and prioritises based on the A-User Controller’s intent, ensuring relevance.
Traceability: Every refinement step is auditable, to trace back why, “that object in the corner” became “the salient target for interaction.”

Why It Matters

Without Visual Spatial Reasoning, the metaverse would be a flat stage of unprocessed visuals. With it, visual scenes become interpretable narratives. It’s the difference between “there are three objects in the room” and “the User is focused on the screen, while another entity gestures toward the door.”

Of course, Visual Spatial Reasoning does not replace vision. It bridges the gap between raw descriptors and effective interaction, ensuring that the A‑User can observe, interpret, and act with precision and intent.

If Audio Spatial Reasoning is the metaverse’s “sound‑aware interpreter,” then Visual Spatial Reasoning is its “sight‑aware analyst” that starts by seeing objects and eventually can understand their role, their relevance, and their story in the scene.

No Comments InAll posts

Leonardo Chiariglione
2025-11-19

Audio Spatial Reasoning: The Sound-Aware Interpreter

Autonomous User (A-User) is an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. It is a “conversation partner in a metaverse interaction” with the User, itself an A-User or and H-User directly controlled by a human. The figure shows a diagram of the A-User while the User generates audio-visual streams of information and possibly text as well.

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (walk, converse, do things, etc.) with another User in a metaverse. The latter User may be an A-User or be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the third of a sequence of aiming at illustrating more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first two dealt with the Control performed by the A-User Control AI Module on the other components of the A-User and how the A-User captures the external metaverse environment using the Context Capture AI Module.

Audio Spatial Reasoning is the A-User’s acoustic intelligence module – the one that listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning. Therefore, Its role is not just about “hearing”, it is also about “understanding” where sound is coming from, how relevant it is, and what it implies in the context of the User’s intent in the environment.

When the A-User system receives a Context snapshot from Context Capture – including audio streams with a position and orientation and a description of the User’s emotional state (called User State) – Audio Spatial Reasoning start an analysis of directionality, proximity, and semantic importance of incoming sounds. The conclusion is something like “That voice is coming from the left, with a tone of urgence, and its orientation is directed at the A-User.”

All this is represented with an extension of the Audio Scene Descriptors describing:

Which audio sources are relevant
Where they are located in 3D space
How close or far they are
Whether they’re foreground (e.g., a question) or background (e.g., ambient chatter)

This guide is sent to Prompt Creation and Domain Access. Let’s see what happens with the former. The extended Audio Scene Descriptors are fused with the User’s spoken or written input and the current User State. The result is a PC-Prompt – a rich query enriched with text expressing the multimodal information collected so far – that is passed to Basic Knowledge for reasoning.

The Audio Scene Descriptors are further processed and integrated with domain-specific information. The response is called Audio Spatial Directive that includes domain-specific logic, scene priors, and task constraints. For example, if the scene is a medical simulation, Domain Access might tell Audio Spatial Reasoning that “only sounds from authorised personnel should be considered”. This feedback helps Audio Spatial Reasoning refine its interpretation – filtering out irrelevant sounds, boosting priority for critical ones, and aligning its spatial model with the current domain expectations.

Therefore, we can call Audio Spatial Reasoning as the A-User’s auditory guide. It knows where sounds are coming from, what they mean, and how they should influence the A-User’s behaviour. The A-User responds to a sound with spatial awareness, contextual sensitivity, and domain consistency.

There are still about two mounts to the deadline of 2025/01/19 when responses Call must reach the MPAI Secretariat (secretariat@mpai.community) without exception.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv) – Today or tomorrow depending on where you are.

No Comments InAll posts

Leonardo Chiariglione
2025-11-13

Context Capture: The A-User’s First Glimpse of the World

Autonomous User (A-User) is an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. It is a “conversation partner in a metaverse interaction” with the User, itself an A-User or and H-User directly controlled by a human. The figure shows a diagram of the A-User while the User generates audio-visual streams of information and possibly text as well.

The sequence of posts – of which this is the second – that illustrates more in depth the architecture of an A-User provides as an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first post dealt with the A-User Control, the AI-Module (AIM) that controls the other AIM of the A-User and is possibly controlled by a human.

Context Capture is the A-User’s sensory front-end – the AIM that opens up perception by scanning the environment and assembling a structured snapshot of what’s out there in the moment. It is the first AI Module (AIM) in the loop providing the data and setting the stage for everything that follows. When A-User Control decides it’s time to engage, it prompts Context Capture to focus on a specific M-Location – the zone where the User is active, rendering its Avatar.

What Context Capture produces is called Context – a time-stamped, multimodal snapshot that represents the A-User’s initial understanding of the scene. But this isn’t just raw data. Context is composed of two key ingredients: Audio-Visual Scene Descriptors and User State.

The Audio-Visual Scene Descriptors are like a spatial sketch of the environment. They describe what’s visible and audible: objects, surfaces, lighting, motion, sound sources, and spatial layout. They provide the A-User with a sense of “what’s here” and “where things are.” But they’re not perfect. These descriptors are often shallow – they capture geometry and presence but not meaning. A chair might be detected as a rectangular mesh with four legs, but Context Capture doesn’t know if it’s meant to be sat on, moved, or ignored.

That’s where Spatial Reasoning comes in. Spatial Reasoning is the AIM that takes this raw spatial sketch and starts asking the deeper questions:

“Which object is the User referring to?”
“Is that sound coming from a relevant source?”
“Does this object afford interaction, or is it just background?”

It analyses the Context and produces two critical outputs:

Spatial Output: a refined map of spatial relationships, referent resolutions, and interaction constraints.
Spatial Guide: a set of cues that enrich the user’s input — highlighting which objects or sounds are relevant, how close they are, and how they might be used.

These outputs are sent downstream to Domain Access and Prompt Creation. The former refines the spatial understanding of the scene. The latter enriches the A-User’s query when it formulates the prompt to the Basic Knowledge (LLM).

Then there’s User State – a snapshot of the User’s cognitive, emotional, and attentional posture. Is the User focused, distracted, curious, frustrated? Context Capture reads facial expressions, gaze direction, posture, and vocal tone to infer a baseline state. But again, it’s just a starting point. User behaviour may be nuanced, and initial readings can be incomplete, noisy or ambiguous. That’s why User State Refinement exists – to track changes over time, infer deeper intent, and guide the alignment of the A-User’s expressive behaviour done by Personality Alignment.

In short, Context Capture is the A-User’s first glimpse of the world – a fast, structured perception layer that’s good enough to get started, but not good enough to finish the job. It’s the launchpad for deeper reasoning, richer modulation, and more expressive interaction. Without it, the A-User would be blind. With it, the system becomes situationally aware, emotionally attuned, and ready to reason – but only if the rest of the AIMs do their part.

Responses to the Call must reach the MPAI Secretariat (secretariat@mpai.community) by 2025/01/21.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv).

No Comments InAll posts

Leonardo Chiariglione
2025-11-08

A-User Control: The Autonomous Agent’s Brain

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. The latter User may also be an A-User or may be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the first of a planned sequence of posts having the goal to illustrate more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture.

A-User Control is the general commander of the A-User system making sure the Avatar behaves like a coherent digital entity aware of the rights it can exercise in an instance of the MPAI Metaverse Model – Architecture (MMM-TEC) standard. The command is actuated by various signals exchanged with the AI-Modules (AIM) composing the Autonomous User.

At its core, A-User Control decides what the A-User should do, which AIM should do it, and how it should do it – all while respecting the Rights granted to the A-User and the Rules defined by the M-Instance. A-User Control either executes an Action directly or delegates it to another Process in the metaverse to carry it out.

A-User Control is not just about triggering actions. A-User Control also manages the operation of its AIMs, for instance A-User Rendering, which can turn text produced by the Basic Knowledge (LLM) and the Personal Status selected by Personality Alignment into a speaking and gesturing Avatar. A-User Control sends shaping commands to A-User Rendering, ensuring the Avatar’s behaviour aligns with metaverse-generated cues and contextual constraints.

A-User Control is not independent of human influence. The human, i.e., the A-User “owner”, can override, adjust, or steer its behaviour. This makes A-User Control a hybrid system: autonomous by design, but open to human modulation when needed.

The control begins when A-User Control triggers Context Capture to perceive the current M-Location — the spatial zone of the metaverse where the User is active. That snapshot, called Context, includes spatial descriptors and a readout of the human’s cognitive and emotional posture called User State. From there, the two Spatial Reasoning components – Audio and Visual – use Context to analyse the scene and sending outputs to Domain Access and Prompt Creation, which enrich the User’s input and guide the A-User’s understanding.

As reasoning flows through Basic Knowledge, Domain Access, and User State Refinement, A-User Control ensures that every action, rendering, and modulation is aligned with the A-User’s operational logic.

In summary, the A-User Control is the executive function of the A-User: part orchestrator, part gatekeeper, part interpreter. It’s the reason the Avatar doesn’t just speak – it does so while being aware of the Context – both the spatial and User components – with purpose, permission, and precision.

Stay tuned for more introductions into the world of the Autonomous User Architecture.

Responses to the Call must reach the MPAI Secretariat (secretariat@mpai.community) by 2025/01/21.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv).

No Comments InAll posts

Leonardo Chiariglione
2025-11-02

A new MPAI standard project for Autonomous Users in a metaverse

The concept of virtual reality is now well established, with the metaverse concept as an important variant. Accordingly, MPAI has established a related standard, the MPAI Metaverse Model – Technologies (MMM-TEC) standard. However, standards for the contents of an MPAI metaverse instance (M-Instance) are still in progress. This document introduces the current status of these efforts and invites participation.

The contents include Processes representing entities with agency, called Users, and other entities lacking agency – essentially, various things populating an M-Instance – called Items.

Some Users represent humans. These may be directly operated by humans (and are called H-Users), or may have a high degree of operational autonomy (and are called A-Users, or informally, agents). Both types may be rendered as avatars called Personae.

The MMM-TEC standard specifies technologies enabling Users to perform various Actions on Items (things) in an M-Instance. For example, Users may sense data from the real world or may move Items in the M-Instance, possibly in combination with other Processes. However, MMM-TEC does not yet specify how an A-User decides to perform an Action.

Thus MPAI is developing a new standard covering such decisions: what does an A-User do when deciding to do something to achieve a Goal in an M-Instance? MPAI has assembled numerous relevant technologies, but more are needed. Therefore, the 61^st MPAI General Assembly (MPAI-61) has published the Call for Technologies Pursuing Goals in metaverse (MPAI-PGM) – Autonomous User Architecture (AUA). The Call requests interested parties – irrespective of their membership in MPAI – to submit responses that may enable MPAI to develop a robust A-User Architecture standard attractive to implementers and users.

The planned standard’s scope is as follows: PGM-AUA will specify functions and interfaces by which an A-User interacts with another User, either an A-User or an H-User. (Again, the term “User” means “conversational partner in the metaverse”, whether autonomous or driven by a human.) A-Users can capture text and audio-visual information originated by, or surrounding, the User; extract the User State, i.e., snapshots of the User’s cognitive, emotional, and interactional states; produce an appropriate multimodal response, rendered as a speaking Avatar; and move appropriately in the relevant virtual space.

One possible way to model an A-User’s interactions with other Users might be to train a very powerful unitary Large Language Model, able to use spatial and media information. However, because such a model would be unwieldy and difficult to manage, MPAI instead assumes the use of a relatively simple Large Language Model with basic language and reasoning capabilities. Spatial, audio-visual, and User description information will be passed to and from this Basic Model in natural language.

To handle this integration, MPAI proposes the MPAI AI Framework (MPAI-AIF) standard. This standard provides the necessary infrastructure to define a foundation for an A-User to which the necessary technologies can be added. MPAI-AIF enables specification of an AI Workflow (AIW) composed of AI Modules (AIMs). In this case, these can jointly represent an A-User in a manner that is modular, i.e., able to swap or update modules independently from other modules; transparent, i.e., able to perform clear roles and expose well-defined interfaces; and extensible, i.e., able to add or replace specific competences as needed.

The following figure represents a tentative diagram of the A-User architecture.

The model represents a largely autonomous A-User’s (“agent’s”) interactions with another User (A-User or H-User) at a given instant. It would thus be invoked repeatedly for extended interactions.

At a high level, we see an executive element (A-User Control), which can receive as input a human command or the response to some Action, and which delivers as output its status in response to the relevant command; any related action; and any request that it may itself deliver.

NOTE: While an A-User is defined as a relatively autonomous Process, a human may take over or modify its operation via the A-User Control.

More formally, the executive A-User Control AIM (AI Module) drives A-User operation by controlling how it interacts with the environment and performs Actions and Process Actions by:

Performing an Action or requesting another Process to perform one.
Controlling the operation of AIMs, in particular A-User Rendering.

Here is a full accounting of the input and output, separated from the full diagram to remove distractions:

This input-output diagram summarizes in two lines the Responses from, and Commands to, the A-User’s six current AIMs, or AI Modules, which jointly enable the A-User’s actions. They represent perception and reasoning about what is perceived; knowledge and processing about the current domain, e.g., surgery or a particular game; and composition of an appropriate response, including the A-User’s simulated emotion, cognitive state, and social attitudes, alignment with the agent’s simulated personality, and rendering of the resulting response’s visual and audio aspects. Most of these modules can consult the A-User’s Large Language Model, the Basic Knowledge (LLM).

Keeping this overview in mind, we can survey the individual modules in greater detail.

Context Capture

The Context Capture AIM, prompted by the A-User Control, perceives a particular location of the M-Instance – called an M-Location – where the A-User’s conversation partner, here designated as User, is rendering its Avatar. The result of the capture is called the Context, a time-stamped structured snapshot representing the A-User’s initial understanding of the M-Location. Context is composed of

Audio-Visual Scene Descriptors describing the spatial content.
User State, describing the User’s cognitive, emotional, and attentional posture within the environment.

Spatial Reasoning

The Spatial Reasoning AIM analyses the Context and:

Sends Audio and Visual Spatial Output, i.e., spatial relationships, referent resolutions, and interaction constraints, to the Domain Access AIM seeking additional domain-specific information.
Sends Audio and Visual Spatial Guides, i.e., audio source relevance, directionality, and proximity (Audio) and object relevance, orientation, proximity, and affordance (Visual) to the Prompt Creation AIM. The objective is to enrich the A-User’s spoken or written input with additional information about A-User before sending a prompt created by Prompt Creation (a PC-Prompt) to Basic Knowledge, a basic LLM. PC-Prompt includes
1. User Text and User State (from Context Capture).
2. Audio and Visual Spatial Guide (from Spatial Reasoning).

Spatial Reasoning is specified as two separate AIMs, one for the Audio and the other for the Visual component. 3D Graphics inputs are also handled by the Visual component.

Domain Access

The Basic Knowledge LLM sends to the Domain Access module an Initial Response containing a direct response to the PC-Prompt, along with general reasoning based on foundational LLM capabilities.

The Domain Access module then

Processes and responds to two flows:
1. Spatial Output (audio and visual, from Spatial Reasoning):
  1. Accesses domain-specific models, ontologies, or M-Instance services.
  2. Returns Audio and Visual Spatial Directive to inject contextual background, scene-specific logic, and task relevance into the reasoning loop of Spatial Reasoning to improve the fidelity of its spatial interpretation.
2. Initial Response (from the Basic Knowledge LLM):
  1. Accesses domain-specific models, ontologies, or M-Instance services to retrieve:
    1. Scene-specific object roles (e.g., “this is a surgical tool”)
    2. Task-specific constraints (e.g., “only authorised Users may interact”)
    3. Semantic affordances (e.g., “this object can be grasped”)
  2. Returns to Basic Knowledge LLM a DA-Prompt that includes initial reasoning, spatial semantics, domain overlays, and User/task constraints.
Performs further interchanges:
1. Sends Refined Context Guide (to User State Refinement):
  1. Includes a structured object with:
    1. Updated User descriptors
    2. Scene salience and relevance
    3. Interaction history and inferred goals

User State Refinement

Uses Refined Context Guide from the Domain Access module to update User State and to generate a UR-Prompt for the Basic Knowledge LLM, reflecting the refined understanding.

Domain Access

The Basic Knowledge LLM sends to the Domain Access module an Initial Response containing a direct response to the PC-Prompt, along with general reasoning based on foundational LLM capabilities.

The Domain Access module then

Processes and responds to two flows:
1. Spatial Output (audio and visual, from Spatial Reasoning):
  1. Accesses domain-specific models, ontologies, or M-Instance services.
  2. Returns Audio and Visual Spatial Directive to inject contextual background, scene-specific logic, and task relevance into the reasoning loop of Spatial Reasoning to improve the fidelity of its spatial interpretation.
2. Initial Response (from the Basic Knowledge LLM):
  1. Accesses domain-specific models, ontologies, or M-Instance services to retrieve:
    1. Scene-specific object roles (e.g., “this is a surgical tool”)
    2. Task-specific constraints (e.g., “only authorised Users may interact”)
    3. Semantic affordances (e.g., “this object can be grasped”)
  2. Returns to Basic Knowledge LLM a DA-Prompt that includes initial reasoning, spatial semantics, domain overlays, and User/task constraints.
Performs further interchanges:
1. Sends Refined Context Guide (to User State Refinement):
  1. Includes a structured object with:
    1. Updated User descriptors
    2. Scene salience and relevance
    3. Interaction history and inferred goals

User State Refinement

Uses Refined Context Guide from the Domain Access module to update User State and to generate a UR-Prompt for the Basic Knowledge LLM, reflecting the refined understanding.

Basic Knowledge produces and sends to Personality Alignment a Refined Response.

Personality Alignment

Receives (1) a Personality-based Refined Response and (2) an Expressive State Guide, both conveying a variety of elements such as:
1. Expressivity, e.g.:
  1. Tone, e.g., formal, casual, empathetic, assertive
  2. Tempo, e.g., fast, slow, rhythmic
  3. Gesture style, e.g., expansive, restrained, animated
  4. Facial dynamics, e.g., smile frequency, gaze behaviour, eyebrow movement
  5. Etc.
2. Behavioural Traits, e.g.:
  1. Verbosity level
  2. Use of metaphors or humour
  3. Degree of emotional expressiveness
3. Type of role: assistant, mentor, negotiator, entertainer, etc.
Formulates and sends
1. An A-User Personal Status (using the MPAI-specified Personal Status structure, containing specification of emotion, cognitive state, and social attitudes) reflecting the Personality to A-User Rendering.
2. A PA-Prompt to Basic Knowledge reflecting:
  1. Speech modulation instructions (e.g., pitch, emphasis)
  2. Facial expression timing and intensity
3. Gesture choreography
4. Synchronisation cues across modalities.

A-User Rendering

The Basic Knowledge LLM sends a Final Response to the A-User Rendering module that conveys semantic content, contextual integration, expressive framing, and personality consistency.

As seen below, the A-User Rendering module (1) uses the Final Response and A-User Personal Status to synthesise a speaking Avatar and (2) employs an A-User Control Command from A-User Control to refine the speaking Avatar.

Extended Call for Technologies

The complexity of the MMM-TEC model has prompted MPAI to extend its usual practice for Calls for Technologies. In addition to the usual Call for Technologies, Use Cases and Functional Requirements, Framework Licence, and Template for Responses, the Call also refers to a Tentative Technical Specification, a document drafted as if it were an actual Technical Specification. Respondents to the Call are free to comment on, change, or extend the Tentative Technical Specification or to make any other proposals judged relevant to the Call.

Anyone, irrespective of MPAI membership status, may respond to the Call. Responses shall reach the MPAI Secretariat by 2026/01/21T23:59.

Appropriate MPAI working groups will thoroughly review the Responses and retain those deemed appropriate for the future PGM-AUA standard. MPAI may select suitable technologies from those submitted in Responses, but is not obligated to select any proposal. Respondents will be encouraged to join MPAI. If Respondents whose Responses are accepted in full or in part do not join MPAI, MPAI will discontinue consideration of their proposed technologies.

MPAI is organizing two online presentations with similar content on 2025/11/17. To attend register:

here for the 9 UTC presentation.
here for the 16 UTC presentation.

No Comments InAll posts

Leonardo Chiariglione
2025-10-29

MPAI calls for technologies supporting metaverse-based Agentic AI

Geneva, Switzerland – 29^th October 2025. MPAI – Moving Picture, Audio and Data Coding by Artificial Intelligence – the international, non-profit, unaffiliated organisation developing AI-based data coding standards – has concluded its 61^st General Assembly (MPAI-61) approving the publication of a Call for Autonomous User Architecture Technologies.

With this Call for Technologies, formally “Pursuing Goals in metaverse (MPAI-PGM) – Autonomous User Architecture (PGM-AUA)”, MPAI is aiming at a standard enabling Autonomous Users to perform activities such as moving around and conversing with other Users. These are processes representing humans in a metaverse conforming with the MPAI Metaverse Model Technologies standard (MMM-TEC). They can either operate with a high degree of autonomy (A-Users) or be directly controlled by humans (H-Users).

PGM-AUA will rely on the friendly MMM-TEC environment and many relevant technologies already available in the 16 approved MPAI standards. However, the ambitious PGM-AUA goal requires many new technologies that the Call is designed to secure.

The text of the call and associated document is available. Responses are due to the MPAI Secretariat by 2025/01/21T23:59.

MPAI-61 has also approved the new versions of standards previously posted for Community Comments:

MPAI is continuing the development of its work plan that involves the following activities:

AI Framework (MPAI-AIF): developing a new MPAI-AIF specification that facilitates the creation of new workflows using available AIMs.
AI for Health (MPAI-AIH): developing the specification of a system receiving and processing licenses AI Health Data and enabling clients to improve health processing models via federated learning.
Context-based Audio Enhancement (CAE-DC): developing the Audio Six Degrees of Freedom (CAE-6DF) and Audio Object Scene Rendering (CAE-AOR) specifications.
Connected Autonomous Vehicle (MPAI-CAV): investigating extensions of the current CAV-TEC specification.
Compression and Understanding of Industrial Data (MPAI-CUI): developing the Company Performance Prediction V2.0 specification.
End-to-End Video Coding (MPAI-EEV): exploring the potential of AI-based End-to-End Video coding.
AI-Enhanced Video Coding (MPAI-EVC): finalising the Up-sampling Filter for Video applications (EVC-UFV) standard.
Governance of the MPAI Ecosystem (MPAI-GME): operating the MPAI Ecosystem per the MPAI-GME Specification.
Human and Machine Communication (MPAI-HMC): developing reference software and performance assessment.
Multimodal Conversation (MPAI-MMC): discussing the conversational part of the PGM-AUA Call for Technologies.
MPAI Metaverse Model (MPAI-MMM): developing support for security in the MMM-TEC specs.
Neural Network Watermarking (MPAI-NNW): Reviewing the responses to the Call on Neural Network Traceability Technologies.
Object and Scene Description (MPAI-OSD): discussing the spatial part of the PGM-AUA Call for Technologies.
Portable Avatar Format (MPAI-PAF): discussing the rendering part of the PGM-AUA Call for Technologies.
AI Module Profiles (MPAI-PRF): extending the scope of the current version of AI Module Profiles.
Server-based Predictive Multiplayer Gaming (MPAI-SPG): exploring new standard opportunities in the domain.
Data Types, Formats, and Attributes (MPAI-TFA) extending the standard to data types used by MPAI standards (e.g., automotive, health, and metaverse).
XR Venues (MPAI-XRV): developing the standard for improved development and execution of Live Theatrical Performances.

Legal entities and representatives of academic departments supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data can become MPAI members.

Please visit the MPAI website, contact the MPAI Secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.

No Comments InAll posts

Leonardo Chiariglione
2025-10-04

Celebrating the first five years of MPAI

Where there are organisations counting years of existence in decades or centuries, there should not be much to celebrate for an organisation that only reaches as few as five years of existence. But there are years and years – even days and days – like in one day as a lion or a hundred years as a sheep.

The last five were not the years of a sheep but as one day as a lion.

We started with the idea of an organisation dedicated to standards for AI-based data coding because we thought that standards would bring benefits to a domain mostly alien to it. Not like some standards that look more like legal tools designed to oppress users but standards offering fair opportunities to all parties in the chain extending from innovators to end users.

An ambitious organisation like MPAI could not operate like four friends in a bar. The MPAI operation rules were developed and are now enshrined in the MPAI Patent Policy. The ambitions of MPAI were further enhanced by the definition of the MPAI Ecosystem extending from MPAI to implementers, integrators, and end users with the introduction of a new actor called MPAI Store, now incorporated in Scotland as a company limited by guarantee. There is a standard – Governance of the MPAI Ecosystem (MPAI-GME) setting the rules of operation of the Ecosystem.

The idea of a mission was there but what about implementing it? We acted as lions and posited that opaque monolithic AI should become component-based AI. Now a large share of our standards are based on the AI Framework (MPAI-AIF) standard, specifying an environment where AI Workflows composed of AI Modules can be initialised, dynamically configured, and controlled. MPAI-AIF also provided a stimulus to adoption of JSON Schema as a “language” to represent data types, AI Modules, and AI Workflows in MPAI standards. Today there is virtually no MPAI standard that does not use that language.

Having laid down the technical foundations, we started the buildings. One was designed to host the quite representative area of human and machine conversation extending beyond the “word” to cover other sometimes ethereal but information-carrying sensations and feelings. The standard called Multimodal Conversation (MPAI-MMC) is the first attempt at digitally representing this ethereal information with the Personal Status data type and Human-Machine Communication (MPAI-HMC) is an excellent example of its application.

Another investigation stream since the early MPAI days is audio sitting at the MPAI table as “Context-based Audio Enhancement” leading to the Context-based Audio Enhancement – Use Cases (MPAI-CAE) standard. Finally, with Compression and Understanding of Industrial Data (MPAI-CUI), MPAI demonstrated that data from so far unexplored domains like finance could benefit from standards.

Just one year after its establishment, MPAI could claim success by publishing its first three standards: MPAI-CUI, MPAI-GME, and MPAI-MMC and, by the end of 2021, another two: MPAI-AIF and MPAI-CAE.

Since its early days, MPAI was convinced that standards should have as much visibility as possible. For this reason, it established a successful cooperation with the Institute of Electric and Electronic Engineers (IEEE) – Standard Association (SA). Today, starting from three standards in 2022, nine MPAI standards have been adopted by IEEE without modifications and three more are in the pipeline.

The creation of MPAI Development Committees and Working Groups and their activity continued unrelenting. The use of watermarking and then fingerprinting to trace the use of neural networks let to the development of Neural Network Watermarking – Traceability (NNW-NNT). Connected Autonomous Vehicles was started in late 2020 and is now a standard with the name Connected Autonomous Vehicle – Technologies (CAV-TEC). MPAI was probably the first to engage in activities leading to a metaverse standard and now it can claim to have a solid candidate to lead the move to interoperable metaverses with MPAI Metaverse Model – Technologies (MMM-TEC). Since its early days, MPAI worked on online gaming, producing the Server-based Predictive multiplayer Gaming – Mitigation of Data Loss Effects (SPG-MDL) standard where a set of AI Modules predicts the game state of an online multiplayer game.

MPAI abhors the attitude of other standards bodies who develop unnecessarily “siloed” standards where technologies are treated exclusively from the point of view of the domain of that standard without considering similar technologies in other domains. Object and Scene Description (MPAI-OSD) and Portable Avatar Format (MPAI-PAF) do specify AI Workflows specific to their domains but their AI Modules and Data Types were specified for wide reuse in many other MPAI standards. This attitude is not confined to these two standards as the same can be said of MPAI-CAE and MPAI-MMC.

Atypical – but no less important – standards are AI Module Profiles (MPAI-PRF) establishing a machine-readable description to identify AI Module Profiles and Data Types, Formats, and Attributes (MPAI-TFA) providing a standard way to add information about data for processing by a machine.

Last comes a standard that embodies probably the very first activity – AI for video. AI-Enhanced Video Coding – Up-sampling Filter for Video applications (EVC-UFV) offers an AI super-resolution filter vastly superior to currently used filters.

Five years ago, MPAI was very bold in targeting standards for AI, then just a nice technology to talk about. In five years, however, AI is all over the place and much talked about. What will the future offer for MPAI?

Some answers are clear:

With its impressive portfolio of 15 standards, there will be much maintenance and enhancement work to do.
Two new standards are being developed and should be completed in a short time: AI for Health – Health Secure Platform and XR Venues – Live Theatrical Performance.
One project – End-to-End Video coding has still to go through the Call for Technologies phase
A Call for Technologies is open, and responses are expected: Neural Network Watermarking – Technologies.
A new Call for Technologies on Pursuing Goals in the metaverse is being prepared. This will require the development of a significant number of “behaviours” on top of a “baseline” Small Language Model.
Development of reference implementations to enhance the value and attractiveness of existing standards.

AI continues its lightning speed of development and MPAI will continue watching and identifying standardisation opportunities in different domains.

Long live MPAI!

No Comments InAll posts

Leonardo Chiariglione
2025-10-02

MPAI celebrates five years of pioneering AI standards

Geneva, Switzerland – 30^th September 2025. MPAI – Moving Picture, Audio and Data Coding by Artificial Intelligence – the international, non-profit, unaffiliated organisation developing AI-based data coding standards – has celebrated its fifth anniversary at its 60^th General Assembly (MPAI-60).

Established 5 years ago on 30 September 2020, MPAI has created the organisation, given itself rigorous procedures of work, developed 15 standards and two technical reports, obtained adoption of eight of its standards without modification by IEEE Standards Association, and is setting sights on next challenges targeting both extensions and new standards.

In line with its mission of AI-based data coding, MPAI standards cover execution of AI applications, audio enhancement, connected autonomous vehicles, finance, human and machine conversation, metaverse, objects and scenes, avatars, and many others.

MPAI-60 has approved final publication of new versions of existing standards:

and is publishing the following standards for Community Comments

MPAI is continuing the development of its work plan that involves the following activities:

AI Framework (MPAI-AIF): developing a new MPAI-AIF specification that facilitates the creation of new workflows using available AIMs.
AI for Health (MPAI-AIH): developing the specification of a system receiving and processing licenses AI Health Data and enabling clients to improve health processing models via federated learning.
Context-based Audio Enhancement (CAE-DC): developing the Audio Six Degrees of Freedom (CAE-6DF) and Audio Object Scene Rendering (CAE-AOR) specifications.
Connected Autonomous Vehicle (MPAI-CAV): investigating extensions of the current CAV-TEC specification.
Compression and Understanding of Industrial Data (MPAI-CUI): developing the Company Performance Prediction V2.0 specification.
End-to-End Video Coding (MPAI-EEV): exploring the potential of AI-based End-to-End Video coding.
AI-Enhanced Video Coding (MPAI-EVC): refining the Up-sampling Filter for Video applications (EVC-UFV) standard.
Governance of the MPAI Ecosystem (MPAI-GME): working on version 2.0 of the Specification.
Human and Machine Communication (MPAI-HMC): developing reference software and performance assessment.
Multimodal Conversation (MPAI-MMC): Developing the notion of Perceptive and Agentive AI (PAAI) capable of handling more complex questions.
MPAI Metaverse Model (MPAI-MMM): extending the capabilities of the MMM-TEC specs to support more applications.
Neural Network Watermarking (MPAI-NNW): Issuing a Call on Neural Network Traceability Technologies.
Object and Scene Description (MPAI-OSD): extending the capabilities of the MPAI-OSD V1.3 to support more applications.
Portable Avatar Format (MPAI-PAF): extending the capabilities of the MPAI-PAF V1.4 to support more applications.
AI Module Profiles (MPAI-PRF): extending the scope of the current version of AI Module Profiles.
Server-based Predictive Multiplayer Gaming (MPAI-SPG): exploring new standard opportunities in the domain.
Data Types, Formats, and Attributes (MPAI-TFA) extending the standard to data types used by MPAI standards (e.g., automotive, health, and metaverse).
XR Venues (MPAI-XRV): developing the standard for improved development and execution of Live Theatrical Performances.

Legal entities and representatives of academic departments supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data can become MPAI members.

No Comments InAll posts

Leonardo Chiariglione
2025-08-21

Exploring the innovations of the MMM-TEC V2.1 standard

The MPAI Metaverse Model (MPAI-MMM) – Technologies (MMM-TEC) specification is based on an innovative approach. As in the real world (Universe) we have animate and inanimate things, in an MPAI Metaverse (M-Instance) we have Processes and Items. Processes can animate Items (things) in the metaverse but can also act as a bridge between metaverse and universe. For convenience, MMM-TEC defines four classes of Processes: Apps, Devices, Services, and Users.

Probably, the most interesting one is the User, defined as the “representative” of a human where representation means that the human is responsible for what their Users do in the metaverse. The representation function can be very strict because the human drives everything one of their User does or very loose because the User is a fully autonomous agent (still under the human’s responsibility). As the User is a Process, it cannot be “perceived” except from what it does, but it can render itself in a perceptible form, called Persona that may visually appear as a humanoid. A human can have more than one User, and a User can be rendered with more than one Persona.

Humans can do interesting things in the world, but what interesting things can they do in the metaverse? MMM-TEC answers this question by offering a range of 28 basic Actions, called Process Actions. An important one is Register. By Registering, a human gets the Rights to import (via the UM-Send Action) and deploy (via the Execute Action) Users and render (by e.g., MM-Adding) Personae. UM-Send means sending things from the universe to the metaverse and MM-Add means placing an Avatar and then possibly animating it (MM-Animate) with a stream or rendering it (MU-Actuate) in the universe.

Universe and metaverse are connected, but they should be mutually “protected”. One example of what this means is data from the universe cannot be simply imported into the metaverse, but is first captured (UM-Capture), then identified (Identify) – i.e., converted into an Item – and finallu acted upon, e.g., used to animate an avatar. Also, a User is not entitled to do just anything anywhere in the metaverse because its operation is governed by three basic notions: Rights, expressing the fact that a User (in general, a Process) may perform a certain Process Action; Rules, expressing the fact a Process may, may not, or must perform a Process Action; and P-Capabilities expressing that the Process can perform certain Process Actions.

What if a Process wants to perform a Process Action, has the Rights to perform it, and its performance complies with the Rules, but it cannot, i.e., it does not know how to perform it? MMM-TEC makes use of a philosophy of language notion called Speech Act that is expressed by an individual and contains both information and action. For instance, User MU-Actuates Persona At M-Location At U-Location With Spatial Attitude will mean that the User renders at U-Location in the universe with a certain Position and Orientation the Persona that is placed at an M-Location in the Metaverse. If the User can – i.e., it has the P-Capabilities to – MU-Actuate the Persona, for instance because it is connected to the universe via an appropriate device, and may, i.e., it has the Rights to MU-Actuate, and the planned Process Action complies with the Rules, then the Process Action is performed. However, if the User does not have the necessary P-Capabilities or does not have the Rights to MU-Actuate the Persona, it can ask an Import-Export Service to do this on its behalf. Possibly, the Service will request that a Transaction be made in order to perform the requested Process Action.

As a last point, we should describe how MMM-TEC represents Rights and Rules. MMM-TEC states that Rights are, in general, a collection of Process Actions that the Process can perform. Each of them is preceded by Internal, Acquired, or Granted to indicate if the Rights were obtained at the time of Registration, were acquired (e.g., by a Transaction), or are Granted (and then possibly withdrawn) by another Process. Similarly, Rules are expressed by Process Actions each of which is preceded by May, May not, or Must.

We could add many more details to give a complete description of the MMM-TEC potential. You can directly access the standard here, but now we want to address some of the innovations introduced by MMM-TEC V2.1.

The first is the set of new capabilities provided by the Property Change Process Action. We said that we can MM-Add a Persona and then MM-Animate it. But what if we are preparing a theatre performance and we do not want “to be seen” while rehearsing? Property Change can set the Perceptibility Status of an Item but can also change:

The properties of a visual Item in terms of its size, mass, material (i.e., to signal that the object is material or immaterial), gravity (is subject to gravity or not), and texture map.
The audio characteristics of an object: Reflectivity, Reverberation, Diffusion, and Absorption.
The properties of a light source: Type (Point, Directional, Spotlight, Area), colour, and intensity of the light source.
The properties of an audio source: Diffuseness, Directional Patterns, Shape, and Size.
The Personal Status (i.e., emotion) of an avatar.

Another important set of functionalities is provided by significant extensions of how a Process in the metaverse can affect the universe. MMM-TEC V2.1 allows a User to MU-Actuate at a U-Location an Item MM-Added at an M-Location. How can this Process Action be performed? We assume that the M-Instance is connected to a special Device that can perform the following in the universe:

Pick an existing object.
Drive a 3d printer that produces the analogue version of the Item.
Render a 2D or a 3D media object.

MMM-TEC V2.1 calls R-Item any physical object in the universe, including the object produced by a 3D printer and the 2D or 3D media object produced. It also defines the following additional Process Actions:

MU-Add an R-Item: to place an R-Item (a physical object) somewhere in the universe with a Spatial Attitude.
MU-Animate an R-Item: to animate, e.g., a robot, with a stream.
MU-Move an R-Item from a U-Location to another U-Location along a Trajectory.

MMM-TEC is rigorous in defining how Process Actions can be performed in an M-Instance, but what about the universe? Do we want Processes to perform actions in the universe in an uncontrolled way?

The answer is clear: the M-Instance does not control the Universe through some supernatural force but through Devices whose operation is conditional on the Rights and P-Capabilities held by the Device to perform the desired Process Actions in the universe. The Process Actions beginning with “MU-” include the Rights of a Device to act on the universe.

V2.1 adds several new use cases to the long list of V2.0. One of these is called “Emergency in Industrial Metaverse”:

An M-Location includes the Digital Twin of a real factory (R-Factory) where the regular operation is separated from emergency operation described by the use case.
An “emergency” User in the Digital Twin (V-Factory):
1. Has the Rights to actuate and animate an “emergency” robot in the R-Factory.
2. Can be rendered as a Persona having the appearance of the corresponding robot.
In case of an emergency, the User:
1. Activates an alarm in the R-Factory.
2. Actuates its “emergency” robot (Analogue Twin) in the R-Factory.
3. Animates the robot to solve the problem.
4. Renders its Persona so that humans can see what is happening in the R-Factory.
When the emergency is resolved, the robot is moved to its repository.

You are invited to register to attend the online presentation on 12 September at 15 UTC and provide your comments to the MPAI Secretariat by 2025/09/28 T23:59 UTC

No Comments InAll posts

Leonardo Chiariglione
2025-08-20

MPAI publishes MPAI Metaverse Model – Technologies V2.1 standard with extended functionalities

Geneva, Switzerland – 20^th August 2025. MPAI – Moving Picture, Audio and Data Coding by Artificial Intelligence – the international, non-profit, unaffiliated organisation developing AI-based data coding standards – has concluded its 59^th General Assembly (MPAI-59) approving the publication of the MPAI Metaverse Model – Technologies V2.1 with a request for Community Comments.

The earlier 2.0 Version of Technical Specification: MPAI Metaverse Model (MMM) – Technologies (MMM-TEC) already supported digital twinning of real-world environments and their blending with MMM-TEC-specified virtual environments. The new MMM-TEC V2.1 supports “analogue twinning” of virtual- with real-world environments opening attractive industrial metaverse applications. This is achieved by introducing new “Process Actions” (speech acts of an MMM-TEC process sent to another process) and the notion of R-Item (real object) that can be MU-Added (placed at a U-Location, a location in the real world), MU-Moved (moved from a U-Location to another U-Location along a Trajectory), and MU-Animated (animated) in sync with a Persona (the rendering of a Process as an avatar) in the metaverse.

Among the several other innovations included in MMM-TEC V2.1, we mention Change Property, a Process Action whereby a Process changes – if it holds the Rights – the place where and object is located; its properties such as perceptibility, size, mass, gravity, and texture; audio properties such as reflectivity, reverberation, diffusion and absorption; an audio or light source; and the emotional state of an avatar.

MPAI standards are best described as a web of interconnected specifications. The new technologies needed by MMM-TEC are partly specified by Object and Scene Descriptors (MPAI-OSD), Portable Avatar Format (MPAI-PAF), and Data Types, Formats and Attributes (MPAI-TFA). They are now at versions V1.4, V1.5, and V1.4, respectively.

The MMM-TEC1 V2.1 standard on 12 September at 15 UTC (link).
The MPAI-OSD V1.4 and MPAI-PAF V1.5 standards on 12 September at 10 UTC (link).
The MPAI-TFA V1.4 standard on Wednesday 17 September at 15 UTC (link)
The MPAI-GME V2.0 standard on Friday 26 September at 14 UTC (link).

MPAI is continuing the development of its work plan that involves the following activities:

AI Framework (MPAI-AIF): developing a new MPAI-AIF specification that facilitates the creation of new workflows using available AIMs.
AI for Health (MPAI-AIH): developing the specification of a system receiving and processing licenses AI Health Data and enabling clients to improve health processing models via federated learning.
Context-based Audio Enhancement (CAE-DC): developing the Audio Six Degrees of Freedom (CAE-6DF) and Audio Object Scene Rendering (CAE-AOR) specifications.
Connected Autonomous Vehicle (MPAI-CAV): investigating extensions of the current CAV-TEC specification.
Compression and Understanding of Industrial Data (MPAI-CUI): developing the Company Performance Prediction V2.0 specification.
End-to-End Video Coding (MPAI-EEV): exploring the potential of AI-based End-to-End Video coding.
AI-Enhanced Video Coding (MPAI-EVC): refining the Up-sampling Filter for Video applications (EVC-UFV) standard.
Governance of the MPAI Ecosystem (MPAI-GME): working on version 2.0 of the Specification.
Human and Machine Communication (MPAI-HMC): developing reference software and performance assessment.
Multimodal Conversation (MPAI-MMC): Developing the notion of Perceptive and Agentive AI (PAAI) capable of handling more complex questions.
MPAI Metaverse Model (MPAI-MMM): extending the capabilities of the MMM-TEC specs to support more applications.
Neural Network Watermarking (MPAI-NNW): Issuing a Call on Neural Network Traceability Technologies.
Object and Scene Description (MPAI-OSD): extending the capabilities of the MPAI-OSD V1.3 to support more applications.
Portable Avatar Format (MPAI-PAF): extending the capabilities of the MPAI-PAF V1.4 to support more applications.
AI Module Profiles (MPAI-PRF): extending the scope of the current version of AI Module Profiles.
Server-based Predictive Multiplayer Gaming (MPAI-SPG): exploring new standard opportunities in the domain.
Data Types, Formats, and Attributes (MPAI-TFA) extending the standard to data types used by MPAI standards (e.g., automotive, health, and metaverse).
XR Venues (MPAI-XRV): developing the standard for improved development and execution of Live Theatrical Performances.

Legal entities and representatives of academic departments supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data can become MPAI members.

No Comments InAll posts

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit

Archives: 2021-02-17

Notice