March 2023 - MPAI community

Leonardo Chiariglione
2025-12-03

Domain Access: The Specialist Brain Plug-in for the Autonomous User

While the Basic Knowledge module is a generalist language model that “knows a bit of everything”, Domain Access is the expert with a broad range of knowledge that enables the Autonomous User to tap into domain-specific intelligence for deeper understanding.

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (walk, converse, do things, etc.) with another User in a metaverse. The latter User may be an A-User or be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the sixth of a sequence of posts aiming to illustrate the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first five dealt with 1) the Control performed by the A-User Control AI Module on the other components of the A-User; 2) how the A-User captures the external metaverse environment using the Context Capture AI Module; 3) listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning; 4) makes sense of what the Autonomous User sees by understanding objects’ geometry, relationships, and salience; and 5) takes raw sensory input and the User State and turns them into a well‑formed prompt that Basic Knowledge can actually understand and respond to.

The Basic Knowledge module is a generalist language model that “knows a bit of everything.” In contrast, Domain Access is the expert layer that enables the Autonomous User (A-User) to tap into domain-specific intelligence for deeper understanding of user utterances and their context.

How Domain Access Works

Receives Initial Response: Domain Access starts with the response of Basic Knowledge, the generalist model’s response to the prompt generated by Prompt Creation.
Converts to DA-Input: As the natural language response is not the best way to process the response, it is converted into a JSON object called DA-Input for structured processing.
Gets domain knowledge by pulling in domain vocabulary such as, jargon and technical terms.
Creates the next prompt by using this specialised knowledge:
- Injects rules and constraints (e.g., standards, legal compliance).
- Adds reasoning patterns (e.g., diagnostic flows, contractual logic).

All enrichment happens in the JSON domain and so is the produced DA-Prompt Plan – a domain-aware structure ready for conversion into natural language – called DA-Prompt – and resubmission into the knowledge/response pipeline.

Why Domain Access Matters

Without Domain Access, the A-User is like a clever intern: knowledgeable but lacking depth and experience. With Domain Access, it becomes n experienced professional that can:

Deliver accurate, context-aware answers.
Avoid hallucinations by grounding responses in domain rules.
Address different application domains by swapping or adding domain modules without rebuilding the entire A-User.

What you can take away about Domain Access

Get Initial Response from Basic Knowledge.
Convert to DA-Input (JSON).
Enrich with Domain Context:
- Pull in domain vocabulary.
- Inject rules and constraints.
- Add reasoning patterns.
Create DA-Prompt Plan (domain-aware structure).
Translate to DA-Prompt (natural language).
Query Basic Knowledge language model.
Strong points
- Deliver accurate, context-sensitive answers.
- Avoid hallucinations via domain grounding.
- Adapt across different domains by swapping modules.

No Comments InAll posts

Leonardo Chiariglione
2025-11-30

Prompt Creation: Where Words Meet Context

The Prompt Creation module is the storyteller and translator in the Autonomous User’s “brain”, It takes raw sensory input – audio and visual spatial data of Context (such as objects in a scene with their position, orientation and velocity) and the Entity State (rich description of the A‑User’s understanding of the “internal state” of the User) – and turns it into a well‑formed prompt that Basic Knowledge can actually understand and respond to.

This is the fifth of a sequence of posts aiming at illustrating more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first four posts dealt with 1) how the A-User Control AI Module controls the other components of the A-User; 2) how the A-User captures the external metaverse environment using the Context Capture AI Module; 3) how it listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning; and 4) how it makes sense of what the Autonomous User sees by understanding objects’ geometry, relationships, and salience.

Prompt Creation is the storyteller and translator in the Autonomous User’s “brain.” It takes raw sensory input – audio and visual spatial data of Context and User State – and turns it into a well‑formed prompt that Basic Knowledge can sensibly understand and respond to.

The audio and visual components of Spatial Reasoning provide the information on things around the User such as “who’s in the room,” “what’s being said,” “what objects are present,” and “what’s the User doing”. Context Capture provides Entity State as a rich description of the A‑User’s understanding of the “internal state” of the User – which may a representation of a biologically real User, if it represents a human, or simulated when the User represents an agent. The task of Prompt Creation is to synthesise these sources of information into a PC‑Prompt Plan. This plan starts from what the User said, adds intent (e.g., “User wants help” or “User is asking a question”), includes the context around the User (e.g., “User is in a virtual kitchen”), and embeds User State (e.g., “User seems confused”).

This information – conveniently represented as a JSON object – is converted into natural language

, and passed to Basic Knowledge that produces a natural language response called the Initial Response – initial because there are more processing elements in the A‑User pipeline that will refine and improve the answer before it is rendered in the metaverse.

Prompt Creation gives the AI a sense of narrative, so the A-User can:

Ask the right clarifying question.
Respond with relevance to the situation.
Adapt to the environment and User mood.
Maintain continuity across interactions.

If the User says: “Can you help me cook?”

Spatial Reasoning notes the User is in a virtual kitchen with utensils and ingredients.
Entity State suggests the User looks uncertain.
Prompt Creation combines these into: “User is asking for cooking help, is in a kitchen, seems unsure.”

This Initial Response is then passed to Domain Access, which may elaborate a new prompt enriched with domain-specific information (in this case “cooking”, when Basic Knowledge is not well informed about cooking).

Prompt Creation turns raw multimodal input and spatial information into meaningful prompts so the AI can think, speak, and act with purpose. It is the scriptwriter that ensures the A‑User’s dialogue is not only coherent but also contextually aware, emotionally attuned, and situationally precise.

What you can take away about Basic Knowledge

Translates user speech into Language Model understandable prompts
Synthesises spatial data and User State
Detects User intent (e.g., help request, question)
Embeds environmental context (e.g., virtual kitchen)
Captures emotional cues (e.g., confusion, excitement)
Builds a structured PC-Prompt Plan as a JSON object to facilitate prompt creation
Converts PC-Prompt Plan into a natural language prompt
Passes the prompt to Basic Knowledge for response generation
Bridges perception and cognition for purposeful Language Model action

No Comments InAll posts

Leonardo Chiariglione
2025-11-26

MPAI publishes A-based video up-sampling filter standard with online demo of standard performance

Geneva, Switzerland – 26^th November 2025. MPAI – Moving Picture, Audio and Data Coding by Artificial Intelligence – the international, non-profit, unaffiliated organisation developing AI-based data coding standards – has concluded its 62^nd General Assembly (MPAI-62) publishing the Up-sampling Filter for Video applications standard.

Technical Specification: AI-Enhanced Video Coding (MPAI-EVC) – Up-sampling Filter for Video applications (EVC-UFV) V1.0 provides two standard methodologies 1) to design AI-based Super-resolution up-sampling filters for video applications and 2) to reduce the complexity of the designed filters without substantially affecting their performance. The parameters provided in EVC-UFV standard may be used to test the filter performance. Alternatively, an application can be used to submit an image and receive an up-sampled version of the image.

After publishing the Autonomous User Architecture Call for Technologies, MPAI has extended its Tentative Technical Specification: Pursuing Goal in metaverse (MPAI-PGM) – Autonomous User Architecture (PGM-AUA) originally attached to the Call. This addendum is a concrete example of the standard that MPAI seeks to develop with the PGM-AUA Call. Respondents to the Call are encouraged to read, comment on, change, or extend this document in their responses. Alternatively, they can submit their responses with a content unrelated to this document but relevant to the Call. The video recording of the online presentation of the Call is available. Deadline for submissions is 19 January 2026.

MPAI is continuing the development of its work plan that involves the following activities:

AI Framework (MPAI-AIF): extending the MPAI-AIF specification to enable a client to access a remote MPAI-AIF Controller and an AI Module to communicate data to another AIM with associate metadata.
AI for Health (AIH-HSP): developing the specification of a system receiving and processing licenses AI Health Data and enabling clients to improve health processing models via federated learning.
Context-based Audio Enhancement (CAE-USC): developing the Audio Six Degrees of Freedom (CAE-6DF) and the Audio Object Rendering (CAE-AOR) specifications.
Connected Autonomous Vehicle (CAV-TEC): developing a new version of the flagship specification CAV-TEC with security support.
Compression and Understanding of Industrial Data (CUI-CPP): developing the Company Performance Prediction V2.0 specification.
End-to-End Video Coding (MPAI-EEV): exploring the potential of AI-based End-to-End Video coding in compressing video sequences.
AI-Enhanced Video Coding (MPAI-EVC): exploring use of AI to enhance the video codec performance.
Governance of the MPAI Ecosystem (MPAI-GME): operating the MPAI Ecosystem per the MPAI-GME Specification.
Human and Machine Communication (MPAI-HMC): exploring the use of AI in human-to-machine and machine-to-machine communication.
Multimodal Conversation (MPAI-MMC): exploring the impact of the PGM-AUA Call for Technologies on human-to-machine and machine-to-machine
MPAI Metaverse Model (MMM-TEC): developing security-protected protocols in the MMM-TEC specification.
Neural Network Watermarking (NNW-TEC): Developing the new Neural Network Watermarking (MPAI-NNW) – Technologies (NNW-TEC) including assessments of Neural Network Traceability Technologies.
Object and Scene Description (MPAI-OSD): discussing the impact of MPAI standards planned or under development on MPAI-OSD V1.4.
Portable Avatar Format (MPAI-PAF): discussing the impact of MPAI standards planned or under development on MPAI-PAF V1.5.
AI Module Profiles (MPAI-PRF): extending the scope of the current version of AI Module Profiles.
Server-based Predictive Multiplayer Gaming (MPAI-SPG): exploring new standard opportunities in the domain.
Data Types, Formats, and Attributes (MPAI-TFA) extending the standard to data types used by MPAI standards that are planned or under development.
XR Venues (XRV-LTP): developing the standard for improved execution of Live Theatrical Performances using AI.

Legal entities and representatives of academic departments supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data can become MPAI members. New members joining before 31^st December 2025 have their membership extended until 31^st December 2026.

Please visit the MPAI website, contact the MPAI Secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.

No Comments InAll posts

Leonardo Chiariglione
2025-11-23

Visual Spatial Reasoning: The Vision Aware Interpreter

Autonomous User (A-User) is an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. It is a “conversation partner in a metaverse interaction” with the User, itself an A-User or an H-User directly controlled by a human. The figure shows a diagram of the A-User while the User generates audio-visual streams of information and possibly text as well.

This is the fourth of a sequence of posts aiming to illustrate the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first three dealt with 1) the Control performed by the A-User Control AI Module on the other components of the A-User, 2) how the A-User captures the external metaverse environment using the Context Capture AI Module, and 3) listens, localises, and interprets sound not just as data, but as data having a meaning and a spatial anchor.

When the A-User acts in a metaverse space, sound doesn’t tell the whole story. The visual scene – objects, zones, gestures, occlusions – is the canvas where situational meaning unfolds. That’s where Visual Spatial Reasoning comes in: it’s the interpreter that makes sense of what the Autonomous User sees, not just what it hears.

Visual Spatial Reasoning can be considered as the visual analyst embedded in the “brain” of the Autonomous User. It doesn’t just list objects; it understands their geometry, relationships, and salience. A chair isn’t just “a chair” – it’s occupied, near a table, partially occluded, or the focus of attention. By enriching raw descriptors into structured semantics, Visual Spatial Reasoning transforms objects made of pixels into actionable targets.

This is what it does

Scene Structuring: Takes and organises raw visual descriptors into coherent spatial maps.
Semantic Enrichment: Adds meaning – classifying objects, inferring affordances, and ranking salience.
Directed Alignment: Filters and prioritises based on the A-User Controller’s intent, ensuring relevance.
Traceability: Every refinement step is auditable, to trace back why, “that object in the corner” became “the salient target for interaction.”

Why It Matters

Without Visual Spatial Reasoning, the metaverse would be a flat stage of unprocessed visuals. With it, visual scenes become interpretable narratives. It’s the difference between “there are three objects in the room” and “the User is focused on the screen, while another entity gestures toward the door.”

Of course, Visual Spatial Reasoning does not replace vision. It bridges the gap between raw descriptors and effective interaction, ensuring that the A‑User can observe, interpret, and act with precision and intent.

If Audio Spatial Reasoning is the metaverse’s “sound‑aware interpreter,” then Visual Spatial Reasoning is its “sight‑aware analyst” that starts by seeing objects and eventually can understand their role, their relevance, and their story in the scene.

No Comments InAll posts

Leonardo Chiariglione
2025-11-19

Audio Spatial Reasoning: The Sound-Aware Interpreter

Autonomous User (A-User) is an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. It is a “conversation partner in a metaverse interaction” with the User, itself an A-User or and H-User directly controlled by a human. The figure shows a diagram of the A-User while the User generates audio-visual streams of information and possibly text as well.

This is the third of a sequence of aiming at illustrating more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first two dealt with the Control performed by the A-User Control AI Module on the other components of the A-User and how the A-User captures the external metaverse environment using the Context Capture AI Module.

Audio Spatial Reasoning is the A-User’s acoustic intelligence module – the one that listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning. Therefore, Its role is not just about “hearing”, it is also about “understanding” where sound is coming from, how relevant it is, and what it implies in the context of the User’s intent in the environment.

When the A-User system receives a Context snapshot from Context Capture – including audio streams with a position and orientation and a description of the User’s emotional state (called User State) – Audio Spatial Reasoning start an analysis of directionality, proximity, and semantic importance of incoming sounds. The conclusion is something like “That voice is coming from the left, with a tone of urgence, and its orientation is directed at the A-User.”

All this is represented with an extension of the Audio Scene Descriptors describing:

Which audio sources are relevant
Where they are located in 3D space
How close or far they are
Whether they’re foreground (e.g., a question) or background (e.g., ambient chatter)

This guide is sent to Prompt Creation and Domain Access. Let’s see what happens with the former. The extended Audio Scene Descriptors are fused with the User’s spoken or written input and the current User State. The result is a PC-Prompt – a rich query enriched with text expressing the multimodal information collected so far – that is passed to Basic Knowledge for reasoning.

The Audio Scene Descriptors are further processed and integrated with domain-specific information. The response is called Audio Spatial Directive that includes domain-specific logic, scene priors, and task constraints. For example, if the scene is a medical simulation, Domain Access might tell Audio Spatial Reasoning that “only sounds from authorised personnel should be considered”. This feedback helps Audio Spatial Reasoning refine its interpretation – filtering out irrelevant sounds, boosting priority for critical ones, and aligning its spatial model with the current domain expectations.

Therefore, we can call Audio Spatial Reasoning as the A-User’s auditory guide. It knows where sounds are coming from, what they mean, and how they should influence the A-User’s behaviour. The A-User responds to a sound with spatial awareness, contextual sensitivity, and domain consistency.

There are still about two mounts to the deadline of 2025/01/19 when responses Call must reach the MPAI Secretariat (secretariat@mpai.community) without exception.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv) – Today or tomorrow depending on where you are.

No Comments InAll posts

Leonardo Chiariglione
2025-11-13

Context Capture: The A-User’s First Glimpse of the World

Autonomous User (A-User) is an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. It is a “conversation partner in a metaverse interaction” with the User, itself an A-User or and H-User directly controlled by a human. The figure shows a diagram of the A-User while the User generates audio-visual streams of information and possibly text as well.

The sequence of posts – of which this is the second – that illustrates more in depth the architecture of an A-User provides as an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first post dealt with the A-User Control, the AI-Module (AIM) that controls the other AIM of the A-User and is possibly controlled by a human.

Context Capture is the A-User’s sensory front-end – the AIM that opens up perception by scanning the environment and assembling a structured snapshot of what’s out there in the moment. It is the first AI Module (AIM) in the loop providing the data and setting the stage for everything that follows. When A-User Control decides it’s time to engage, it prompts Context Capture to focus on a specific M-Location – the zone where the User is active, rendering its Avatar.

What Context Capture produces is called Context – a time-stamped, multimodal snapshot that represents the A-User’s initial understanding of the scene. But this isn’t just raw data. Context is composed of two key ingredients: Audio-Visual Scene Descriptors and User State.

The Audio-Visual Scene Descriptors are like a spatial sketch of the environment. They describe what’s visible and audible: objects, surfaces, lighting, motion, sound sources, and spatial layout. They provide the A-User with a sense of “what’s here” and “where things are.” But they’re not perfect. These descriptors are often shallow – they capture geometry and presence but not meaning. A chair might be detected as a rectangular mesh with four legs, but Context Capture doesn’t know if it’s meant to be sat on, moved, or ignored.

That’s where Spatial Reasoning comes in. Spatial Reasoning is the AIM that takes this raw spatial sketch and starts asking the deeper questions:

“Which object is the User referring to?”
“Is that sound coming from a relevant source?”
“Does this object afford interaction, or is it just background?”

It analyses the Context and produces two critical outputs:

Spatial Output: a refined map of spatial relationships, referent resolutions, and interaction constraints.
Spatial Guide: a set of cues that enrich the user’s input — highlighting which objects or sounds are relevant, how close they are, and how they might be used.

These outputs are sent downstream to Domain Access and Prompt Creation. The former refines the spatial understanding of the scene. The latter enriches the A-User’s query when it formulates the prompt to the Basic Knowledge (LLM).

Then there’s User State – a snapshot of the User’s cognitive, emotional, and attentional posture. Is the User focused, distracted, curious, frustrated? Context Capture reads facial expressions, gaze direction, posture, and vocal tone to infer a baseline state. But again, it’s just a starting point. User behaviour may be nuanced, and initial readings can be incomplete, noisy or ambiguous. That’s why User State Refinement exists – to track changes over time, infer deeper intent, and guide the alignment of the A-User’s expressive behaviour done by Personality Alignment.

In short, Context Capture is the A-User’s first glimpse of the world – a fast, structured perception layer that’s good enough to get started, but not good enough to finish the job. It’s the launchpad for deeper reasoning, richer modulation, and more expressive interaction. Without it, the A-User would be blind. With it, the system becomes situationally aware, emotionally attuned, and ready to reason – but only if the rest of the AIMs do their part.

Responses to the Call must reach the MPAI Secretariat (secretariat@mpai.community) by 2025/01/21.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv).

No Comments InAll posts

Leonardo Chiariglione
2025-11-08

A-User Control: The Autonomous Agent’s Brain

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. The latter User may also be an A-User or may be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the first of a planned sequence of posts having the goal to illustrate more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture.

A-User Control is the general commander of the A-User system making sure the Avatar behaves like a coherent digital entity aware of the rights it can exercise in an instance of the MPAI Metaverse Model – Architecture (MMM-TEC) standard. The command is actuated by various signals exchanged with the AI-Modules (AIM) composing the Autonomous User.

At its core, A-User Control decides what the A-User should do, which AIM should do it, and how it should do it – all while respecting the Rights granted to the A-User and the Rules defined by the M-Instance. A-User Control either executes an Action directly or delegates it to another Process in the metaverse to carry it out.

A-User Control is not just about triggering actions. A-User Control also manages the operation of its AIMs, for instance A-User Rendering, which can turn text produced by the Basic Knowledge (LLM) and the Personal Status selected by Personality Alignment into a speaking and gesturing Avatar. A-User Control sends shaping commands to A-User Rendering, ensuring the Avatar’s behaviour aligns with metaverse-generated cues and contextual constraints.

A-User Control is not independent of human influence. The human, i.e., the A-User “owner”, can override, adjust, or steer its behaviour. This makes A-User Control a hybrid system: autonomous by design, but open to human modulation when needed.

The control begins when A-User Control triggers Context Capture to perceive the current M-Location — the spatial zone of the metaverse where the User is active. That snapshot, called Context, includes spatial descriptors and a readout of the human’s cognitive and emotional posture called User State. From there, the two Spatial Reasoning components – Audio and Visual – use Context to analyse the scene and sending outputs to Domain Access and Prompt Creation, which enrich the User’s input and guide the A-User’s understanding.

As reasoning flows through Basic Knowledge, Domain Access, and User State Refinement, A-User Control ensures that every action, rendering, and modulation is aligned with the A-User’s operational logic.

In summary, the A-User Control is the executive function of the A-User: part orchestrator, part gatekeeper, part interpreter. It’s the reason the Avatar doesn’t just speak – it does so while being aware of the Context – both the spatial and User components – with purpose, permission, and precision.

Stay tuned for more introductions into the world of the Autonomous User Architecture.

Responses to the Call must reach the MPAI Secretariat (secretariat@mpai.community) by 2025/01/21.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv).

No Comments InAll posts

Leonardo Chiariglione
2025-11-02

A new MPAI standard project for Autonomous Users in metaverse

The concept of virtual reality is now well established, with the metaverse concept as an important variant. Accordingly, MPAI has established a related standard, the MPAI Metaverse Model – Technologies (MMM-TEC) standard. However, standards for the contents of an MPAI metaverse instance (M-Instance) are still in progress. This document introduces the current status of these efforts and invites participation.

The contents include Processes representing entities with agency, called Users, and other entities lacking agency – essentially, various things populating an M-Instance – called Items.

Some Users represent humans. These may be directly operated by humans (and are called H-Users), or may have a high degree of operational autonomy (and are called A-Users, or informally, agents). Both types may be rendered as avatars called Personae.

The MMM-TEC standard specifies technologies enabling Users to perform various Actions on Items (things) in an M-Instance. For example, Users may sense data from the real world or may move Items in the M-Instance, possibly in combination with other Processes. However, MMM-TEC does not yet specify how an A-User decides to perform an Action.

Thus MPAI is developing a new standard covering such decisions: what does an A-User do when deciding to do something to achieve a Goal in an M-Instance? MPAI has assembled numerous relevant technologies, but more are needed. Therefore, the 61^st MPAI General Assembly (MPAI-61) has published the Call for Technologies Pursuing Goals in metaverse (MPAI-PGM) – Autonomous User Architecture (AUA). The Call requests interested parties – irrespective of their membership in MPAI – to submit responses that may enable MPAI to develop a robust A-User Architecture standard attractive to implementers and users.

The planned standard’s scope is as follows: PGM-AUA will specify functions and interfaces by which an A-User interacts with another User, either an A-User or an H-User. (Again, the term “User” means “conversational partner in the metaverse”, whether autonomous or driven by a human.) A-Users can capture text and audio-visual information originated by, or surrounding, the User; extract the User State, i.e., snapshots of the User’s cognitive, emotional, and interactional states; produce an appropriate multimodal response, rendered as a speaking Avatar; and move appropriately in the relevant virtual space.

One possible way to model an A-User’s interactions with other Users might be to train a very powerful unitary Large Language Model, able to use spatial and media information. However, because such a model would be unwieldy and difficult to manage, MPAI instead assumes the use of a relatively simple Large Language Model with basic language and reasoning capabilities. Spatial, audio-visual, and User description information will be passed to and from this Basic Model in natural language.

To handle this integration, MPAI proposes the MPAI AI Framework (MPAI-AIF) standard. This standard provides the necessary infrastructure to define a foundation for an A-User to which the necessary technologies can be added. MPAI-AIF enables specification of an AI Workflow (AIW) composed of AI Modules (AIMs). In this case, these can jointly represent an A-User in a manner that is modular, i.e., able to swap or update modules independently from other modules; transparent, i.e., able to perform clear roles and expose well-defined interfaces; and extensible, i.e., able to add or replace specific competences as needed.

The following figure represents a tentative diagram of the A-User architecture.

Figure 1 – The reference model of the Autonomous User Architecture

The model represents a largely autonomous A-User’s (“agent’s”) interactions with another User (A-User or H-User) at a given instant. It would thus be invoked repeatedly for extended interactions.

At a high level, we see an executive element (A-User Control), which can receive as input a human command or the response to some Action, and which delivers as output its status in response to the relevant command; any related action; and any request that it may itself deliver.

NOTE: While an A-User is defined as a relatively autonomous Process, a human may take over or modify its operation via the A-User Control.

More formally, the executive

The A-User Control AIM drives A-User operation by controlling how it interacts with the environment and performs Actions and Process Actions based on the Rights it holds and the M-Instance Rules. It does so by:

Performing or requesting another Process to perform an Action.
Controlling the operation of AIMs, in particular A-User Rendering.

The responsible human may take over or modify the operation of the A-User Control by exercising Human Commands. Figure 2 summarises the input and output data of the A-User Control AIM

Figure 2 – Simplified view of the Reference Model of A-User Control

A Human Command received from a human will generate a Human Command Status in response. A Process Action Request to a Process – that may include another User – will generate a Process Action Response. Various types of Commands (called Directives) to the Autonomous User AI Modules (AIM) will generate responses (called Statuses). The Figure singles out the A-User Rendering Directives issued to the A-User Rendering AIM. This will generate a response typically including a Speaking Avatar that the A-User Control AIM will MM-Add or MM-Move in the metaverse. The complete Reference Model of A-User Control can be found here.

The Context Capture AIM, prompted by the A-User Control, perceives a particular location of the M-Instance – called M-Location – where the User, i.e., the A-User’s conversation partner, has MM-Added its Avatar. In the metaverse, the A-User perceives by issuing an MM-Capture Process Action Request. The multimodal data captured is processed and the result is called Context – a time-stamped snapshot of the M-Location – composed of:

Audio and Visual Scene Descriptors describing the spatial content.
Entity State, describing the User’s cognitive, emotional, and attentional posture.

Thus, Context represents the initial A-User’s understanding of the User and the M-Location where it is embedded.

The Spatial Reasoning AIM – composed of two AIMs, Audio Spatial Reasoning and Visual Spatial Reasoning – analyses Context and sends an enhanced version of the Audio and Visual Scene Descriptors, containing audio source relevance, directionality, and proximity (Audio) and object relevance, proximity, referent resolutions, and affordance (Visual) to

The Domain Access AIM seeking additional domain-specific information. Domain Access responds with further enhanced Audio and Visual Scene Descriptors, and
The Prompt Creation AIM sending to the Basic Knowledge, a basic LLM, the PC-Prompt integrating:
1. User Text and Entity State (from Context Capture).
2. Enhanced Audio and Visual Scene Descriptors (from Spatial Reasoning).

This is depicted in Figure 3.

Figure 3 – Basic Knowledge receives PC-Prompt from Prompt Creation

The Initial Response to PC-Prompt is sent by Basic Knowledge to Domain Access that

Processes the Audio and Visual Scene Descriptors and the Initial Response by accessing domain-specific models, ontologies, or M-Instance services to retrieve:
1. Scene-specific object roles (e.g., “this is a surgical tool”)
2. Task-specific constraints (e.g., “only authorised Users may interact”)
3. Semantic affordances (e.g., “this object can be grasped”)
Produces and sends four flows:
1. Enhanced Audio and Visual Scene Descriptors to Spatial Reasoning to enhance its scene understanding.
2. User Context Guide to User State Refinements to enable it to update User’s Entity State.
3. Personality Context Guide to Personality Alignment.
4. DA-Prompt, a new prompt to Basic Knowledge including initial reasoning and spatial semantics.

Figure 4 – Domain Access serves Spatial Reasoning, Basic Knowledge, User State Refinement, and Personality Alignment

Basic Knowledge produces and sends an Enhanced Response to the User State Refinement AIM.

User State Refinement refines its understanding of User State using the User Context Guide, produces and sends:

UR-Prompt to Basic Knowledge.
Expressive State Guide to Personality Alignment providing A-User with the means to adopt a Personality that is congruent with the User’s Entity State.

Basic Knowledge produces and sends a Refined Response to Personality Alignment.

This is depicted in Figure 5.

Figure 5 – User State Refinements feeds Personality Alignment

Personality Alignment

Selects a Personality based Refined Response and Expressive State Guide and conveying a variety of elements such as : Expressivity (e.g., Tone, Tempo, Face, Gesture) and Behavioural Traits (e.g.: verbosity, humour, emotion), Type of role (e.g., assistant, mentor, negotiator, entertainer), etc.
Formulates and sends
1. An A-User Entity State reflecting the Personality to A-User Rendering.
2. A PA-Prompt to Basic Knowledge reflecting the intended speech modulation, face and gesture), synchronisation cues across modalities

Basic Knowledge sends a Final Response that conveys semantic content, contextual integration, expressive framing, and personality coherence.

This is depicted in Figure 6.

Figure 6 – Personality Alignment feeds A-User Rendering

A-User Rendering uses Final Response, A-User Entity Status and A-User Control Command from A-User Control to synthesise and shape a speaking Avatar contained in the A-User Control. This is depicted in Figure 7.

Figure 7 – The result of the Autonomous User processing is fed to A-User Control

Extended Call for Technologies

The complexity of the MMM-TEC model has prompted MPAI to extend its usual practice for Calls for Technologies. In addition to the usual Call for Technologies, Use Cases and Functional Requirements, Framework Licence, and Template for Responses, the Call also refers to a Tentative Technical Specification, a document drafted as if it were an actual Technical Specification. Respondents to the Call are free to comment on, change, or extend the Tentative Technical Specification or to make any other proposals judged relevant to the Call.

Anyone, irrespective of MPAI membership status, may respond to the Call. Responses shall reach the MPAI Secretariat by 2026/01/21T23:59.

Appropriate MPAI working groups will thoroughly review the Responses and retain those deemed appropriate for the future PGM-AUA standard. MPAI may select suitable technologies from those submitted in Responses, but is not obligated to select any proposal. Respondents will be encouraged to join MPAI. If Respondents whose Responses are accepted in full or in part do not join MPAI, MPAI will discontinue consideration of their proposed technologies.

No Comments InAll posts

Leonardo Chiariglione
2025-10-29

MPAI calls for technologies supporting metaverse-based Agentic AI

Geneva, Switzerland – 29^th October 2025. MPAI – Moving Picture, Audio and Data Coding by Artificial Intelligence – the international, non-profit, unaffiliated organisation developing AI-based data coding standards – has concluded its 61^st General Assembly (MPAI-61) approving the publication of a Call for Autonomous User Architecture Technologies.

With this Call for Technologies, formally “Pursuing Goals in metaverse (MPAI-PGM) – Autonomous User Architecture (PGM-AUA)”, MPAI is aiming at a standard enabling Autonomous Users to perform activities such as moving around and conversing with other Users. These are processes representing humans in a metaverse conforming with the MPAI Metaverse Model Technologies standard (MMM-TEC). They can either operate with a high degree of autonomy (A-Users) or be directly controlled by humans (H-Users).

PGM-AUA will rely on the friendly MMM-TEC environment and many relevant technologies already available in the 16 approved MPAI standards. However, the ambitious PGM-AUA goal requires many new technologies that the Call is designed to secure.

The text of the call and associated document is available. Responses are due to the MPAI Secretariat by 2025/01/21T23:59.

MPAI-61 has also approved the new versions of standards previously posted for Community Comments:

MPAI is continuing the development of its work plan that involves the following activities:

AI Framework (MPAI-AIF): developing a new MPAI-AIF specification that facilitates the creation of new workflows using available AIMs.
AI for Health (MPAI-AIH): developing the specification of a system receiving and processing licenses AI Health Data and enabling clients to improve health processing models via federated learning.
Context-based Audio Enhancement (CAE-DC): developing the Audio Six Degrees of Freedom (CAE-6DF) and Audio Object Scene Rendering (CAE-AOR) specifications.
Connected Autonomous Vehicle (MPAI-CAV): investigating extensions of the current CAV-TEC specification.
Compression and Understanding of Industrial Data (MPAI-CUI): developing the Company Performance Prediction V2.0 specification.
End-to-End Video Coding (MPAI-EEV): exploring the potential of AI-based End-to-End Video coding.
AI-Enhanced Video Coding (MPAI-EVC): finalising the Up-sampling Filter for Video applications (EVC-UFV) standard.
Governance of the MPAI Ecosystem (MPAI-GME): operating the MPAI Ecosystem per the MPAI-GME Specification.
Human and Machine Communication (MPAI-HMC): developing reference software and performance assessment.
Multimodal Conversation (MPAI-MMC): discussing the conversational part of the PGM-AUA Call for Technologies.
MPAI Metaverse Model (MPAI-MMM): developing support for security in the MMM-TEC specs.
Neural Network Watermarking (MPAI-NNW): Reviewing the responses to the Call on Neural Network Traceability Technologies.
Object and Scene Description (MPAI-OSD): discussing the spatial part of the PGM-AUA Call for Technologies.
Portable Avatar Format (MPAI-PAF): discussing the rendering part of the PGM-AUA Call for Technologies.
AI Module Profiles (MPAI-PRF): extending the scope of the current version of AI Module Profiles.
Server-based Predictive Multiplayer Gaming (MPAI-SPG): exploring new standard opportunities in the domain.
Data Types, Formats, and Attributes (MPAI-TFA) extending the standard to data types used by MPAI standards (e.g., automotive, health, and metaverse).
XR Venues (MPAI-XRV): developing the standard for improved development and execution of Live Theatrical Performances.

Legal entities and representatives of academic departments supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data can become MPAI members.

No Comments InAll posts

Leonardo Chiariglione
2025-10-04

Celebrating the first five years of MPAI

Where there are organisations counting years of existence in decades or centuries, there should not be much to celebrate for an organisation that only reaches as few as five years of existence. But there are years and years – even days and days – like in one day as a lion or a hundred years as a sheep.

The last five were not the years of a sheep but as one day as a lion.

We started with the idea of an organisation dedicated to standards for AI-based data coding because we thought that standards would bring benefits to a domain mostly alien to it. Not like some standards that look more like legal tools designed to oppress users but standards offering fair opportunities to all parties in the chain extending from innovators to end users.

An ambitious organisation like MPAI could not operate like four friends in a bar. The MPAI operation rules were developed and are now enshrined in the MPAI Patent Policy. The ambitions of MPAI were further enhanced by the definition of the MPAI Ecosystem extending from MPAI to implementers, integrators, and end users with the introduction of a new actor called MPAI Store, now incorporated in Scotland as a company limited by guarantee. There is a standard – Governance of the MPAI Ecosystem (MPAI-GME) setting the rules of operation of the Ecosystem.

The idea of a mission was there but what about implementing it? We acted as lions and posited that opaque monolithic AI should become component-based AI. Now a large share of our standards are based on the AI Framework (MPAI-AIF) standard, specifying an environment where AI Workflows composed of AI Modules can be initialised, dynamically configured, and controlled. MPAI-AIF also provided a stimulus to adoption of JSON Schema as a “language” to represent data types, AI Modules, and AI Workflows in MPAI standards. Today there is virtually no MPAI standard that does not use that language.

Having laid down the technical foundations, we started the buildings. One was designed to host the quite representative area of human and machine conversation extending beyond the “word” to cover other sometimes ethereal but information-carrying sensations and feelings. The standard called Multimodal Conversation (MPAI-MMC) is the first attempt at digitally representing this ethereal information with the Personal Status data type and Human-Machine Communication (MPAI-HMC) is an excellent example of its application.

Another investigation stream since the early MPAI days is audio sitting at the MPAI table as “Context-based Audio Enhancement” leading to the Context-based Audio Enhancement – Use Cases (MPAI-CAE) standard. Finally, with Compression and Understanding of Industrial Data (MPAI-CUI), MPAI demonstrated that data from so far unexplored domains like finance could benefit from standards.

Just one year after its establishment, MPAI could claim success by publishing its first three standards: MPAI-CUI, MPAI-GME, and MPAI-MMC and, by the end of 2021, another two: MPAI-AIF and MPAI-CAE.

Since its early days, MPAI was convinced that standards should have as much visibility as possible. For this reason, it established a successful cooperation with the Institute of Electric and Electronic Engineers (IEEE) – Standard Association (SA). Today, starting from three standards in 2022, nine MPAI standards have been adopted by IEEE without modifications and three more are in the pipeline.

The creation of MPAI Development Committees and Working Groups and their activity continued unrelenting. The use of watermarking and then fingerprinting to trace the use of neural networks let to the development of Neural Network Watermarking – Traceability (NNW-NNT). Connected Autonomous Vehicles was started in late 2020 and is now a standard with the name Connected Autonomous Vehicle – Technologies (CAV-TEC). MPAI was probably the first to engage in activities leading to a metaverse standard and now it can claim to have a solid candidate to lead the move to interoperable metaverses with MPAI Metaverse Model – Technologies (MMM-TEC). Since its early days, MPAI worked on online gaming, producing the Server-based Predictive multiplayer Gaming – Mitigation of Data Loss Effects (SPG-MDL) standard where a set of AI Modules predicts the game state of an online multiplayer game.

MPAI abhors the attitude of other standards bodies who develop unnecessarily “siloed” standards where technologies are treated exclusively from the point of view of the domain of that standard without considering similar technologies in other domains. Object and Scene Description (MPAI-OSD) and Portable Avatar Format (MPAI-PAF) do specify AI Workflows specific to their domains but their AI Modules and Data Types were specified for wide reuse in many other MPAI standards. This attitude is not confined to these two standards as the same can be said of MPAI-CAE and MPAI-MMC.

Atypical – but no less important – standards are AI Module Profiles (MPAI-PRF) establishing a machine-readable description to identify AI Module Profiles and Data Types, Formats, and Attributes (MPAI-TFA) providing a standard way to add information about data for processing by a machine.

Last comes a standard that embodies probably the very first activity – AI for video. AI-Enhanced Video Coding – Up-sampling Filter for Video applications (EVC-UFV) offers an AI super-resolution filter vastly superior to currently used filters.

Five years ago, MPAI was very bold in targeting standards for AI, then just a nice technology to talk about. In five years, however, AI is all over the place and much talked about. What will the future offer for MPAI?

Some answers are clear:

With its impressive portfolio of 15 standards, there will be much maintenance and enhancement work to do.
Two new standards are being developed and should be completed in a short time: AI for Health – Health Secure Platform and XR Venues – Live Theatrical Performance.
One project – End-to-End Video coding has still to go through the Call for Technologies phase
A Call for Technologies is open, and responses are expected: Neural Network Watermarking – Technologies.
A new Call for Technologies on Pursuing Goals in the metaverse is being prepared. This will require the development of a significant number of “behaviours” on top of a “baseline” Small Language Model.
Development of reference implementations to enhance the value and attractiveness of existing standards.

AI continues its lightning speed of development and MPAI will continue watching and identifying standardisation opportunities in different domains.

Long live MPAI!

No Comments InAll posts

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit

Archives: 2023-03-13

Notice