A new MPAI standard project for Autonomous Users in a metaverse
The concept of virtual reality is now well established, with the metaverse concept as an important variant. Accordingly, MPAI has established a related standard, the MPAI Metaverse Model – Technologies (MMM-TEC) standard. However, standards for the contents of an MPAI metaverse instance (M-Instance) are still in progress. This document introduces the current status of these efforts and invites participation.
The contents include Processes representing entities with agency, called Users, and other entities lacking agency – essentially, various things populating an M-Instance – called Items.
Some Users represent humans. These may be directly operated by humans (and are called H-Users), or may have a high degree of operational autonomy (and are called A-Users, or informally, agents). Both types may be rendered as avatars called Personae.
The MMM-TEC standard specifies technologies enabling Users to perform various Actions on Items (things) in an M-Instance. For example, Users may sense data from the real world or may move Items in the M-Instance, possibly in combination with other Processes. However, MMM-TEC does not yet specify how an A-User decides to perform an Action.
Thus MPAI is developing a new standard covering such decisions: what does an A-User do when deciding to do something to achieve a Goal in an M-Instance? MPAI has assembled numerous relevant technologies, but more are needed. Therefore, the 61st MPAI General Assembly (MPAI-61) has published the Call for Technologies Pursuing Goals in metaverse (MPAI-PGM) – Autonomous User Architecture (AUA). The Call requests interested parties – irrespective of their membership in MPAI – to submit responses that may enable MPAI to develop a robust A-User Architecture standard attractive to implementers and users.
The planned standard’s scope is as follows: PGM-AUA will specify functions and interfaces by which an A-User interacts with another User, either an A-User or an H-User. (Again, the term “User” means “conversational partner in the metaverse”, whether autonomous or driven by a human.) A-Users can capture text and audio-visual information originated by, or surrounding, the User; extract the User State, i.e., snapshots of the User’s cognitive, emotional, and interactional states; produce an appropriate multimodal response, rendered as a speaking Avatar; and move appropriately in the relevant virtual space.
One possible way to model an A-User’s interactions with other Users might be to train a very powerful unitary Large Language Model, able to use spatial and media information. However, because such a model would be unwieldy and difficult to manage, MPAI instead assumes the use of a relatively simple Large Language Model with basic language and reasoning capabilities. Spatial, audio-visual, and User description information will be passed to and from this Basic Model in natural language.
To handle this integration, MPAI proposes the MPAI AI Framework (MPAI-AIF) standard. This standard provides the necessary infrastructure to define a foundation for an A-User to which the necessary technologies can be added. MPAI-AIF enables specification of an AI Workflow (AIW) composed of AI Modules (AIMs). In this case, these can jointly represent an A-User in a manner that is modular, i.e., able to swap or update modules independently from other modules; transparent, i.e., able to perform clear roles and expose well-defined interfaces; and extensible, i.e., able to add or replace specific competences as needed.
The following figure represents a tentative diagram of the A-User architecture.

The model represents a largely autonomous A-User’s (“agent’s”) interactions with another User (A-User or H-User) at a given instant. It would thus be invoked repeatedly for extended interactions.
At a high level, we see an executive element (A-User Control), which can receive as input a human command or the response to some Action, and which delivers as output its status in response to the relevant command; any related action; and any request that it may itself deliver.
NOTE: While an A-User is defined as a relatively autonomous Process, a human may take over or modify its operation via the A-User Control.
More formally, the executive A-User Control AIM (AI Module) drives A-User operation by controlling how it interacts with the environment and performs Actions and Process Actions by:
- Performing an Action or requesting another Process to perform one.
- Controlling the operation of AIMs, in particular A-User Rendering.
Here is a full accounting of the input and output, separated from the full diagram to remove distractions:

This input-output diagram summarizes in two lines the Responses from, and Commands to, the A-User’s six current AIMs, or AI Modules, which jointly enable the A-User’s actions. They represent perception and reasoning about what is perceived; knowledge and processing about the current domain, e.g., surgery or a particular game; and composition of an appropriate response, including the A-User’s simulated emotion, cognitive state, and social attitudes, alignment with the agent’s simulated personality, and rendering of the resulting response’s visual and audio aspects. Most of these modules can consult the A-User’s Large Language Model, the Basic Knowledge (LLM).
Keeping this overview in mind, we can survey the individual modules in greater detail.
Context Capture
The Context Capture AIM, prompted by the A-User Control, perceives a particular location of the M-Instance – called an M-Location – where the A-User’s conversation partner, here designated as User, is rendering its Avatar. The result of the capture is called the Context, a time-stamped structured snapshot representing the A-User’s initial understanding of the M-Location. Context is composed of
- Audio-Visual Scene Descriptors describing the spatial content.
- User State, describing the User’s cognitive, emotional, and attentional posture within the environment.
Spatial Reasoning
The Spatial Reasoning AIM analyses the Context and:
- Sends Audio and Visual Spatial Output, i.e., spatial relationships, referent resolutions, and interaction constraints, to the Domain Access AIM seeking additional domain-specific information.
- Sends Audio and Visual Spatial Guides, i.e., audio source relevance, directionality, and proximity (Audio) and object relevance, orientation, proximity, and affordance (Visual) to the Prompt Creation AIM. The objective is to enrich the A-User’s spoken or written input with additional information about A-User before sending a prompt created by Prompt Creation (a PC-Prompt) to Basic Knowledge, a basic LLM. PC-Prompt includes
- User Text and User State (from Context Capture).
- Audio and Visual Spatial Guide (from Spatial Reasoning).
Spatial Reasoning is specified as two separate AIMs, one for the Audio and the other for the Visual component. 3D Graphics inputs are also handled by the Visual component.

Domain Access
The Basic Knowledge LLM sends to the Domain Access module an Initial Response containing a direct response to the PC-Prompt, along with general reasoning based on foundational LLM capabilities.
The Domain Access module then
- Processes and responds to two flows:
- Spatial Output (audio and visual, from Spatial Reasoning):
- Accesses domain-specific models, ontologies, or M-Instance services.
- Returns Audio and Visual Spatial Directive to inject contextual background, scene-specific logic, and task relevance into the reasoning loop of Spatial Reasoning to improve the fidelity of its spatial interpretation.
- Initial Response (from the Basic Knowledge LLM):
- Accesses domain-specific models, ontologies, or M-Instance services to retrieve:
- Scene-specific object roles (e.g., “this is a surgical tool”)
- Task-specific constraints (e.g., “only authorised Users may interact”)
- Semantic affordances (e.g., “this object can be grasped”)
- Returns to Basic Knowledge LLM a DA-Prompt that includes initial reasoning, spatial semantics, domain overlays, and User/task constraints.
- Accesses domain-specific models, ontologies, or M-Instance services to retrieve:
- Spatial Output (audio and visual, from Spatial Reasoning):
- Performs further interchanges:
- Sends Refined Context Guide (to User State Refinement):
- Includes a structured object with:
- Updated User descriptors
- Scene salience and relevance
- Interaction history and inferred goals
- Includes a structured object with:
- Sends Refined Context Guide (to User State Refinement):
User State Refinement
Uses Refined Context Guide from the Domain Access module to update User State and to generate a UR-Prompt for the Basic Knowledge LLM, reflecting the refined understanding.

Domain Access
The Basic Knowledge LLM sends to the Domain Access module an Initial Response containing a direct response to the PC-Prompt, along with general reasoning based on foundational LLM capabilities.
The Domain Access module then
- Processes and responds to two flows:
- Spatial Output (audio and visual, from Spatial Reasoning):
- Accesses domain-specific models, ontologies, or M-Instance services.
- Returns Audio and Visual Spatial Directive to inject contextual background, scene-specific logic, and task relevance into the reasoning loop of Spatial Reasoning to improve the fidelity of its spatial interpretation.
- Initial Response (from the Basic Knowledge LLM):
- Accesses domain-specific models, ontologies, or M-Instance services to retrieve:
- Scene-specific object roles (e.g., “this is a surgical tool”)
- Task-specific constraints (e.g., “only authorised Users may interact”)
- Semantic affordances (e.g., “this object can be grasped”)
- Returns to Basic Knowledge LLM a DA-Prompt that includes initial reasoning, spatial semantics, domain overlays, and User/task constraints.
- Accesses domain-specific models, ontologies, or M-Instance services to retrieve:
- Spatial Output (audio and visual, from Spatial Reasoning):
- Performs further interchanges:
- Sends Refined Context Guide (to User State Refinement):
- Includes a structured object with:
- Updated User descriptors
- Scene salience and relevance
- Interaction history and inferred goals
- Includes a structured object with:
- Sends Refined Context Guide (to User State Refinement):
User State Refinement
Uses Refined Context Guide from the Domain Access module to update User State and to generate a UR-Prompt for the Basic Knowledge LLM, reflecting the refined understanding.
Basic Knowledge produces and sends to Personality Alignment a Refined Response.

Personality Alignment
- Receives (1) a Personality-based Refined Response and (2) an Expressive State Guide, both conveying a variety of elements such as:
- Expressivity, e.g.:
- Tone, e.g., formal, casual, empathetic, assertive
- Tempo, e.g., fast, slow, rhythmic
- Gesture style, e.g., expansive, restrained, animated
- Facial dynamics, e.g., smile frequency, gaze behaviour, eyebrow movement
- Etc.
- Behavioural Traits, e.g.:
- Verbosity level
- Use of metaphors or humour
- Degree of emotional expressiveness
- Type of role: assistant, mentor, negotiator, entertainer, etc.
- Expressivity, e.g.:
- Formulates and sends
- An A-User Personal Status (using the MPAI-specified Personal Status structure, containing specification of emotion, cognitive state, and social attitudes) reflecting the Personality to A-User Rendering.
- A PA-Prompt to Basic Knowledge reflecting:
- Speech modulation instructions (e.g., pitch, emphasis)
- Facial expression timing and intensity
- Gesture choreography
- Synchronisation cues across modalities.
A-User Rendering
The Basic Knowledge LLM sends a Final Response to the A-User Rendering module that conveys semantic content, contextual integration, expressive framing, and personality consistency.

As seen below, the A-User Rendering module (1) uses the Final Response and A-User Personal Status to synthesise a speaking Avatar and (2) employs an A-User Control Command from A-User Control to refine the speaking Avatar.

Extended Call for Technologies
The complexity of the MMM-TEC model has prompted MPAI to extend its usual practice for Calls for Technologies. In addition to the usual Call for Technologies, Use Cases and Functional Requirements, Framework Licence, and Template for Responses, the Call also refers to a Tentative Technical Specification, a document drafted as if it were an actual Technical Specification. Respondents to the Call are free to comment on, change, or extend the Tentative Technical Specification or to make any other proposals judged relevant to the Call.
Anyone, irrespective of MPAI membership status, may respond to the Call. Responses shall reach the MPAI Secretariat by 2026/01/21T23:59.
Appropriate MPAI working groups will thoroughly review the Responses and retain those deemed appropriate for the future PGM-AUA standard. MPAI may select suitable technologies from those submitted in Responses, but is not obligated to select any proposal. Respondents will be encouraged to join MPAI. If Respondents whose Responses are accepted in full or in part do not join MPAI, MPAI will discontinue consideration of their proposed technologies.
MPAI is organizing two online presentations with similar content on 2025/11/17. To attend register:





