Moving Picture, Audio and Data Coding
by Artificial Intelligence

Archives: 2021-12-31

A-User Formation: Building the A-User

If Personality Alignment gives the A-User its style, A-User Formation AIM gives the A-User its body and its voice, the avatar and the speech for the A-User Control to embed in the metaverse. The A-User stops being an abstract brain controlling various types of processing and becomes a visible, interactive entity. It’s not just about projecting a face on a bot; it’s about creating a coherent representation that matches the personality, the context, and the expressive cues.

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (walk, converse, do things, etc.) with another User in a metaverse. The latter User may be an A-User or be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the tenth and last of a sequence of posts aiming to illustrate more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first six dealt with 1) the Control performed by the A-User Control AI Module on the other components of the A-User; 2) how the A-User captures the external metaverse environment using the Context Capture AI Module; 3) listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning; 4) makes sense of what the Autonomous User sees by understanding objects’ geometry; relationships, and salience; 5) takes raw sensory input and the User State and turns them into a well‑formed prompt that Basic Knowledge can actually understand and respond to; 6) taps into domain-specific intelligence for deeper understanding of user utterances and operational context; 7) the core language model of the Autonomous User – the “knows-a-bit-of-everything” brain, the first responder to a prompt of a sequence of four; 8) converting a “blurry photo” of the User in the environment taken at the onset of the process into a focused picture; and 9) providing not only a generic bot but a character with intent, tone, and flair – not only a matter of what the avatar utters but how its words land, how the avatar moves, and how the whole interaction feels.

A-User Formation AIM gives the A-User a body and a voice, the results of a chain or AI Modules composing the A-User pipeline enabling a perceptible and coherent representation that matches the personality, the context, and the expressive cues.

The inputs driving A-User Formation are

  • A-User Entity Status: The personality blueprint from Personality Alignment (tone, gestures, behavioural traits).
  • Final Response: personality-tuned content from Basic Knowledge – what the avatar will utter.
  • A-User Control Command: Directives for rendering and positioning in the metaverse (e.g., MM-Add, MM-Move).
  • Rendering Parameters: Synchronisation cues for speech, facial expressions, and gestures.

What comes out of the box: Formation Status

  • A multimodal representation of the A-User (Speaking Avatar) that talks, moves, and reacts in sync with the A-User’s intent – the best expression the A-User can give of itself in the circumstances.
  • Structured report on the processing that led to the result.

What Makes A-User Formation Special?

It’s the last mile of the pipeline – the point where all upstream intelligence (context, reasoning, User’s Entity Status estimation, personality) becomes visible and interactive. A-User Formation ensures:

  • Expressive Coherence: Speech, gestures, and facial cues match the chosen personality.
  • Contextual Fit: Avatar appearance and behaviour align with domain norms (e.g., formal in a medical setting, casual in a social lounge).
  • Technical Precision: Synchronisation across Personal Status modalities for natural and consistent interaction.

Key Points to Take Away about A-User Formation

  1. Purpose: Turns the A-User’s personality and reasoning into a visible and audible interactive avatar.
  2. Inputs: Personality-aligned final response, control commands, and rendering parameters.
  3. Outputs: Speaking avatar, formation status.
  4. Goal: Deliver a coherent, expressive, and context-aware representation that feels natural and engaging in response to how the User was perceived at the beginning and processed during the pipeline.

Personality Alignment: The Style Engine of A-User

Personality Alignment is where an A-User interacting with a User embedded in a metaverse environment stops being a generic bot and starts acting like a character with intent, tone, and flair. It’s not just a matter of what it utters – it’s about how those words land, how the avatar moves, and how the whole interaction feels.

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (walk, converse, do things, etc.) with another User in a metaverse. The latter User may be an A-User or be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the ninth of a sequence of posts aiming to illustrate more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first six dealt with 1) the Control performed by the A-User Control AI Module on the other components of the A-User; 2) how the A-User captures the external metaverse environment using the Context Capture AI Module; 3) listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning; 4) makes sense of what the Autonomous User sees by understanding objects’ geometry; relationships, and salience; 5) takes raw sensory input and the User State and turns them into a well‑formed prompt that Basic Knowledge can actually understand and respond to; 6) taps into domain-specific intelligence for deeper understanding of user utterances and operational context; 7) the core language model of the Autonomous User – the “knows-a-bit-of-everything” brain, the first responder to a prompt of a sequence of four; and 8) converting a “blurry photo” of the User in the environment taken at the onset of the process into a focused picture.

The figure is an extract from the A-User Architecture Reference model representing Domain Access generating two streams of data related to the User and its environment and two recipient AI Modules: User State Refinement and Personality Alignment.

This is possible because the A-User receives the right inputs driving the Alignment of the A-User Personality with the refined User’s Entity State:

  • Personality Context Guide: Domain-specific hints from Domain Access (e.g., “medical setting → professional tone”).
  • Expressive State Guide: Emotional and attentional posture of the User (e.g., stressed → calming personality).
  • Refined Response: Text from Basic Knowledge in response to User State Refinement prompt.
  • Personality Alignment Directive: Commands to tweak or override the personality profile (e.g., “switch to negotiator mode”) from the A-User Control AI Module (AIM).

A smart integration of these inputs enables the A-User to deliver the following outputs:

  • A-User Entity State: the complete internal state of the A-User’s synthetic personality produced (tone, gestures, behavioural traits).
  • PA-Prompt: New prompt formulation including the final A-User personality (so the words sound right).
  • Personality Alignment Status: A structured report of personality and expressive alignment to the A-User Control AIM.

Here are some examples of personality profiles that Personality Alignment could use or blend:

  • Mentor Mode: Calm tone, structured answers, moderate gestures, empathy cues.
  • Entertainer Mode: Upbeat tone, humour, wide gestures, animated expressions.
  • Negotiator Mode: Firm tone, controlled gestures, strategic phrasing.
  • Assistant Mode: Neutral tone, minimal gestures, clarity-first responses.

Key Points to Take Away about Personality Alignment

  • Purpose: Makes A-User’s delivery context-aware and emotionally tuned.
  • Inputs: Domain context, user emotional state, refined semantic response, and directives.
  • Outputs: Personality blueprint (Entity Status), PA-Prompt for expressive rendering, and alignment status.
  • Profiles: For example, Mentor, Entertainer, Negotiator, Assistant – each with tone, gesture style, and behavioural traits.
  • Goal: Coherent, adaptive interaction that feels natural and persuasive in the metaverse.

User State Refinement: Turning a Snapshot into a Full Profile

User State Refinement starts from a “blurry photo” of the User in the context (the initial User State) taken by the Context Capture, that includes a location, activity, initial intent, maybe a few emotional hints and adds to the “blurry photo” all the information about the User that the workflow has been able to collect.

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (walk, converse, do things, etc.) with another User in a metaverse. The latter User may be an A-User or be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the eighth of a sequence of posts aiming to illustrate more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first six dealt with 1) the Control performed by the A-User Control AI Module on the other components of the A-User; 2) how the A-User captures the external metaverse environment using the Context Capture AI Module; 3) listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning; 4) makes sense of what the Autonomous User sees by understanding objects’ geometry; relationships, and salience; 5) takes raw sensory input and the User State and turns them into a well‑formed prompt that Basic Knowledge can actually understand and respond to; 6) taps into domain-specific intelligence for deeper understanding of user utterances and operational context; and 7) the core language model of the Autonomous User – the “knows-a-bit-of-everything” brain, the first responder to a prompt of a sequence of four.

When the A-User begins interacting, it starts with a basic User State captured by Context Capture – location, activity, initial intent, and perhaps a few emotional hints. This initial state is useful, but it’s like a blurry photo: the A-User knows that somebody ps there, but not the details that matter for nuanced interaction.

As the session unfolds, the A-User learns much more thanks to Prompt Creation, Spatial Reasoning, and Domain Access. Suddenly, the A-User understands not just what the User said, but what it meant, the context it operates in, and the reasoning patterns relevant to the domain. This new knowledge is integrated with the initial state so that subsequent steps – especially Personality Alignment and Basic Knowledge – are based on an appropriate understanding of the User State.

Why Update the User State?

Personality Alignment is where the A-User adapts tone, style, and interaction strategy. If it only relies on the first guess of the User State, it risks taking an incongruent attitude – formal when casual is needed, directive when supportive is expected. If the User State can be updated the A-User knows more about:

  • The environment incorporating jargon, compliance rules, and reasoning patterns.
  • The internal state and can adjust responses to confusion, urgency, or confidence.

The Refinement Process

  1. Start with Context Snapshot: capture environment, speech, gestures, and basic emotional cues.
  2. Inject Domain Intelligence from Domain Access: technical vocabulary, rules, structured reasoning.
  3. Merge New Observations: emotional shifts, spatial changes, updated intent.
  4. Validate Consistency: ensure module coherence for reliable downstream use.
  5. Feed Forward: pass the refined state to Personality Alignment and sharper prompts to Basic Knowledge.

Key point to take away about User State Refinement

  1. Capture initial User State from Context: location, activity, intent, basic emotions.
  2. Initial state = blurry photo: useful but lacks detail.
  3. A-User learns what was said, meant, and domain reasoning patterns.
  4. Merge new insights with original state for accuracy by
    • Injecting domain intelligence
    • Merging emotional/spatial updates
  5. Outputs: better prompt for Basic Knowledge and more complete User State.
  6. Goal: dynamic, nuanced understanding powering adaptive interaction.

Basic Knowledge: The Generalist Engine Getting Sharper with Every Prompt

Basic Knowledge is the core language model of the Autonomous User – the “knows-a-bit-of-everything” brain. It’s the provider of the first response to a prompt but the Autonomous User doesn’t fire off just one answer but four of them in a progressive refinement loop, providing smarter and more context-aware responses with every refined prompt.

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (walk, converse, do things, etc.) with another User in a metaverse. The latter User may be an A-User or be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the seventh of a sequence of posts aiming to illustrate more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first six dealt with 1) the Control performed by the A-User Control AI Module on the other components of the A-User; 2) how the A-User captures the external metaverse environment using the Context Capture AI Module; 3) listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning; 4) makes sense of what the Autonomous User sees by understanding objects’ geometry, relationships, and salience; 5) takes raw sensory input and the Entity State and turns them into a well‑formed prompt that Basic Knowledge can actually understand and respond to; and 6) taps into domain-specific intelligence for deeper understanding of user utterances and operational context.

Basic Knowledge is the core language model of the Autonomous User – the “knows-a-bit-of-everything” brain. It’s the provider of the first response to a prompt but the Autonomous User doesn’t fire off just one answer but four of them in a progressive refinement loop, providing smarter and more context-aware responses with every refined prompt.

 The Journey of a Prompt

  1. Starts Simple: The first prompt from Prompt Creation is a rough draft because the A-User has only a superficial knowledge of the Context and User intent.
  2. Domain Access adds expert seasoning: jargon, compliance rules, reasoning patterns. The prompt becomes richer and sharper.
  3. User State Refinement injects dynamic knowledge about the User – refined emotions, more focused goals, better spatial context – so the prompt feels more attuned to what the User feels and wants.
  4. Personality Alignment Tells A-User how to Behave: it ensures that the appropriate A-User’s style and mood drive the next prompt.
  5. Final Prompt Delivery: when Basic Knowledge receives the last prompt (from Personality Alignment) the final touches have been added.

This sequence of prompts eventually provides:

  • Better responses: Each prompt reduces ambiguity.
  • Domain grounding: Avoids hallucinations by embedding rules and expert logic.
  • Personalisation: Adapts A-User’s tone and content to User State.
  • Scalability: Works across domains without retraining.

Basic Knowledge starts as a generalist, but thanks to refined prompts, it ends up delivering expert-level, context-aware, and User-sensitive responses. It starts from a rough sketch and, by iterating with specialist information sources, it provides a final response that includes all the information extracted or produced in the workflow.

Key points to take away about Basic Knowledge

  1. Is the core language model of the A-User – generalist brain.
  2. Works in four progressive refinement steps.
  3. Starts with Prompt Creation prompt, a rough draft with limited context.
  4. Domain Access adds jargon, compliance rules, reasoning patterns.
  5. User State Refinement injects emotions, focused goals, spatial context.
  6. Personality Alignment ensures style and mood match User State.
  7. Each refinement step reduces ambiguity and improves accuracy.
  8. Benefits: better responses, domain grounding, personalisation, scalability.
  9. Basic Knowledge starts as a generalist but ends up delivering an expert-level, User-sensitive, and A-User aware response.

Domain Access: The Specialist Brain Plug-in for the Autonomous User

While the Basic Knowledge module is a generalist language model that “knows a bit of everything”, Domain Access is the expert with a broad range of knowledge that enables the Autonomous User to tap into domain-specific intelligence for deeper understanding.

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (walk, converse, do things, etc.) with another User in a metaverse. The latter User may be an A-User or be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the sixth of a sequence of posts aiming to illustrate the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first five dealt with 1) the Control performed by the A-User Control AI Module on the other components of the A-User; 2) how the A-User captures the external metaverse environment using the Context Capture AI Module; 3) listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning; 4) makes sense of what the Autonomous User sees by understanding objects’ geometry, relationships, and salience; and 5) takes raw sensory input and the User State and turns them into a well‑formed prompt that Basic Knowledge can actually understand and respond to.

The Basic Knowledge module is a generalist language model that “knows a bit of everything.” In contrast, Domain Access is the expert layer that enables the Autonomous User (A-User) to tap into domain-specific intelligence for deeper understanding of user utterances and their context.

How Domain Access Works

  • Receives Initial Response: Domain Access starts with the response of Basic Knowledge, the generalist model’s response to the prompt generated by Prompt Creation.
  • Converts to DA-Input: As the natural language response is not the best way to process the response, it is converted into a JSON object called DA-Input for structured processing.
  • Gets domain knowledge by pulling in domain vocabulary such as, jargon and technical terms.
  • Creates the next prompt by using this specialised knowledge:
    • Injects rules and constraints (e.g., standards, legal compliance).
    • Adds reasoning patterns (e.g., diagnostic flows, contractual logic).

All enrichment happens in the JSON domain and so is the produced DA-Prompt Plan – a domain-aware structure ready for conversion into natural language – called DA-Prompt – and resubmission into the knowledge/response pipeline.

Why Domain Access Matters

Without Domain Access, the A-User is like a clever intern: knowledgeable but lacking depth and experience. With Domain Access, it becomes n experienced professional that can:

  • Deliver accurate, context-aware answers.
  • Avoid hallucinations by grounding responses in domain rules.
  • Address different application domains by swapping or adding domain modules without rebuilding the entire A-User.

What you can take away about Domain Access

  • Get Initial Response from Basic Knowledge.
  • Convert to DA-Input (JSON).
  • Enrich with Domain Context:
    • Pull in domain vocabulary.
    • Inject rules and constraints.
    • Add reasoning patterns.
  • Create DA-Prompt Plan (domain-aware structure).
  • Translate to DA-Prompt (natural language).
  • Query Basic Knowledge language model.
  • Strong points
    • Deliver accurate, context-sensitive answers.
    • Avoid hallucinations via domain grounding.
    • Adapt across different domains by swapping modules.

Prompt Creation: Where Words Meet Context

The Prompt Creation module is the storyteller and translator in the Autonomous User’s “brain”, It takes raw sensory input  –  audio and visual spatial data of Context (such as objects in a scene with their position, orientation and velocity) and the Entity State (rich description of the A‑User’s understanding of the “internal state” of the User) – and turns it into a well‑formed prompt that Basic Knowledge can actually understand and respond to.


We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (walk, converse, do things, etc.) with another User in a metaverse. The latter User may be an A-User or be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the fifth of a sequence of posts aiming at illustrating more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first four posts dealt with 1) how the A-User Control AI Module controls the other components of the A-User; 2) how the A-User captures the external metaverse environment using the Context Capture AI Module; 3) how it listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning; and 4) how it makes sense of what the Autonomous User sees by understanding objects’ geometry, relationships, and salience.

Prompt Creation is the storyteller and translator in the Autonomous User’s “brain.” It takes raw sensory input – audio and visual spatial data of Context and User State – and turns it into a well‑formed prompt that Basic Knowledge can sensibly understand and respond to.

The audio and visual components of Spatial Reasoning provide the information on things around the User such as “who’s in the room,” “what’s being said,” “what objects are present,” and “what’s the User doing”. Context Capture provides Entity State as a rich description of the A‑User’s understanding of the “internal state” of the User – which may a representation of a biologically real User, if it represents a human, or simulated when the User represents an agent. The task of Prompt Creation is to synthesise these sources of information into a PC‑Prompt Plan. This plan starts from what the User said, adds intent (e.g., “User wants help” or “User is asking a question”), includes the context around the User (e.g., “User is in a virtual kitchen”), and embeds User State (e.g., “User seems confused”).

This information – conveniently represented as a JSON object – is converted into natural language

, and passed to Basic Knowledge that produces a natural language response called the Initial Responseinitial because there are more processing elements in the A‑User pipeline that will refine and improve the answer before it is rendered in the metaverse.

Prompt Creation gives the AI a sense of narrative, so the A-User can:

  • Ask the right clarifying question.
  • Respond with relevance to the situation.
  • Adapt to the environment and User mood.
  • Maintain continuity across interactions.

If the User says: “Can you help me cook?”

  • Spatial Reasoning notes the User is in a virtual kitchen with utensils and ingredients.
  • Entity State suggests the User looks uncertain.
  • Prompt Creation combines these into: “User is asking for cooking help, is in a kitchen, seems unsure.”

This Initial Response is then passed to Domain Access, which may elaborate a new prompt enriched with domain-specific information (in this case “cooking”, when Basic Knowledge is not well informed about cooking).

Prompt Creation turns raw multimodal input and spatial information into meaningful prompts so the AI can think, speak, and act with purpose. It is the scriptwriter that ensures the A‑User’s dialogue is not only coherent but also contextually aware, emotionally attuned, and situationally precise.

What you can take away about Basic Knowledge

  • Translates user speech into Language Model understandable prompts
  • Synthesises spatial data and User State
  • Detects User intent (e.g., help request, question)
  • Embeds environmental context (e.g., virtual kitchen)
  • Captures emotional cues (e.g., confusion, excitement)
  • Builds a structured PC-Prompt Plan as a JSON object to facilitate prompt creation
  • Converts PC-Prompt Plan into a natural language prompt
  • Passes the prompt to Basic Knowledge for response generation
  • Bridges perception and cognition for purposeful Language Model action

MPAI publishes A-based video up-sampling filter standard with online demo of standard performance

Geneva, Switzerland – 26th November 2025. MPAI – Moving Picture, Audio and Data Coding by Artificial Intelligence – the international, non-profit, unaffiliated organisation developing AI-based data coding standards – has concluded its 62nd General Assembly (MPAI-62) publishing the Up-sampling Filter for Video applications standard.

Technical Specification: AI-Enhanced Video Coding (MPAI-EVC) – Up-sampling Filter for Video applications (EVC-UFV) V1.0 provides two standard methodologies 1) to design AI-based Super-resolution up-sampling filters for video applications and 2) to reduce the complexity of the designed filters without substantially affecting their performance. The parameters provided in EVC-UFV standard may be used to test the filter performance. Alternatively, an application can be used to submit an image and receive an up-sampled version of the image.

After publishing the Autonomous User Architecture Call for Technologies, MPAI has extended its Tentative Technical Specification: Pursuing Goal in metaverse (MPAI-PGM) – Autonomous User Architecture (PGM-AUA) originally attached to the Call. This addendum is a concrete example of the standard that MPAI seeks to develop with the PGM-AUA Call. Respondents to the Call are encouraged to read, comment on, change, or extend this document in their responses. Alternatively, they can submit their responses with a content unrelated to this document but relevant to the Call. The video recording of the online presentation of the Call is available. Deadline for submissions is 19 January 2026.

MPAI is continuing the development of its work plan that involves the following activities:

  1. AI Framework (MPAI-AIF): extending the MPAI-AIF specification to enable a client to access a remote MPAI-AIF Controller and an AI Module to communicate data to another AIM with associate metadata.
  2. AI for Health (AIH-HSP): developing the specification of a system receiving and processing licenses AI Health Data and enabling clients to improve health processing models via federated learning.
  3. Context-based Audio Enhancement (CAE-USC): developing the Audio Six Degrees of Freedom (CAE-6DF) and the Audio Object Rendering (CAE-AOR) specifications.
  4. Connected Autonomous Vehicle (CAV-TEC): developing a new version of the flagship specification CAV-TEC with security support.
  5. Compression and Understanding of Industrial Data (CUI-CPP): developing the Company Performance Prediction V2.0 specification.
  6. End-to-End Video Coding (MPAI-EEV): exploring the potential of AI-based End-to-End Video coding in compressing video sequences.
  7. AI-Enhanced Video Coding (MPAI-EVC): exploring use of AI to enhance the video codec performance.
  8. Governance of the MPAI Ecosystem (MPAI-GME): operating the MPAI Ecosystem per the MPAI-GME Specification.
  9. Human and Machine Communication (MPAI-HMC): exploring the use of AI in human-to-machine and machine-to-machine communication.
  10. Multimodal Conversation (MPAI-MMC): exploring the impact of the PGM-AUA Call for Technologies on human-to-machine and machine-to-machine
  11. MPAI Metaverse Model (MMM-TEC): developing security-protected protocols in the MMM-TEC specification.
  12. Neural Network Watermarking (NNW-TEC): Developing the new Neural Network Watermarking (MPAI-NNW) – Technologies (NNW-TEC) including assessments of Neural Network Traceability Technologies.
  13. Object and Scene Description (MPAI-OSD): discussing the impact of MPAI standards planned or under development on MPAI-OSD V1.4.
  14. Portable Avatar Format (MPAI-PAF): discussing the impact of MPAI standards planned or under development on MPAI-PAF V1.5.
  15. AI Module Profiles (MPAI-PRF): extending the scope of the current version of AI Module Profiles.
  16. Server-based Predictive Multiplayer Gaming (MPAI-SPG): exploring new standard opportunities in the domain.
  17. Data Types, Formats, and Attributes (MPAI-TFA) extending the standard to data types used by MPAI standards that are planned or under development.
  18. XR Venues (XRV-LTP): developing the standard for improved execution of Live Theatrical Performances using AI.

Legal entities and representatives of academic departments supporting the MPAI mission and able to contribute to the development of standards for the efficient use of data can become MPAI members. New members joining before 31st December 2025 have their membership extended until 31st December 2026.

Please visit the MPAI website, contact the MPAI Secretariat for specific information, subscribe to the MPAI Newsletter and follow MPAI on social media: LinkedIn, Twitter, Facebook, Instagram, and YouTube.

 


Visual Spatial Reasoning: The Vision Aware Interpreter

Autonomous User (A-User) is an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. It is a “conversation partner in a metaverse interaction” with the User, itself an A-User or an H-User directly controlled by a human. The figure shows a diagram of the A-User while the User generates audio-visual streams of information and possibly text as well.

This is the fourth of a sequence of posts aiming to illustrate the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first three dealt with 1) the Control performed by the A-User Control AI Module on the other components of the A-User, 2) how the A-User captures the external metaverse environment using the Context Capture AI Module, and 3) listens, localises, and interprets sound not just as data, but as data having a meaning and a spatial anchor.

When the A-User acts in a metaverse space, sound doesn’t tell the whole story. The visual scene – objects, zones, gestures, occlusions – is the canvas where situational meaning unfolds. That’s where Visual Spatial Reasoning comes in: it’s the interpreter that makes sense of what the Autonomous User sees, not just what it hears.

Visual Spatial Reasoning can be considered as the visual analyst embedded in the “brain” of the Autonomous User. It doesn’t just list objects; it understands their geometry, relationships, and salience. A chair isn’t just “a chair” – it’s occupied, near a table, partially occluded, or the focus of attention. By enriching raw descriptors into structured semantics, Visual Spatial Reasoning transforms objects made of pixels into actionable targets.

This is what it does

  • Scene Structuring: Takes and organises raw visual descriptors into coherent spatial maps.
  • Semantic Enrichment: Adds meaning – classifying objects, inferring affordances, and ranking salience.
  • Directed Alignment: Filters and prioritises based on the A-User Controller’s intent, ensuring relevance.
  • Traceability: Every refinement step is auditable, to trace back why, “that object in the corner” became “the salient target for interaction.”

Why It Matters

Without Visual Spatial Reasoning, the metaverse would be a flat stage of unprocessed visuals. With it, visual scenes become interpretable narratives. It’s the difference between “there are three objects in the room” and “the User is focused on the screen, while another entity gestures toward the door.”

Of course, Visual Spatial Reasoning does not replace vision. It bridges the gap between raw descriptors and effective interaction, ensuring that the A‑User can observe, interpret, and act with precision and intent.

If Audio Spatial Reasoning is the metaverse’s “sound‑aware interpreter,” then Visual Spatial Reasoning is its “sight‑aware analyst” that starts by seeing objects and eventually can understand their role, their relevance, and their story in the scene.


Audio Spatial Reasoning: The Sound-Aware Interpreter

Autonomous User (A-User) is an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. It is a “conversation partner in a metaverse interaction” with the User, itself an A-User or and H-User directly controlled by a human. The figure shows a diagram of the A-User while the User generates audio-visual streams of information and possibly text as well.

We have already presented the system diagram of the Autonomous User (A-User), an autonomous agent able to move and interact (walk, converse, do things, etc.) with another User in a metaverse. The latter User may be an A-User or be under the direct control of a human and is thus called a Human-User (H-User). The A-User acts as a “conversation partner in a metaverse interaction” with the User.

This is the third of a sequence of aiming at illustrating more in depth the architecture of an A-User and provide an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first two dealt with the Control performed by the A-User Control AI Module on the other components of the A-User and how the A-User captures the external metaverse environment using the Context Capture AI Module.

Audio Spatial Reasoning is the A-User’s acoustic intelligence module – the one that listens, localises, and interprets sound not just as data, but as data having a spatially anchored meaning. Therefore, Its role is not just about “hearing”, it is also about “understanding” where sound is coming from, how relevant it is, and what it implies in the context of the User’s intent in the environment.

When the A-User system receives a Context snapshot from Context Capture – including audio streams with a position and orientation and a description of the User’s emotional state (called User State) – Audio Spatial Reasoning start an analysis of directionality, proximity, and semantic importance of incoming sounds. The conclusion is something like “That voice is coming from the left, with a tone of urgence, and its orientation is directed at the A-User.”

All this is represented with an extension of the Audio Scene Descriptors describing:

  • Which audio sources are relevant
  • Where they are located in 3D space
  • How close or far they are
  • Whether they’re foreground (e.g., a question) or background (e.g., ambient chatter)

This guide is sent to Prompt Creation and Domain Access. Let’s see what happens with the former. The extended Audio Scene Descriptors are fused with the User’s spoken or written input and the current User State. The result is a PC-Prompt – a rich query enriched with text expressing the multimodal information collected so far – that is passed to Basic Knowledge for reasoning.

The Audio Scene Descriptors are further processed and integrated with domain-specific information. The response is called Audio Spatial Directive that includes domain-specific logic, scene priors, and task constraints. For example, if the scene is a medical simulation, Domain Access might tell Audio Spatial Reasoning that “only sounds from authorised personnel should be considered”. This feedback helps Audio Spatial Reasoning refine its interpretation – filtering out irrelevant sounds, boosting priority for critical ones, and aligning its spatial model with the current domain expectations.

Therefore, we can call Audio Spatial Reasoning as the A-User’s auditory guide. It knows where sounds are coming from, what they mean, and how they should influence the A-User’s behaviour. The A-User responds to a sound with spatial awareness, contextual sensitivity, and domain consistency.

There are still about two mounts to the deadline of 2025/01/19 when responses Call must reach the MPAI Secretariat (secretariat@mpai.community) without exception.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv) – Today or tomorrow depending on where you are.


Context Capture: The A-User’s First Glimpse of the World

Autonomous User (A-User) is an autonomous agent able to move and interact (converse, etc.) with another User in a metaverse. It is a “conversation partner in a metaverse interaction” with the User, itself an A-User or and H-User directly controlled by a human. The figure shows a diagram of the A-User while the User generates audio-visual streams of information and possibly text as well.

The sequence of posts – of which this is the second – that illustrates more in depth the architecture of an A-User provides as an easy entry point for those who wish to respond to the MPAI Call for Technology on Autonomous User Architecture. The first post dealt with the A-User Control, the AI-Module (AIM) that controls the other AIM of the A-User and is possibly controlled by a human.

Context Capture is the A-User’s sensory front-end – the AIM that opens up perception by scanning the environment and assembling a structured snapshot of what’s out there in the moment. It is the first AI Module (AIM) in the loop providing the data and setting the stage for everything that follows. When A-User Control decides it’s time to engage, it prompts Context Capture to focus on a specific M-Location – the zone where the User is active, rendering its Avatar.

What Context Capture produces is called Context – a time-stamped, multimodal snapshot that represents the A-User’s initial understanding of the scene. But this isn’t just raw data. Context is composed of two key ingredients: Audio-Visual Scene Descriptors and User State.

The Audio-Visual Scene Descriptors are like a spatial sketch of the environment. They describe what’s visible and audible: objects, surfaces, lighting, motion, sound sources, and spatial layout. They provide the A-User with a sense of “what’s here” and “where things are.” But they’re not perfect. These descriptors are often shallow – they capture geometry and presence but not meaning. A chair might be detected as a rectangular mesh with four legs, but Context Capture doesn’t know if it’s meant to be sat on, moved, or ignored.

That’s where Spatial Reasoning comes in. Spatial Reasoning is the AIM that takes this raw spatial sketch and starts asking the deeper questions:

  • “Which object is the User referring to?”
  • “Is that sound coming from a relevant source?”
  • “Does this object afford interaction, or is it just background?”

It analyses the Context and produces two critical outputs:

  • Spatial Output: a refined map of spatial relationships, referent resolutions, and interaction constraints.
  • Spatial Guide: a set of cues that enrich the user’s input — highlighting which objects or sounds are relevant, how close they are, and how they might be used.

These outputs are sent downstream to Domain Access and Prompt Creation. The former refines the spatial understanding of the scene. The latter enriches the A-User’s query when it formulates the prompt to the Basic Knowledge (LLM).

Then there’s User State – a snapshot of the User’s cognitive, emotional, and attentional posture. Is the User focused, distracted, curious, frustrated? Context Capture reads facial expressions, gaze direction, posture, and vocal tone to infer a baseline state. But again, it’s just a starting point. User behaviour may be nuanced, and initial readings can be incomplete, noisy or ambiguous. That’s why User State Refinement exists – to track changes over time, infer deeper intent, and guide the alignment of the A-User’s expressive behaviour done by Personality Alignment.

In short, Context Capture is the A-User’s first glimpse of the world – a fast, structured perception layer that’s good enough to get started, but not good enough to finish the job. It’s the launchpad for deeper reasoning, richer modulation, and more expressive interaction. Without it, the A-User would be blind. With it, the system becomes situationally aware, emotionally attuned, and ready to reason – but only if the rest of the AIMs do their part.

Responses to the Call must reach the MPAI Secretariat (secretariat@mpai.community) by 2025/01/21.

To know more, register to attend the online presentations of the Call on 17 November at 9 UTC (https://tinyurl.com/y4antb8a) and 16 UTC (https://tinyurl.com/yc6wehdv).