<–References Go to ToC AI Workflow–>
(Informative)
1 Introduction
An Autonomous User (A‑User) according to this PGM‑AUA standardis a User as defined by and endowed with the capabilities specified in Technical Specification: MPAI Metaverse Model (MPAI-MMM) – Technologies (MMM‑TEC) implemented with an AI Workflow as specified by Technical Specification: AI Framework (MPAI‑AIF) .
Operating within an M‑Instance, the A‑User receives commands from the human responsible for it and uses these commands to guide its behaviour and decision‑making. It perceives its surroundings by capturing Perceptible Objects – Text Objects, Audio Objects, 3D Model Objects, and Visual Objects – from the Audio‑Visual Scene shared with the User it interacts with, whether that User is another Autonomous User or a Human User (H‑User) directly controlled by a person, as well as any other relevant objects present in the environment.
The A‑User processes the captured information through its sequence of AI Modules, progressively interpreting the context, determining intent, refining its understanding of the interacting User, and shaping its own planned behaviour. Based on the outcome of this processing, the A‑User may generate a Speaking Avatar rendered in the M‑Instance and may perform Actions or issue Process Action Requests consistent with its goals, its Personality, and the Rights it holds. In turn, the A‑User also receives the corresponding Process Action Responses produced by the M‑Instance, allowing it to incorporate the effects of its previous actions into its evolving internal state and to operate coherently within the ongoing interaction.
Figure 1 gives the Reference Model of the AI Workflow implementing the Autonomous User.

Figure 1 – Reference Model of Autonomous User Architecture (PGM-AUA)
The Autonomous User Architecture is composed of ten AI Modules. Per Technical Specification: AI Framework (MPAI-AIF), a conforming PGM-AUA implementation may have fewer or more AIMs, provided the PGM-AUA interfaces are preserved. Terms in bold represent AIMs and Terms in italic represent Data Types.
The Autonomous User Control AIM (PGM-AUC) acts as the interface between the responsible human and the remaining nine AIMs. It receives Human Commands, issues Directives to the nine AIMs, receives their Status responses, and responds to the human with Human Command Status responses.
The Context Capture (PGM-CXC) AIM perceives the pertinent portion of the M-Instance and passes it to downstream AIMs. The core AIM is the Basic Knowledge (PGM-BKN) AIM, a Large Language Model having the level of capabilities proper to the PGM-AUA implementation. The task of PGM-AUA is to provide the Speech to be uttered by the PGM-AUA Persona in the M-Instance through a series of refinements whereby the LLM refines its understanding of the Text uttered by the User by accessing the appropriate spatial and User Entity State information via Model Context Protocol (MCP) interfaces. The A-User Formation (PGM-AUF) AIM uses the Final Response provided by the PGM-BKN and the A-User’s Entity State produced by the Personality Alignment (PGM-PAL) AIM to create the Speaking Avatar.
The rest of this chapter provides a general overview of the PGM-AUA operation.
2 A-User Control
The A-User Control (PGM-AUC) AIM drives the A-User operation by:
- Interacting with the M-Instance environment through Actions and Process Actions based on the Rights it holds and the M-Instance Rules.
- Controlling the operation of all AIMs.
The human responsible for the A-User may take over or modify the A-User Control operation by issuing Human Commands. Figure 2 represents some input and output data of the A-User Control AIM, where Figure 2 the A-User Formation Directive issued to the A-User Formation (PGM-AUF) AIM has been specifically singled out.

Figure 2 – Simplified view of the Reference Model of A-User Control with A-User Formation
A Human Command received by the A-User Command AIM results in a Human Command Status response. A Process Action Request to an M-Instance Process (which may be another User) will generate a Process Action Response. Various types of commands (called Directives) sent by the A-User Control AIM to other PGM-AUA AIMs will generate responses (called Statuses). The A-User Formation Directive sent to the A-User Formation AIM will generate a Status response that will typically include a speaking Avatar that the A-User Control AIM may MM-Add, MM-Animate, or MM-Move in the M-Instance. The specification of the A-User Control AIM is provided here.
3 The Front End and Spatial Reasoning
The Context Capture AIM, prompted by the A-User Control AIM, perceives a particular location of the M-Instance – called Location – where the User in the M-Instance – i.e., the A-User’s conversation partner – has MM-Added its Avatar with a Point of View. In the M-Instance, the A-User perceives the environment by issuing MM-Capture Process Action Requests. In effect, these Requests may direct the A-User’s attention in various perceptual modes (see Process Actions).

Figure 3 – The Context Capture (PGM-CXC) AIM
Once it is captured, the multimodal data is processed and the result is called the Context (a time-stamped snapshot of the Location) composed of:
- Audio Scene Descriptors and Visual Scene Descriptors describing the spatial content.
- Entity State, describing the User’s cognitive, emotional, and attentional posture.
Thus, the Context represents the initial A-User’s understanding of the User and the Location where it is embedded. The specification of the Context Capture AIM is provided here.
The Context data reaches the Audio Spatial Reasoning (PGM-ASR) and Visual Spatial Reasoning (PGM-VSR) AIMs. They analyse the Context and send enhanced versions of the Audio Scene Descriptors and Visual Scene Descriptors to the Domain Access and Prompt Creation AIMs. The Audio Spatial Reasoning AIM may add audio source relevance, directionality, and proximity, while the Visual Spatial Reasoning AIM may add visual object relevance, proximity, referent resolutions, and affordance.
The specifications of the Audio Spatial Reasoning and Visual Spatial Reasoning AIMs are provided here and here, respectively.
One stream of Audio Scene Descriptors and Visual Scene Descriptors reaches the Domain Access (PGM-DAC) AIM, where their spatial information is integrated with domain-specific information to provide further enhanced Audio Scene Descriptors and Visual Scene Descriptors as depicted in Figure 4. The specification of the Domain Access AIM is provided here.

Figure 4 – Domain Access further enhances Scene Descriptors
4 The first LLM query
The Context stream and a stream of enhanced Audio Scene Descriptors and Visual Scene Descriptors reach the Prompt Creation (PGM-PRC) AIM. As depicted in Figure 5, this AIM has the goal to produce the PC Prompt by integrating the Text information included or derived from the Context (e.g., by recognising a Speech segment uttered by the User) and the additional information derived from the User Entity State (from Context Capture) enhanced by the Audio Scene Descriptors and Visual Scene Descriptors (from Spatial Reasoning).

Figure 5 – Prompt Creation produces PC-Prompt
The workflow of the Prompt Creation and Basic Knowledge (PGM-BKN) AIM interaction unfolds as follows:
- The Prompt Creation AIM injects a PC‑Prompt, composed of structured and schema‑validated operations, into the Basic Knowledge AIM.
- The LLM parses the PC‑Prompt, selects the matching Prompt Creation‑exposed tool(s), and sends an MCP call with validated arguments to the MCP Server.
- The MCP Server runs the Prompt Creation‑registered tool handler(s) and returns a structured result of the tool execution.
- The LLM assembles the Initial Response (including tool result) and returns it to Prompt Creation.
The Prompt Creation AIM is now able to inform the A-User Control AIM of the outcome of the LLM interrogation. The specification of the Prompt Creation AIM is provided here.
5 The second LLM query
As depicted in Figure 6, Domain Access is now able to perform its domain‑specific analysis using the same MCP‑structured representation of the LLM output. It formulates a structured, schema‑validated DA‑Prompt that expresses the domain‑specific operations the LLM must execute next. Through MCP, Domain Access sends the DA‑Prompt to Basic Knowledge, which processes it by invoking the Domain Access‑registered tool definitions. Basic Knowledge then returns an Enhanced Response – an MCP‑structured result that reflects the execution of Domain Access‑specific reasoning.

Figure 6 – Domain Access and its data
- A User CXT Guide addressed to User State Refinement (PGM-USR) and conveying domain‑grounded cues (e.g., disambiguated semantics, role assignments, constraints, and context carried over from previous moments) that **USR** combines with its historical user model to refine the User Entity State .
- A Personality CXT Guide addressed to Personality Alignment (PGM-PAL) and conveying expressive and interactional cues (e.g., tone, pacing, gesture salience, and role‑style hints) that Personality Alignment will use with other data to select and shape the Personality used downstream.
6 The third LLM query
The User State Refinement AIM receives the same Enhanced Response as Domain Access, giving it access to the enriched, MCP‑structured interpretation produced by Basic Knowledge in response to the DA‑Prompt. Together with its own accumulated information about the User’s recent goals, behaviour, and interaction flow, User State Refinement uses this Enhanced Response to establish a clearer picture of the User’s short‑term intentions and situational posture.
While Domain Access focuses on domain semantics and spatial grounding, User State Refinement focuses on the User, analysing how the current moment aligns with what the User has been doing so far. To complete this interpretation, User State Refinement issues a UR‑Prompt to Basic Knowledge through MCP. This UR‑Prompt formulates a set of user‑centred operations – such as validation of inferred intent, clarification of ambiguous signals, or resolution of short‑term behavioural hypotheses – that the LLM must execute to support the refinement of the User Entity State.
As depicted in Figure 7, Basic Knowledge uses MCP to executes the User State Refinement‑registered tool definitions referenced in the UR‑Prompt and returns a Refined Response. This Refined Response resolves residual ambiguities and provides a more coherent representation of the User’s immediate goals, preferences, uncertainties, and behavioural cues. It gives User State Refinement the final piece of evidence needed to synthesise a stable, short‑term understanding of the User.

Figure 7 – The actors of the third query
With this Refined Response, User State Refinement completes its stateful reasoning process and assembles the Expressive State Guide – a structured data type that captures the refined short‑term user context. The ESG is then provided to Personality Alignment (PAL), which uses it, together with the Personality Context Guide from DAC, to select and shape the Personality that will drive the final expressive output of the A‑User.
7 The fourth LLM query
Personality Alignment (PGM-PAL) receives the same Refined Response from Basic Knowledge as User State Refinement, giving it access to a consolidated understanding of the User’s short‑term goals, semantic interpretation, and behavioural cues. Together with the Expressive State Guide (ESG) received from USR and the Personality Context Guide received from Domain Access, Personality Alignment interprets how the User is likely to respond, engage, and interact in the immediate context.
Using this combined information, Personality Alignment determines the Personality that the A‑User should adopt for the current interaction. This Personality describes expressive, behavioural, and multimodal characteristics such as tone, tempo, verbal style, facial animation cues, gesture profile, behavioural traits, and the interactional role to be assumed. Once selected, Personality Alignment encodes its expressive and behavioural prescriptions in the form of a PA‑Prompt, a structured MCP prompt that instructs Basic Knowledge on how to shape the next A‑User speech and visual appearance so that it conforms to the chosen Personality.
As depicted in Figure 8, Personality Alignment sends this PA‑Prompt to Basic Knowledge through MCP. The LLM interprets the PA‑Prompt, invokes the PAL‑registered tool definitions, and executes the corresponding expressive and behavioural constraints as part of its reasoning process. It then returns a Final Response, structured in the same MCP format used throughout the workflow. This Final Response is delivered both to Personality Alignment – which uses it to verify that the intended expressive constraints were correctly applied – and to A‑User Formation (PGM-AUF), which uses it alongside the A‑User Entity State produced by PAL and the directives from A‑User Control to synthesise the final multimodal Avatar output.

Figure 8 – The actors of the fourth query
In this way, the fourth and last LLM query ensures that the A‑User’s produced response is not only semantically coherent and contextually grounded but also aligned with the selected expressive Personality, enabling consistent and traceable rendering of the A‑User’s behaviour.
Figure 9 depicts the operation of A‑User Formation (AUF). This receives the Final Response from Basic Knowledge, the A‑User Entity State produced by Personality Alignment, and the current directive from A‑User Control. AUF combines these three inputs to synthesise the multimodal behaviour of the A‑User. Drawing on the expressive and behavioural cues encoded in the A‑User Entity State, AUF coordinates speech, facial expression, gesture, and timing so that the resulting output matches the Personality selected upstream. The semantic content and expressive framing contained in the Final Response guide the generation of the A‑User’s utterance, while the directive from A‑User Control ensures compliance with procedural constraints and action requirements. The resulting speaking Avatar is returned to A‑User Control as part of the AIM’s Status, completing the A‑User workflow for the current interaction step..

Figure 9 – The result of the A-User processing is rendered as its Persona
<–References Go to ToC AI Workflow–>