<–References Go to ToC AI Modules–>
(Tentative)
| Function | Reference Model | Input/Output Data |
| Functions of AI Modules | Input/output Data of AI Modules | AIW, AIMs, and JSON Metadata |
1 Function
The A-User Architecture is represented by an AI Workflow that
- May receive a command from a human.
- Captures Text Objects, Audio Objects, 3D Model Objects and Visual Objects of an Audio-Visual Scene in an M-Instance that includes one User that can be Autonomous (A-User) or Human (H-User) as a result of an Action performed by the A-User or requested by a Human Command.
- Produces an Action or a Process Action Request that may reference the A-User’s Persona, i.e., the speaking Avatar generated by the A-User in response to the input data.
- Receives Process Action Responses to Process Action Requests made by the A-User.
2 Reference Model
Figure 1 gives the Reference Model of the AI Workflow implementing the Autonomous User.
The figure represents an initial diagram of the A-User architecture.

Let’s walk through this model.
The A-User Control AIM drives A-User operation by controlling how it interacts with the environment and performs Actions and Process Actions based on the Rights it holds and the M-Instance Rules by:
- Performing or requesting another Process to perform an Action.
- Controlling the operation of AIMs, in particular A-User Rendering.
The human may take over or modify the operation of the A-User Control.

The Context Capture AIM, prompted by the A-User Control, perceives a particular location of the M-Instance – called M-Location – where the User i.e., the A-User’s conversation partner, is rendering its Avatar. The result of the capture is called Context, a time-stamped structured snapshot representing the initial A-User understanding of the M-Location. Context is composed of
- Audio-Visual Scene Descriptors describing the spatial content.
- User State, describing the User’s cognitive, emotional, and attentional posture within the environment.
The Spatial Reasoning AIM is analyses Context and sends:
- Audio and Visual Spatial Output, i.e., spatial relationships, referent resolutions, and interaction constraints, to the Domain Access AIM seeking additional domain-specific information.
- Audio and Visual Spatial Guides, i.e., audio source relevance, directionality, and proximity (Audio) and object relevance, orientation, proximity, and affordance (Visual) to the Prompt Creation AIM to enrich the User’s spoken or written input with additional User information before sending PC-Prompt to Basic Knowledge, a basic LLM. PC-Prompt includes
- User Text and User State (from Context Capture).
- Audio and Visual Spatial Guide (from Spatial Reasoning).
Spatial Reasoning is specified as two separate AIMs.

Basic Knowledge sends Domain Access an Initial Response containing a core answer as a direct response to the PC-Prompt and general reasoning based on foundational LLM capabilities.
Domain Access
- Processes and responds to two flows:
- Audio and Visual Spatial Output:
- Accesses domain-specific models, ontologies, or M-Instance services.
- Returns Audio and Visual Spatial Directive to inject contextual priors, scene-specific logic, and task relevance into the reasoning loop of Spatial Reasoning to improve its fidelity of spatial interpretation.
- Initial Response:
- Accesses domain-specific models, ontologies, or M-Instance services to retrieve:
- Scene-specific object roles (e.g., “this is a surgical tool”)
- Task-specific constraints (e.g., “only authorised Users may interact”)
- Semantic affordances (e.g., “this object can be grasped”)
- Returns to Basic Knowledge DA-Prompt that includes, initial reasoning, spatial semantics, domain overlays, and User/task constraints
- Accesses domain-specific models, ontologies, or M-Instance services to retrieve:
- Produces two flows:
- Refined Context Guide:
- Includes a structured object with:
- Updated User descriptors
- Scene salience and relevance
- Interaction history and inferred goals
- Enables User State Refinements to update User State and to generate a UR-Prompt that reflects the refined understanding.
- Includes a structured object with:
- Refined Context Guide:
- Audio and Visual Spatial Output:

Basic Knowledge produces and sends Enhanced Response to the User State Refinement AIM.
User State Refinement refines its understanding of User State, produces and sends to:
- Basic Knowledge a UR-Prompt.
- Personality Alignment AIM Expressive State Guide, a structured representation of the A-User’s current User State, informing Personality Alignment how to adopt an A-User Personality that is emotionally effective and contextually appropriate.
Basic Knowledge produces and sends to Personality Alignment a Refined Response.

Personality Alignment
- Selects a Personality based Refined Response and Expressive State Guide and conveying a variety of elements such as:
- Expressivity, e.g.:
- Tone, e.g., formal, casual, empathetic, assertive.
- Tempo, e.g., fast, slow, rhythmic.
- Gesture style, e.g., expansive, restrained, animated.
- Facial dynamics, e.g., smile frequency, gaze behaviour, eyebrow movement.
- Etc.
- Behavioural Traits, e.g.:
- Verbosity level
- Use of metaphors or humour
- Degree of emotional expressiveness.
- Type of role: assistant, mentor, negotiator, entertainer, etc.
- Expressivity, e.g.:
- Formulates and sends
- An A-User Personal Status (using MPAI-specified Personal Status) reflecting the Personality to A-User Rendering.
- A PA-Prompt to Basic Knowledge reflecting:
- Speech modulation instructions (e.g., pitch, emphasis)
- Facial expression timing and intensity
- Gesture choreography
- Synchronisation cues across modalities
Basic Knowledge sends a Final Response that conveys semantic content, contextual integration, expressive framing, and personality consistence.

A-User Rendering uses Final Response and A-User Personal Status to synthesise a speaking Avatar and A-User Control Command from A-User Control to shape the speaking Avatar.

With the exception of Basic Knowledge, A-User AIMs are not required to have language and reasoning capabilities. Prompt Creation, Domain Access, User State Refinement, and Personality Alignment convert their output Data to/from text to JSON Schemas whose names are given in Table 2. The propagation of information through the Basic Knowledge AIMs takes place from the AIMs in the first column to the right through the AIMs, e.g., Prompt Creation to Basic Access and then to Domain Access from the right to the left.
Table 2 – the flow of Prompts and Responses though the A-User’s Basic Knowledge
| Prompt Creation | PC-Prompt Plan | PC-Prompt | Basic Knowledge |
| Domain Access | DA Input | Initial Response | |
| DA-Prompt Plan | DA-Prompt | ||
| User State Refinement | UR Input | Enhanced Response | |
| UR-Prompt Plan | UR-Prompt | ||
| Personality Alignment | PA Input | Refined Response | |
| PA-Prompt Plan | PA-Prompt | ||
| A-User Output | Final Response |
3 Input/Output Data
Table 3 gives the Input/Output Data of the Autonomous User AIW.
Table 3 – Input/output data of the Autonomous User
| Input | Description |
| Human Command | A command from the responsible human overtaking or complementing the control of the A-User. |
| Process Action Response | Generated by the M-Instance Process sin response to the A-User’s Process Action Request |
| Text Object | User input as text. |
| Audio Object | The Audio component of the Scene where the User is embedded. |
| 3D Model Object | The 3DModel component of the Scene where the User is embedded. |
| Visual Object | The Visual component of the Scene where the User is embedded. |
| Output | Description |
| Human Command Status | |
| Action | Action performed by A-User. |
| Process Action Request | A-User’s Process Action Request. |
4 Functions of AI Modules
Table 4 gives the functions performed by PGM-AUA AIMs.
Table 4 – Functions of PGM-AUA AIMs
| Acronym | Name | Definition |
| PGM-AUC | A-User Control | Performs Actions and Process Action Requests, such as utter a speech or move its Persona (Avatar) consequent to its interactions with the User. |
| PGM-CXT | Context Capture | Captures at least one of Text, Audio, 3D Model, and Visual, and produces Context, a representation of the User and the environment where the User is located. |
| PGM-ASR | Audio Spatial Reasoning | Transforms raw Audio Scene Descriptors and Audio cues into semantic outputs that Prompt Creation (PRC) uses to enhance User Text and to Domain Access (DAC) seeking additional information. |
| PGM-VSR | Visual Spatial Reasoning | Transforms raw Visual Scene Descriptors, gesture vectors, and gaze cues into semantic outputs that Prompt Creation (PRC) uses to enhance User Text and to Domain Access (DAC) seeking additional information. |
| PGM-PRC | Prompt Creation | Transforms semantic inputs received from Context Capture (CXC) and Audio and Visual Spatial Reasoning (SPR) and, indirectly from Domain Access (DAC) as responses provided to SPR, into natural language prompts (PR-Prompts) to Basic Knowledge. |
| PGM-BKN | Basic Knowledge | A language model – not necessarily general-purpose – that receives the enriched PC Prompt Creation (PCR), Domain Access (DAC), User State Refinement (USR), and Personality Alignment (PAL) texts and converts into responses that are used by the various AIMs to gradually produce the Final Response. |
| PGM-DAC | Domain Access | Performs the following main functions: – Interprets the Spatial Outputs from SPR and any User-related semantic inputs (from User). – Selects and activates domain-specific behaviours to deal with the specific input from SPR and BKN. – Produces semantically enhanced outputs to SPR and BKN. |
| PGM-USR | User State Refinement | Modulates the Enhanced Response from BKN into a User State and Context-aware UR-Prompt, which is then sent to BKN. |
| PGM-PAL | Personality Alignment | Modulates the Refined Response into an A-User Personality Profile-aware PA-Prompt, which is then sent to BKN. |
| PGM-AUR | A-User Rendering | Receives the Final Response from BKN, A-User Personal Status from Personality Alignment (PAL), and Command from A-User Control and renders the A-User as a speaking Avatar. |
| PGM-AUC | A-User Control | The User Control AIM (PGM-USC) governs the operational lifecycle of the A-User though its AIMs and orchestrates its interaction with both the M-Instance and the human User. |
5 Input/output Data of AI Modules
Table 5 provides acronyms, names, and links to the specification of the AI modules composing the PGM-AUA AIW and their input/output data. The current specification is tentative but is expected to evolve from input from Responses to the Call for Technologies.
Table 5 – Input/output Data of AI Modules
6 AIW, AIMs, and JSON Metadata
Table 6 provides the links to the AIW and AIM specifications and to the JSON syntaxes.
Table 6 – AIW, AIMs, and JSON Metadata
| AIW | AIMs | Name | JSON |
| PGM-AUA | Autonomous User | X | |
| PGM-AUC | A-User Control | X | |
| PGM-CXT | Context Capture | X | |
| PGM-ASR | Audio Spatial Reasoning | X | |
| PGM-VSR | Visual Spatial Reasoning | X | |
| PGM-PRC | Prompt Creation | X | |
| PGM-BKN | Basic Knowledge | X | |
| PGM-DAC | Domain Access | X | |
| PGM-USR | User State Refinement | X | |
| PGM-PAL | Personality Alignment | X | |
| PGM-AUR | A-User Rendering | X |