<-References Go to ToC AI Modules->
Technical Specification: Portable Avatar Format (MPAI-PAF) V1.3 together with other MPAI Technical Specifications provides technologies enabling the implementation of the Avatar-Based Videoconference Use Case, a form of videoconference held in Virtual Environments populated by Avatars representing humans showing their visual appearance and uttering their voices.
MPAI-PAF) V1.3 assumes that implementations will be based on Technical Specification: AI Framework (MPAI-AIF) V2.1.
Table 1 displays the full list of AIWs specified by MPAI-PAF V1.2. Click a listed AIW to access its dedicated page, which includes its functions, reference model, I/O Data, Functions of AIMs, I/O Data of AIMs, and a table providing links to the AIW-related AIW, AIMs, and JSON metadata.
All previously specified MPAI-PAF AI-Workflows are superseded by those specified by V2.2 but may be used if the version is explicitly mentioned.
Acronym | Names and Specifications of AI Workflows | JSON |
PAF-CTX | Videoconference Client Transmitter | X |
MMC-VMS | Virtual Meeting Secretary | X |
PAF-AVS | Avatar Videoconference Server | X |
PAF-CRX | Videoconference Client Receiver | X |
Figure 1 depicts the system composed of four types of subsystems specified as AI Workflows.
Figure 1 – Avatar-Based Videoconference end-to-end diagram
The components of the PAF-ABV system:
- Participant: a human joining an ABV either individually or as a member of a group of humans in the same physical space.
- Audio-Visual Scene: a Virtual Audio-Visual Environment equipped with Visual Objects such as a Table and an appropriate number of chairs and Audio Objects described by Audio-Visual Scene Descriptors.
- Portable Avatar: a data set specified by MPAI-PAF including data representing a human participant.
- Videoconference Client Transmitter:
- At the beginning of the conference:
- Receives from Participants and sends to the Server Portable Avatars containing the Avatar Models and Language Selectors.
- Sends to the Server Speech Object and Face Object for Authentication.
- Continuously sends to the Server Portable Avatars containing Avatar Descriptors and Speech.
- At the beginning of the conference:
- The Avatar Videoconference Server
- At the beginning of the conference:
- Selects the Audio-Visual Descriptors, e.g., a Meeting Room.
- Equips the Room with Objects, i.e., Table and Chairs.
- Places Avatar Models around the Table with a given Spatial Attitude.
- Distributes Portable Avatars containing Avatars Models, their Speech Objects and Spatial Attitudes, and Audio-Visual Scene Descriptors to all Receiving Clients.
- Authenticates Speech and Face Objects and assigns IDs to Avatars.
- Sets the common conference language.
- Continuously:
- Translates Speech to Participants according to their Language Selectors.
- Sends Portable Avatars containing Avatar Descriptors, Speech, and Spatial Attitude of Participants and Virtual Meeting Secretary to all Receiving Clients and Virtual Meeting Secretary.
- At the beginning of the conference:
- Virtual Meeting Secretary is an Avatar not corresponding to any Participant that continuously:
- Uses the common meeting language.
- Understands Text Objects and Speech Objects of all Avatars and extracts their Personal Statuses.
- Drafts a Summary of its understanding of Avatars’ Text Objects, Speech Objects, and Personal Status.
- Displays the Summary either to:
- Outside of the Virtual Environment for participants to read and edit directly, or
- The Visual Space for Avatars to comment, e.g., via Text Objects.
- Refines the Summary.
- Sends its Portable Avatar containing its Avatar Descriptors to the Server.
- Videoconference Client Receiver:
- At the beginning of the conference:
- Receives Audio-Visual Scene Descriptors and Portable Avatars containing Avatar Models with their Spatial Attitudes.
- Continuously:
- Receives Portable Avatars with Avatar Descriptors and Speech.
- Produces Visual Scene Descriptors and Audio Scene Descriptors.
- Renders the Audio-Visual Scene by spatially adding the Avatars’ Speech Objects to the Spatial Attitude of the respective Avatars’ Mouths. Rendering may be done from a Point of View, possibly different from the Position assigned to their Avatars in the Visual Scene, selected by participant who use a device of their choice (Head Mounted Display or 2D display/earpad) to experience the Audio-Visual Scene.
- At the beginning of the conference:
Each component of the Avatar-Based Videoconference Use Case is implemented as an AI Workflow (AIW) composed of AI Modules (AIMs). Each AIW includes the following elements:
1 | Functions of the AIW | The functions performed by the AIW implementing the Use Case. |
2 | Reference Model of the AIW | The Topology of AIMs in the AIW. |
3 | Input and Output Data of the AIW | Input and Output Data of the AIW. |
4 | Functions of the AIMs | Functions performed by the AIMs. |
5 | Input and Output Data of the AIW | Input and Output Data of the AIMs. |
6 | AIW, AIMs, and JSON Metadata | Links to summary specification on the web of the AIMs and corresponding JSON Metadata [2]. |