This is the public page of Mixed-reality Collaborative Spaces (MPAI-MCS), an MPAI standard project developing technologies for scenarios where geographically separated humans represented by avatars collaborate in virtual-reality spaces where:

  1. Virtual Twins of humans – embodied in speaking avatars having a high level of similarity, in terms of voice and appearance, with their Human Twins – are directed by Human Twins to achieve an agreed goal.
  2. Human-like speaking avatars, possibly without a visual appearance, not representing a human, e.g., a secretary taking notes of the meeting, answer questions, etc.

The space where the collaboration takes place is called Environment. It can be anything from a fictitious space to a replica of a real space.

MPAI is currently investigating the Use Case called Avatar-Based Videoconference where each participant is represented by an avatar sitting at a table. The avatars faithfully represent the participants with their speech, faces, and gestures. This is achieved by using the emotion extracted from speech, face and gesture of participants.

MPAI-MCS seeks to define standard formats for the Environment and for the Avatar so that, by owning an MCS client, a participant can:

  1. Distribute their own avatars reproducing their activity and speech to other participants in the virtual conference.
  2. Assemble the videcoconference room using the received avatars and participate in it.

The end-to-end block diagram of the Avatar-Based Videonference Use Case is given by the figure below where:

  • Each participant sends:
    • server (at start): language preferences, avatar model, and speech and face descriptors for authentication.
    • server and virtual secretary (during conferemce): avatar description, and speech and text.
  • The server sends each participant:
    • (at start): Environment description and avatar models.
    • (during conference): participant ID, speech and text in the requested language, avatar descriptors.
  • The virtual secretary sends each participant its own avatar model (at start), avatars descriptors, and speech and text.

Figure 1 – Reference Model of the Avatar-Based Videoconference

The figures below describe the internals of the 4 system components with a particular partitioning of functionality: transmitting client (Figure 2), server (Figure 3), virtual secretary (Figure 4), and receiving client (Figure 5). Different partitions are obtained by moving the internal components from one system component (blue blocks in the figure above) to another).

At the start of the meeting the client sends language preference and avatar descriptors to the server. When the conference is on, the client client continuously generates  audio and visual scene descriptors, the former providing the individual speech sources and their locations and the latter the individual humans in the room and their locations. Part of the visual descriptors are used to enable face-based participant authentication and part to generate the avatar descriptors. Part of the the speech descriptors are used to to enable face-based participant authentication and to provide additional information to Avatar Description to refine avatar descriptors. The participant speech is sent to the server as is.
Figure 2 – The Avatar-Based Videoconference  client (transmitter)
The server performs the function of:

  1. Distributing the Environment Model to participants.
  2. Authenticating participant using face and speech descriptors.
  3. Uniquely associating speech sources and avatar descriptors.
  4. Forwarding the received and processed information to participants.
Figure 3 – The Avatar-Based Videoconference  server

The virtual secretary produces a summary of the utterances of the avatars integrated by its understanding of their emotion. The summary can then be forwearded to an external application where participants can edit the summary.

In a more sophisticated set up, avatars can interact with the virtual secretary by speech and text. The virtual secretary edits the summary taking into account the avatars’ utterances and their emotion.

Figure 4 – The Virtual Secretary of the Avatar-Based Videoconference
The participant

  1. locates at positions of their liking the avatars generated by the clients with the associated speech.
  2. Selects the point from which to see/hear the videoconference (non necessarily the position of their avatar).
  3. Participates in the videoconference.
Figure 5 – An MCS client (receiver)

This use case is part of theMPAI-MMC Use Cases and Functional Requirements WD1.4. MPAI intends to issue a Call for Technologies on 9 July 2022. Anybody may respond to the Call. If a proposed technology is accepted, the proponent is requested to join MPAI.


MPAI-MCSis at the level of Use Cases and Functional Requirements. If you wish to participate in this work you have the following options

  1. Join MPAI
  2. Participate until the MPAI-MCS Functional Requirements are approved (after that only MPAI members can participate) by sending an email to the MPAI Secretariat.
  3. Keep an eye on this page.

Return to the MPAI-MCS page