<–Speaking humans and machines Humans conversing with machines–>
6.1 Introduction
6.2 Facial attributes estimation
6.3 Synthesising humans visually
6.4 Applications & potential dangers

6.1       Introduction

Actions and facial expressions play an important role in communication between humans and machines. Even if the exact words are the same, the information contained in words will change if the actions and facial expressions are different. Therefore, for perfect human-machine communication, changes in visual behavior and facial expressions are required. Also, people can have a comfortable conversation when they feel like they are human and not machines, which means we need to create the same facial expressions as humans.

There is a need for a standard for AI that can facilitate the implementation of AI technology using multiple modules. One of these goals of MPAI is to address the latest AI technology for the visual part.

6.2       Facial attributes estimation

For a human generator to create a face and body according to the change of human attributes, it is necessary to understand human attributes. It can be done in two ways: explicit and implicit.

Explicit Way

  • Key Point Detection is a method of extracting a key point of a person from a video in which a person appears, such as a landmark detection or 3D angle detection. In general, the key point of a person is extracted using the AI classifier, and for this, key point information of the person in the image and the annotated image is required. Based on this person’s supervision, the AI module learns the person’s attribute information.
  • Facial Emotion Detection. In general, humans can read emotions based on the facial expressions of the other party. Therefore, people can label a person’s emotion in the video by looking at the expression and behavior of the person in the video. Learning the AI classifier based on the labeled videos makes learning the AI classifier that detects the person’s emotion. Again, the AI module explicitly learns human emotions based on human supervision.

Implicit Way

  • Key Point Detection. In addition to learning key points based on human supervision, the AI module can learn key points indirectly. By learning several people’s movements, the AI module can extract key points of people that can be generally applied. The key point of the person that the AI module thinks is different from the landmark that humans annotate, but since the AI module extracts a key point that is easier to understand, it shows a higher performance when creating a person.
  • Latent Space in Human Generator. To create a person, we need a generator for this. The most used generator for synthesizing virtual humans is a Generative Adversarial Network (GAN). GAN enables the generator and discriminator to be trained competitively and generate images. The image generated by the generator is determined whether it is an actual image or a generated image through the discriminator. And then, the generator is trained to deceive the discriminator. In this process, the generator finds a low-dimensional latent space that can express the image to be created. Since the low-dimensional latent space found in this way implicitly contains information about the image, it contains attribute information about the person that the AI model thinks, like the attribute about the person that a person thinks. Generators will be discussed in more detail in the next section.
  • Voice Information Voice information is needed to implement the human visual part perfectly. When a person speaks, it will have a mouth shape that is in sync with it. In addition, depending on the person’s words, their actions and expressions change. Therefore, it is necessary to extract information about mouth shapes and emotions contained in human speech and also be able to use this information. This suggests the need for a multi-modal AI module because it generates a visible part based on voice information.

6.3       Synthesising humans visually

As explained in the chapter above, a constructor is needed to synthesize a virtual human visually. The generative adversarial network (GAN) is the most used, and the most popular technology among GANs is Style GAN.

Style GAN has the advantage of creating a high-quality image of 1024 x 1024 pixels. It is trained to adjust the attributes of the image created by disentangling the low-dimensional latent space. Suppose we use a non-disentangled low-dimensional latent space when we want to change the characteristics of the generated image (e.g., face angle). In that case, the characteristics we want to change and other characteristics (ex. facial expressions) also change simultaneously. This causes difficulties in synthesizing Humans. On the other hand, if we use the low-dimensional latent space of the disentangle style GAN, it is possible to synthesize humans by changing only the characteristics we want to change.

In addition to GAN, a technology called NeRF has recently been attracting attention. GAN is a technology that generates images in two-dimensional space, whereas NeRF is a technology that creates images in three-dimensional space. Since the human face and body are three-dimensional data, there is a limit to generating an image in a two-dimensional space. On the other hand, because NeRF represents a person in a three-dimensional space, more accurate human creation is possible. The NeRF projects the camera ray in the direction of the image pixel to be generated from the observer’s viewpoint. The pixel’s color (r, g, b) information is calculated by adding up the color (r, g, b) information and volume density information of each point where this camera ray passes through the object.

The way GAN and NeRF utilize the Human attribute is different. Because GAN controls the image generated by moving in the low-dimensional latent space derived from the learning process, the human analyzes the human attribute understood by deep learning, creating the desired human image. On the other hand, in the case of NeRF, an image is generated based on three-dimensional coordinates and angle information, so a desired human image is created using explicit human attributes.

6.4       Applications & potential dangers

Applications

Virtual humans can be used in many ways. First, the part that is receiving the most attention is marketing. Since a virtual human has fewer temporal and spatial constraints than a real person, it is unnecessary to consider the temporal and spatial aspects when shooting an advertisement video or appearing content. Next is a virtual chatbot. Human-computer interfaces that use only voice and text, such as kiosks or navigation, can utilize visual information due to the advent of virtual humans. This increases friendliness and can further lead to improving the company’s image. In addition, it is directly related to the creation of characters in the metaverse world, which has recently been in the spotlight. Even within the metaverse, a virtual world, virtual humans are required for activities, and AI technology will be utilized in many ways to create such virtual humans.

 Potential Dangers

People lose their jobs as virtual humans replace their jobs. Virtual humans can replace influencers who earn income from marketing filming, and virtual humans can replace various occupations such as announcers and cafe staff. In addition, if a virtual human imitating a real famous person appears, the virtual person may appear in negative images such as pornography and violent images, thereby damaging the famous person’s image. Furthermore, it can be abused to manipulate public opinion.

<–Speaking humans and machines Humans conversing with machines–>