Visual Object and Scene Description (MPAI-OSD)
2.1 Audio Tape Irregularity Detection (MPAI-CAE) 1
2.2 Identify object in a human’s hand (MQA) 1
2.3 Detecting emotion and meaning in human face (CWE) 1
2.4 Visual and scene for Connected Autonomous Vehicles (CAV) 2
2.5 Avatar-based videoconference (MCS) 2
2.6 Tracking video game player’s movements. 2
2.8 Integrative genomic/video experiments (animals) 3
1 Introduction
Visual object and scene description (MPAI-OSD) is an MPAI project at the Use Case stage, collecting Use Cases sharing the goal of describing visual objects and, in some cases, locate them in the space.
By scene description we mean the usual description of objects and their attributes in a scene and the semantic description of the objects.
AIMs in the MPAI-OSD area have already been requested in Conversation with emotion and Multimodal Conversation. However, no specific responses have been received.
New use cases are constantly identified that require new AIMs falling under the MPAI-OSD scope.
2 Description of Use Cases
2.1 Audio Tape Irregularity Detection (MPAI-CAE)
This belongs to the family of generic object description.
MPAI is using this component in the MPAI-CAE Use Case Audio Recording Preservation.
It is designed to:
- Receive the video signal of a camera pointing to the magnetic reading head of a traditional audio tape.
- Detect the images that show irregularities on the tape.
- Provide as output, if an image shows an irregularity:
- The image
- The type of irregularity
- The time code
2.2 Identify object in a human’s hand (MQA)
This belongs to the family of generic object description.
MPAI is using this component in the MPAI-MMC Use Case Multimodal Question Answering.
It is designed to:
- Receive the picture of an object.
- Recognise the type of object.
- Provide the object identifier as outputs.
2.3 Detecting emotion and meaning in human face (CWE)
This belongs to the family of human description.
MPAI is using this component in MPAI-MMC Conversation with Emotion Use Case which is designed to:
- Receive a video of the face of a human.
- Identify the type and intensity of the emotion in the face of the human.
- Provide as output:
- The type of emotion out of a finite set of codified emotions.
- The intensity (grade) of the emotion.
- The time stamp that the type and intensity of emotion refers to.
2.4 Visual and scene for Connected Autonomous Vehicles (CAV)
This contains many elements belonging to human, animal, vehicle, road sign and traffic light description. There is a large variety of sensing devices:
- 2D and 3D cameras
- Lidar
- Radar
- Ultrasound
and other sources of information such as odometer and GSSN to create a Basic World Representation (BWR). A CAV exchanges its BWRs with other CAVs in range and produces a refined Full World Representation (FWR).
MPAI-CAV Human-CAV Interaction Use Case where CAV needs to
- Detect the emotion of a passenger to be able to have better conversations or provide better responses to queries.
- Locate passengers in the compartment so that the avatar representing the CAV can gaze at them in a more natural way.
2.5 Avatar-based videoconference (MCS)
Geographically distributed users can send their data to a virtual space and create local 3D audio-visual spaces where they can see a virtual meeting populated with avatars whose face and head are animated that they can navigate vithout moving their avatar.
- MPAI-MCS Local Avatar Videconference Use Case where a participant in a videoconference is represented by an avatar whose torso is faithfully represented by an avatar.
- MPAI-MCS Virtual eLearning where teacher and students are represented by avatars and can interact with 3D Audio-Visual Objects Enter, navigate and act on 3D audio-visual objects by doing:
- Define a portion of the object – manual or automatic
- Count objects per unit volume
- Detect structures in a (portion of) the 3D AV object
- Combine objects
- Call an anomaly detector on a portion with an anomaly criterion
- Follow a link to another portion of the object
- 3D print (portions of) the 3D AV object
2.6 Tracking video game player’s movements
This Use Case belongs to the family of human description.
It is a system designed automatically to understand the game player’s physical movements in a video game. The features of the movements are:
- The human object is largely static, and only hand/arm and finger movements are detected.
- The types of movements are limited in number.
- The system should understand the movements fast and accurately.
The system is designed to:
- Receive a video.
- Compute descriptors of the human.
- Understand the intention expressed by the movements from the descriptors.
2.7 Correct Posture
This Use Case belongs to family of human description.
It is a system designed to advise the user by suggesting how they should correct their pose.
The main features of this Use Case are
- The human using the application walks in a restricted environment.
- Very specific type movements must be detected with high accuracy.
- The detected movements are compared with reference movements.
The system is designed to:
- Receive a video.
- Compute descriptors of the human.
- Compares the descriptors with reference descriptors.
- Provide suggestions about movement corrections.
2.8 Integrative genomic/video experiments (animals)
This belongs to the family of animal description (however, see later).
MPAI is using this component in several MPAI-GSA Use Cases.
It is a system designed to
- Receive a sequence of images containing laboratory animals with specified genomic data of which the effects on behavioural patterns are assessed.
- Compute behavioural patterns of living organisms, e.g., measures of the parameters of animal activity, everywhere and/or in specified ROI, such as:
- (average) velocity.
- Time spent.
- Time spent near walls.
- Turning speed.
There is another component that belongs to the family of plant description.