Visual Object and Scene Description (OSD)

Proponents: Leonardo Chiariglione (CEDEO), Luigi Troiano (Kebula)

Description: This use case addresses the “object and scene description” component of several use cases considered in MPAI’s use case document rev2.0 (N46). “Object and scene description” is used to indicate a description (language) of objects and their attributes, and the semantic description of the individual objects in a scene.

Comments:

The Examples section of this use case shows that proprietary solutions can address the needs of the examples. However, proprietary solutions have the following disadvantages

Lack of interoperability, i.e. proprietary solutions create silos
Adoption of a single technology as a black box
Closed applications without access to the enabling technology
Delegation of enabling technology innovation to the technology provider (lock-in).

On the other hand, a standard representation of the objects in a scene and of the scene allows for

Interoperable applications, i.e. data can be moved from one domain to another
Possibility to select the enabling technology with the only constraint of preserving the interfaces
Possibility to create open applications where the enabling technology can be replaced safeguarding interoperability of data
MPAI, by decision of its members, can decide to extend an existing standard or to create a new one when progress of technology – independently achieved – requires it.

Examples:

Multiplayer online gaming

In the Hide and seek example of the “Distributed multiplayer online gaming in Next Generation Games” use case (ME.MP-09)

Player A
1. Points their smartphone to a scene populated by persons and other objects
2. Sends the description of the scene and their objects to a server
3. Sends commands to the server to hide a synthetic person among the real persons
The server
1. Understands the scene description received
2. Adds the animated synthetic person to the scene
3. Distributes the composite scene to all players
All players but player A seek the synthetic person in the crowd.

The enabling technologies are

Description of the persons in the scene and their movements
Description of other objects
Description of the position of all objects in the scene
Transmission of all descriptors to the server
Transmission of commands to animate a synthetic person (player trying to hide)
Animation of a new object (person trying to hide)
Creation of a new scene description containing
1. The objects received
2. The new animated object
3. Any other object that the logic of the game may require
Transmission of new scene description to all players

In summary:

Visual objects and scene description
Animation of objects described by 1.
AI-assisted driving

In the “AI-assisted driving” (TP.MP-01) use case, a car computes the descriptors of the scene and displays appropriate messages to the driver. When considered appropriate, the car communicates the scene descriptors to the neighbouring cars.

At first glance, the technologies needed in usage example #1 are the same that are needed in this usage example, as it can be seen from the workflow below

Camera of the car captures the scene of a human crossing the road
Car (or camera) computes the visual descriptors, the position in space, speed and direction of movement
Car assesses the urgency and display appropriate information of the human to driver and, if needed, to nearby vehicles

Vision-to-sound transformation

In the Vision-to-Sound Transformation (HC.AV-01) use case, a 3D Visual space is transformed into a 3D Audio space. An approach to solve this problem is based on a full description of the visual scene and its objects and then on a standard visual-to-audio conversion.

The requirements of this usage example are not known. Is the scene or its objects static or dynamic?

In either case, however, the potential Visual Object and Scene Description standard could be used.

Tracking video game player’s movements

In the “Tracking game player’s movements” (ME.MP-12) use case, the client sends descriptions of the game player’s movements to a server. The server decodes the game player’s move intention from the descriptors.

The technology in this usage example is akin to the preceding ones. However, in this case

The human object is largely static, only hand/arm, fingers movements are detected
The types of human object movements are more restricted
High accuracy is required

Correct Pose

In the “Correct Posture” (HC.MP-02) use case, an AI application analyses the video of a person to suggest how the person should correct their posture.

The technology in this usage example is similar in terms of movement representation accuracy as the preceding one. However, in this case

The human object walks in a restricted environment
Very specific type movements must be detected with high accuracy
Scene description is typically not required.

Person matching

In the “AI-services for next generation TV” (ME.MP-11) use case, descriptors of the video program and associated metadata reach the set top box together with the actual program.

When the user wants to know more about an object:

User points the remote control to the object on the screen
Set top box
1. Computes object descriptors
2. Compares the computed descriptors with those in the stream
3. When a match is found, executes as per metadata

This usage example differs from the preceding ones because the scene and its objects are 2D, while in the other usage examples the scene and the objects can be considered as 3D.

There has been and is significant research on 2D scene descriptors for searching objects in a data base of TV programs.

Integrative genomic/video experiments

In the “Integrative analysis of multi-source genomic/sensor experiments” (ST.OD-06) use case, an AI-based application assesses genomic data and their effects on living organisms. An example is the study of movements of Zebrafish, a (sub-)tropical fish that is widely used in this kind of tests. Currently used programs provide: speed, average speed, acceleration, time spent in ROI, trajectories, identification of same animal in different videos, turning speed, time near walls and more [1].

Audio Recording Preservation

In this Use Case, features of images taken from the video of an audio tape passing in front of the magnetic head are used to search in a Knowledge Base of magnetic tape irregularities.

Conversation with emotion

In this use case a human has a conversation with a machine. The machine takes pictures of the human, extracts features and queries a Knowledge Base of features of known emotions

Multimodal Question Answering

In this use case a human interrogates a machine using a picture of an object. The machine extracts features of the image and queries a Knowledge Base of features of objects.

This usage example can be satisfied by a standard that simply defines the syntax and semantics of the output of AIMs such as speed, average speed, acceleration, time spent in ROI etc. or it could dig into the low-level descriptors that are used to compute the said high-level descriptors.

Requirements:

This is an initial set of requirements

The standard shall enable the description of human objects to enable
1. Animation of human models
2. Interpretation of application-specific gestures
The standard shall enable the description of scenes where
1. Individual objects can be manipulated
2. The scene can be updated

Object of standard:

Low-level descriptors of selected objects
Application-dependent high-level descriptors
Description of dynamic scenes

Based on the examples above, MPAI-OSD could define target several standards covering specific applications

Potential standard	Example
Person movement description	Multiplayer online gaming AI-assisted driving Correct pose Tracking game player’s movements
Scene description	Multiplayer online gaming AI-assisted driving Integrative genomic/video experiments
Generic object description	Multiplayer online gaming AI-assisted driving
Person identification	Person matching
Animal movements description	Integrative genomic/video experiments

Benefits: The standard would stimulate a competition among developers of neural networks that can best convert objects into the required standard set of descriptors.

Bottlenecks: many applications would benefit from the availability of independently sourced object description solutions that expose the MPAI interfaces. Decoupling of the enabling technologies from the applications would benefit both sides

Social aspects: the impact of the standard is already being assessed by the many examples of highly realistic synthetic faces

Success criteria: the number and success of applications that use the standard

Cookie	Duration	Description
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Technical".
CookieLawInfoConsent	1 year	The cookie is set by the GDPR Cookie Consent plug-in and is used to store whether the user has consented to the use of cookies or not. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_pk_id.6.08a8	13 months	Used to store a few details about the user such as the unique visitor ID
_pk_ses.6.08a8	30 minutes	Short lived cookies used to temporarily store data for the visit

MPAI Application Note #8

Visual Object and Scene Description (OSD)

Notice