Application NoteDraft Use Cases and Functional Requirements

MPAI-GSA Use Cases and Functional Requirements work programme

1        Introduction

Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) is an international association with the mission to develop AI-enabled data coding standards. Research has shown that data coding with AI-based technologies is more efficient than with existing technologies.

The MPAI approach to developing AI data coding standards is based on the definition of standard interfaces of AI Modules (AIM). AIMs operate on input data and provide output data both of which have a standard format. AIMs can be combined and executed in an MPAI-specified AI-Framework called MPAI-AIF. A Call for MPAI-AIF Technologies [2] against functional requirements [1] is currently open.

While AIMs must expose standard interfaces to be able to operate in an MPAI AI Framework, their performance may differ depending on the technologies used to implement them. MPAI believes that competing developers striving to provide more performing proprietary and interoperable AIMs will promote horizontal markets of AI solutions that build on and further promote AI innovation.

The MPAI standardisation model is currently hard to implement because in many cases the data used do not have well-defined format or unambiguous semantics. This document lays down a plan to achieve the goal of achieving the desired standardisation. It does that by introducing four representative Use Cases that use AIMs to understand and compress the results of high-throughput experiments combining genomic/proteomic and other data – for instance from video, motion, location, weather, medical sensors are identified. These are used to derive AI Modules, their input/output types and the type of data format standardisation required to achieve the goal.

The Use Cases are

  1. Integrative analysis of ‘omics datasets
  2. Smart Farming
  3. Genomics and phenotypic/spatial data
  4. Genomics and behaviour

This document is to be read in conjunction with the MPAI-GSA Call for Technologies (CfT) [3] as it provides the functional requirements of all the technologies that have been identified as required to implement the current MPAI-GSA Use Cases. Respondents to the MPAI-GSA CfT should make sure that their responses are aligned with the functional requirements expressed in this document.

This document is structured in 7 chapters, including this Introduction.

Chapter 2 briefly introduces the AI Framework Reference Model and its six Components
Chapter 3 briefly introduces the 4 Use Cases.
Chapter 4 presents the 4 MPAI-CAE Use Cases with the following structure

1.     Reference architecture

2.     AI Modules

3.     I/O data of AI Modules

4.     Technologies and Functional Requirements

Chapter 5 outlines a possible solution
Chapter 6 gives suggested references
Chapter 7 gives a basic list of relevant terms and their definition

2        The MPAI AI Framework (MPAI-AIF)

Most MPAI applications considered so far can be implemented as a set of AIMs – AI, ML and even traditional Data Processing (DP)-based units with standard interfaces assembled in suitable topol­ogies to achieve the specific goal of an application and executed in an MPAI-defined AI Frame­work. MPAI is making all efforts to identify processing modules that are re-usable and upgradable without necessarily changing the inside logic. MPAI plans on completing the development of a 1st generation AI Framework called MPAI-AIF in July 2021.

The MPAI-AIF Architecture is given by Figure 1.

 

Figure 1 – The MPAI-AIF Architecture

Where

  1. Management and Control manages and controls the AIMs, so that they execute in the correct order and at the time when they are needed.
  2. Execution is the environment in which combinations of AIMs operate. It receives external inputs and produces the requested outputs both of which are application specific interfacing with Management and Control and with Communication, Storage and Access.
  3. AI Modules (AIM) are the basic processing elements receiving processing specific inputs and producing processing specific outputs.
  4. Communication is required in several cases and can be implemented, e.g., by means of a service bus and may be used to connect with remote parts of the framework
  5. Storage encompasses traditional storage and is used to e.g., store the inputs and outputs of the individual AIMs, data from the AIM’s state and intermediary results, shared data among AIMs.
  6. Access represents the access to static or slowly changing data that are required by the application such as domain knowledge data, data models, etc.

In MPAI-GSA data can be of three types:

  • Primary, i.e., the original unprocessed high-throughput content (such as DNA sequencing or video data)
  • Secondary, i.e., the results of the pre-processing of primary data (such as gene expression estimates or features extracted from video) – applications will typically use these as input rather than primary data
  • Metadata specifying additional information about the biological sample or experiment (such as sample content, cell types and barcodes, collection time and place).

The API provides uniform access to data; in particular, it standardises the definition of the semantics of the different data sources.

Figure 2 is an alternative view of the MPAI AI Framework showing the different role of the 3 types of data.

Figure 2 – The MPAI-AIF Architecture highlighting 3 data types

The possibility of implementing genomic workflows integrated with different data sources whose processing concurs to obtaining the desired result relies on the availability of standard and machine-actionable data formats.

3        Use Cases

Integrative Genomic/Sensor Analysis uses AI to understand and compress the results of high-throughput experiments combining genomic/proteomic and other data – for instance from video, motion, location, weather, medical sensors.

So far, the following application areas, ranging from personalised medicine to smart farming, have been considered.

3.1.1      Integrative analysis of ‘omics datasets

In one possible realisation of this use case, one would like to correlate a list of genomic variants present in humans and having a known effect on health (metadata) with the variants present in a specific individual (secondary data). Such variants are derived from sequencing data for the individual (primary data) on which some variant calling workflow has been applied. Additional information derived from transcriptomics (RNA-sequencing, secondary data) might be taken into account. The list of variants could potentially be used to get to a personalised therapy.

Notably, there is an increasing number of companies doing just that as their core business. Their products differ by: the choice of the primary processing workflow (how to call variants from the sequencing data for the individual); the choice of the machine learning analysis (how to establish the clinical importance of the variants found); and the choice of metadata (which databases of variants with known clinical effect to use).

3.1.2      Genomics and phenotypic/spatial data

As an example we take single-cell RNA sequencing. The primary data sources is RNA-sequencing performed at the same time on a number (typically hundred of thousands) of different cells – while bulk RNA sequencing mixes together RNAs coming from several thousands of different cells, in single-cell RNA sequencing the RNAs coming from each different cell are separately barcoded, and hence distinguishable. The DNA barcodes for each cell would be metadata here. Cells can then be clustered together according to the expression patterns present in the secondary data (vectors of expression values for all the species of RNA present in the cell) and, if sufficient metadata and spatial information is present, clusters of expression patterns can be associated with different types/lineages of cells – the technique is typically used to study tissue differentiation. A number of complex algorithms exist to perform primary analysis (statistical uncertainty in single-cell RNA-sequencing is much bigger than in bulk RNA-sequencing) and, in particular, secondary AI-based clustering/analysis. Again, expressing those algorithms in terms of MPAI-GSA would make them much easier to describe and much more comparable. External commercial providers might provide researchers with clever modules to do all or part of the machine learning analysis.

3.1.3      Genomics and behaviour

In a typical application of this use case, one would like to correlate animal behaviour (typically of lab mice) with their genetic profile (case of knock-down mice). Another application might be correlating genetic variants with the reaction to drug administration (typically encountered in neurobiology), possibly monitored in real-time with functional MRI scans. Hence primary data would be video data from cameras tracking the animal and/or data from an MRI scanner; secondary data would be processed video data in the form of primitives describing the animal’s movement, well-being, activity, weight, etc.; and metadata would be a description of the genetic background of the animal (for instance, the name of the gene which has been deactivated) or a timeline with the list and amount of drugs which have been administered to the animal. Again, there are several companies providing software tools to perform some or all of such analysis tasks – they might be easily reformulated in terms of MPAI-GSA applications.

3.1.4      Smart Farming

During the past few years, there has been an increasing interest in data-rich techniques to optimise livestock and crop production (so called “smart farming”). The range of techniques is constantly expanding, but the main ideas are to combine molecular techniques (mainly high-throughput sequencing and derived protocols, such as RNA-sequencing, ChIP-sequencing, HiC, etc.; and mass-spectrometry – as per the ‘omics case at point 2) and monitoring by images (growth rate under different conditions, sensor data, satellite-based imaging) for both livestock species and crops. So this use case can be seen as a combination of cases 2 and 4. Primary sources would be genomic data and images; secondary data would be vectors of values for a number of genomic tags and features (growth rate, weight, height) extracted from images; metadata would be information about environmental conditions, spatial position, etc. A growing number of companies are offering services in this area – again, having the possibility of deploying them as MPAI-GSA applications would open up a large arena where academic or commercial providers would be able to meet the needs of a number of customers in a well-defined way.

4        Functional Requirements

4.1       Integrative analysis of ‘omics datasets

4.1.1      Reference architecture

Figure 3 – An example of Integrative analysis of ‘omics datasets

4.1.2      AI Modules

Table 1 – AI Modules of Integrative analysis of ‘omics datasets

AIM Function
Determine regulation  
Determine significant variants  
Determine relevant variants  
Determine actionable variants  

4.1.3      I/O interfaces of AI Modules

Table 2 – I/O data of AIMs

AIM Input Data Output Data
Determine regulation Sample metadata

RNA-sequencing (P)

Expression (S)

Genomic functional annotation

Regulation model

Genomic functional annotation

Determine significant variants DNA-sequencing (P)

Genomic variants (S)

Sample metadata

Regulation model

Genomic functional annotation

Significant variants
Determine relevant variants Significant variants

Variants with known clinical significance

Relevant variants
Determine actionable variants Relevant variants

Variant-targeting drugs

Personalised therapy

4.1.4      Technologies and Functional Requirements

Table 3 – Data types and formats

Data type Format
DNA-sequencing (P) FASTQ/SAM
Expression (S) Tabular/Matrix
Genomic functional annotation GTF/GFF
Genomic variants (S) VCF
Personalised therapy Tabular/JSON/Ontology
Regulation model Tabular/JSON/Ontology
Relevant variants VCF
RNA-sequencing (P) FASTQ/SAM
Sample metadata Tabular/JSON/Ontology
Significant variants VCF
Variants with known clinical significance VCF
Variant-targeting drugs Tabular/JSON/Ontology

4.2       Genomics and phenotypic/spatial data

4.2.1      Reference architecture

Figure 4 – An example of Genomic and Phenotypic/spatial data

4.2.2      AI Modules

Table 4 – AI Modules of Genomics and phenotypic/spatial data

AIM Function
   
   
   
   

4.2.3      I/O interfaces of AI Modules

Table 5 – I/O data of Genomics and phenotypic/spatial data AIMs

AIM Input Data Output Data
     
     
     
     

4.2.4      Technologies and Functional Requirements

Table 6 – Data types and formats

Data type Format
   
   
   
   
   
   
   
   
   
   
   
   

4.3       Genomics and behaviour

4.3.1      Reference architecture

Figure 5 – An example of Genomics and Behaviour

4.3.2      AI Modules

Table 7 – AI Modules of Genomics and behaviour

AIM Function
   
   
   

4.3.3      I/O interfaces of AI Modules

Table 8 – I/O data of Genomics and behaviour AIMs

AIM Input Data Output Data
     
     
     
     

4.3.4      Technologies and Functional Requirements

Table 9 – Data types and formats

Data type Format
   
   
   
   
   
   
   
   
   
   
   
   

 

4.4       Smart Farming

4.4.1      Reference architecture

Figure 6 – An example of Smart Farming

4.4.2      AI Modules

Table 10 – AI Modules of Smart Farming

AIM Function
   
   
   

4.4.3      I/O interfaces of AI Modules

Table 11 – I/O data of Smart Farming AIMs

AIM Input Data Output Data
     
     
     
     

4.4.4      Technologies and Functional Requirements

Table 12 – Data types and formats

Data type Format
   
   
   
   
   
   
   
   
   
   
   
   

5        Data formats

Broadly speaking, the data formats identified by the use cases fall under three categories:

  1. Genomic/sequencing/proteomic data
  2. Video/audio/sensor data
  3. Metadata and other data that is weakly structured. Examples would be drugs databases, pathway/metabolic/growth models, behavioural annotations, information about samples and experiments, and the I/O of secondary analysis themselves. Such information is often presented in tabular format, but without a defined way of associating the rows/columns with their semantics (see, e.g., differential regulation for RNA-sequencing experiments).

In the following sections such categories are analysed in more detail.

5.1       Genomic/sequencing/proteomic data

Data type Format Identified solution
Sequencing reads FASTA MPEG-G parts 1/2

 

Genomic references
Genomic assemblies
Sequencing reads FASTQ
Aligned data SAM
Genomic functional annotations GFF/GTF MPEG-G part 6
Genomic variants VCF
Genomic tracks BigWig
Genomic assemblies Graph formats
Genomic contacts (Sparse) Matrix Formats
Expression data Tabular

5.1.1      Genomic assemblies, assembly graph

Usage domain Assembly graphs

Graph-like genome references

Semantics Express a string graph (set of sequences which are partially overlapping).
Such as NCBI’s ASN
Requirements Ability to represent and query string graph, either standalone or as a combination of formats
Possible solutions Standardise ASN (not ideal).

FASTA for the edges combined with a tabular representation of nodes

5.1.2      Proteomic/spatial proteomic data

Data type  
Usage domain
Semantics
Requirements
Possible solutions

5.1.3      Smart farming data

Usage domain
Semantics
Requirements
Possible solutions

5.2       Video/audio/sensor data

5.2.1      Metadata

Data type Format Identified solution
Experiment recording Audio/Video Formats MPEG video/audio formats.

Common with MPAI-CAE

Association between events and AV streams Subtitle-like formats MPEG video/audio file formats.

Common with MPAI-CAE?

5.2.2      Location/satellite

Usage domain Experiment recording
Semantics Coordinates on the surface of the Earth and additional collection metadata
Requirements Ability to represent the point where the experiment is carried out with an accuracy adequate to the application (which might vary – from lab to smart farming)
Possible solutions Tabular.

 

5.2.3      MRI-like data

Usage domain Data from (functional-, …) MRI experiments
Semantics 3D or 4D images, together with experimental meta-data
Requirements Ability to represent voxel-based spatial imaging information, possibly with time courses
Possible solutions Existing format for imaging (PACS?) plus an MPAI-defined metadata schema.

5.3       Metadata/weakly structured data

5.3.1      Metadata

Usage domain All use cases
Semantics Metadata about the collection of experimental data
Requirements Ability to describe:

·       Sample

·       Collector

·       Collection data and place

·       Collection or generation experimental methodology

·       Generating experiment

·       Relations of the sample with its generating experiment (time series, hierarchical sub-category)

5.3.2      Models (metabolic, behaviours)

 

Usage domain All use cases
Semantics A model generated out of experimental data and describing relations between samples and/or other biological concepts
Requirements Ability to describe:

·       Scope of the model

·       Relations between the different components of the model (cluster, sets, graphs, conditions)

·       Relations between model components and time

5.3.3      Audio/video events

Usage domain Video/audio/sensor
Semantics Describing features extracted from 2-3-4D video/audio/sensor data
Requirements Ability to describe:

·       The nature/ontology of the event

·       Spatial/temporal characteristics of the event (ROI, duration)

·       Placement of the event within 2-3-4D video/audio/sensor streams

5.3.4      Secondary inputs/Outputs of AIMs

Data type  
Usage domain All use cases
Semantics Describing secondary inputs, or outputs, of AIMs in terms of components and ontologies
Requirements Ability to describe:

·       The inputs/outputs in terms of their components (spatial/temporal dimensions, combination of channels)

·       The ontology of each component/channel.

Partially in common with MPAI-AIF?

6        Possible solution

All such categories of data can be represented as a tree-like data structure (which could be expressed in JSON-like format) combined with an ontology expressing the nature of the nodes of the tree.

For instance, in the case of the outputs of an AIM expressing differential regulation estimated from an RNA-sequencing experiment and other data, the representation might be something like:

  • For each time point:
    • Time
    • For each sample:
      • For each feature in the payload:
        • Name
        • Unit of measurement
        • Ontology
      • Sample name
      • Sample collection time
      • Sample collection place
      • More information about the sample (collector, etc.)
      • Category of sample
      • For each gene:
        • Estimated expression value
      • For each couple of sample sets:
        • Set of samples 1
        • Set of samples 2
        • For each result of the experiment:
          • Name
          • Unit of measurement
          • Ontology
        • For each DR gene:
          • Estimated log-fold change
          • Estimated FDR/p-value for the fold-change

The meta-information about the data structure (in red) might be stored separately or embedded in the data structure itself. Given that information, it would be possible to query such data structures.

This suggests that defining an association between each example and an adequate meta-data schema might be sufficient to provide a satisfactory solution.

7        References

  1. MPAI-AIF Use Cases and Functional Requirements
  2. MPAI-AIF Call for Technologies
  3. MPAI-GSA Call for Technologies

8        Terms and definitions

 


Application NoteDraft Use Cases and Functional Requirements

MPAI Application Note #2

Integrative genomic/sensor analysis (MPAI-GSA)

Proponent: Paolo Ribeca (BioSS/James Hutton)

 Description: Most experiment in quantitative genomics consist of a setup whereby a small amount of metadata – observable clinical score or outcome, desirable traits, observed behaviour – is correlated with, or modelled from, a set of data-rich sources. Such sources can be:

  1. Biological experiments – typically sequencing or proteomics/metabolomics data
  2. Sensor data – coming from images, movement trackers, etc.

All these data-rich sources share the following properties:

  1. They produce very large amounts of “primary” data as output
  2. They need “primary”, experiment-dependent, analysis, in order to project the primary data (1) onto a single point in a “secondary”, processed space with a high dimensionality – typically a vector of thousands of values
  3. The resulting vectors, one for each experiment, are then fed to some machine or statistical learning framework, which correlates such high-dimensional data with the low-dimensional metadata available for the experiment. The typical purpose is to either model the high-dimensional data in order to produce a mechanistic explanation for the metadata, or to produce a predictor for the metadata out of the high-dimensional data.
  4. Although that is not typically necessary, in some circumstances it might be useful for the statistical or machine learning algorithm to be able to go back to the primary data (1), in order to extract more detailed information than what is available as a summary in the processed high-dimensional vectors produced in (2).

Providing a uniform framework to:

  1. Represent the results of such complex, data-rich, experiments, and
  2. Specify the way the input data is processed by the statistical or machine learning stage

would be extremely beneficial.

Comments: Although this structure above is common to a number of experimental setups, it is conceptual and never made explicit. Each “primary” data source can consist of heterogeneous information represented in a variety of formats, especially when genomics experiments are considered, and the same source of information is usually represented in different ways depending on the analysis stage – primary or secondary. That results in data processing workflows that are ad-hoc – two experiments combining different sets of sources will require two different workflows able to process each one a specific combination of input/output formats. Typically, such workflows will also be layered out as a sequence of completely separated stages of analysis, which makes it very difficult for the machine or statistical learning stage to go back to primary data when that would be necessary.

MPAI-GSA aims to create an explicit, general and reusable framework to express as many different types of complex integrative experiments as possible. That would provide (I) a compressed, optimized and space-efficient way of storing large integrative experiments, but also (II) the possibility of specifying the AI-based analysis of such data (and, possibly, primary analysis too) in terms of a sequence of pre-defined standardized algorithms. Such computational blocks might be partly general and prior-art (such as standard statistical algorithms to perform dimensional reduction) and partly novel and problem-oriented, possibly provided by commercial partners. That would create an healthy arena whereby free and commercial methods could be combined in a number of application-specific “processing apps”, thus generating a market and fostering innovation. A large number of actors would ultimately benefit from the MPAI-GSA standard – researchers performing complex experiments, companies providing medical and commercial services based on data-rich quantitative technologies, and the final users who would use instances of the computational framework as deployed “apps”.

Examples

The following examples describe typical uses of the MPAI-GSA framework.

  1. Medical genomics – sequencing and variant-calling workflows

In this use case, one would like to correlate a list of genomic variants present in humans and having a known effect on health (metadata) with the variants present in a specific individual (secondary data). Such variants are derived from sequencing data for the individual (primary data) on which some variant calling workflow has been applied. Notably, there is an increasing number of companies doing just that as their core business. Their products differ by: the choice of the primary processing workflow (how to call variants from the sequencing data for the individual); the choice of the machine learning analysis (how to establish the clinical importance of the variants found); and the choice of metadata (which databases of variants with known clinical effect to use). It would be easy to re-deploy their workflows as MPAI-GSA applications.

  1. Integrative analysis of ‘omics datasets

In this use case, one would like to correlate some macroscopic variable observed during a biological process (e.g. the reaction to a drug or a vaccine – metadata) with changes in tens of thousands of cell markers (gene expression estimated from RNA; amount of proteins present in the cell – secondary data) measured through a combination of different high-throughput quantitative biological experiments (primary data – for instance, RNA-sequencing, ChIP-sequencing, mass spectrometry). This is a typical application in research environments (medical, veterinary and agricultural). Both primary and secondary analysis are performed with a variety of methods depending on the institution and the provider of bioinformatics services. Reformulating such methods in terms of MPAI-GSA would help reproducibility and standardisation immensely. It would also provide researchers with a compact way to store their heterogeneous data.

  1. Single-cell RNA-sequencing

Similar to the previous one, but in this case at least one of the primary data sources is RNA-sequencing performed at the same time on a number (typically hundred of thousands) of different cells – while bulk RNA sequencing mixes together RNAs coming from several thousands of different cells, in single-cell RNA sequencing the RNAs coming from each different cell are separately barcoded, and hence distinguishable. The DNA barcodes for each cell would be metadata here. Cells can then be clustered together according to the expression patterns present in the secondary data (vectors of expression values for all the species of RNA present in the cell) and, if sufficient metadata is present, clusters of expression patterns can be associated with different types/lineages of cells – the technique is typically used to study tissue differentiation. A number of complex algorithms exist to perform primary analysis (statistical uncertainty in single-cell RNA-sequencing is much bigger than in bulk RNA-sequencing) and, in particular, secondary AI-based clustering/analysis. Again, expressing those algorithms in terms of MPAI-GSA would make them much easier to describe and much more comparable. External commercial providers might provide researchers with clever modules to do all or part of the machine learning analysis.

  1. Experiments correlating genomics with animal behaviour

In this use case, one wants to correlate animal behaviour (typically of lab mice) with their genetic profile (case of knock-down mice) or the previous administration of drugs (typically encountered in neurobiology). Hence primary data would be video data from cameras tracking the animal; secondary data would be processed video data in the form of primitives describing the animal’s movement, well-being, activity, weight, etc.; and metadata would be a description of the genetic background of the animal (for instance, the name of the gene which has been deactivated) or a timeline with the list and amount of drugs which have been administered to the animal. Again, there are several companies providing software tools to perform some or all of such analysis tasks – they might be easily reformulated in terms of MPAI-GSA applications.

  1. Spatial metabolomics

One of the most data-intensive biological protocols nowadays is spatial proteomics, whereby in-situ mass-spec/metabolomics techniques are applied to “pixels”/”voxels” of a 2D/3D biological sample in order to obtain proteomics data at different locations in the sample, typically with sub-cellular resolution. This information can also be correlated with pictures/tomograms of the sample, to obtain phenotypical information about the nature of the pixel/voxel. The combined results are typically analysed with AI-based technique. So primary data would be unprocessed metabolomics data and images, secondary data would be processed metabolomics data and cellular features extracted from the images, and metadata would be information about the sample (source, original placement within the body, etc.). Currently the processing of spatial metabolomics data is done through complex pipelines, typically in the cloud – having these as MPAI-GSA applications would be beneficial to both the researchers and potential providers of computing services.

  1. Smart farming

During the past few years, there has been an increasing interest in data-rich techniques to optimise livestock and crop production (so called “smart farming”). The range of techniques is constantly expanding, but the main ideas are to combine molecular techniques (mainly high-throughput sequencing and derived protocols, such as RNA-sequencing, ChIP-sequencing, HiC, etc.; and mass-spectrometry – as per the ‘omics case at point 2) and monitoring by images (growth rate under different conditions, sensor data, satellite-based imaging) for both livestock species and crops. So this use case can be seen as a combination of cases 2 and 4. Primary sources would be genomic data and images; secondary data would be vectors of values for a number of genomic tags and features (growth rate, weight, height) extracted from images; metadata would be information about environmental conditions, spatial position, etc. A growing number of companies are offering services in this area – again, having the possibility of deploying them as MPAI-GSA applications would open up a large arena where academic or commercial providers would be able to meet the needs of a number of customers in a well-defined way.

 Requirements:

MPAI-GSA should provide support for the storage of, and access to:

  • Unprocessed genomic data from the most common sources (reference sequences, short and long sequencing reads)
  • Processed genomic data in the form of annotations (genomic models, variants, signal tracks, expression data). Such annotations can be produced as the result of primary analysis of the unprocessed data or come from external sources
  • Video data both unprocessed and processed (extracted features, location, movement tracking)
  • Sensor data both unprocessed (such as GPS position tracking, series describing temperature/humidity/weather conditions, general input of multi-channel sensors) and processed.
  • Experiment meta-data (such as collection date and place; classification in terms of a number of user-selected categories for which a discrete or continuous value is available)
  • Support for the semantic description of the ontology of all the considered sources.

MPAI-GSA should also provide support for:

  • The combination into a general analysis workflow of a number of computational blocks that access processed, and possibly unprocessed, data as input channels, and produce output as a sequence of vectors in a space of arbitrary dimension. Combination would be done in terms of nodes (processing blocks and adaptors [blocks that return as output a subset of the input channels, nodes that replicate the input as output several times]), and a connection graph
  • The possibility of defining and implementing a novel processing block from scratch in terms of either some source code or a proprietary binary codec
  • A number of pre-defined blocks that implement well-known analysis methods (such as PCA, MDS, CA, SVD, NN-based methods, etc.).

Object of standard: A high-level, schematic description of the standard can be found in the following figure:

Figure 1 A reference diagram for MPAI-GSA

Currently, three areas of standardization are identified:

  1. Interface to define the machine/statistical learning in terms of basic algorithmic blocks:
    1. Ability to define basic computational blocks that take a (sub)set of inputs and produce a corresponding set of outputs (the cardinalities of the sets can be different). More in detail:
      1. The way of defining a block will be mandated by the standard (in particular, a unique ID might be issued to the implementer by a centralised authority)
      2. A basic computational block will need to define and export its computational requirements in terms of CPU and RAM
      3. Some blocks might specify their preferred execution model, for instance some specific cloud computation platform
      4. Some pre-defined blocks might be provided by the standard, for instance the implementation of a few well-known methods to perform dimensional reduction (PCA, MDS, etc.).
  2. Ability to create processing pipelines in terms of basic algorithmic blocks and to define the associated control flow required to perform the full computation
  3. A standardised output format
  4. Whenever possible, interoperability with established technologies (e.g. TensorFlow).

The components for this area will very likely be provided by MPAI-AIF. MPAI-AIF proposes a general, standardised framework that can be used to specify computational workflows based on machine learning in a number of scenarios. It will provide core functionality to several of the MPAI standard proposals currently under consideration.

  1. Interface for the machine/statistical learning to access processed (secondary) data: A first set of input and output signals, with corresponding syntax and semantics, to process secondary data (i.e., the results of processing primary data) with methods based on machine learning. They have the following shared properties:
    1. Irrespective of their source (genomic or sensor) all the inputs to the AI processor are expressed as vectors of values of different number types
    2. Information about the semantics of each input (which source produced which input, and which biological entity/feature each input corresponds to) is standardised and provided through metadata
    3. In order to provide fast access and search capabilities, an index for the metadata is provided.
  2. Interface for the machine/statistical learning to access unprocessed (primary) data with the following features:
    1. Definition of the semantics of the primary data accessible for a large number of data types (especially genomic – see ISO/IEC 23092 part 6 for an indicative list)
    2. Information on the methods used to process the primary data into secondary with a clearly defined semantics is standardised and provided through metadata
    3. Possibly, ability to define basic computational blocks for primary analysis similar to those at (2) and their combination into complex computations, in order to re-process primary data whenever needed by (2)
    4. The processing pipeline may be a combination of local and in-cloud processing.

On the other hand, we would not like to re-standardise the representation of primary data! For instance, part 6 of the the MPEG-G format (ISO/IEC 23092) already standardizes meta-data for most of the techniques employed in the field of genomics under the unifying concept of genomic annotations. Part 3 already offers a clear API through which primary sequencing data can be accessed. In order to avoid effort replication, MPAI-GSA might represent genomic data as MPEG-G encoded data. Similar possibilities might be considered for video primary sources.

Benefits: MPAI-GSA will be offer a number of advantages to a number of actors:

  1. Technology providers and researchers will have available a robust, tested framework within which to develop modular applications that are easy to define, deploy and modify. They will no longer need to spend time and resources on implementing access to a number of basic data formats or the mechanics of a unified access to heterogeneous resources – they will be able to focus fully on the development of their machine learning methods. The methods will be clearly defined in terms of computational modules, some of which can be provided by commercial third parties. This will drastically improve reproducibility, which is an increasing problem with the current biological research based on big data. Offering a robust, well-defined framework will also lower the amount of resources needed to enter the market and help smaller actors to generate and implement competitive ideas
  2. Equipment manufacturers and application vendors can choose from the set of technologies made available through MPAI-GSA standard by different sources, integrate them and satisfy their specific needs
  3. Service providers will have available a growing number of MPAI-GSA applications able to solve different categories of data analysis problems. As all such applications are clearly expressed in terms of a reproducible standard rather than being developed and hosted opaquely on some closed corporate computing infrastructure, comparing the different applications and offering different options to customers will become much simpler
  4. End users will enjoy a thriving, competitive market that provides an ever-growing number of quality solutions by an ever-growing number of actors. The general availability of such a powerful technology will hopefully make widespread applications that today require research computational equipment and personnel, such as clinically oriented genetic/genomic analysis.

 Bottlenecks: In order to fully exploit the potential of MPAI-GSA, one will need widespread availability of computing power, in particular for applications comprising steps whereby primary data is processed into secondary. That would typically require the ability to perform some of such computational steps in the cloud, as few users have access to enough resources. AI-friendly processing units able to implement and speed up secondary analysis would also help, perhaps allowing MPAI-GSA applications based only on secondary-analysis on commodity devices such as mobile phones.

 Social aspects: Genomic applications partially based on the phone might facilitate social uses of the technology (such as receiving and exploring the results of genetic tests, or establishing genealogies).

Success criteria: Data-rich applications are the future of a number of disciplines, in particular life sciences, personalised medicine and the one-health approach – whereby humans, livestock, farming and ultimately the whole terrestrial ecosystem is seen as an integrated system, with each part influencing the rest. At the moment, however, creating analysis workflows able to exploit the data is a painstaking ad-hoc process which requires sizeable investments in technology and development. Most of the times the effort cannot be reused, as by definition the applications are problem-specific. MPAI-GSA will be successful if it can create a framework facilitating the development of modular, reusable analysis frameworks that on one hand trivialise data storage and on the other hand streamline the creation of complex methods. Its success will be defined by its ability to attract a number of actors – researchers, commercial providers of computational solutions and analytical services, end users. The thriving ecosystem of applications thus generated will be a necessary ingredient to transparently integrate data-rich technologies for the life sciences into common practice, widespread appliances, and everyday life.