Here is a diagram of my end-to-end vision for a data platform, which I have presented at the Bioinformatics Strategy Meeting Europe (London, 12/07/2016).
Data feeders, such as data capture devices, field sensors, IoT, genomic sequencers, etc., generate data which goes into a Working Data Store.
The Working Data Store (on premises) could use technologies such as Hadoop. Ideally data sets should be identified through a DOI-like system. Data owners/authors should be identified through ORCID. The Working Data Store would feed a Reference Data Store.
Electronic Lab Notebook systems, and Lab Information Management Systems would also generate data. They would also feed the Working Data Store and potentially the Reference Data Store.
The Reference Data Store (on premises) has all the characteristics associated with the management of active data sets.
Other significant data source are the systems of reference in the organisation (ECM, ERP, CRM, etc). as well as cloud-based data sources
A federated data repository interface is the mechanism through which the access to data across the different repositories would be gained (Working and Reference Data Stores, Systems of Reference and Cloud-based data sources).
The interface would also offer access to the on-site Compute facilities, as well as to cloud-based compute facilities (Amazon, Azure etc).
Through a set of Web APIs, the interface would expose the data as needed to a web platform for publication.
On-site analytics and visualisation tools (e.g. R, Matlab, Galaxy, etc) would access the data through the same interface.
Cloud based analytics and visualisation tools (such as semantic/knowledge based languages with rich,, predefined functions and models for analysis across different domains) would also interact with this data through the federated interface. In addition these would link to cloud-based knowledge repositories directly. This category also includes simpler tools that are available more readily across the enterprise – e.g. PowerBI.
Some of the challenges re data analytics are:
- Availability of tools across the organisation
- skills needed, and learning curve;
- scale-ability and performance.
- Data availability across the organisation
- How fit for purpose is the data, and whether it is granular enough;
- Compliance requirements;
- Data quality.
- Interfacing the tools with the data (particularly relevant as some of the cloud tools are in their infancy).
This article and the diagrams/images included, by Florentin Albu, are licensed under a Creative Commons Attribution 4.0 International License.