2. Delivery Plan
2.5 Infrastructure: Data generation, integration and analysis
Empowering the environment research community to exploit the opportunities provide by innovation in omics is the provision of experts and enablers, access to an expert skill-base together with provision of training and tools to support independent capability. Defining a one-size-fits-all strategy that addresses the infrastructural requirements for Environmental omics is challenging, given several variables: (a) the rate of technological advance; (b) access to appropriate local infrastructure; (c) the maturity of different analytical platforms; (d) diversity of applications; and, (e) the expertise level of researchers and research communities.
New platforms appear, and technologies retire at an unprecedented rate. This raises two issues; the continued capital investment needed to provide researchers with access to state-of-the-art technologies and the dissemination of best practice for the use of these new technologies. The institutional provision of omics infrastructure is extremely variable, with some researchers having access through dedicated local facilities whilst others rely on provision by regional or national centres. The area of genomics has matured rapidly, whilst other disciplines, such as metabolomics and proteomics, have not experienced the same developmental trajectory, leaving expertise and equipment at a premium. The diversity of applications mirrors the strength and breadth of environmental research, with projects ranging from the transcriptomic analysis of organisms from extreme environments to the characterisation of DNA from 8,000-year-old sediments. eDNA, dietary DNA and aDNA all require specialist facilities, robust sample protocols and rigours data analysis pipelines. Expertise in both the physical preparation of samples and analysis of resulting data is extremely inconsistent with some of the best environmental researchers being inexperienced in the application of omics in their fields whilst having world leading expertise. It is essential that the environmental community develop infrastructure network to share best practice and facilitate access to specialist facilities and expertise.
Addressing this heterogeneity in availability represents the key strategic infrastructure requirement of ‘Environmental Omics’. Three approaches are key to delivering enhanced research outputs, these include:
a) provision of national centres of excellence with state-of-the-art platforms for data generation, analytical pipelines (established and new) and an expert knowledge base;
b) delivery of training (informatics and wet-lab) to expand the knowledge/skill base within the environmental community;
c) development of tools to support community empowerment and democratisation of omics;
d) access to informatic capability and capacity to support analytical requirements;
e) promotion of collaboration to share expertise and support interdisciplinarity.
However, any infrastructural investment must exploit the established national capability providing support for environmental specific resource and developing fields, whilst exploiting regional or national centres and commercial providers where appropriate.
Access to DNA/RNA sequencing capacity is not a current limitation to research aspirations. Selecting the appropriate technology (platforms and methodologies) that can assist delivery of a specific research objectives is often complex, requiring access to expertise. Specific specialist facilities, experienced user base and physical infrastructure are required for the preparation of specific sample types, e.g. single cell analysis or preparation of archaeological samples. It is important that these specialist infrastructures be linked to expert users, but available to the wider community. Furthermore, neglected areas in environmental science, such as environmental metabolomics and proteomics, need significant support to assist further development and exploitation. However, these techniques are heavily used in the BBSRC and medical communities. UKRI is ideally placed to facilitate greater co-ordination and exchange of these facilities between the different research communities, with a dynamic distributed infrastructure where a hub coordinates access to a range of expertise across the spectrum of different science disciplines
Key to realising the full potential of Omics is data
Each technological advance allows us to acquire more data at lower cost. The increase in genetic sequence data capture has exceeded Moore’s law for the past 14 years (https://www.genome.gov/sequencingcostsdata/). The opportunities this provides for environmental research are far reaching, however, the challenge posed by handling the data is also significant. Careful consideration must be given to the infrastructure used for storage, policies/mechanisms for data sharing, analytical tools and integration of omics data with the full spectrum of environmental metadata. In isolation, omics data has significant value but when combined with the full spectrum of related data, whether this be land-use acquired by remote sensing through to detailed phenotypic measurements, the value of the omics data is significantly enhanced. In 1999, the astrophysics community identified the direct cost to research caused by having no coordinated data structure for their observations at 333 FTE / annum  and thus justified the development of the platform that now allows all astrological data to be access through a single portal (ALADIN  / SIMBAD portals ). The recent explosion of environmental omics data places our community in a parallel situation, although publication should ensure data disclosure of raw data or metadata is not included or not digitally accessible (this therefore does not represent 5* open data). The recent announcement of the ‘Constructing a Digital Environment’ Strategic Priorities Fund programme provides an ideal platform on which to integrate an omics data layer into a multi-dimensional representation of the natural environment enabling monitoring, analysis, modelling and visualisation across spatial and temporal scales.
Therefore, investment in the long-term infrastructure to support the integration omics environmental data into the ‘Digital Environment’ together with the related ecological and geophysical data is a priority.
Big Data Infrastructure
The requirement for access to high performance computing (HPC) to support environmental omics data analysis is essential. The heterogeneity of data type, the need for specialist analytical pipelines and the dynamic nature of bioinformatic software development possess significant challenges for classical centrally managed HPC infrastructures. No one hardware configuration will support all types of omics analysis, with genome assemblies demanding terabytes of directly accessible RAM whilst other informatic processes can efficiently use parallel processing clusters. The emergence of specialist processor architectures dedicated toward fulfilling specific information tasks may assist a restricted suite of applications but will not address the heterogeneity of applications. The exponential increase in the size of reference data sets (bacterial metagenome resource held at NCBI now exceeds 2 petabytes) presents challenges for researchers who wish to interrogate these repositories. The integration of omics data with complementary environmental data requires standardisation and ontologies whilst the need for transparent reproducibility of information analysis needs novel innovation and approaches. However, many of these computational issues are not limited to ‘Environmental Omics’ data and therefore solutions should be addressed through the development of an integrated UKRI e-infrastructure to be integrated into the UKRI infrastructure roadmap.
A range of e-infrastructure solutions for bioinformatics outside the classical HPC system have been developed. Systems such as the Genomic Virtual laboratory (MRC CLIMB implemented infrastructure) and CyVerse exploit a core open cluster architecture providing users with the ability to customise and scale their computing requirements. These systems can be applied at a local institutional level or as distributed networks and can be easily transferred to large-data centres. The NERC Omics Community successfully trialled an equivalent system (EOS cloud), demonstrating the viability of this type of computer architecture to deliver for the breadth of its community. New approaches to delivering High-Performance Computing (HPC) capability that have been customised to support the requirement of the individual researchers provide exciting developments. One large advantage that these implementations provide is the ability to record and share the complete configuration used to analyse a specific dataset, promoting reproducibility of these complex pipelines.
It is essential that we drive forward towards 5* open data: this will require data to be accessible but also linkable. The development of Open Data Cubes in environmental science provides a vehicle for anchor omics data within the physical environment, providing the temporal and spatial reference to allow integration of the data with remote sensing and wider earth observation information i.e. land use or weather pattern. However, there is an increasing need requirement for conserve vocabularies and controlled ontologies (i.e. Genome Standards Consortium) if we are to ensure that environmental omics data is truly linkable and discoverable.