Evolving Data-driven science: the unprecedented coherence of Big Data, HPC, and informatics, and crossing the next chasms
As we approach the AGU Centenary, we celebrate the successes of data-driven science whilst looking anxiously at the future, with consideration of hardware software, workflow and interconnectedness that need further attention.
The colocation of scientific datasets with HPC/cloud compute has successfully demonstrated the overall supercharging of our research productivity. Over time we questioned whether to “bring data to the compute”, or “compute to the data” and considered and reconsidered the benefits, weaknesses and challenges both technically and socially. The gap between how large volume data and longtail data are managed is steadily closing, and the standards for interoperability and ability for connectivity between scientific fields have been slowly maturing. In many cases transdisciplinary science is now a reality.
However, computing technology is no longer advancing according to Moore’s law (and equivalents) and is evolving in unexpected ways. For some major computational software codes, these technology changes are forcing us to reconsider the development strategy, how to transition existing code to both address the needs of scientific improvements in capability, while at the same time improving the ability to adjust to changes in the underlying technical infrastructure. In doing so, some old assumptions of data precision and reproducibility are being reconsidered. Quantum computing is now on the horizon which will mean further consideration of software and data access mechanisms.
Currently, for data management, despite the apparent value and opportunity, the demands on high quality datasets that can be used for new data-driven methods are testing the funding/business case and overall value proposition for celebrated open data and its FAIRness. Powerful new technologies such as AI and deep learning have a voracious appetite for big data and much stronger (and unappreciated) requirements around quality of data, information management, connectivity and persistence. These new technologies are evolving at the same time as the ubiquitous IOT, fog computing, and blockchain pipelines have emerged creating even more complexity and potentially hypercoherence issues.
In this talk I will discuss the journey so far in data-intensive computational science, and consider the chasms we have yet to cross.
Earth sensing: from ice to the Internet of Things
The evolution of technology has led to improvements in our ability to use sensors for earth science research. Radio communications have improved in terms of range and power use. Miniaturisation means we now use 32 bit processors with embedded memory, storage and interfaces. Sensor technology makes it simpler to integrate devices such as accelerometers, compasses, gas and biosensors. Programming languages have developed so that it has become easier to create software for these systems. This combined with the power of the processors has made research into advanced algorithms and communications feasible. The term environmental sensor networks describes these advanced systems which are designed specifically to take sensor measurements in the natural environment.
Through a decade of research into sensor networks, deployed mainly in glaciers, many areas of this still emerging technology have been explored. From deploying the first subglacial sensor probes with custom electronics and protocols we learnt tuning to harsh environments and energy management. More recently installing sensor systems in the mountains of Scotland has shown that standards have allowed complete internet and web integration.
This talk will discuss the technologies used in a range of recent deployments in Scotland and Iceland focussed on creating new data streams for cryospheric and climate change research.
Cynthia Chandler, Woods Hole Oceanographic Institution
The scientific research endeavor requires data, and in some cases massive amounts of complex and highly diverse data. From experimental design, through data acquisition and analysis, hypothesis testing, and finally drawing conclusions, data collection and proper stewardship are critical to science. Even a single experiment conducted by a single researcher will produce data to test the working hypothesis. The types of complex science questions being tackled today often require large, diverse, multi-disciplinary teams of researchers who must be prepared to exchange their data.
This 2016 AGU Leptoukh Lecture comprises a series of vignettes that illustrate a brief history of data stewardship: where we have come from, how and why we have arrived where we are today, and where we are headed with respect to data management. The specific focus will be on management of marine ecosystem research data and will include observations on the drivers, challenges, strategies, and solutions that have evolved over time. The lessons learned should be applicable to other disciplines and the hope is that many will recognize parallels in their chosen domain.
From historical shipboard logbooks to the high-volume, digital, quality-controlled ocean science data sets created by today’s researchers, there have been enormous changes in the way ocean data are collected and reported. Rapid change in data management practices is being driven by new data exchange requirements, by modern expectations for machine-interoperable exchange, and by the desire to achieve research transparency. Advances in technology and cultural shifts contribute to the changing conditions through which data managers and informatics specialists must navigate.
The unique challenges associated with collecting and managing environmental data, complicated by the onset of the big data era, make this a fascinating time to be responsible for data. It seems there are data everywhere, being collected by everyone, for all sorts of reasons, and people have recognized the value of access to data. Properly managed and documented data, freely available to all, hold enormous potential for reuse beyond the original reason for collection.
Dawn Wright, Environmental Systems Research Institute
Director of Models and Data at the UK National Centre for Atmospheric Science, Professor of Weather and Climate Computing at the University of Reading, and the Director of the STFC Centre for Environmental Data Archival (CEDA).
The grand challenges of climate science will stress our informatics infrastructure severely in the next decade. Our drive for ever greater simulation resolution/complexity/length/repetition, coupled with new remote and in-situ sensing platforms present us with problems in computation, data handling, and information management, to name but three. These problems are compounded by the background trends: Moore’s Law is no longer doing us any favours: computing is getting harder to exploit as we have to bite the parallelism bullet, and Kryder’s Law (if it ever existed) isn’t going to help us store the data volumes we can see ahead. The variety of data, the rate it arrives, and the complexity of the tools we need and use, all strain our ability to cope. The solutions, as ever, will revolve around more and better software, but “more” and “better” will require some attention.
In this talk we discuss how these issues have played out in the context of CMIP5, and might be expected to play out in CMIP6 and successors. Although the CMIPs will provide the thread, we will digress into modelling per se, regional climate modelling (CORDEX), observations from space (Obs4MIPs and friends), climate services (as they might play out in Europe), and the dependency of progress on how we manage people in our institutions. It will be seen that most of the issues we discuss apply to the wider environmental sciences, if not science in general. They all have implications for the need for both sustained infrastructure and ongoing research into environmental informatics.
Simon Cox is a Senior Principal Research Scientist at CSIRO. He trained as geophysicist, with a PhD in experimental rock mechanics from Columbia (Lamont-Doherty) following degrees in geological sciences at Cambridge and Imperial College London. He came to Australia for a post-doc with CSIRO, and then spent four years teaching at Monash University in Melbourne where he first began using GIS. Returning to CSIRO in Perth in 1994 to work on information management for the Australian Geodynamics CRC, he moved its focus for reporting onto the emerging World Wide Web, deploying a web-mapping system for Australian geology and geophysics in 1995. The challenge of maintaining the AGCRC website led to metadata-based systems, and Simon’s engagement with the standards community when he joined the Dublin Core Advisory Council.
Work on XML-based standards for mineral exploration data led on to foundation of the GeoSciML project in collaboration with a number of geological surveys. An interest in tying these into broader interoperability systems led to engagement with the Open Geospatial Consortium, where he co-edited the Geography Markup Language (GML) v2 and v3. In OGC he developed Observations and Measurements as a common language for in situ, ex situ and remote sensing, going on to become an ISO standard, and forming the basis for operational systems in diverse fields including air-traffic, water data transfer and environmental monitoring applications. In 2009-10 he spent a year as a senior fellow at the EC Joint Research Centre in Italy working on integration of GEOSS and INSPIRE. He served on the council of the IUGS Commission for Geoscience Information and the International Association for Mathematical Geosciences. In 2006 he was awarded OGC’s highest honor, the Gardels medal. He has been a member of AGU since 1982. Simon is currently based in CSIRO Land and Water in Melbourne, working on a variety of projects across environmental informatics and spatial data systems.