LEAP-Pangeo Architecture
Contents
LEAP-Pangeo Architecture¶
LEAP-Pangeo is a cloud-based data and computing platform that will be used to support research, education, and knowledge transfer within the LEAP program.
Design Principles¶
In the proposal, we committed to building this in a way that enables the tools and infrastructure to be reused and remixed. So The challenge for LEAP Pangeo is to deploy an “enterprise quality” platform built entirely out of open-source tools, and to make this platform as reusable and useful for the broader climate science community as possible. We committed to following the following design principles:
Open source
Modular system: built out of smaller, standalone pieces which interoperate through clearly documented interfaces / standards
Agile development on GitHub
Following industry-standard best practices for continuous deployment, testing, etc.
Resuse of existing technologies and contribution to “upstream” open source projects on which LEAP-Pangeo depends (rather than development of new stuff just for the sake of it). This is a key part of our sustainability plan.
Design and Architecture¶
There are four primary components to LEAP-Pangeo.
The Data Library¶
The data library will provide analysis-ready, cloud-optimized data for all aspects of LEAP. The data library is directly inspired by the IRI Data Library mentioned above; however, LEAP-Pangeo data will be hosted in the cloud, for maximum impact, accessibility, and interoperability.
The contents of the data library will evolve dynamically based on the needs of the project. Examples of data that may become part of the library are
NOAA OISST sea-surface temperature data, used in workshops and classes to illustrate the fundamentals of geospatial data science.
High-resolution climate model simulations from the NCAR “EarthWorks” project, used by LEAP researchers to develop machine-learning parameterizations of climate processes like cloud and ocean eddies.
Machine-learning “challenge datasets,” published by the LEAP Team and accessible to the world, to help broading participation by ML researchers into climate science.
Easily accessible syntheses of climate projections from CMIP6 data, produced by the LEAP team, for use by industry partners for business strategy and decision making.
Data Storage Service¶
The underlying technology for the LEAP Data catalog will be cloud object storage (e.g. Amazon S3), which enables high throughput concurrent access to many simultaneous users over the public internet. Cloud Object Storage is the most performant, cost-effective, and simple way to serve such large volumes of data.
Initially, the LEAP data will be stored in Google Cloud Storage, in the same cloud region as the JupyterHub. Going forward, we will work with NCAR to obtain an Open Storage Network pod which allows data to be accessible from both Google Cloud and NCAR’s computing system.
Pangeo Forge¶
A central tool for the population and maintenance of the LEAP-Pangeo data catalog is Pangeo Forge. Pangeo Forge is an open source tool for data Extraction, Transformation, and Loading (ETL). The goal of Pangeo Forge is to make it easy to extract data from traditional data repositories and deposit in cloud object storage in analysis-ready, cloud-optimized (ARCO) format.
Pangeo Forge works by allowing domain scientists to define “recipes” that describe data transformation pipelines. These recipes are stored in GitHub repositories. Continuous integration monitors GitHub and automatically executes the data pipelines when needed. The use of distributed, cloud-based processing allows very large volumes of data to be processed quickly.
Pangeo Forge is a new project, funded by the NSF EarthCube program. LEAP-Pangeo will provide a high-impact use case for Pangeo Forge, and Pangeo Forge will empower and enhance LEAP research. This synergistic relationship with be mutually beneficial to two NSF-sponsored projects. Using Pangeo Forge effectively will require LEAP scientists and data engineers to engage with the open-source development process around Pangeo Forge and related technologies.
Catalog¶
A STAC data catalog be used to enumerate all LEAP-Pangeo datasets and provide this information to the public. The catalog will store all relevant metadata about LEAP datasets following established metadata standards (e.g. CF Conventions). It will also provide direct links to raw data in cloud object storage.
The catalog will facilitate several different modes of access:
Searching, crawling, and opening datasets from within notebooks or scripts
“Crawling” by search indexes or other machine-to-machine interfaces
A pretty web front-end interface for interactive public browsing
The Radiant Earth MLHub is a great reference for how we imagine the LEAP data catalog will eventually look.
The Hub¶
Jupyter Notebook / Lab has emerged as the standard tool for doing interactive data science. Jupyter supports combining rich text, code, and generated outputs (e.g. figures) into a single document, creating a way to communicate and share complete data-science research project
JupyterHub is a multi-user Jupyter Notebook / Lab environment that runs on a server. JupyterHub provides a gateway to highly customized software environments backed by dedicated computing with specified resources (CPU, RAM, GPU, etc.) Running in the cloud, JupyterHub can scale up to accommodate any number of simultaneous users with no degradation in performance. JupyterHub environments can support basically every existing programming language. We anticipate that LEAP users will primarily use Python, R, and Julia programming languages. In addition to Jupyter Notebook / Lab, JupyterHub also supports launching R Studio.
The Pangeo project already provides curated Docker images with full-featured Python software environments for environmental data science. These environments will be the starting point for LEAP environments. They may be augmented as LEAP evolves with more specific software as needed by research projects.
Use management and access control for the Hub are described in Users and Categories. We use GitHub for identity management, in order to make it easy to include participants from any partner institution..
The Knowledge Graph¶
LEAP “outputs” will be of four main types:
Datasets (covered above)
Papers - traditional scientific publications
Project Code - the code behind the papers, used to actually generate the scientific results
Trained ML Models - models that can be used directly for inference by others
Educational Modules - used for teaching
All of these object must be tracked and cataloged in a uniform way. The Code Policy and Data Policy will help set these standards.
By tracking the linked relationships between datasets, papers, code, models, and educational , we will generate a “knowledge graph”. This graph will reveal the dynamic, evolving state of the outputs of LEAP research and the relationships between different elements of the project. By also tracking participations (i.e. humans), we will build a novel and inspiring track record of LEAP’s impacts through the project lifetime.
This is the most open-ended aspect of our infrastructure. Organizing and displaying this information effectively is a challenging problem in information architecture and systems design.