LEAP-Pangeo Implementation Plan
Contents
LEAP-Pangeo Implementation Plan¶
The different elements of the project can be implemented in parallel and gradually connected together.
Roles¶
Decision Needed
An open question for LEAP-Pangeo is whether to develop and maintain our infrastructure via subcontracts or via Columbia employees whom we hire. Below the roles are enumerated in a generic way according to the needed expertise.
Data Engineering¶
The LEAP-Pangeo Architecture, particularly the Data Library, will require expertise in modern data engineering, including the following areas:
Geospatial data formats and metadata standards, including modern cloud-optimized formats such as Parquet and Zarr
Geospatial data catalogs and APIs
Cloud object storage
Cloud automation and data pipelines
Distributed computing frameworks for data science (e.g. Dask, Prefect, Apache Beam)
GitHub workflows
Continuous integration and agile development
Track record of contribution to multi-stakeholder open-source software
Possible contractors who fit this role:
DevOps for Cloud Hub¶
Developing and operating the LEAP-Pangeo JupyterHub will require the following expertise:
Strong experience with Docker and containerization of workflows
Deploying cloud-native applications (particularly JupyterHub) using Kubernetes and Helm
Continuous deployment using GitHub workflows
Monitoring and optimizing cloud costs in multi-user JupyterHub environments
Building machine-learning environments for Python and R users with tools such as Conda, Conda Forge, and repo2docker.
Continuous integration and agile development
Track record of contribution to multi-stakeholder open-source software
Possible contractors who fit this role:
Full-Stack Web Development¶
Developing the LEAP Knowledge Graph, including the library of papers, open-source code and machine-learning models, will require a mix of skills commonly referred to as “full-stack web development”.
Front-end web development using HTML, CSS, and modern Javascript frameworks (e.g. React, Vue, etc.)
Development and deployment of REST API endpoints for backend services
Consumption of data from third-party APIs (e.g. GitHub API)
Familiarity with Jupyter Notebook format
Continuous integration and agile development
Track record of contribution to multi-stakeholder open-source software
Education and Training¶
Decision Needed
What is the scope of LEAP-Pangeo training? How much should we expect trainees to learn? What is the intersection with other educational activities, including for-credit courses?
Training participants in using LEAP-Pangeo will require expertise in research computing pedagogy and state-of-the-art knowledge of best practices in scientific computing, machine learning, and cloud computing.
Contractors vs. Employees¶
Pros |
Cons |
|
---|---|---|
Employees |
Longer-term commitment to project. Better integration with on-campus activities. |
Slow hiring. Recruiting challenges. Uncertainty they can deliver needed results. |
Contractors |
Can spin up rapidly. Proven track records. Connection to broader ecosystems. Don’t have to deal with hiring. Acccess to top technical talent. |
Potentially less integrated into project. |
Timeline¶
What follows is a possible timeline for implementation.
Fall 2021¶
Activities¶
📍 Deploy generic Pangeo JupyterHub on Google Cloud using supported credits.
📍 Provide basic end-user documentation for using the Hub (comparable to Pangeo Cloud docs).
Milestones¶
✅ LEAP members log into the hub and run their first code against existing cloud data.
Spring 2021¶
Activities¶
📍 Conduct data survey to assess data needs of research, education, and outreach activities.
📍 Data engineers work with researchers on data ingestion.
📍 Refine Hub environment based on initial feedback.
Milestones¶
✅ LEAP researchers ingest first datasets into cloud data library.
✅ LEAP seminar uses LEAP-Pangeo data science environments for teaching.
Summer 2021¶
Activities¶
📍 Launch initial data catalog
📍 Begin training program
Milestones¶
✅ Perform first LEAP-Pangeo training for participations
✅ LEAP REU interns successfully use LEAP-Pangeo for projects.
✅ First LEAP publications are added to the knowledge graph, along with supporting data and code