Infrastructure

LEAP-Pangeo Data Catalog

Catalog Address

catalog.leap.columbia.edu

Management Repo

https://github.com/leap-stc/data-management

Maintained in collaboration with

Carbonplan

For more explanation about the catalog, and its role in the overall vision of LEAP, see LEAP-Pangeo Architecture. See Ingesting Datasets into Cloud Storage for details on how to ingest data and link it into the catalog.

LEAP-Pangeo JupyterHub

Our team has a cloud-based JupyterHub. For information who can access the hub with which privileges, please refer to Membership Tiers.

Hub Address

https://leap.2i2c.cloud/

Hub Location

Google Cloud us-central1

Hub Operator

2i2c

Hub Configuration

https://github.com/2i2c-org/infrastructure/tree/master/config/clusters/leap

This document goes over the primary technical details of the JupyterHub.

  • For a quick tutorial on basic usage, please check out our Getting Started tutorial.

  • To get an in-depth overview of the LEAP Pangeo Architecture and how the JupyterHub fits into it, please see the Architecture page.

Server

Managing Servers

You can start and stop your server (and even open multiple named servers) from the Hub Control Panel. You can get to the hub control panel by navigating to https://leap.2i2c.cloud/hub/home in your browser or navigating to File > Hub Control Panel from the JupyterLab Interface.

Your User Directory

When you open your hub, you can navigate to the “File Browser” and see all the files in your User Directory image

Your User Directory behaves very similar to a filestystem on your computer. If you save a file from a notebook, you will see it appear in the File Browser (you might have to wait a few seconds or press refresh) and you can use a terminal to navigate the terminal as you would on a UNIX machine:

image

Note

As shown in the picture above, every user will see '/home/jovyan' as their root directory. This is different from many HPC accounts where your home directory will point to a directory with your username. But the functionality is similar. These are your own files and they cannot be seen/modified by other users (except admins).

The primary purpose of this directory is to store small files, like github repositories and other code.

Warning

To accommodate the expanding LEAP community, the data and compute team has instituted a storage quota on individual user directories /home/jovyan. Your home directory is intended only for notebooks, analysis scripts, and small datasets (< 1 GB). It is not an appropriate place to store large datasets. Unlike the cloud buckets, these directories use an underlying storage with a rigid limit. If a single user fills up the space, the Hub crashes for everyone. We recommend users use less than 25GB and enforce a hard limit of 50GB. Users who persistently violate the limit may temporarily get reduced cloud access.

To check how much space you are using in your home directory open a terminal window on the hub and run du -h --max-depth=1 ~/ | sort -h.

If you want to save larger files for your work use our LEAP-Pangeo Cloud Storage Buckets and consult our Hub Data Guide. See the FAQs for guidance on reducing storage.

The Software Environment

The software environment you encounter on the Hub is based upon docker images which you can run on other machines (like your laptop or an HPC cluster) for better reproducibility.

Upon start up you can choose between

  • A list of preselected images

  • The option of passing a custom docker image via the "Other..." option.

Preselected Images

LEAP-Pangeo uses several full-featured, up-to-date Python environments maintained by Pangeo. You can read all about them at the following URL:

There are separate images for pytorch and tensorflow which are available in a drop-down panel when starting up your server. The Hub contains a specific version of the image which can be found here.

For example, at the time of writing, the version of pangeo-notebook is 2022.05.10. A complete list of all packages installed in this environment is located at:

Attention

We regularly update the version of the images provided in the drop-down menu.

To ensure full reproducibility you should save the full info of the image you worked with (this is stored in the environment variable JUPYTER_IMAGE_SPEC) with your work. You could for example print the following in the first cell of a notebook:

import os

print(os.environ["JUPYTER_IMAGE_SPEC"])

You can then use that string with the custom images to reproduce your work with exactly the same environment.

Custom Images

If you select the Image > Other... Option during server login you can paste an arbitrary reference in the form of docker_registry/organization/image_name:image_version. As an example we can get the 2023.05.08 version of the pangeo tensorflow notebook by pasting quay.io/pangeo/ml-notebook:2023.05.08.

If you want to build your own docker image for your project, take a look at this template and the instructions to learn how to use repo2docker to set up CI workflows to automatically build docker images from your repository.

Installing additonal packages

You can install additional packages using pip and conda. However, these will disappear when your server shuts down.

For a more permanent solution we recommend building project specific dockerfiles and using those as custom images.

LEAP-Pangeo Cloud Storage Buckets

LEAP-Pangeo provides users two cloud buckets to store data. Your Server is automatically authenticated to read from any of these buckets but write access might differ (see below). See Access to LEAP-Pangeo resources without the JupyterHub for details on how to access buckets from ‘outside’ the JupyterHub.

  • gs://leap-scratch/ - Temporary Storage deleted after 7 days. Use this bucket for testing and storing large intermediate results. More info

  • gs://leap-persistent/ - Persistent Storage. Use this bucket for storing results you want to share with other members.

  • gs://leap-persistent-ro/ - Persistent Storage with read-only access for most users. To upload data to this bucket you need to use this method below.

Files stored on each of those buckets can be accessed by any LEAP member, so be concious in the way you use these.

  • Do not put sensitive information (passwords, keys, personal data) into these buckets!

  • When writing to buckets only ever write to your personal folder! Your personal folder is a combination of the bucketname and your github username (e.g. `gs://leap-persistent/funky-user/’).

Compute

🚧

Access to LEAP-Pangeo resources without the JupyterHub

Temporary Token

You can generate a temporary (1 hour) token with read/write access as follows:

mamba install google-cloud-sdk

Now you can generate a temporary token (valid for 1 hour) that allows you to upload data to the cloud.

gcloud auth print-access-token

This will print a temporary token in the terminal. You can e.g. copy that to your clipboard.

Persistent Access via Google Groups

We manage access rights through Google Groups. Please contact the Data and Computation Team to get added to the appropriate group (a gmail address is required for this).

Service Account

If you want more permanent access to resources, e.g. as part of a repositories CI using a service account, please reach out to the Data and Computation Team to discuss options.

Analysis-Ready Cloud-Optimized (ARCO) Data

Below you can find some examples of ARCO data formats

Zarr

zarr

Pangeo-Forge

You can find more information about Pangeo-Forge here.