LEAP-Pangeo JupyterHub

Our team has a cloud-based JupyterHub. For information who can access the hub with which privileges, please refer to Users and Categories

Hub Address

https://leap.2i2c.cloud/

Hub Location

Google Cloud us-central1

Hub Operator

2i2c

Hub Configuration

https://github.com/2i2c-org/infrastructure/tree/master/config/clusters/leap

Getting Started

To get started using the hub, check out this video by James Munroe from 2i2c explaining the architecture.

Getting Help

For questions about how to use the Hub, please use the LEAP-Pangeo discussion forum:

Office Hours

We also offer in-person and virtual Office Hours on Thursdays for questions about LEAP-Pangeo. You can reserve an appointment here.

Hub Usage

This is a rough and ready guide to using the Hub. This documentation will be expanded as we learn and evolve. Feel free to edit it yourself if you have suggetions for improvement!

Logging In

  1. 👀 Navigate to https://leap.2i2c.cloud/ and click the big orange button that says “Log in to continue”

  2. 🔐 You will be prompted to authorize a GitHub application. Say “yes” to everything. Note you must belong to the appropriate GitHub team in order to access the hub. See Users and Categories for more information.

  3. 📠 You will redirect to a screen with the following options.

image

Note: Depending on your membership you might see additional options (e.g. for GPU machines)

You have to make 3 choices here:

  • The machine type (Choose between “CPU only” or “GPU” if available) ⚠️The GPU images should be used only when needed to accelerate model training.

  • The software environment (“Image”). Find more info in the Software Environment Section below.

  • The node share. These are shared resources, and you should try to use the smallest image you need. You can easily start up a new server with a larger share if you find your work to be limited by CPU/RAM

  1. 🕥 Wait for your server to start up. It can take up to few minutes.

Using JupyterLab

After your server fires up, you will be dropped into a JupyterLab environment.

If you are new to JupyterLab, you might want to peruse the user guide.

Shutting Down Your Server

Your server will shut down automatically after a period of inactivity. However, if you know you are done working, it’s best to shut it down directly. To shut it down, go to https://leap.2i2c.cloud/hub/home and click the big red button that says “Stop My Server”

image

You can also navigate to this page from JupyterLab by clicking the File menu and going to Hub Control Panel.

The Software Environment

The software environment you encounter on the Hub is based upon docker images which you can run on other machines (like your laptop or an HPC cluster) for better reproducibility.

Upon start up you can choose between

  • A list of preselected images

  • The option of passing a custom docker image via the "Other..." option.

Preselected Images

LEAP-Pangeo uses several full-featured, up-to-date Python environments maintained by Pangeo. You can read all about them at the following URL:

There are separate images for pytorch and tensorflow which are available in a drop-down panel when starting up your server. The Hub contains a specific version of the image which can be found here.

For example, at the time of writing, the version of pangeo-notebook is 2022.05.10. A complete list of all packages installed in this environment is located at:

Attention

We regularly update the version of the images provided in the drop-down menu.

To ensure full reproducibility you should save the full info of the image you worked with (this is stored in the environment variable JUPYTER_IMAGE_SPEC) with your work. You can then use that string with the custom images to reproduce your work with exactly the same environment.

Custom Images

If you select the Image > Other... Option during server login you can paste an arbitrary reference in the form of docker_registry/organization/image_name:image_version. As an example we can get the 2023.05.08 version of the pangeo tensorflow notebook by pasting quay.io/pangeo/ml-notebook:2023.05.08.

If you want to build your own docker image for your project, take a look at this template and the instructions to learn how to use repo2docker to set up CI workflows to automatically build docker images from your repository.

Installing additonal packages

You can install additional packages using pip and conda. However, these will disappear when your server shuts down.

For a more permanent solution we recommend building project specific dockerfiles and using those as custom images.

Files and Data

Data and files work differently in the cloud. To help onboard you to this new way of working, we have written a guide to Files and Data in the Cloud:

We recommend you read this thoroughly, especially the part about Git and GitHub.

Warning

Please do not store large files in your user directory /home/jovyan. Your home directory is intended only for notebooks, analysis scripts, and small datasets (< 1 GB). It is not an appropriate place to store large datasets.

LEAP-Pangeo Buckets

LEAP-Pangeo provides users two cloud buckets to store data

  • gs://leap-scratch/ - Temporary Storage deleted after 7 days. Use this bucket for testing and storing large intermediate results. More info

  • gs://leap-persistent - Persistent Storage. Use this bucket for storing results you want to share with other members.

  • gs://leap-persistent-ro - Persistent Storage with read-only access for most users. To upload data to this bucket you need to use this method below.

Files stored on each of those buckets can be accessed by any LEAP member, so be concious in the way you use these.

  • Do not put sensitive information (passwords, keys, personal data) into these buckets!

  • When writing to buckets only ever write to your personal folder! Your personal folder is a combination of the bucketname and your github username (e.g. `gs://leap-persistent/funky-user/’).

Inspecting contents of the bucket

We recommend using gcsfs or fsspec which provide a filesytem-like interface for python.

You can e.g. list the contents of your personal folder with

import gcsfs
fs = gcsfs.GCSFileSystem() # equivalent to fsspec.fs('gs')
fs.ls('leap-persistent/funky-user')

Basic writing to and reading from cloud buckets

We do not recommend uploading large files (e.g. netcdf) directly to the bucket. Instead we recommend to write data as ARCO (Analysis-Ready Cloud-Optimized) formats like zarr(for n-dimensional arrays) and parquet(for tabular data) (read more here why we recommend ARCO formats).

If you work with xarray Datasets switching the storage format is as easy as swapping out a single line when reading/writing data:

Xarray provides a method to stream results of a computation to zarr

ds = ...
ds_processed = ds.mean(...).resample(...)
user_path = "gs://leap-scratch/funky-user" # 👀 make sure to prepend `gs://` to the path or xarray will interpret this as a local path
store_name = "processed_store.zarr"
ds_processed.to_zarr(f'{user_path}/{store_name}')

This will write a zarr store to the scratch bucket.

You can read it back into an xarray dataset with this snippet:

import xarray as xr
ds = xr.open_dataset('gs://leap-scratch/funky-user/processed_store.zarr', engine='zarr', chunks={}) #

… and you can give this to any other registered LEAP user and they can load it exactly like you can!

You can also write other files directly to the bucket by using fsspec.open similarly to the python builtin open

with fsspec.open('gs://leap-scratch/funky-user/test.txt', mode='w') as f:
    f.write('hello world')

Deleting from cloud buckets

Warning

Depending on which cloud bucket you are working, make sure to double check which files you are deleting by inspecting the contents and only working in a subdirectory with your username (e.g. gs://<leap-bucket>/<your-username>/some/project/structure.

You can remove single files by using a gcsfs/fsspec filessytem as above

import gcsfs
fs = gcsfs.GCSFileSystem() # equivalent to fsspec.fs('gs')
fs.rm('leap-persistent/funky-user/file_to_delete.nc')

If you want to remove zarr stores (which are an ‘exploded’ data format, and thus represented by a folder structure) you have to recursively delete the store.

Warning

The warning from above is even more important here! Make sure that the folder you are deleting does not contain any data you do not want to delete!

fs.rm('leap-scratch/funky-user/processed_store.zarr', recursive=True)

I have a dataset and want to work with it on the hub. How do I upload it?

If you would like to add a new dataset to the LEAP Data Library, please first raise an issue here. This enables us to track detailed information about proposed datasets and have an open discussion about how to upload it to the cloud.

We distinguish between two primary types of data to upload: “Original” and “Published” data.

  • Published Data has been published and archived in a publically accessible location (e.g. a data repository like zenodo or figshare). We do not recommend uploading this data to the cloud directly, but instead use Pangeo Forge to transform and upload it to the cloud. This ensures that the data is stored in an ARCO format and can be easily accessed by other LEAP members.

  • Original Data is any dataset that is produced by researchers at LEAP and has not been published yet. The main use case for this data is to share it with other LEAP members and collaborate on it. For original data we support direct uploaded to the cloud. Be aware that original data could change rapidly as the data producer is iterating on their code. We encourage all datasets to be archived and published before using them in scientific publications.

Transform and Upload published data to an ARCO format (with Pangeo Forge)

Coming Soon

Upload medium sized original data from your local machine

For medium sized datasets, that can be uploaded within an hour, you can use a temporary access token generated on the JupyterHub to upload data to the cloud.

  • Set up a new environment on your local machine (e.g. laptop)

mamba env create --name leap_pange_transfer python=3.9 google-auth gcsfs jupyterlab xarray zarr dask #add any other dependencies (e.g. netcdf4) that you need to read your data
  • Activate the environment

conda activate leap_pange_transfer

and set up a jupyter notbook (or a pure python script) that loads your data in as few xarray datasets as possible. For instance, if you have one dataset that consists of many files split in time, you should set your notebook up to read all the files using xarray into a single dataset, and then try to write out a small part of the dataset to a zarr store.

  • Now start up a LEAP-Pangeo server and open a terminal. Install the Google Cloud SDK using mamba

mamba install google-cloud-sdk

Now you can generate a temporary token (valid for 1 hour) that allows you to upload data to the cloud.

gcloud auth print-access-token

Copy the resulting token into a plain text file token.txt in a convenient location on your local machine.

  • Now start a JupyterLab notebook and paste the following code into a cell:

import gcsfs
import xarray as xr
from google.cloud import storage
from google.oauth2.credentials import Credentials

# import an access token
# - option 1: read an access token from a file
with open("path/to/your/token.txt") as f:
    access_token = f.read().strip()

# setup a storage client using credentials
credentials = Credentials(access_token)
fs = gcsfs.GCSFileSystem(token=credentials)

Make sure to replace the path/to/your/token.txt with the actual path to your token file.

Try to write a small dataset to the cloud:

ds = xr.DataArray([1]).to_dataset(name='test')
ds.to_zarr('gs://leap-scratch/<your_username>/test_offsite_upload.zarr') #adding the 'gs://' prefix makes this just work with xarray!

Replace <your_username> with your actual username on the hub.

  • Make sure that you can read the test dataset from within the hub (go back to Basic writing to and reading from cloud buckets).

  • Now the last step is to paste the code to load your actual dataset into the notebook and use .to_zarr to upload it.

Make sure to give the store a meaningful name, and raise an issue in the data-management repo to get the dataset added to the LEAP Data Library.

Make sure to use a different bucket than leap-scratch, since that will be deleted every 7 days! For more info refer to the available storage buckets.

Uploading large original data from an HPC system (no browser access on the system available)

A commong scenario is the following: A researcher/student has run a simulation on a High Performance Computer (HPC) at their institution, but now wants to collaboratively work on the analysis or train a machine learning model with this data. For this they need to upload it to the cloud storage.

The following steps will guide you through the steps needed to authenticate and upload data to the cloud, but might have to be slightly modified depending on the actual setup of the users HPC.

Conversion Script/Notebook

In most cases you do not just want to upload the data in its current form (e.g. many netcdf files).

Instead we will load the data into an xarray.Dataset and then write that Dataset object directly to a zarr store in the cloud. For this you need a python environment with xarray, gcsfs, zarr installed (you might need additional dependencies for your particular use case).

  1. Spend some time to set up a python script/jupyter notebook on the HPC system that opens your files and combines them in to one or more xarray.Datasets (combine as many files as sensible into a single dataset). Make sure that your data is lazily loaded and the Dataset.data is a dask array

  2. Check your dataset:

    • Check that the metadata is correct.

    • Check that all the variables/dimensions are in the dataset

    • Check the dask chunksize. A general rule is to aim for around 100MB size, but the size and structure of chunking that is optimal depends heavily on the later use case.

  3. Try to write out a subset of the data locally by calling the .to_zarr method on the dataset.

Once that works we can move on to the authentication.

Upload Prerequisites

Before we are able to set up authentication we need to make sure our HPC and local computer (required) are set up correctly.

Steps Steps executed on your ”local” computer (e.g. laptop) will be colored in green and steps on your ”remote” computer (e.g. HPC) in purple.

  1. SSH into the HPC

  2. Check that you have an internet connection with ping www.google.com

  3. Request no browser authentication:

    gcloud auth application-default login --scopes=https://www.googleapis.com/auth/devstorage.read_write,https://www.googleapis.com/auth/iam.test --no-browser
    

    🚨 It is very important to include the --scopes= argument for security reasons. Do not run this command without it!

  4. Follow the onscreen prompt and paste the command into a terminal on your local machine.

  5. This will open a browser window. Authenticate with the gmail account that was added to the google group.

  6. Go back to the terminal and follow the onscreen instructions. Copy the text from the command line and paste the command in the open dialog on the remote machine.

  7. Make sure to note the path to the auth json! It will be something like .../.config/gcloud/....json.

Now you are have everything you need to authenticate.

Lets verify that you can write a small dummy dataset to the cloud. In your notebook/script run the following (make sure to replace the filename and your username as instructed).

Your dataset should now be available for all LEAP members 🎉🚀

import xarray as xr
import gcsfs
import json

with open("your_auth_file.json") as f: #🚨 make sure to enter the `.json` file from step 7
	token=json.load(f)

# test write a small dummy xarray dataset to zarr
ds = xr.DataArray([1, 4, 6]).to_dataset(name='data') 
# Once you have confirmed 

fs = gcsfs.GCSFileSystem(token=token)
mapper = fs.get_mapper("gs://leap-persistent/<username>/testing/demo_write_from_remote.zarr") #🚨 enter your leap (github) username here
ds.to_zarr(mapper)

Now you can repeat the same steps but replace your dataset with the full dataset from above and leave your python code running until the upload has finished. Depending on the internet connection speed and the size of the full dataset, this can take a while.

If you want to see a progress bar, you can wrap the call to .to_zarr with a dask progress bar

from dask.diagnostics import ProgressBar
with ProgressBar():
   ds.to_zarr(mapper)

Once the data has been uploaded, make sure to erase the .../.config/gcloud/....json file from step 7, and ask to be removed from the Google Group.

Dask

To help you scale up calculations using a cluster, the Hub is configured with Dask Gateway. For a quick guide on how to start a Dask Cluster, consult this page from the Pangeo docs:

GPUs

Tier2 and Tier3 members (see Users and Categories) have access to a ‘Large’ Server instance with GPU. Currently the GPUs are Nvidia T4 models. To check what GPU is available on your server you can use nvidia-smi in the terminal window. You should get output similar to this:

   nvidia-smi
   
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
   | N/A   41C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+