Infrastructure
Contents
Infrastructure¶
LEAP-Pangeo Data Catalog¶
Catalog Address |
|
Management Repo |
|
Maintained in collaboration with |
For more explanation about the catalog, and its role in the overall vision of LEAP, see LEAP-Pangeo Architecture. See Ingesting Datasets into Cloud Storage for details on how to ingest data and link it into the catalog.
LEAP-Pangeo JupyterHub¶
Our team has a cloud-based JupyterHub. For information who can access the hub with which privileges, please refer to Membership Tiers.
Hub Address |
|
Hub Location |
|
Hub Operator |
|
Hub Configuration |
https://github.com/2i2c-org/infrastructure/tree/master/config/clusters/leap |
This document goes over the primary technical details of the JupyterHub.
For a quick tutorial on basic usage, please check out our Getting Started tutorial.
To get an in-depth overview of the LEAP Pangeo Architecture and how the JupyterHub fits into it, please see the Architecture page.
Server¶
Managing Servers¶
You can start and stop your server (and even open multiple named servers) from the Hub Control Panel
. You can get to the hub control panel by navigating to https://leap.2i2c.cloud/hub/home
in your browser or navigating to File > Hub Control Panel
from the JupyterLab Interface.
Your User Directory¶
When you open your hub, you can navigate to the “File Browser” and see all the files in your User Directory
Your User Directory behaves very similar to a filestystem on your computer. If you save a file from a notebook, you will see it appear in the File Browser (you might have to wait a few seconds or press refresh) and you can use a terminal to navigate the terminal as you would on a UNIX machine:
Note
As shown in the picture above, every user will see '/home/jovyan'
as their root directory. This is different from many HPC accounts where your home directory will point to a directory with your username. But the functionality is similar. These are your own files and they cannot be seen/modified by other users (except admins).
The primary purpose of this directory is to store small files, like github repositories and other code.
Warning
To accommodate the expanding LEAP community, the data and compute team has instituted a storage quota on individual user directories /home/jovyan
. Your home directory is intended only for notebooks, analysis scripts, and small datasets (< 1 GB). It is not an appropriate place to store large datasets. Unlike the cloud buckets, these directories use an underlying storage with a rigid limit. If a single user fills up the space, the Hub crashes for everyone. We recommend users use less than 25GB and enforce a hard limit of 50GB. Users who persistently violate the limit may temporarily get reduced cloud access.
To check how much space you are using in your home directory open a terminal window on the hub and run du -h --max-depth=1 ~/ | sort -h
.
If you want to save larger files for your work use our LEAP-Pangeo Cloud Storage Buckets and consult our Hub Data Guide. See the FAQs for guidance on reducing storage.
The Software Environment¶
The software environment you encounter on the Hub is based upon docker images which you can run on other machines (like your laptop or an HPC cluster) for better reproducibility.
Upon start up you can choose between
A list of preselected images
The option of passing a custom docker image via the
"Other..."
option.
Preselected Images¶
LEAP-Pangeo uses several full-featured, up-to-date Python environments maintained by Pangeo. You can read all about them at the following URL:
There are separate images for pytorch and tensorflow which are available in a drop-down panel when starting up your server. The Hub contains a specific version of the image which can be found here.
For example, at the time of writing, the version of pangeo-notebook
is 2022.05.10
.
A complete list of all packages installed in this environment is located at:
Attention
We regularly update the version of the images provided in the drop-down menu.
To ensure full reproducibility you should save the full info of the image you worked with (this is stored in the environment variable JUPYTER_IMAGE_SPEC
) with your work. You could for example print the following in the first cell of a notebook:
import os
print(os.environ["JUPYTER_IMAGE_SPEC"])
You can then use that string with the custom images to reproduce your work with exactly the same environment.
Custom Images¶
If you select the Image > Other...
Option during server login you can paste an arbitrary reference in the form of docker_registry/organization/image_name:image_version
. As an example we can get the 2023.05.08
version of the pangeo tensorflow notebook by pasting quay.io/pangeo/ml-notebook:2023.05.08
.
If you want to build your own docker image for your project, take a look at this template and the instructions to learn how to use repo2docker to set up CI workflows to automatically build docker images from your repository.
Installing additonal packages¶
You can install additional packages using pip
and conda
.
However, these will disappear when your server shuts down.
For a more permanent solution we recommend building project specific dockerfiles and using those as custom images.
LEAP-Pangeo Cloud Storage Buckets¶
LEAP-Pangeo provides users two cloud buckets to store data. Your Server is automatically authenticated to read from any of these buckets but write access might differ (see below). See Access to LEAP-Pangeo resources without the JupyterHub for details on how to access buckets from ‘outside’ the JupyterHub.
gs://leap-scratch/
- Temporary Storage deleted after 7 days. Use this bucket for testing and storing large intermediate results. More infogs://leap-persistent/
- Persistent Storage. Use this bucket for storing results you want to share with other members.gs://leap-persistent-ro/
- Persistent Storage with read-only access for most users. To upload data to this bucket you need to use this method below.
Files stored on each of those buckets can be accessed by any LEAP member, so be concious in the way you use these.
Do not put sensitive information (passwords, keys, personal data) into these buckets!
When writing to buckets only ever write to your personal folder! Your personal folder is a combination of the bucketname and your github username (e.g. `gs://leap-persistent/funky-user/’).
Compute¶
🚧
Access to LEAP-Pangeo resources without the JupyterHub¶
Temporary Token¶
You can generate a temporary (1 hour) token with read/write access as follows:
Now start up a LEAP-Pangeo server and open a terminal. Install the Google Cloud SDK using mamba
mamba install google-cloud-sdk
Now you can generate a temporary token (valid for 1 hour) that allows you to upload data to the cloud.
gcloud auth print-access-token
This will print a temporary token in the terminal. You can e.g. copy that to your clipboard.
Persistent Access via Google Groups¶
We manage access rights through Google Groups. Please contact the Data and Computation Team to get added to the appropriate group (a gmail address is required for this).
Service Account¶
If you want more permanent access to resources, e.g. as part of a repositories CI using a service account, please reach out to the Data and Computation Team to discuss options.
Analysis-Ready Cloud-Optimized (ARCO) Data¶
Below you can find some examples of ARCO data formats