Working outside the LEAP JupyterHub#
While we primarily recommend working with LEAP data from our shared compute platform, we understand that many users have good reasons for sticking to the setups they already have. Here we describe solutions to common issues faced for interfacing with LEAP data externally, i.e. not from the JupyterHub.
Authentication#
If you want to access the LEAP cloud buckets from outside the pre-authenticated JupyterHub, you need to be authenticated.
Generating Auth Config Files#
Unless a given cloud bucket allows anonymous access or is preauthenticated within your environment, you will need to authenticate with a key/secret pair. The LEAP-Pangeo owned buckets are pre-authenticated on the Hub but working with them externally requires configuation. Note: If you are accessing non-public cloud data from the JupyterHub (not published by LEAP), you will also have to follow this process.
Always Handle credentials with care!
Always handle secrets with care. Do not store them in plain text that is visible to others (e.g. in a notebook cell that is pushed to a public github repository). See Handling Secrets for more instructions on how to keep secrets safe.
We recommend to store your secrets in one of the following configuration files (which will be used in the following example to read and write data):
Fsspec supports named [](aws profiles) in a credentials files. You can create one via Generate an aws credential file via the aws CLI(installed on the hub by defaule):
aws configure --profile <pick_a_name>
Pick a sensible name for your profile, particularly if you are working with multiple profiles and buckets.
The file ~/.aws/credentials
then contains your key/secret similar to this:
[<the_profile_name_you_picked>]
aws_access_key_id = ***
aws_secret_access_key = ***
Rclone has its own configuration file format where you can specify the key and secret (and many other settings) in a similar fashion (note the missing aws_
though!).
We recommend setting up the config file (show the default location with rclone config file
) by hand to look something like this:
[<remote_name>]
... # other values
access_key_id = XXX
secret_access_key = XXX
You can have multiple ‘remotes’ in this file for different cloud buckets.
For the OSN Pod, use this remote definition:
[osn]
type = s3
provider = Ceph
endpoint = https://nyu1.osn.mghpcc.org
access_key_id = XXX
secret_access_key = XXX
Warning
Ideally we want to store these secrets only in one central location. The natural place for these seems to be in an AWS cli profiles, which can also be used for fsspec. There however seem to be multiple issues (here) around this feature in rclone, and so far we have not succeeded in using AWS profiles in rclone. According to those issues we can only make the aws profiles (or source profiles?, anyways the credentials part of it) work if we define one config file per remote and use the ‘default’ profilewhich presumably breaks compatibility with fsspec, and also does not work at all right now. So at the moment we will have to keep the credentials in two separate spots 🤷♂️. Please make sure to apply proper caution when handling secrets for each config files that stores secrets in plain text!
Note
You can find more great documentation, specifically on how to use OSN resources, in this section of the HyTEST Docs
Temporary Token#
Note
This does not actually generate credentials that work outside the JupyterHub, Using the Google Groups is the recommended way of accessing the cloud buckets from outside the Hub.
You can generate a temporary (1 hour) token with read/write access as follows:
Now start up a LEAP-Pangeo server and open a terminal. Install the Google Cloud SDK using mamba
mamba install google-cloud-sdk
Now you can generate a temporary token (valid for 1 hour) that allows you to upload data to the cloud.
gcloud auth print-access-token
This will print a temporary token in the terminal. You can e.g. copy that to your clipboard.
Service Account#
If you want more permanent access to resources, e.g. as part of a repositories CI using a service account, please reach out to the Data and Computation Team to discuss options.
Moving Data#
🚧
You can move directories from a local computer to cloud storage with rclone (make sure you are properly authenticated!):
rclone copy path/to/local/dir/ <remote_name>:<bucket-name>/funky-user/some-directory
You can also move data between cloud buckets using rclone
rclone copy \
<remote_name_a>:<bucket-name>/funky-user/some-directory\
<remote_name_b>:<bucket-name>/funky-user/some-directory
Copying single files
To copy single files with rclone use the copyto command or copy the containing folder and use the --include
or --exclude
flags to select the file to copy.
Note
Copying with rclone will stream the data from the source to your computer and back to the target, and thus transfer speed is likely limited by the internet connection of your local machine.