Skip to content

Data Tools

There are many tools available to interact with cloud object storage. We currently have basic operations documented for three tools:

  • rclone which provides a Command Line Interface to many different storage backends, see here for more details. Rclone is highly versatile and suits almost all use cases. Details on authentication and usage can be found here.

  • GCloud SDK. One can interact with Google Cloud storage directly using the Google Cloud SDK and Command Line Interface. Please consult the Install Instructions for more guidance.

  • fsspec (and its submodules gcsfs and s3fs) provide filesystem-like access to local, remote, and embedded file systems from within a python session. Fsspec is also used by xarray under the hood and so integrates easily with normal coding workflows (and tools like Dask).

    • fsspec can be used to transfer small amounts of data to google cloud storage or OSN. If you have more than a few hundred MB's, it is worth using Rclone
    • fsspec should be installed by default on the Base Pangeo Notebook environment on the JupyterHub. If you are working locally, you can install it with pip or conda/mamba. ex: pip install fsspec.

Tool Selection

small: KB's to 100's of MB

medium: 1 to \<100GB

large: >100GB

Move data from User Directory to GCS

  • Data volume:
    • small: fsspec/gcsfs
    • medium: rclone

Move data from JupyterHub to OSN

  • If the data volume is:
    • small: rclone
    • medium: rclone

Move data from Laptop/HPC to OSN

  • If the data volume is:
    • small: rclone
    • medium: rclone
    • large: rclone

Rclone

Rclone is an open-source command line tool for moving and syncing data. It can be very useful for moving data into or out of LEAP cloud buckets. There is a bit of a learning-curve, but it has extensive docs. Please reach out to us if you run into issues.

Usage

Rclone can be installed on the hub by typing in the Terminal:

mamba install rclone -y

Once installed, you can setup "remotes", which are storage locations that have credentials. You can view and modify the rclone config file to add remotes. Rclone will show you where this is located by running: rclone config file. Please consult our Authentication guide for instructions on how to setup various remotes on rclone.

[leap-pubs]
type = s3
provider = Ceph
access_key_id = <CONTACT DATA AND COMPUTE TEAM>
secret_access_key = <CONTACT DATA AND COMPUTE TEAM>
endpoint = https://nyu1.osn.mghpcc.org
no_check_bucket = true

[leap-inbox]
type = s3
provider = Ceph
access_key_id = <CONTACT DATA AND COMPUTE TEAM>
secret_access_key = <CONTACT DATA AND COMPUTE TEAM>
endpoint = https://nyu1.osn.mghpcc.org
no_check_bucket = true

[leap-gcs]
type = google cloud storage
object_acl = bucketOwnerFullControl
location = us
env_auth = true

Details for the LEAP OSN pods are here.

You can find more great documentation, specifically on how to use OSN resources, in this section of the HyTEST Docs

Listing directory contents

Rclone has file-system like ls syntax for cloud storage. You can list the files in a bucket/directory with:

rclone ls leap-inbox:leap-pangeo-inbox/example-dataset

Moving Data

Rclone can copy data from your local computer, from an HPC system, or between cloud buckets.

From Local/HPC to OSN:

rclone copy path/to/local_or_local/dir/ leap-inbox:<leap-pangeo-inbox/dataset_name/dataset_input_files

Between Cloud Buckets:

rclone copy \
<remote_name_a>:<bucket-name>/dataset-name/dataset-files \
<remote_name_b>:<bucket-name>/dataset-name/dataset-files \

From Local/HPC to GCS:

rclone copy /path/to/local_ir_hpc/data/ \
  leap-gcs:leap-persistent/username/my_dataset

The rclone syntax for copying single files is slightly different then for multiple files. To copy a single file you can use the copyto command. Another option is to copy the containing folder and use the --include or --exclude flags to select the file to copy.

Note

Transfer speed is likely limited by the internet connection of your local machine.

fsspec/gcsfs

fsspec provides a unified Python interface to local and remote filesystems.

  • gcsfs is the backend for Google Cloud Storage (GCS).
  • s3fs is the backend for S3-compatible systems, such as the OSN pod.

fsspec integrates directly with scientific Python libraries like xarray, zarr, and dask, which makes it the most seamless option for small transfers or direct read/write access from code.

Installation

fsspec and gcsfs should already be available in the Base Pangeo Notebook environment on the JupyterHub.\ If working locally, install with pip or conda/mamba:

pip install fsspec gcsfs s3fs
# or
mamba install fsspec gcsfs s3fs -c conda-forge

Usage

Writing to GCS:

import fsspec

fs = fsspec.filesystem("gcs")  # uses gcsfs under the hood
with fs.open("gs://leap-scratch/username/test.txt", "w") as f:
    f.write("hello world")

Listing contents in GCS: fs.ls("gs://leap-scratch/username/")

Writing to OSN (via S3 interface):

import fsspec

fs = fsspec.filesystem(
    "s3",
    key="<ACCESS_KEY>",
    secret="<SECRET_KEY>",
    client_kwargs={"endpoint_url": "https://nyu1.osn.mghpcc.org"},
)

Moving Data

Fsspec is great for copying smaller files from your JupyterHub home directory (/home/jovyan/...) into a LEAP GCS bucket (gs://leap-scratch/... or gs://leap-persistent/...)

Copy a single file:

import fsspec
import shutil

# Local file in your JupyterHub home directory
local_path = "/home/jovyan/my_project/data.csv"

# Remote file path in GCS (your user subdirectory)
remote_path = "gs://leap-scratch/your-username/data.csv"

# Create a GCS filesystem object
fs = fsspec.filesystem("gcs")

# Open the source file and the destination file, then copy
with open(local_path, "rb") as src, fs.open(remote_path, "wb") as dst:
    shutil.copyfileobj(src, dst)

Copy an entire directory (e.g., project folder):

import fsspec
import shutil
import os

fs = fsspec.filesystem("gcs")

local_dir = "/home/jovyan/my_project/"
remote_dir = "gs://leap-persistent/your-username/my_project/"

for root, _, files in os.walk(local_dir):
    for fname in files:
        local_file = os.path.join(root, fname)
        rel_path = os.path.relpath(local_file, local_dir)
        remote_file = remote_dir + rel_path

        with open(local_file, "rb") as src, fs.open(remote_file, "wb") as dst:
            shutil.copyfileobj(src, dst)