Guide for Data and Computation Team Members

This is a short write up facilitate the spin up of new team members for the Data and Computation Team and describe regular maintenance tasks.

Onboarding

Checklist for new members

  • Ask to be added to the Data and Computation Github team

  • Ask to be added to the @data-and-compute Slack user group

  • Subscribe to Relevant Slack channels

  • Consider enabling notifications for Relevant github repos

  • Make a PR to the _config.yaml file here in a PR. to add a picture and your personal data to the webpage.

  • Get access to the Grafana Dashboard

  • Request access to a service account to monitor Google Dataflow and Storage from the Google Cloud Console by raising an issue here

    • Instructions for admin:

      • Go to the Google Cloud Console > IAM > Grant Access

      • Add the following permissions:

        • Dataflow Admin

        • Storage Admin

        • Logs Viewer

        • Monitoring Viewer

        • Logs Viewer

        • Compute Viewer

Relevant Slack channels

  • data-and-computation-team: Private channel for internal discussions, please contact the Manager for Data and Computing to be added.

  • leap-pangeo: Community wide channel where LEAP-Pangeo users can ask questions. Members of the team should join the channel and regularly engage with issues raised.

Relevant github repos

Regular Maintenance

This section documents common tasks performed as part of the Data and Computation Team’s duties.

Monitor User Directory Useage

The user directories of all members are on a shared volume of fixed size. We pay for the total size of the volume no matter if it is used or not, so we should strive to keep usage to a minimum. Our current policy is that <50GB per user is acceptable. Unfortunately we do not have a technical way to implement per user quotas, so we need to regularly check two things:

Regularly Updating the Software Environment

We aim to provide users with up-to-date default software environments. This currently requires to change the pangeo-docker-images tag manually.

  • Make sure you are subscribed to release notifications on pangeo-docker-images to recieve Github notification about new releases

  • To bump the version, submit a PR to this config file in the 2i2c infrastructure repo. In that PR you need to change the image tag for all image choices, see an example here.

  • To send emails, a token file is setup as the OAUTH_GMAIL_CREDENTIALS Github Secret in the member management repo; every so often (around biweekly) the action will require re-authentication, generating a new token that should replace the existing secret. This makes use of OAUTH_GMAIL_CLIENT_SECRET, which never needs to change. To update the OAUTH_GMAIL_CREDENTIALS secret, run the generate_emails_token(github_action_mode=False) function from utils.py locally, which will direct you to a confirmation screen. Log in to the leap.pangeo@gmail.com account and authorize access. You will require a copy of the CLIENT_SECRETS file on your personal machine which can be retreived from Google Cloud Console from the above pangeo support email.

Offboarding members

  • [] Delete personal dct-team-<first_name> service account in IAM (needs admin priviliges).

Admin Tasks

This part of the guide is reserved for team members with admin access to the 'leap-stc' github organization!

Renewing Personal fine grained access token for LEAP member management

In order to automate member sign up by adding github users from a private Google Sheet to the appropriate github teams (via this gh action) the github action needs the appropriate priviliges to add/remove members from teams. We are currently handling this by providing a personal access token as the "ORG_TOKEN" secret. The person creating the token will usually be the Manager for Data and Computation.

Note

Ideally we want to remove the dependency on a single user account here, but for now this is the only way I have found this to work properly. Maybe there is a way to establish a ‘dummy’ user?

Steps

  • Make sure you have access to set secrets on the private member_management repo

  • Go to the personal account “Settings>Developer Settings” Tab. From there naviate to “Personal Access Token>Fine-Grained tokens”

  • If present click on “LEAP member management token”, othewise create a new token with that name (the actual name is optional here, but make sure to name it in a memorable way), and authenticate.

  • Generate or regenerate the token

    • The required permissions are “Read and Write access to members” and “Read Access to actions and metadata”

    • Set the expiration to a full year (the current limit set on the org level)

  • Make sure to copy the token (leave the page open until the next step is completed, since you will have to recreate the token once the page is closed!)

  • Go to the member_management repo and navigate to “Settings > Secrets and Variables > Actions” and open the “ORG_TOKEN” to edit

  • Paste the above token from the clipboard and save.

  • Run the Member Add Action and confirm that it is successful

  • Close the token page and you are done!

Handover Checklist for Admins

The following is a list of tasks that should be done by any new hire in the Data and Computation Manager position to ensure smooth operations.

Moving Data between buckets using bare VMs

In general you need some form of compute to move data between different object store locations, but be aware that the data will be always be streamed to and from that location over the internet, so fast connection speed is key for fast transfers. There are a variety of ways to move data with perhaps the easiest being to run fsspec or rclone on your local computer, but speed is likely limited by your local internet connection. For certain tasks (e.g. moving data to admin only publishing buckets on the m2lines OSN Pod) it is recommended to use rclone on a VM

Tip

These instructions should be easy to adapt to VM instances on other clouds, and can likely be automated to a much larger degree, but this is what has worked so far. Ultimately this approach is a somewhat manual implementation of the concept of skyplane which sadly does not seem to be actively maintained anymore. As of the writing of these docs we were able to achieve ~700MB/s transfer speeds with a single VM following the instructions below

Manual spinup of cloud VMs for bulk data transfer

Following these instructions requires permissions on the LEAP Google Cloud Account. Contact an admin if you run into permission issues.

Warning

Using VMs this way does not automatically delete instances. Make sure to do that when your transfer is done.

  • Navigate to the Google Cloud Console and from there to “Compute Engine” and “VM instances”

  • Click on “Create Instance”

  • Configure your VM instance (this is an example config that worked well in the past, you can modify as needed). If not specified below leave all settings on the default.

    • Choose a memorable name like “boatymccloneface”

    • Use a region that is close to your storage (for LEAP buckets this is 'us-central1' and leave the zone on 'Any'

    • Choose an 'E2' machine type preset (here e2-standard-8)

    • In “OS and Storage” select the latest “Ubuntu” version as Operating System and “Balanced persistent disk” as Boot disk type.

      • Set the size to 20GB

    • Under “Observability” enable “Install Ops Agent …”

    • (Only needed when source location is on GCS) Under “Security” change “Access scopes” to “Set access for each API”, and set “Storage” to “Read Only”.

    • (Optional cost saving) Under “Advanced” select 'VM provisioning model: Spot' (this means the instance can shut down at any time, and you will have to rerun these steps to pick up the transfer. If you want the job to finish guaranteed, choose “On Demand”, but be aware that this will come at a higher cost).

    • Optional but highly Recommended: Under “Advanced” enable “Set a time limit for the VM”, and limit it to the number of hours you expect the transfer to take. You can choose to either stop or delete the VM under “On VM termination”. If you choose stop you will keep incurring costs for the storage volume, so unless you expect to restart the instance, choose delete here.

  • Click on “Create”

  • You should now be able to see your instance in the list under “VM Instances”. Click the SSH button to tunnel into the VM

  • Install rclone with sudo -v ; curl https://rclone.org/install.sh | sudo bash

  • Start a tmux session with tmux new (cheatsheet for tmux)

  • Set the config via env variables one by one. The exact details might depend on your source/target storage. See the rclone docs for more details. This example copies from the LEAP gcs buckets to the OSN pod

    export RCLONE_CONFIG_SOURCE_TYPE=gcs
    export RCLONE_CONFIG_SOURCE_ENV_AUTH=true
    export RCLONE_CONFIG_TARGET_TYPE=s3
    export RCLONE_CONFIG_TARGET_PROVIDER=Ceph
    export RCLONE_CONFIG_TARGET_ENDPOINT=https://nyu1.osn.mghpcc.org
    export RCLONE_CONFIG_TARGET_ACCESS_KEY_ID=XXX
    export RCLONE_CONFIG_TARGET_SECRET_ACCESS_KEY=XXX
    
  • Run the transfer! rclone sync --fast-list --s3-chunk-size 128M --s3-upload-concurrency 128 --transfers 128 --checkers 256 -P source:leap-persistent/some/prefix/ target:osn-bucket-name/another/prefix

    • Choosing sync here enables you to restart a transfer if it failed (e.g. due to a spot instance being shut down, or the transfer taking longer than expected).

    • The additional flags passed here seem to work well for past transfers, but they might be tuned for better performance in various scenarios.

  • Watch the transfer progress or work on something else ☕️

  • You might get disconnected from the SSH browser window after a while (this is why we run the process within tmux!). Simple click on SSH again and run tmux ls. Pick whatever session you want to re-attach. Then do tmux attach -d -t <session id> to re-attach it to a new tmux instance and release it from the old one.

  • Important. DO NOT SKIP!: When your transfer is finished, go back to Google Cloud Console and from there to “Compute Engine” and “VM instances” and click the three dots to the right of your instance, and delete it. If you forget about this LEAP will keep paying for the instance!

Non-Technical Admin Tasks

This section describes admin tasks that are necessary for the maintenance of LEAP-Pangeo components (including collaborative efforts lead M²LInES) which require appropriate permissions, but no coding (everything can be achieved on one of several websites).

Member Sign Up Troubleshooting

Ideally members should be signed on automatically and well ahead of any event (see Member Sign Up Procedure for an overview of our member sign up mechanics). But despite the best efforts situations arise where either staff or the organizers/instructors of an event need to quickly sign on new members, and troubleshoot if certain users do not have access to the JupyterHub. Follow the steps in this Flowchart to quickly resolve any issues in such cases.

A+G: Check if a user is in the Member Spreadsheet

Request read access or inquire with LEAP staff to confirm that the user(s) who are having trouble, are listed in the Member Data Spreadsheet and make sure the the sheet includes their github username. Always make sure that the github name exists (search in github), and does not contain extra characters like "@".

B: Add Users to the Member Spreadsheet

If this is not time sensitive, make sure that the user has completed the membership application (more details here) and inquire with LEAP staff about the status of membership (adding users might take a while, thus always sign up users well in advance). In a time sensitive situation contact LEAP staff to expedite the addition of new users added to the Member Spreadsheet.

C+D+F: Checking Github Team Membership and invite status

Repeat the following steps for all relevant Github Teams (you can find links to the team page here):

  • C: Navigate to the search bar at the top that says “Find a member…” and enter the github username of the user. If the name shows up the user is part of this github team

  • D: Right next to the search bar is a button that says “… pending members”. Click on that button and scroll down the list. If the username is in that list the member has received an invite, but not yet accepted it.

  • F: Accepting the invite has to be done by the user!. Point them to our FAQ’s for instructions on how to accept the invite.

E: Manually rerunning github sign up action

Attention

You need to have maintainer access to the private leap-stc/member_management repo in order to follow these steps. If you cannot repeat these steps please ask one of the github organization admins to be added to an appropriate team (e.g. bootcamp-instructor) that has access.

  • Navigate to leap-stc/member_management

  • In the top rider click on “Actions”

  • On the left click on “Parse Member Sheet and Add New Members”

  • In the upper right corner, click on “Run Workflow”, and again on “Run Workflow” in the pop up window.

  • After a short while you should see a yellow circle in the main window, indicating that the github action is in progress. Wait until the action is finished (usually 3-5 minutes).

  • If the circle turns green, you are done. If the circle turns red, try to run it one more time to exclude any random issues. If the error persists, reach out to a member of the Data and Computation Team

H: Clearing Browser Cache etc

If users are part of a github team, and still have trouble signing on, instruct them to clear their browser cache, or try a different browser. This step is unlikely to be necessary

If none of this works

Note

This is a very unlikely scenario and has very rarely happend!

Follow the flowchart and reach out to either the Data and Computation Team or 2i2c support.

M²LInES OSN pod administration

All administrative tasks pertaining to the M²LInES OSN bucket are handled via the Coldfront Portal. Please log in with one of your affiliated organizations and make sure to use the same one each time in case you have multiple affiliations.

The important units of division on the OSN pod are projects and buckets. Each project can have multiple buckets, and you can give others access as guests and admins on a per project basis. Buckets are how the actual storage space is organized, and each bucket will have access credentials and a storage quota, both of which might need actions from an admin from time to time.

Check Bucket attributes

To check individual buckets’ attributes log into the Coldfront Portal, click on the relevant project, navigate to the “Allocations” section, find the bucket name in the “Information” column and click on the folder symbol in the “Actions” column.

Scroll to the “Allocation Attributes” section. You can see all relevant values here.

OSN Anonymous Access: If False, this data is public, no credentials are needed to read data (writing still requires credentials). OSN Bucket Quota (TB): Shows the currently allocated size. This is the max size, not what is actuall used! OSN RO/RW Bucket Access/Secret Key: Credentials for read-only (RO) and read-write (RW) access to the bucket. See Credentials for more details.

Share bucket credentials

Attention

Some buckets are not meant to be accessible for write by users! Please always refer to OSN Pod Organization and only give access to project specific buckets and the 'leap-pangeo-inbox' bucket to non-admins.

  • Navigate to the specific bucket you want to share credentials to (see above for detailed steps)

  • Copy the relevant Access and Secret keys (either RO or RW depending on the desired use) and share them with the relevant users e.g. by pasting them into a password manager and sharing an authenticated link.

Increasing Storage Quota

If any of the buckets needs more storage space, follow these steps:

  • Log into Coldfront

  • Navigate to the project that contains the bucket (we currently separate projects for M²LInES, LEAP, and the LEAP-Pangeo Data ingestion)

  • Scroll to “Allocations” and find your bucket in the “Information” column. Click on the folder icon in the corresponding “Actions” column.

  • In the top right click on “Request Change” and scroll down to “Allocation Attributes”. Enter the desired new size in TB in the “Request New Value” column and the “OSN Bucket Quota (TB)” row and enter a short justification (required).

  • Click the “Submit” button.

  • You should see a green box with “Allocation change request successfully submitted. ” at the top of the next page.

  • Wait for email confirmation of the change.

Provision a new Bucket

  • Log into Coldfront

  • Navigate to an existing Project (or create a new one; see below), scroll to “Allocations” and click on “Request Resource Allocation”

  • In the following dialog chose “OSN Bucket (Storage)”, write a short justification which includes whether you want the bucket to have anonymous (public) access!, and choose a size in TB.

  • Click “Submit”

  • Wait for email confirmation.

Create a new Project

Note

You need PI status on the pod to create new projects. Reach out to the M²LInES admin to discuss this if you do not have access yet.

  • Log into Coldfront

  • Click on the “Projects” link on the homepage

  • Click “Add a Project”

  • Choose a title, write a short description of the project, and optionally choose a field of science.

  • Then click “Save”

  • Wait for email confirmation.

OSN Support

For any questions/functionality not mentioned above, please refer to the OSN documentation or reach out to the OSN support ('help@osn.mghpcc.org')

Google Cloud Account Maintenance

Attention

To follow these steps you need to have the appropriate priviliges on the LEAP cloud account/organization. Reach out to an organization owner if you need access.

All of the following steps are done from the Google Cloud Console

Billing

  • Click on the ‘burger’ symbold on the top-left, and navigate to Billing or search for “Billing” in the bar on the top.