Data Ingestion Tutorial

New to cloud data ingestion? You’re not alone! Transferring large volumes of climate data is not an easy process, and can feel like a tall order. Never fear! This tutorial is aimed completely at beginners, and is designed to provide the why, how, and what for data ingestion!

First of all, data ingestion refers to the general process of moving data from wherever it might exist to an analysis-ready, cloud-optimized (ARCO) format.

Requirements and Desiderata

We have additional requirements for the data ingestion to make the process sustainable and scalable:

  • Process needs to be reproducible, e.g. when we want to reingest data to a different storage location

  • Separation of concerns: The person who knows the dataset (the ‘data expert’) is in the unique position to encode their knowledge about the dataset into the recipe, but they should not be concerned with the details of how to execute it and where the data is ultimate stored. This is the responsibility of the Data and Compute team.

The way we achieve this is to base our ingestion on Pangeo Forge recipes. For clearer organization each dataset the recipe should reside in its own repository under the leap-stc github organization. Each of these repositories will be called a ‘feedstock’, which contains additional metadata files (you can read more in the Pangeo Forge docs).

How to get new data ingested

To start ingesting a dataset follow these steps:

  1. Let the LEAP community and the Data and Computation Team know about this new dataset. We gather all ingestion requests in our ‘leap-stc/data_management’ issue tracker. You should check existing issues with the tag ‘dataset’ to see if somebody else might have already requested this particular dataset. If that is not the case you can add a new dataset_request. Making these request in a central location enables others to see which datasets are currently being ingested and what the status is.

  2. Use our feedstock template to create a feedstock repostory by following instructions in the README to get you started with either one of the above.

  3. If issues arise please reach out to the Data and Computation Team

Note

This does currently not provide a solution to handle datasets that have been produced by you (as e.g. part of a publication). We are working on formalizing a workflow for this type of data. Please reach out to the Data and Computation Team if you have data that you would like to publish. See Types of Data Used at LEAP for more.