from climsim_utils.data_utils import *
2024-07-22 20:09:52.744061: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-22 20:09:52.744144: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-22 20:09:52.801154: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-22 20:09:52.932548: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-22 20:09:55.088675: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Instantiating class#

The example below will save training data in both .h5 and .npy format. Adjust if you only need one format. Also adjust input_abbrev to the input data files you will use. We expanded the original ‘.mli.’ input files to include additional features such as previous steps’ information, and ‘.mlexpand.’ was just an arbitrary name we used for the expanded input files.

Currently the training script would assume the training set is in .h5 format while the validation set is in .npy form. It’s fine to only keep save_h5=True in the block below for generating training data.

grid_path = '/global/u2/z/zeyuanhu/nvidia_codes/Climsim_private/grid_info/ClimSim_low-res_grid-info.nc'
norm_path = '/global/u2/z/zeyuanhu/nvidia_codes/Climsim_private/preprocessing/normalizations/'

grid_info = xr.open_dataset(grid_path)
#no naming issue here. Here these normalization-related files are just placeholders since we set normalize=False in the data_utils.
input_mean = xr.open_dataset(norm_path + 'inputs/input_mean_v5_pervar.nc')
input_max = xr.open_dataset(norm_path + 'inputs/input_max_v5_pervar.nc')
input_min = xr.open_dataset(norm_path + 'inputs/input_min_v5_pervar.nc')
output_scale = xr.open_dataset(norm_path + 'outputs/output_scale_std_lowerthred_v5.nc')

data = data_utils(grid_info = grid_info, 
                  input_mean = input_mean, 
                  input_max = input_max, 
                  input_min = input_min, 
                  output_scale = output_scale,
                  input_abbrev = 'mlexpand',
                  output_abbrev = 'mlo',
                  normalize=False,
                  save_h5=True,
                  save_npy=True
                  )
# set data path
data.data_path = '/global/homes/z/zeyuanhu/scratch/hugging/E3SM-MMF_ne4/train/'

# set inputs and outputs to V2 rh subset (rh means using RH to replace specific humidty in input feature)
data.set_to_v2_rh_vars()
data.input_vars
['state_t',
 'state_rh',
 'state_q0002',
 'state_q0003',
 'state_u',
 'state_v',
 'pbuf_ozone',
 'pbuf_CH4',
 'pbuf_N2O',
 'state_ps',
 'pbuf_SOLIN',
 'pbuf_LHFLX',
 'pbuf_SHFLX',
 'pbuf_TAUX',
 'pbuf_TAUY',
 'pbuf_COSZRS',
 'cam_in_ALDIF',
 'cam_in_ALDIR',
 'cam_in_ASDIF',
 'cam_in_ASDIR',
 'cam_in_LWUP',
 'cam_in_ICEFRAC',
 'cam_in_LANDFRAC',
 'cam_in_OCNFRAC',
 'cam_in_SNOWHICE',
 'cam_in_SNOWHLAND']

Create training data#

Below is an example of creating the training data by integrating the 7 year climsim simulation data. A subsampling of 1000 is used as an example. In the actual work we did, we used a stride_sample=1. We could not fit the full 7-year data into the memory wihout subsampling. If that’s also the case for you, try to only process a subset of data at one time by adjusting regexps in set_regexps method. We saved 14 separate input .h5 files. For each year, we saved two files by setting start_idx=0 or 1. We have a folder like v2_full, which includes 14 subfolders named ‘11’, ‘12’, ‘21’, ‘22’, …, ‘71’,’72’, and each subfolder contains a train_input.h5 and train_target.h5. How you split to save training data won’t influence the training. The training script will read in all the samples and randomly select samples across all the samples to form each batch.

# set regular expressions for selecting training data
data.set_regexps(data_split = 'train', 
                regexps = ['E3SM-MMF.mlexpand.000[1234567]-*-*-*.nc', # years 1 through 7
                        'E3SM-MMF.mlexpand.0008-01-*-*.nc']) # first month of year 8
# set temporal subsampling
data.set_stride_sample(data_split = 'train', stride_sample = 1000)
# create list of files to extract data from
data.set_filelist(data_split = 'train', start_idx=0)
# save numpy files of training data
data.save_as_npy(data_split = 'train', save_path = '/global/homes/z/zeyuanhu/scratch/hugging/E3SM-MMF_ne4/preprocessing/v2_example/')
WARNING:tensorflow:From /global/homes/z/zeyuanhu/.conda/envs/climsim/lib/python3.10/site-packages/climsim_utils/data_utils.py:792: calling DatasetV2.from_generator (from tensorflow.python.data.ops.dataset_ops) with output_types is deprecated and will be removed in a future version.
Instructions for updating:
Use output_signature instead
WARNING:tensorflow:From /global/homes/z/zeyuanhu/.conda/envs/climsim/lib/python3.10/site-packages/climsim_utils/data_utils.py:792: calling DatasetV2.from_generator (from tensorflow.python.data.ops.dataset_ops) with output_shapes is deprecated and will be removed in a future version.
Instructions for updating:
Use output_signature instead
2024-06-16 17:38:52.707705: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2256] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

Create validation data#

# set regular expressions for selecting validation data
data.set_regexps(data_split = 'val',
                 regexps = ['E3SM-MMF.mlexpand.0008-0[23456789]-*-*.nc', # months 2 through 9 of year 8
                            'E3SM-MMF.mlexpand.0008-1[012]-*-*.nc', # months 10 through 12 of year 8
                            'E3SM-MMF.mlexpand.0009-01-*-*.nc']) # first month of year 9
# set temporal subsampling
# data.set_stride_sample(data_split = 'val', stride_sample = 7)
data.set_stride_sample(data_split = 'val', stride_sample = 700)
# create list of files to extract data from
data.set_filelist(data_split = 'val')
# save numpy files of validation data
data.save_as_npy(data_split = 'val', save_path = '/global/homes/z/zeyuanhu/scratch/hugging/E3SM-MMF_ne4/preprocessing/v2_example/')

Create test data#

data.data_path = '/global/homes/z/zeyuanhu/scratch/hugging/E3SM-MMF_ne4/_test/'

data.set_to_v4_vars()

# set regular expressions for selecting validation data
data.set_regexps(data_split = 'test',
                 regexps = ['E3SM-MMF.mlexpand.0009-0[3456789]-*-*.nc', 
                            'E3SM-MMF.mlexpand.0009-1[012]-*-*.nc',
                            'E3SM-MMF.mlexpand.0010-*-*-*.nc',
                            'E3SM-MMF.mlexpand.0011-0[12]-*-*.nc'])
# set temporal subsampling
# data.set_stride_sample(data_split = 'test', stride_sample = 7)
data.set_stride_sample(data_split = 'test', stride_sample = 700)
# create list of files to extract data from
data.set_filelist(data_split = 'test')
# save numpy files of validation data
data.save_as_npy(data_split = 'test', save_path = '/global/homes/z/zeyuanhu/scratch/hugging/E3SM-MMF_ne4/preprocessing/v2_example/')
!ls /global/homes/z/zeyuanhu/scratch/hugging/E3SM-MMF_ne4/preprocessing/v2_example/
test_input.h5	test_target.npy  train_target.h5   val_input.npy
test_input.npy	train_input.h5	 train_target.npy  val_target.h5
test_target.h5	train_input.npy  val_input.h5	   val_target.npy