Loading and Working with the Dataset

Loading and Working with the Dataset#

This notebook is based on an original notebook by Minh Phan (UW Varanasi intern 2023). It describes how to read and work with the Indian Ocean dataset.

The dataset contains chlorophyl concentrations, atmospheric and oceanographic fields used to force the machine learning models. The dataset is a single zarr file.

Note

cmocean is problematic to import. If the import step fails, uncomment the cell below and run it to pip install the package. You can uncomment both lines by highlighting both lines and ctrl-/.

# %%capture
# %pip install cmocean
import xarray as xr
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # simple plotting
import holoviews as hv # simple plotting
import hvplot.xarray # simple plotting
import cmocean

Read data#

xarray can be used to open zarr files.

ds = xr.open_zarr("~/shared/mind_the_chl_gap/IO.zarr")

The dataset representation can viewed below. Clicking on Data variables displays the full list of variables.

ds
<xarray.Dataset> Size: 66GB
Dimensions:                       (time: 16071, lat: 177, lon: 241)
Coordinates:
  * lat                           (lat) float32 708B 32.0 31.75 ... -11.75 -12.0
  * lon                           (lon) float32 964B 42.0 42.25 ... 101.8 102.0
  * time                          (time) datetime64[ns] 129kB 1979-01-01 ... ...
Data variables: (12/27)
    CHL                           (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    CHL_cmes-cloud                (time, lat, lon) uint8 686MB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    CHL_cmes-gapfree              (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    CHL_cmes-land                 (lat, lon) uint8 43kB dask.array<chunksize=(177, 241), meta=np.ndarray>
    CHL_cmes-level3               (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    CHL_cmes_flags-gapfree        (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    ...                            ...
    ug_curr                       (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    v_curr                        (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    v_wind                        (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    vg_curr                       (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    wind_dir                      (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    wind_speed                    (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
Attributes: (12/92)
    Conventions:                     CF-1.8, ACDD-1.3
    DPM_reference:                   GC-UD-ACRI-PUG
    IODD_reference:                  GC-UD-ACRI-PUG
    acknowledgement:                 The Licensees will ensure that original ...
    citation:                        The Licensees will ensure that original ...
    cmems_product_id:                OCEANCOLOUR_GLO_BGC_L3_MY_009_103
    ...                              ...
    time_coverage_end:               2024-04-18T02:58:23Z
    time_coverage_resolution:        P1D
    time_coverage_start:             2024-04-16T21:12:05Z
    title:                           cmems_obs-oc_glo_bgc-plankton_my_l3-mult...
    westernmost_longitude:           -180.0
    westernmost_valid_longitude:     -180.0

We can slice data by the dimensions (latitude, longitude, time) and data variables.

# slice by latitude
# notice how we specify the range in reverse
ds.sel(lat=slice(0, -12))
<xarray.Dataset> Size: 18GB
Dimensions:                       (time: 16071, lat: 49, lon: 241)
Coordinates:
  * lat                           (lat) float32 196B 0.0 -0.25 ... -11.75 -12.0
  * lon                           (lon) float32 964B 42.0 42.25 ... 101.8 102.0
  * time                          (time) datetime64[ns] 129kB 1979-01-01 ... ...
Data variables: (12/27)
    CHL                           (time, lat, lon) float32 759MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
    CHL_cmes-cloud                (time, lat, lon) uint8 190MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
    CHL_cmes-gapfree              (time, lat, lon) float32 759MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
    CHL_cmes-land                 (lat, lon) uint8 12kB dask.array<chunksize=(49, 241), meta=np.ndarray>
    CHL_cmes-level3               (time, lat, lon) float32 759MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
    CHL_cmes_flags-gapfree        (time, lat, lon) float32 759MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
    ...                            ...
    ug_curr                       (time, lat, lon) float32 759MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
    v_curr                        (time, lat, lon) float32 759MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
    v_wind                        (time, lat, lon) float32 759MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
    vg_curr                       (time, lat, lon) float32 759MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
    wind_dir                      (time, lat, lon) float32 759MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
    wind_speed                    (time, lat, lon) float32 759MB dask.array<chunksize=(100, 49, 241), meta=np.ndarray>
Attributes: (12/92)
    Conventions:                     CF-1.8, ACDD-1.3
    DPM_reference:                   GC-UD-ACRI-PUG
    IODD_reference:                  GC-UD-ACRI-PUG
    acknowledgement:                 The Licensees will ensure that original ...
    citation:                        The Licensees will ensure that original ...
    cmems_product_id:                OCEANCOLOUR_GLO_BGC_L3_MY_009_103
    ...                              ...
    time_coverage_end:               2024-04-18T02:58:23Z
    time_coverage_resolution:        P1D
    time_coverage_start:             2024-04-16T21:12:05Z
    title:                           cmems_obs-oc_glo_bgc-plankton_my_l3-mult...
    westernmost_longitude:           -180.0
    westernmost_valid_longitude:     -180.0
# slice by longitude
ds.sel(lon=slice(42, 45))
<xarray.Dataset> Size: 4GB
Dimensions:                       (time: 16071, lat: 177, lon: 13)
Coordinates:
  * lat                           (lat) float32 708B 32.0 31.75 ... -11.75 -12.0
  * lon                           (lon) float32 52B 42.0 42.25 ... 44.75 45.0
  * time                          (time) datetime64[ns] 129kB 1979-01-01 ... ...
Data variables: (12/27)
    CHL                           (time, lat, lon) float32 148MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
    CHL_cmes-cloud                (time, lat, lon) uint8 37MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
    CHL_cmes-gapfree              (time, lat, lon) float32 148MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
    CHL_cmes-land                 (lat, lon) uint8 2kB dask.array<chunksize=(177, 13), meta=np.ndarray>
    CHL_cmes-level3               (time, lat, lon) float32 148MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
    CHL_cmes_flags-gapfree        (time, lat, lon) float32 148MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
    ...                            ...
    ug_curr                       (time, lat, lon) float32 148MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
    v_curr                        (time, lat, lon) float32 148MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
    v_wind                        (time, lat, lon) float32 148MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
    vg_curr                       (time, lat, lon) float32 148MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
    wind_dir                      (time, lat, lon) float32 148MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
    wind_speed                    (time, lat, lon) float32 148MB dask.array<chunksize=(100, 177, 13), meta=np.ndarray>
Attributes: (12/92)
    Conventions:                     CF-1.8, ACDD-1.3
    DPM_reference:                   GC-UD-ACRI-PUG
    IODD_reference:                  GC-UD-ACRI-PUG
    acknowledgement:                 The Licensees will ensure that original ...
    citation:                        The Licensees will ensure that original ...
    cmems_product_id:                OCEANCOLOUR_GLO_BGC_L3_MY_009_103
    ...                              ...
    time_coverage_end:               2024-04-18T02:58:23Z
    time_coverage_resolution:        P1D
    time_coverage_start:             2024-04-16T21:12:05Z
    title:                           cmems_obs-oc_glo_bgc-plankton_my_l3-mult...
    westernmost_longitude:           -180.0
    westernmost_valid_longitude:     -180.0
# slice by time
ds.sel(time=slice('1998', '1999'))
<xarray.Dataset> Size: 3GB
Dimensions:                       (time: 730, lat: 177, lon: 241)
Coordinates:
  * lat                           (lat) float32 708B 32.0 31.75 ... -11.75 -12.0
  * lon                           (lon) float32 964B 42.0 42.25 ... 101.8 102.0
  * time                          (time) datetime64[ns] 6kB 1998-01-01 ... 19...
Data variables: (12/27)
    CHL                           (time, lat, lon) float32 125MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
    CHL_cmes-cloud                (time, lat, lon) uint8 31MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
    CHL_cmes-gapfree              (time, lat, lon) float32 125MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
    CHL_cmes-land                 (lat, lon) uint8 43kB dask.array<chunksize=(177, 241), meta=np.ndarray>
    CHL_cmes-level3               (time, lat, lon) float32 125MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
    CHL_cmes_flags-gapfree        (time, lat, lon) float32 125MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
    ...                            ...
    ug_curr                       (time, lat, lon) float32 125MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
    v_curr                        (time, lat, lon) float32 125MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
    v_wind                        (time, lat, lon) float32 125MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
    vg_curr                       (time, lat, lon) float32 125MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
    wind_dir                      (time, lat, lon) float32 125MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
    wind_speed                    (time, lat, lon) float32 125MB dask.array<chunksize=(60, 177, 241), meta=np.ndarray>
Attributes: (12/92)
    Conventions:                     CF-1.8, ACDD-1.3
    DPM_reference:                   GC-UD-ACRI-PUG
    IODD_reference:                  GC-UD-ACRI-PUG
    acknowledgement:                 The Licensees will ensure that original ...
    citation:                        The Licensees will ensure that original ...
    cmems_product_id:                OCEANCOLOUR_GLO_BGC_L3_MY_009_103
    ...                              ...
    time_coverage_end:               2024-04-18T02:58:23Z
    time_coverage_resolution:        P1D
    time_coverage_start:             2024-04-16T21:12:05Z
    title:                           cmems_obs-oc_glo_bgc-plankton_my_l3-mult...
    westernmost_longitude:           -180.0
    westernmost_valid_longitude:     -180.0
# slice by variable
ds[['u_curr', 'u_wind']]
<xarray.Dataset> Size: 5GB
Dimensions:  (time: 16071, lat: 177, lon: 241)
Coordinates:
  * lat      (lat) float32 708B 32.0 31.75 31.5 31.25 ... -11.5 -11.75 -12.0
  * lon      (lon) float32 964B 42.0 42.25 42.5 42.75 ... 101.5 101.8 102.0
  * time     (time) datetime64[ns] 129kB 1979-01-01 1979-01-02 ... 2022-12-31
Data variables:
    u_curr   (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
    u_wind   (time, lat, lon) float32 3GB dask.array<chunksize=(100, 177, 241), meta=np.ndarray>
Attributes: (12/92)
    Conventions:                     CF-1.8, ACDD-1.3
    DPM_reference:                   GC-UD-ACRI-PUG
    IODD_reference:                  GC-UD-ACRI-PUG
    acknowledgement:                 The Licensees will ensure that original ...
    citation:                        The Licensees will ensure that original ...
    cmems_product_id:                OCEANCOLOUR_GLO_BGC_L3_MY_009_103
    ...                              ...
    time_coverage_end:               2024-04-18T02:58:23Z
    time_coverage_resolution:        P1D
    time_coverage_start:             2024-04-16T21:12:05Z
    title:                           cmems_obs-oc_glo_bgc-plankton_my_l3-mult...
    westernmost_longitude:           -180.0
    westernmost_valid_longitude:     -180.0
# combine multiple slicing options all at once
ds[['u_curr', 'u_wind']].sel(time=slice('1998', '1999'), 
                             lat=slice(0, -12), 
                             lon=slice(42, 45))
<xarray.Dataset> Size: 4MB
Dimensions:  (time: 730, lat: 49, lon: 13)
Coordinates:
  * lat      (lat) float32 196B 0.0 -0.25 -0.5 -0.75 ... -11.5 -11.75 -12.0
  * lon      (lon) float32 52B 42.0 42.25 42.5 42.75 ... 44.25 44.5 44.75 45.0
  * time     (time) datetime64[ns] 6kB 1998-01-01 1998-01-02 ... 1999-12-31
Data variables:
    u_curr   (time, lat, lon) float32 2MB dask.array<chunksize=(60, 49, 13), meta=np.ndarray>
    u_wind   (time, lat, lon) float32 2MB dask.array<chunksize=(60, 49, 13), meta=np.ndarray>
Attributes: (12/92)
    Conventions:                     CF-1.8, ACDD-1.3
    DPM_reference:                   GC-UD-ACRI-PUG
    IODD_reference:                  GC-UD-ACRI-PUG
    acknowledgement:                 The Licensees will ensure that original ...
    citation:                        The Licensees will ensure that original ...
    cmems_product_id:                OCEANCOLOUR_GLO_BGC_L3_MY_009_103
    ...                              ...
    time_coverage_end:               2024-04-18T02:58:23Z
    time_coverage_resolution:        P1D
    time_coverage_start:             2024-04-16T21:12:05Z
    title:                           cmems_obs-oc_glo_bgc-plankton_my_l3-mult...
    westernmost_longitude:           -180.0
    westernmost_valid_longitude:     -180.0

We can also graph the data right from slicing, especially heatmaps from 2D arrays, or line charts. This is especially useful when we want to inspect elements on the go.

# make sure that the array you slice for a heatmap visualization is a 2D array
heatmap_arr = ds['wind_speed'].sel(time='2000-01-02')
heatmap_arr
<xarray.DataArray 'wind_speed' (lat: 177, lon: 241)> Size: 171kB
dask.array<getitem, shape=(177, 241), dtype=float32, chunksize=(177, 241), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 708B 32.0 31.75 31.5 31.25 ... -11.5 -11.75 -12.0
  * lon      (lon) float32 964B 42.0 42.25 42.5 42.75 ... 101.5 101.8 102.0
    time     datetime64[ns] 8B 2000-01-02
Attributes:
    long_name:  10 metre absolute speed
    units:      m s**-1
heatmap_arr.plot.imshow()
<matplotlib.image.AxesImage at 0x7f70744d4cd0>
../_images/6fb4c46abe3acf949292c78ada560a6c5a8f8a3001d960c79717e6866173de31.png
# contour map with no filling
heatmap_arr.plot.contour()
<matplotlib.contour.QuadContourSet at 0x7f7074532e90>
../_images/3f44e8ed3229accfb4adee6f3a5db872cb6e9707b8f05fcdb05d2dd5d3e5996d.png
# contour map with color filling
heatmap_arr.plot.contourf()
<matplotlib.contour.QuadContourSet at 0x7f7074385890>
../_images/f98241d663dc389b0f39e3db25ef8a6302285fd8e8f59a9c90c084b556e19686.png
# A 3D surface plot
heatmap_arr.plot.surface()
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x7f7074498350>
../_images/e5bc49318af66c01ea8a375ef24704afa75acee481afc86c5c360e2e7676e909.png
# We can create interactive plots with hvplots
heatmap_arr.hvplot().options(cmap='bgy', width=600, height=500)

Line plots#

This is mean daily wind speed by month.

ds['wind_speed'].sel(time=slice('2007', '2009')).mean(dim=['lat', 'lon']).plot(figsize=(10, 5))
[<matplotlib.lines.Line2D at 0x7f70741ea450>]
../_images/61d2241510ef7dd23544fd65bb402312112ef46f71682c8fa861ed249f0cef8b.png

We can add in parameters to customize our graphs, as additional arguments are passed to the underlying matplotlib plot() function.

ds['air_temp'].sel(time=slice('2007', '2009')).mean(dim=['lat', 'lon']).plot.line('r-o', figsize=(10,5), markersize=1)
[<matplotlib.lines.Line2D at 0x7f70741566d0>]
../_images/a48ceb47a2d1c9bbb6b127b6dfd16eb37a8cac1dc0094fa13343e7cf4c1752a4.png

Histogram#

# creating a new Axe object if there is no currently
# available one
ax = plt.gca() 
ds['wind_dir'].plot.hist(ax = ax)
ax.set_xlabel('10 metre wind direction (degrees east)')
ax.set_ylabel('frequency')
ax.set_title('Daily average wind direction distribution over covered area (1979-2022)')
Text(0.5, 1.0, 'Daily average wind direction distribution over covered area (1979-2022)')
../_images/9a2db4e8bf956af0ec5e843d51da6c416a33f9e46914e7b47bc0ac7cef78fd9e.png

Resampling#

With xarray#

We can resample (aggregate) your data temporally. It may take a long while for the data to finish resampling, especially if your dataset is big and your resampling frequency is small.

ds_resampled = ds['CHL_cmes-gapfree'].resample(time='1ME').mean()
ds_resampled
<xarray.DataArray 'CHL_cmes-gapfree' (time: 528, lat: 177, lon: 241)> Size: 90MB
dask.array<transpose, shape=(528, 177, 241), dtype=float32, chunksize=(6, 177, 241), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 708B 32.0 31.75 31.5 31.25 ... -11.5 -11.75 -12.0
  * lon      (lon) float32 964B 42.0 42.25 42.5 42.75 ... 101.5 101.8 102.0
  * time     (time) datetime64[ns] 4kB 1979-01-31 1979-02-28 ... 2022-12-31
Attributes: (12/101)
    Conventions:                     CF-1.8, ACDD-1.3
    DPM_reference:                   GC-UD-ACRI-PUG
    IODD_reference:                  GC-UD-ACRI-PUG
    acknowledgement:                 The Licensees will ensure that original ...
    ancillary_variables:             flags CHL_uncertainty
    citation:                        The Licensees will ensure that original ...
    ...                              ...
    type:                            surface
    units:                           milligram m-3
    valid_max:                       1000.0
    valid_min:                       0.0
    westernmost_longitude:           -180.0
    westernmost_valid_longitude:     -180.0

We can see that after resampling, our time dimension size is reduced from days to months.

CHL_month = ds_resampled.mean(dim=['lat', 'lon']).hvplot(label='monthly resampling').options(color='red', )
CHL_month
CHL_day = ds['CHL_cmes-gapfree'].mean(dim=['lat', 'lon']).hvplot(label='daily resampling').options(color='blue')
CHL_day
(CHL_day*CHL_month).options(title='Monthly vs Daily resampling of chlorophyll-a levels', xlabel='year')