This tutorial shows how to detrend data - a commonly-used technique prior to detailed climate data analysis - using CDAT.

The CDAT software was developed by LLNL. This tutorial was written by Charles Doutriaux. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

In [1]:

```
from __future__ import print_function
import cdms2
import MV2
import genutil
import cdutil
import vcs
import os
import requests
```

The following NetCDF file contains Near-Surface Air Temperature data in degrees Kelvin (K) over North America. It is downloaded to the directory where this notebook is stored.

In [2]:

```
filename = 'tas_Amon_IPSL-CM5A-LR_1pctCO2_r1i1p1_185001-198912.nc'
if not os.path.exists(filename):
r = requests.get("https://cdat.llnl.gov/cdat/sample_data/notebooks/{}".format(filename), stream=True)
with open(filename,"wb") as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter local_filename keep-alive new chunks
f.write(chunk)
```

The following two lines of code open the file just downloaded to your local computer (`filename`

) and extract data for the Near-Surface Air Temperature (`tas`

) and assign it to the variable `data`

.

In [3]:

```
f = cdms2.open(filename)
data = f("tas")
```

Let's take a quick look at the data using the `.info()`

method. Since this file is a subset of the globe, the data cover 13 latitude bands and 16 longitude values. There are 1680 different time values which are measured as "days since 1850-01-01 00:00:00". The range of time values is the difference between the last value (51084.5) and the first value (15.5) which equals 51069 days. If we divide this range by the number of values in the dataset (51069/1680) we get an average time duration for each data point of 30.398 days which tells us that we are working with monthly data.

In [4]:

```
data.info()
```

To demonstrate `genutil`

's capabilities and highlight a scientific point about the order of operations, let's mask (delete) some data and compute the time series.

We'll delete, or mask, all data values where the maximum temperature is less 2 degrees Kelvin (K) - `data.max()-2`

.

In [5]:

```
data_masked = MV2.masked_greater(data, data.max()-2)
```

Next we'll create a time series of the masked data that will span the entire 1680 time values in the data file.

In [6]:

```
data_masked_ts = genutil.averager(data_masked, axis='xy')
```

Let's take a look at the time series. Note the trend and the annual cycle.

In [7]:

```
x = vcs.init(bg=True, geometry=(800,600))
line = vcs.create1d()
line.markersize = .5
x.plot(data_masked_ts, line)
```

Out[7]:

First we will remove the annual cycle for these monthly data. Because the data are masked (i.e. some data are missing) the order of operation matters, as we will demonstrate.

In [8]:

```
data_departures = cdutil.times.ANNUALCYCLE.departures(data_masked) # extract the departures of the original data.
data_departures_ts = genutil.averager(data_departures, axis='xy') # create time series of the data departures.
data_ts_departures = cdutil.times.ANNUALCYCLE.departures(data_masked_ts) # extract the departures of the original data_ts time series data.
```

Notice how the trend shows up much nicer now!

In [9]:

```
x.clear()
x.plot(data_departures_ts)
```

Out[9]:

Please note the importance of the order of operation when missing data is present. The two time series are slightly different especially in the later years, where missing data occurs since the temperature crosses the threshold to be removed (2 degrees cooler than the maximum temperature).

In [10]:

```
x.clear()
x.plot(data_departures_ts - data_ts_departures)
```

Out[10]:

First we will compute the trend over `time`

. Note that the index of time can be anything; `genutil`

will determine its index.

After computation we lose the time axis. Also note the units: since the time axis units are in `days since XXX`

, the coeff/trend is in `K/day`

.

In [11]:

```
coeff, intercept = genutil.statistics.linearregression(data_departures, axis="t")
print("Shapes: coeff {}, intercept {}".format(coeff.shape, intercept.shape))
# Let's do the same for the time series
coeff_ts, intercept_ts = genutil.statistics.linearregression(data_departures_ts, axis="t")
```

Now let's actually remove the trend.

In [12]:

```
times = MV2.array(data_masked.getTime()[:])
times.setAxis(0, data_masked.getTime())
# since time is not necessarily on index 0 we need to use the grower function
# we use data as the first argument to ensure the same order
tmp, full_times = genutil.grower(data_departures, times)
# same for cefficient
tmp, coeff = genutil.grower(data_masked, coeff)
# same for intercept
tmp, intercept = genutil.grower(data_masked, intercept)
data_departures_detrend = data_departures - full_times * coeff - intercept
data_departures_ts_detrend = data_departures_ts - times * coeff_ts - intercept_ts
x.clear()
x.plot(data_departures_ts_detrend)
```

Out[12]:

Here, again, the order matters.

In [13]:

```
data_departures_detrend_ts = genutil.averager(data_departures_detrend, axis='xy')
x.clear()
x.plot(data_departures_detrend_ts - data_departures_ts_detrend)
```

Out[13]: