This tutorial shows how to use CDAT to remove the climatological mean annual cycle and detrend data - a common procedure applied prior to detailed climate data analysis of monthly anomalies.
The data considered in this notebook are monthly-mean surface air temperatures gridded over the United States and spanning the years 1850 - 1990. The original dataset is complete, but it is artificially modified in this notebook by "masking" some values, yielding an incomplete dataset with some values "missing" (as is often encountered in analysis of climate data). The analysis procedure involves three major steps:
When there are missing values in the dataset (as in the sample calculations below), the final detrended time-series will depend on the order in which these steps are executed. Here we examine options for detrending the data, and we show that slightly different results are generated depending on the order in which operations are performed. More sophisticated treatments (not discussed here) involving appropriately weighting samples and collections of samples should be considered for datasets that only sparsely cover the time and space domains.
To download this Jupyter Notebook, right click on the link and choose "Download Linked File As..." or "Save Link as...". Remember where you saved the downloaded file, which should have an .ipynb extension. (You'll need to launch the Jupyter notebook or JupyterLab instance from the location where you store the notebook file.)
from __future__ import print_function
import cdms2
import MV2
import genutil
import cdutil
import vcs
import os
import requests
import numpy
The following CMIP5 NetCDF file contains Near-Surface Air Temperature data in degrees Kelvin (K) over North America. It is downloaded to the directory where this notebook is stored.
filename = 'tas_Amon_IPSL-CM5A-LR_1pctCO2_r1i1p1_185001-198912.nc'
if not os.path.exists(filename):
r = requests.get("https://cdat.llnl.gov/cdat/sample_data/notebooks/{}".format(filename), stream=True)
with open(filename,"wb") as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter local_filename keep-alive new chunks
f.write(chunk)
The following two lines of code open the file just downloaded to your local computer (filename
), extract data for the Near-Surface Air Temperature (tas
), and assign it to the variable data
.
f = cdms2.open(filename)
data = f("tas")
The following line of code uses the .info()
method to allow us to take a quick look at the structure of the temperature data stored in the data
variable.
There are 1680 different time values which are measured as "days since 1850-01-01 00:00:00". The range of the time values is the difference between the last value (51084.5) and the first value (15.5) which equals 51069 days. If we divide this range (51069) by the number of intervals in the dataset (1680-1), we get (51069/(1680-1)) = 30.416 days, which is the average time duration for each data point. This tells us that we are working with monthly data.
The data cover 13 latitude bands and 16 longitude values over the United States (latitudes ~25.6 to ~48.3 and longitudes -123.75 to -67.5).
data.info()
*** Description of Slab tas *** id: tas shape: (1680, 13, 16) filename: missing_value: 1e+20 comments: grid_name: <None> grid_type: generic time_statistic: long_name: Near-Surface Air Temperature units: K tileIndex: None original_name: t2m associated_files: baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc areacella: areacella_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc coordinates: height standard_name: air_temperature cell_methods: time: mean (interval: 30 minutes) cell_measures: area: areacella history: 2011-03-07T11:45:34Z altered by CMOR: Treated scalar dimension: 'height'. 2011-03-07T11:45:34Z altered by CMOR: replaced missing value flag (9.96921e+36) with standard missing value (1e+20). 2011-03-07T11:45:34Z altered by CMOR: Inverted axis: lat. Grid has Python id 0x16496e150. Gridtype: generic Grid shape: (13, 16) Order: yx ** Dimension 1 ** id: time Designated a time axis. units: days since 1850-01-01 00:00:00 Length: 1680 First: 15.5 Last: 51084.5 Other axis attributes: axis: T calendar: noleap realtopology: linear long_name: time standard_name: time Python id: 0x164be8550 ** Dimension 2 ** id: lat Designated a latitude axis. units: degrees_north Length: 13 First: 25.578947067260742 Last: 48.31578826904297 Other axis attributes: axis: Y realtopology: linear long_name: latitude standard_name: latitude Python id: 0x106aaa150 ** Dimension 3 ** id: lon Designated a longitude axis. units: degrees_east Length: 16 First: -123.75 Last: -67.5 Other axis attributes: axis: X modulo: 360.0 topology: circular realtopology: circular long_name: longitude standard_name: longitude Python id: 0x164986410 *** End of description for tas ***
First, to get a feel for the data, let's spatially average the data over the entire domain to create a time series of the mean temperature for the region. In creating this time series, the averager
function will take the temperature data for the entire region and spatially average it to yield a single temperature value as a function of time (i.e., the latitude and longitude dimensions are removed by this action, as shown with the .shape
method.)
data_ts = genutil.averager(data, axis='xy', weights=['weighted','weighted'], combinewts=1)
data_ts.shape
/Users/davis278/miniconda3/envs/cdat821/lib/python3.7/site-packages/genutil/averager.py:627: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison if wgt in ['equal', 'unweighted']: /Users/davis278/miniconda3/envs/cdat821/lib/python3.7/site-packages/genutil/averager.py:665: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison if dim_wt in ['equal', 'unweighted']:
(1680,)
In the cell above, we have specified that averaging over the longitude and latitude dimensions (x and y) should be performed by weighting by grid-cell area. Note that the "combinewts" option should also be included for correct area-weighting.
In the next line of code, let's plot this time series.
x = vcs.init(bg=True, geometry=(1200,900))
line = vcs.create1d()
line.markersize = .5
x.plot(data_ts, line)
<vcs.displayplot.Dp at 0x157912830>
The figure above shows that the surface temperature averaged over the U.S. is characterized by a pronounced annual cycle and a long term trend with some year to year variability. The goal of the remainder of this analysis is remove both the climatological mean annual cycle and the long term trend. This procedure leads to a filtered time series representing monthly temperature anomalies characterizing variability that cannot be explained by annual cycle forcing or any long-term changes in climate forcing.
But first, let's look at a numerical example, which illustrates that the order in which you perform operations can make a difference when a dataset is incomplete (i.e., when a dataset includes "masked" or "missing" data).
If data are missing from a dataset, the order of operations can matter. The following is a numerical example of averaging values two different ways.
Let's say we have the following dataset, which is a function of X and Y:
Y1 | Y2 | Y3 | Y4 | |
---|---|---|---|---|
X1 | 3 | 4 | - | 7 |
X2 | - | 5 | - | - |
X3 | 1 | 2 | 5 | 5 |
X4 | - | - | 6 | 4 |
Creating this dataset using a numpy array and using 999 where no values exist yields:
import numpy
array = numpy.array([[3,4,999,7],[999,5,999,999],[1,2,5,5],[999,999,6,4.]])
array
array([[ 3., 4., 999., 7.], [999., 5., 999., 999.], [ 1., 2., 5., 5.], [999., 999., 6., 4.]])
Masking the 999 values leads to the following:
masked = numpy.ma.masked_equal(array, 999.)
masked
masked_array( data=[[3.0, 4.0, --, 7.0], [--, 5.0, --, --], [1.0, 2.0, 5.0, 5.0], [--, --, 6.0, 4.0]], mask=[[False, False, True, False], [ True, False, True, True], [False, False, False, False], [ True, True, False, False]], fill_value=999.0)
If we average over Y first, then average over X, we get (4.667 + 5.000 + 3.250 + 5.000) / 4 = 4.479
Y1 | Y2 | Y3 | Y4 | Average | |
---|---|---|---|---|---|
X1 | 3 | 4 | - | 7 | 4.667 |
X2 | - | 5 | - | - | 5.000 |
X3 | 1 | 2 | 5 | 5 | 3.250 |
X4 | - | - | 6 | 4 | 5.000 |
Average | 4.479 |
Verifying this with code gives:
a = numpy.ma.average(numpy.ma.average(masked, axis=-1))
print ("Y, X:", "{0:.3f}".format(a))
Y, X: 4.479
If we average over X first, then over Y, we get (2.000 + 3.667 + 5.500 + 5.333) / 4 = 4.125
Y1 | Y2 | Y3 | Y4 | Average | |
---|---|---|---|---|---|
X1 | 3 | 4 | - | 7 | |
X2 | - | 5 | - | - | |
X3 | 1 | 2 | 5 | 5 | |
X4 | - | - | 6 | 4 | |
Average | 2.000 | 3.667 | 5.500 | 5.333 | 4.125 |
Verifying this with code yields:
b = numpy.ma.average(numpy.ma.average(masked, axis=0))
print ("X, Y:", "{0:.3f}".format(b))
X, Y: 4.125
Finally, if we average using all 16 cells at once (but, of course, exclude those with missing data), we get (3 + 4 + 7 + 5 + 1 + 2 + 5 + 5 + 6 + 4) / 10 = 4.200
c = numpy.ma.average(numpy.ma.average(masked))
print ("All:", "{0:.3f}".format(c))
All: 4.200
We get three different overall means (4.479, 4.125, or 4.200) depending on our processing choices (specifically the order of operations). (Note that with appropriate weighting, which is not done here, a consistent mean can be obtained, independent of the order of operations. From the first example of averaging over Y, then X, since the total number of values in the dataset is 10, the proper weighting would be: 4.667 * 3/10 + 5.000 * 1/10 + 3.250 * 4/10 + 5.000 * 2/10 = 4.200.)
When additional processing steps are involved, ordering can affect results in more complex ways, as in the next example.
Before detrending a time series, it is often best to filter it by removing the climatological mean annual cycle; in fact, this may be a requirement for particular types of analyses. In the case of a time series that does not span an integral/non-fractional representative number of years, an accurate determination of the trend of interest may require removal of the climatological mean annual cycle. To see why, consider a temperature time series starting in January and ending in July a year and a half later (i.e., not an integral number of years). Over Northern Hemisphere continents with a large annual cycle, the ending temperature would be much higher than the beginning temperature simply due to the usual seasonal changes in temperature. A linear fit to the time-series would result in a trend that would not reflect any real change in climate but just the particulars of the time period treated. To avoid such artificial trends, one should first remove the climatological annual cycle.
The surface temperature data considered earlier has no missing data, but more often than not observational datasets are incomplete. For purposes of illustration, we therefore will first modify the original surface temperature dataset by designating certain values as "missing". Specifically, we'll treat as "missing" (i.e., delete or mask) all data values that are within 7 degrees of the maximum temperature (data.max()-7
) and store the result in datamskd
.
datamskd = MV2.masked_greater(data, data.max()-7)
We now consider what order to carry out the two-step operation needed to remove the climatological mean annual cycle:
We examine the two ordering options, with steps performed in the order shown above and then in the reverse order.
First at each location (grid cell) we will remove the climatological mean annual cycle to produce monthly mean departures or anomalies relative to the climatological annual cycle. Then we will spatially average the anomalies to produce a time-series of regional-mean anomalies.
In the next line of code, the ANNUALCYCLE.departures
calculates the average monthly temperature value for each of the 12 months in a year over the complete time period for each location in the input data file and determines the departure of each temperature (at each time and location) from this average (i.e., "climatological") monthly value.
For example, once an average January value for the entire dataset has been calculated, each January value in the dataset is subtracted from that average January value to yield a series of January departures for each year in the data set. Since there are 1680 months in the dataset, there are 1680/12 = 140 years of data, and therefore 140 January departures. Since there are 140 Februaries, 140 Marches, etc. there are 140 departures x 12 months = 1680 departures for each location in the dataset, as the .shape
method shows (1680 departure values by 13 latitude bands, by 16 longitude values).
datamskd_departures = cdutil.times.ANNUALCYCLE.departures(datamskd) # extract the departures of the masked data.
datamskd_departures.shape
(1680, 13, 16)
Now that we have a time series of monthly departures at each grid cell, we can spatially average them over the entire domain to obtain a single area-weighted time-series for the regional-mean monthly anomalies:
datamskd_departures_ts = genutil.averager(datamskd_departures, axis='xy', weights=['weighted','weighted'], combinewts=1) # create time series of the masked data departures.
Using the .shape
method below verifies that the resulting spatially averaged data no longer have latitude and longitude information.
datamskd_departures_ts.shape
(1680,)
Let's plot the resulting time series of the departures (i.e. the result of removing the annual cycle before averaging spatially).
x = vcs.init(bg=True, geometry=(1200,900))
x.clear()
x.plot(datamskd_departures_ts)
<vcs.displayplot.Dp at 0x164a08bb0>
Notice how, with the annual cycle removed, it is easier to see the trend and which months are anomalously warm or cold (compared to the climatological mean temperature).
It should be noted that with the order of operations executed under this option, we cannot expect the mean of the anomalies for a given month of the year summed over all years to be exactly zero. In the case considered here, for example, the climatological monthly mean anomalies are:
# Format the monthly mean anomaly data to show 5 digits after the decimal point.
numpy.set_printoptions(precision=5, formatter={'float': '{: 0.5f}'.format})
print(cdutil.times.ANNUALCYCLE.climatology(datamskd_departures_ts))
[ 0.00000 -0.00000 0.00000 -0.00000 0.01006 0.05874 0.09536 0.09576 0.01505 -0.00000 0.00000 0.00000]
Although for any individual cell the anomalies do sum to zero for each month of the year, when the anomalies are spatially averaged and there are missing values this cannot be guaranteed. This is because when there are missing values, the order of averaging data sequentially over two dimensions can depend on the order it is done, as demonstrated earlier. This means that under Processing Option 1, additional care must be taken in calculating the anomalies. Although applying appropriate weighting during the averaging procedures (over time and over space) can remedy the problem, an easier solution is to remove a mean value (over all years for a given month of the year) from that month's temperature anomaly (datmskd_departures_ts
) to obtain anomalies (datmskd_departures_ts_corrected
) with means of zero:
datamskd_departures_ts_corrected = cdutil.times.ANNUALCYCLE.departures(datamskd_departures_ts)
datamskd_departures_ts_corrected.shape
(1680,)
print(cdutil.times.ANNUALCYCLE.climatology(datamskd_departures_ts_corrected))
[ 0.00000 0.00000 -0.00000 0.00000 -0.00000 0.00000 0.00000 0.00000 -0.00000 0.00000 0.00000 -0.00000]
The time series mean should now be zero (within the limits of machine precision):
print(numpy.mean(datamskd_departures_ts_corrected))
#print("{0:.5f}".format(numpy.mean(datamskd_departures_ts_corrected)))
-1.3534147347811433e-16
Numpy calculates an unweighted mean by default, whereas the cdutils averager
method calculates a weighted mean by default.
In this case, both the weighted and unweighted means are essentially zero (within the limits of machine precision, as mentioned above), but the following lines of code prove that numpy calculates an unweighted mean (which is the same as cdutil.averager with the weights set to "unweighted").
print(cdutil.averager(datamskd_departures_ts_corrected, weights='unweighted'))
-1.3534147347811433e-16
and cdutils calculates a weighted mean by default:
print(cdutil.averager(datamskd_departures_ts_corrected, weights='weighted'))
-1.4238664607012576e-16
print(cdutil.averager(datamskd_departures_ts_corrected))
-1.4238664607012576e-16
In the plot below, the mean shown in the upper left corner is cdutil's weighted mean.
x.clear()
x.plot(datamskd_departures_ts_corrected)
<vcs.displayplot.Dp at 0x164a08ad0>
Now let's reverse the order of performing the operations. We first calculate the spatially-averaged time series and then remove the annual cycle.
We calculate the time series characterizing area-mean temperature for the masked dataset by spatially averaging the temperature values over all latitude and longitude locations to give a single global temperature time series. (Again, the .shape
method, shows we are looking at 1680 values with no latitude or longitude, as expected.)
datamskd_ts = genutil.averager(datamskd, axis='xy', weights=['weighted','weighted'], combinewts=1)
datamskd_ts.shape
(1680,)
Let's take a look at this time series. Note that the trend and the annual cycle are similar but not identical to what we saw with the unmasked data. In particular the maximum value of the spatial mean (considering all times) is lower than before (300.556 now compared to 301.658 for the unmasked data). This is because temperatures at individual grid cells that are within 7 K of the maximum temperature have been eliminated from consideration ("masked").
x = vcs.init(bg=True, geometry=(1200,900))
line = vcs.create1d()
line.markersize = .5
x.plot(datamskd_ts, line)
<vcs.displayplot.Dp at 0x168d56130>
Next we'll remove the annual cycle using the ANNUALCYCLE.departures
method as we did in executing Option 1 above. Again, the method calculates an average temperature value for each month of the year and determines the departure of the temperature at each time value from the average monthly temperature, effectively removing the annual cycle from the data.
datamskd_ts_departures = cdutil.times.ANNUALCYCLE.departures(datamskd_ts)
datamskd_ts_departures.shape
(1680,)
print(cdutil.times.ANNUALCYCLE.climatology(datamskd_ts_departures)) # All 12 annual cycle values should now be 0
[ 0.00000 0.00000 0.00000 0.00000 0.00000 -0.00000 -0.00000 0.00000 -0.00000 -0.00000 -0.00000 0.00000]
The time series mean should be zero (within the limits of machine precision):
print(numpy.mean(datamskd_ts_departures)) # unweighted mean
#print("{0:.5f}".format(numpy.mean(datamskd_ts_departures)))
4.0602442043434294e-15
print(cdutil.averager(datamskd_ts_departures)) # weighted mean
#print("{0:.5f}".format(cdutil.averager(datamskd_ts_departures)))
4.2715993821037725e-15
x.clear()
x.plot(datamskd_ts_departures)