Using NetCDF4 Compression with CDMS

CDMS2 writes out data using the NetCDF library

NetCDF4 allows for file compression, a good blog about NetCDF4 and compression can be found here

From this blog:

"The netCDF-4 libraries inherit the capability for data compression from the HDF5 storage layer underneath the netCDF-4 interface. Linking a program that uses netCDF to a netCDF-4 library allows the program to read compressed data without changing a single line of the program source code."

and

"Also, we're only dealing with lossless compression"

This Notebook shows how to control NetCDF4 compression (shuffling/deflating) capabilities via cdms2.

© The CDAT software was developed by LLNL. This tutorial was written by Charles Doutriaux. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Download the Jupyter Notebook

Preparing The Notebook

In order to look at a NetCDF content the easiest way is to use ncdump. The following function helps us do a line call within Python, for Notebook clarity.

We also prepare some random data

Back To Top

In [1]:
from __future__ import print_function
import subprocess
import shlex
import numpy
import os
import io
import time

# Get file size
def size_it(filename):
    statinfo = os.stat(filename)
    return statinfo.st_size

# Write and return time
def dump(data,filename="example.nc"):
    start = time.time()
    f = cdms2.open(filename,"w")
    f.write(data,id="data")
    f.close()
    return time.time()-start,size_it(filename)

class HTML(object):
    def __init__(self,html):
        self.html = html
    def _repr_html_(self):
        return self.html


# Nice html output for ncdump
class NCINFO(object):
    def __init__(self, filename, variable=None, options=""):
        self.filename = filename
        self.variable = variable
        self.options = options
    def _repr_html_(self):
        out = self.nc_info()
        lines = []
        for l in out.split("\n"):
            for kw in ["chunk","deflate","classic","netcdf4","netcdf-4"]:
                if l.lower().find(kw)>-1:
                    l = "<b>{0}</b>".format(l)
            lines.append(l.replace("\t","&emsp;&emsp;"))
        return "{0}".format("<br>".join(lines))
    def nc_info(self):
        """calls ncdump on file
    Can opass a variable or optional ncdump arguments
    Default call `ncdump -hs filename`"""
        with io.BytesIO() as out:
            ncdumpOptions = "-hs {options}".format(options=self.options)
            if self.variable is not None:
                ncdumpOptions += "-v {variable}".format(self.variable)
            cmd = "ncdump {options} {file}".format(options=ncdumpOptions, file=self.filename)
            print("Runnning {0}".format(cmd),file=out)
            cmd = shlex.split(cmd)
            p = subprocess.Popen(cmd,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
            o, e = p.communicate()
            print("-------",file=out)
            print(o,file=out)
            print("-------",file=out)
            print("File Size {0} bytes".format(size_it(self.filename)),file=out)
            return out.getvalue()
        
import requests
def download(fnm):
    r = requests.get("https://uvcdat.llnl.gov/cdat/sample_data/%s" % fnm,stream=True)
    with open(fnm,"wb") as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:  # filter local_filename keep-alive new chunks
                f.write(chunk)

download("clt.nc")
data = numpy.random.random((120,180,360))
# Random data do not compress well at all, switching to 0/1
data = numpy.greater(data,.5).astype(numpy.float)

Default Settings

By default cdms writes out data in NetCDF4 classic with no shuffling and a deflate level of 1

Back To Top

To access the netcdf value used to write data out use the following commands:

In [2]:
import cdms2
print("NetCDF4? ",cdms2.getNetcdf4Flag())
print("NetCDF Classic?",cdms2.getNetcdfClassicFlag())
print("NetCDF4 Shuffling",cdms2.getNetcdfShuffleFlag())
print("NetCDF4 Deflate?",cdms2.getNetcdfDeflateFlag())
print("NetCDF4 Deflate Level?",cdms2.getNetcdfDeflateLevelFlag())
NetCDF4?  1
NetCDF Classic? 1
NetCDF4 Shuffling 0
NetCDF4 Deflate? 1
NetCDF4 Deflate Level? 1

These values are read in at the time you open the file for writing

Note the BOLD lines

In [3]:
dump(data)
NCINFO("example.nc")
/export/reshel3/anaconda52/envs/cdms2/lib/python2.7/site-packages/cdms2/dataset.py:2173: Warning: Files are written with compression and no shuffling
You can query different values of compression using the functions:
cdms2.getNetcdfShuffleFlag() returning 1 if shuffling is enabled, 0 otherwise
cdms2.getNetcdfDeflateFlag() returning 1 if deflate is used, 0 otherwise
cdms2.getNetcdfDeflateLevelFlag() returning the level of compression for the deflate method

If you want to turn that off or set different values of compression use the functions:
value = 0
cdms2.setNetcdfShuffleFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateLevelFlag(value) ## where value is a integer between 0 and 9 included

To produce NetCDF3 Classic files use:
cdms2.useNetCDF3()
To Force NetCDF4 output with classic format and no compressing use:
cdms2.setNetcdf4Flag(1)
NetCDF4 file with no shuffling or deflate and noclassic will be open for parallel i/o
  "for parallel i/o", Warning)
Out[3]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
    axis_0:_Storage = "chunked" ;
    axis_0:_ChunkSizes = 120 ;
    axis_0:_DeflateLevel = 1 ;
    axis_0:_Endianness = "little" ;
  double axis_1(axis_1) ;
    axis_1:_Storage = "chunked" ;
    axis_1:_ChunkSizes = 180 ;
    axis_1:_DeflateLevel = 1 ;
    axis_1:_Endianness = "little" ;
  double axis_2(axis_2) ;
    axis_2:_Storage = "chunked" ;
    axis_2:_ChunkSizes = 360 ;
    axis_2:_DeflateLevel = 1 ;
    axis_2:_Endianness = "little" ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;
    data:_Storage = "chunked" ;
    data:_ChunkSizes = 40, 60, 120 ;
    data:_DeflateLevel = 1 ;
    data:_Endianness = "little" ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.4" ;
    :_SuperblockVersion = 0 ;
    :_IsNetcdf4 = 1 ;
    :_Format = "netCDF-4 classic model" ;
}

-------
File Size 4144654 bytes

Turning Off Compression

Back to Top

We can use no compression by runnnig

In [4]:
value = 0
cdms2.setNetcdfShuffleFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateLevelFlag(value) ## where value is a integer between 0 and 9 included
dump(data)
NCINFO("example.nc")
Out[4]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
    axis_0:_Storage = "contiguous" ;
    axis_0:_Endianness = "little" ;
  double axis_1(axis_1) ;
    axis_1:_Storage = "contiguous" ;
    axis_1:_Endianness = "little" ;
  double axis_2(axis_2) ;
    axis_2:_Storage = "contiguous" ;
    axis_2:_Endianness = "little" ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;
    data:_Storage = "contiguous" ;
    data:_Endianness = "little" ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.4" ;
    :_SuperblockVersion = 0 ;
    :_IsNetcdf4 = 1 ;
    :_Format = "netCDF-4 classic model" ;
}

-------
File Size 62222745 bytes

Pure NetCDF3

Back To Top

All these options can either be turned to 0 to enable NetCDF3 (as the warning above shows). One can also use the single command:

In [5]:
cdms2.useNetcdf3()
# or for versions earlier than 2.12.2017.10.25
value = 0
cdms2.setNetcdfShuffleFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateFlag(value) ## where value is either 0 or 1
cdms2.setNetcdfDeflateLevelFlag(value) ## where value is a integer between 0 and 9 included
cdms2.setNetcdf4Flag(0)
dump(data)
NCINFO("example.nc")
Out[5]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
  double axis_1(axis_1) ;
  double axis_2(axis_2) ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_Format = "64-bit offset" ;
}

-------
File Size 62213640 bytes

NetCDF4 non classic

Back To TOp

We can also turn off the classic option for netcdf4

In [6]:
cdms2.setNetcdf4Flag(1)
cdms2.setNetcdfClassicFlag(0)
dump(data)
NCINFO("example.nc")
Out[6]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
    axis_0:_Storage = "contiguous" ;
    axis_0:_Endianness = "little" ;
  double axis_1(axis_1) ;
    axis_1:_Storage = "contiguous" ;
    axis_1:_Endianness = "little" ;
  double axis_2(axis_2) ;
    axis_2:_Storage = "contiguous" ;
    axis_2:_Endianness = "little" ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;
    data:_Storage = "contiguous" ;
    data:_Endianness = "little" ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.4" ;
    :_SuperblockVersion = 0 ;
    :_IsNetcdf4 = 1 ;
    :_Format = "netCDF-4" ;
}

-------
File Size 62222611 bytes

Using Shuffling

Back To Top

We can turn on/off shuffling

In [7]:
cdms2.setNetcdf4Flag(1)
cdms2.setNetcdfClassicFlag(0)
cdms2.setNetcdfShuffleFlag(1)
dump(data)
NCINFO("example.nc")
Out[7]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
    axis_0:_Storage = "chunked" ;
    axis_0:_ChunkSizes = 120 ;
    axis_0:_Shuffle = "true" ;
    axis_0:_Endianness = "little" ;
  double axis_1(axis_1) ;
    axis_1:_Storage = "chunked" ;
    axis_1:_ChunkSizes = 180 ;
    axis_1:_Shuffle = "true" ;
    axis_1:_Endianness = "little" ;
  double axis_2(axis_2) ;
    axis_2:_Storage = "chunked" ;
    axis_2:_ChunkSizes = 360 ;
    axis_2:_Shuffle = "true" ;
    axis_2:_Endianness = "little" ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;
    data:_Storage = "chunked" ;
    data:_ChunkSizes = 40, 60, 120 ;
    data:_Shuffle = "true" ;
    data:_Endianness = "little" ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.4" ;
    :_SuperblockVersion = 0 ;
    :_IsNetcdf4 = 1 ;
    :_Format = "netCDF-4" ;
}

-------
File Size 62231677 bytes

Controling Deflate Level

Back To top

We can choose our deflate level (at the expense of time)

In [8]:
cdms2.setNetcdfShuffleFlag(0)
cdms2.setNetcdfDeflateFlag(1)
cdms2.setNetcdfDeflateLevelFlag(5)
dump(data)
NCINFO("example.nc")
Out[8]:
Runnning ncdump -hs example.nc
-------
netcdf example {
dimensions:
  axis_0 = 120 ;
  axis_1 = 180 ;
  axis_2 = 360 ;
variables:
  double axis_0(axis_0) ;
    axis_0:_Storage = "chunked" ;
    axis_0:_ChunkSizes = 120 ;
    axis_0:_DeflateLevel = 5 ;
    axis_0:_Endianness = "little" ;
  double axis_1(axis_1) ;
    axis_1:_Storage = "chunked" ;
    axis_1:_ChunkSizes = 180 ;
    axis_1:_DeflateLevel = 5 ;
    axis_1:_Endianness = "little" ;
  double axis_2(axis_2) ;
    axis_2:_Storage = "chunked" ;
    axis_2:_ChunkSizes = 360 ;
    axis_2:_DeflateLevel = 5 ;
    axis_2:_Endianness = "little" ;
  double data(axis_0, axis_1, axis_2) ;
    data :missing_value = 1.e+20 ;
    data :_FillValue = 1.e+20 ;
    data:_Storage = "chunked" ;
    data:_ChunkSizes = 40, 60, 120 ;
    data:_DeflateLevel = 5 ;
    data:_Endianness = "little" ;

// global attributes:
    :Conventions = "CF-1.0" ;
    :_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.4" ;
    :_SuperblockVersion = 0 ;
    :_IsNetcdf4 = 1 ;
    :_Format = "netCDF-4" ;
}

-------
File Size 2772789 bytes

Summarizing All Options

Back To Top

Let's try with a real life example

In [9]:
f=cdms2.open("clt.nc")
clt = f("clt")

html = "<table border='2'><tr><th>Deflate Level</th><th>NC3</th><th>NC4 Classic no shuffle</th><th>NC4 Classic shuffled</th><th>NC4 no shuffle</th><th>NC4 shuffled</th></tr>"

def addCell():
    t,s = dump(clt)
    return "<td align='center'>{:.2f}/{:d}</td>".format(t,s)

def nc4s():
    out = ""
    for classic in [1,0]:
        cdms2.setNetcdfClassicFlag(classic)
        for shuffle in [0,1]:
            cdms2.setNetcdfShuffleFlag(shuffle)
            out+=addCell()
    out+="</tr>"
    return out

# NetCDF3
html+="<tr><td align='center'>0</td>"
cdms2.useNetcdf3()
cdms2.setNetcdf4Flag(0)
html+=addCell()
cdms2.setNetcdf4Flag(1)
html+=nc4s()
cdms2.setNetcdfDeflateFlag(1)
for i in range(1,10):
    cdms2.setNetcdfDeflateLevelFlag(i)
    html += "<tr><td align='center'>{0}</td><td align='center'>N/A</td>".format(i)
    html += nc4s()
html+="<caption>Time To Write NetCDF File and size for various NC4 settings</caption></table>"
HTML(html)
Out[9]:
Deflate LevelNC3NC4 Classic no shuffleNC4 Classic shuffledNC4 no shuffleNC4 shuffled
00.06/16254820.01/16253230.01/16330520.01/16254820.01/1633197
1N/A0.12/12011050.09/12277390.12/12012500.09/1227943
2N/A0.12/12004710.10/12238950.12/12006160.09/1224099
3N/A0.13/12003710.09/12202750.12/12005160.09/1220479
4N/A0.14/12063520.10/12181590.14/12064970.10/1218363
5N/A0.15/12060920.11/12153300.14/12062370.12/1215534
6N/A0.15/12059610.13/12133530.15/12061060.12/1213557
7N/A0.14/12059050.12/12127130.14/12060500.12/1212917
8N/A0.14/12058880.16/12118080.14/12060330.16/1212012
9N/A0.14/12058880.19/12114490.14/12060330.19/1211653
Time To Write NetCDF File and size for various NC4 settings