#acl AdminGroup:read,write,delete,revert  EditorGroup:read,write,delete,revert All:read
#format wiki
#language en
#pragma section-numbers off


<<TableOfContents(3)>>


You can read and write any arbitrary binary data format using the [[http://docs.python.org/library/struct.html|struct]] module. In general it is a better choice to use standard data formats such as NetCDF for scientific data sets. NetCDF introduces an abstraction layer which allows the machine-independent storage and retrieval of data. In addition, the data should include their documentation in the same file.

= Modules for scientific data formats =


||<tablewidth="200px">Format\module ||[[http://docs.scipy.org/doc/scipy/reference/io.html|scipy.io]] || [[http://pysclint.sourceforge.net/pyhdf/|pyhdf]] ||[[http://www.gdal.org/|GDAL]] ||[[http://www.pyngl.ucar.edu/|PyNIO]] ||[[http://www.pytables.org/moin|PyTables]] ||
||NetCDF ||read/(write?)||read ||read/write ||read/write || ||
||HDF4 || ||read/write ||read/write ||read/write || ||
||HDF5 || || ||read/write || ||read/write ||
||GRIB || || ||read ||read/write || ||

scipy.io offers a convenient interface to read NetCDF data, the most common scientific data format. Pyhdf is a simple interface to HDF. If you don't need other formats Pyhdf may be a good choice. GDAL has some overhead because of the single abstract data model for all formats but supports the largest number of different satellite formats. PyNIO is an excellent choice for dealing with meteorological and climatological datasets. Since PyNGL/PyNIO recently became open source it can be highly recommended. PyTables could be a choice for dealing with HDF5 if GDAL is not sufficient.

== NetCDF ==

 * [[http://www.unidata.ucar.edu/software/netcdf/docs/netcdf.html|The NetCDF Users' Guide]]
 * [[http://cf-pcmdi.llnl.gov/documents/cf-conventions|NetCDF Climate and Forecast (CF) Metadata Conventions]]

You can use the {{{scipy.io}}} routines to read NetCDF data:

 * Open a file with {{{fid=scipy.io.netcdf_file(filename,'r')}}}
 * Print variables with {{{fid.variables.keys()}}}
 * Read (copy) a variable to an array {{{a=fid.variables[NAME][:].copy()}}}
 * Close the file {{{fid.close()}}}

=== Attributes ===

After opening a NetCDF file you can examine the variable NAME (string) by looking at the methods, e.g. {{{var.attributes}}} or {{{var.dimension}}}:

{{{#!python
var=fid.variables[NAME]
var.attributes
}}}

=== NetCDF Data Examples ===

 * ftp://ftp.ifremer.fr/ifremer/cersat/products/gridded/psi-concentration/data/ Sea ice 
 * http://www.esrl.noaa.gov/psd/data/reanalysis/reanalysis.shtml NCEP/NCAR Atmopsheric Reanalysis
 * http://www.argo.ucsd.edu/index.html Argo



== HDF ==
NASA's standard file format, developed by NCSA. The [[http://hdf.ncsa.uiuc.edu/index.html|HDF format]] is a self-describing data format. HDF files can contain binary data and allow direct access to parts of the file without first parsing the entire contents.

The HDF versions 4 and 5 are not compatible.

Different modules are available for reading and writing HDF files

=== HDF4 pyhdf ===
[[http://pysclint.sourceforge.net/pyhdf/|pyhdf]] is a simple python interface to the NCSA HDF4 library.

The following example demonstrates how to read [[http://eosweb.larc.nasa.gov/PRODOCS/misr/level3/download_data.html|level-3 data]] from the [[http://www-misr.jpl.nasa.gov/|Multi-angle Imaging Spectral Radiometer (MISR)]] on the [[http://terra.nasa.gov/|Terra Satellite]]

{{{
#!python

from scipy import array
from pylab import imshow,colorbar,title,savefig
from pyhdf.SD import SD

f=SD('MISR_AM1_CGLS_MAY_2007_F04_0025.hdf')
print f.datasets().keys()
data=array(f.select('NDVI average').get())
data[data<0]=0

imshow(data,interpolation='nearest',cmap=cm.YlGn)
colorbar(shrink=0.5)
title('Normalized Difference Vegetation Index')
}}}
Line 5 opens the HDF file object. line 6 prints the keywords of the included datasets. From this one can identify the keyword for the desired parameter. Line 7 reads the data in a SciPy array. Line 8 selects the negative (bad and missing) data and sets them to zero.

=== HDF5 ===
[[http://www.pytables.org/moin|PyTables]] is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.


= Excercise 1 =

Download NCEP/NCAR reanalysis atmospheric surface temperature monthly mean values. Examine the variables and their attributes.

[[attachment:air.mon.mean.nc]]

 * What are their dimensions?
 * What is the meaning of the dimensions?
 * What are the units?
 * What are the scaling?
 * How are the missing values marked?
 * Select the grid cell for the position of Hamburg and plot the temperature time series.
 
= Excercise 2 =


Write scripts (or try to understand the code below)

 * to download all [[http://www.argo.ucsd.edu/index.html|Argo data]] in the Atlantic for the last month, and
 * to plot the pressure against the temperature for only those profiles with a surface temperature below -1 deg C.

{{{#!python
#download_argo.py
import os,os.path
from ftplib import FTP

tmp_dir='/home/lars/data/tmp/'
ftp_adr='ftp.ifremer.fr'
ftp_dir='/ifremer/argo/geo/atlantic_ocean/2009/11/'

ftp = FTP(ftp_adr)   # connect to host, default port
ftp.login()          # user anonymous, passwd anonymous@
ftp.cwd(ftp_dir)

file_list=ftp.nlst()

for f in file_list:
    urlfile='ftp://'+ftp_adr+ftp_dir+f
    localfile=tmp_dir+f
   
    if not(os.path.exists(localfile)):
        print 'Getting file '+f
         #We use curl instead of ftp.retrbinary for download
        os.system("curl -s -k -o "+localfile+"  "+urlfile)
    else:
        print 'file '+f+' exists'

ftp.close()
}}}

{{{#!python
#Plot Argo profiles with surface temperature below 1 deg C
import scipy.io as io
import glob
from pylab import *

tmp_dir='/home/lars/data/tmp/'
file_liste=glob.glob(tmp_dir+'*.nc')

D={}# Empty dictionary to store selected profiles
for f in file_liste:
    print f

    # Open netcdf data file
    fid=io.netcdf_file(f,'r')

    #print fid.variables.keys()

    # Make a copy of the content 
    lat=fid.variables['LATITUDE'][:].copy()
    lon=fid.variables['LONGITUDE'][:].copy()
    T=fid.variables['TEMP'][:].copy()
    P=fid.variables['PRES'][:].copy()
    
    T[T>=99999]=nan # Set 99999.0 to "Not a Number"
    P[P>=99999]=nan
    
    (nr_profs,Z)=T.shape # Get dimension

    for i in range(nr_profs):
        if T[i,0]<1.0: # Select only those profiles with surface temperature below 1 deg C
            # Store profile in dictionary with the position as a key
            D[(lat[i],lon[i])]=(T[i,:],P[i,:])
    # Close data file
    fid.close()


# Plot data
figure()
for k in D.keys():
    print k
    plot(D[k][0][:],D[k][1][:])
axis([-2,3,2000,0])
xlabel('T')
ylabel('P')
show()
}}}

{{attachment:Argo_plot.png}}