Advanced Python for Environmental Scientists: Glossary

Key Points

Introduction
  • Jupyter Lab is a system for interactive notebooks that can run Python code, these can be either on own computer or a remote computer.

  • Python can scale to using large datasets with the Xarray library.

  • Python can parallelise computation with Dask or Numba.

  • NetCDF format is useful for large data structures as it is self-documenting and handles multiple dimensions.

  • Zarr format is useful for cloud storage as it chunks data so we don’t need to transfer the whole file.

  • Intake catalogues make dealing with multifile datasets easier.

Dataset Parallelism
  • GNU Parallel can apply the same command to every file in a dataset

  • GNU Parallel works on the command line and doesn’t require Python, but it can run multiple copies of a Python script

  • It is often the simplest way to apply parallelism

  • It requires a problem that works independently across a set of files or a range of parameters

  • Without invoking a more complex job scheduler, GNU Parallel only works on a single computer

  • By default GNU Parallel will use every CPU core available to it

Parallelisation with Numpy and Numba
  • We can measure how long a Jupyter cell takes to run with %%time or %%timeit magics.

  • We can use a profiler to measure how long each line of code in a function takes.

  • We should measure performance before attemping to optimise code and target our optimisations at the things which take longest.

  • Numpy can perform operations to whole arrays, this will perform faster than using for loops.

  • Numba can replace some Numpy operations with just in time compilation that is even faster.

  • One way numba achieves higher performance is to use vectorisation extensions of some GPUs that process multiple pieces of data in one instruction.

  • Numba ufuncs let us write arbitary functions for Numba to use.

Working with data in Xarray
  • Xarray can load NetCDF files

  • We can address dimensions by their name using the .dimensionname, ['dimensionname] or sel(dimensionname) syntax.

  • With lazy loading data is only loaded into memory when it is requested

  • We can apply mathematical operations to the whole (or part) of the array, this is more efficient than using a for loop.

  • We can also apply custom functions to operate on the whole or part of the array.

  • We can plot data from Xarray invoking matplotlib.

  • Hvplot can plot interactive graphs.

  • Xarray has many useful built in operations it can perform such as resampling, coarsening and grouping.

Plotting Geospatial Data with Cartopy
  • Cartopy can plot data on maps

  • Cartopy can use Xarray DataArrays

  • We can apply different projections to the map

  • We can add gridlines and country/region boundaries

Parallelising with Dask
  • Dask is a parallel computing framework for Python

  • Dask creates a task graph of all the steps in the operation we request

  • Dask can use your local computer, an HPC cluster, Kubernetes cluster or a remote system over SSH

  • We can monitor Dask’s processing with its dashboard

  • Xarray can use Dask to parallelise some of its operations

  • Delayed tasks let us lazily evaluate functions, only causing them to execute when the final result is requested

  • Futures start a task immediately but return a futures object until computation is completed

Storing and Accessing Data in Parallelism Friendly Formats
  • We can process faster in parallel if we can read or write data in parallel too

  • Data storage is many times slower than accessing our computer’s memory

  • Object stores are one way to store data that is accessible over the web/http, allows replication of data and can scale to very large quantities of data.

  • Zarr is an object store friendly file format intended for storing large array data.

  • Zarr files are stored in chunks and software such as Xarray can just read the chunks that it needs instead of the whole file.

  • Xarray can be used to read in Zarr files

GPUs
  • GPUs are Graphics Processing Units, they have large numbers of very simple processing cores and are suited to some parallel tasks like machine learning and array operations

  • Many laptops and desktops won’t have very powerful GPUs, instead we’ll want to use HPC or Cloud systems to access a GPU.

  • Google’s Colab provides free access to GPUs with a Jupyter notebooks interface.

  • Numba can use GPUs with minor modifications to the code.

  • NVIDIA have drop in replacements for Pandas, Numpy and SciKit learn that are GPU accelerated.

Glossary

The glossary would go here, formatted as:

{:auto_ids}
key word 1
:   explanation 1

key word 2
:   explanation 2

({:auto_ids} is needed at the start so that Jekyll will automatically generate a unique ID for each item to allow other pages to hyperlink to specific glossary entries.) This renders as:

key word 1
explanation 1
key word 2
explanation 2