Introduction
|
Jupyter Lab is a system for interactive notebooks that can run Python code, these can be either on own computer or a remote computer.
Python can scale to using large datasets with the Xarray library.
Python can parallelise computation with Dask or Numba.
NetCDF format is useful for large data structures as it is self-documenting and handles multiple dimensions.
Zarr format is useful for cloud storage as it chunks data so we don’t need to transfer the whole file.
Intake catalogues make dealing with multifile datasets easier.
|
Dataset Parallelism
|
GNU Parallel can apply the same command to every file in a dataset
GNU Parallel works on the command line and doesn’t require Python, but it can run multiple copies of a Python script
It is often the simplest way to apply parallelism
It requires a problem that works independently across a set of files or a range of parameters
Without invoking a more complex job scheduler, GNU Parallel only works on a single computer
By default GNU Parallel will use every CPU core available to it
|
Parallelisation with Numpy and Numba
|
We can measure how long a Jupyter cell takes to run with %%time or %%timeit magics.
We can use a profiler to measure how long each line of code in a function takes.
We should measure performance before attemping to optimise code and target our optimisations at the things which take longest.
Numpy can perform operations to whole arrays, this will perform faster than using for loops.
Numba can replace some Numpy operations with just in time compilation that is even faster.
One way numba achieves higher performance is to use vectorisation extensions of some GPUs that process multiple pieces of data in one instruction.
Numba ufuncs let us write arbitary functions for Numba to use.
|
Working with data in Xarray
|
Xarray can load NetCDF files
We can address dimensions by their name using the .dimensionname , ['dimensionname] or sel(dimensionname) syntax.
With lazy loading data is only loaded into memory when it is requested
We can apply mathematical operations to the whole (or part) of the array, this is more efficient than using a for loop.
We can also apply custom functions to operate on the whole or part of the array.
We can plot data from Xarray invoking matplotlib.
Hvplot can plot interactive graphs.
Xarray has many useful built in operations it can perform such as resampling, coarsening and grouping.
|
Plotting Geospatial Data with Cartopy
|
Cartopy can plot data on maps
Cartopy can use Xarray DataArrays
We can apply different projections to the map
We can add gridlines and country/region boundaries
|
Parallelising with Dask
|
Dask is a parallel computing framework for Python
Dask creates a task graph of all the steps in the operation we request
Dask can use your local computer, an HPC cluster, Kubernetes cluster or a remote system over SSH
We can monitor Dask’s processing with its dashboard
Xarray can use Dask to parallelise some of its operations
Delayed tasks let us lazily evaluate functions, only causing them to execute when the final result is requested
Futures start a task immediately but return a futures object until computation is completed
|
Storing and Accessing Data in Parallelism Friendly Formats
|
We can process faster in parallel if we can read or write data in parallel too
Data storage is many times slower than accessing our computer’s memory
Object stores are one way to store data that is accessible over the web/http, allows replication of data and can scale to very large quantities of data.
Zarr is an object store friendly file format intended for storing large array data.
Zarr files are stored in chunks and software such as Xarray can just read the chunks that it needs instead of the whole file.
Xarray can be used to read in Zarr files
|
GPUs
|
GPUs are Graphics Processing Units, they have large numbers of very simple processing cores and are suited to some parallel tasks like machine learning and array operations
Many laptops and desktops won’t have very powerful GPUs, instead we’ll want to use HPC or Cloud systems to access a GPU.
Google’s Colab provides free access to GPUs with a Jupyter notebooks interface.
Numba can use GPUs with minor modifications to the code.
NVIDIA have drop in replacements for Pandas, Numpy and SciKit learn that are GPU accelerated.
|
{:auto_ids}
key word 1
: explanation 1
key word 2
: explanation 2