Dataset Parallelism

Overview

Teaching: 15 min
Exercises: 10 min

Questions

How do we apply the same command to every file or parameter in a dataset?

Objectives

Use GNU Parallel to apply the same command to every file or parameter in a dataset

Dataset Parallelism with GNU Parallel

GNU Parallel is a very powerful command that lets us execute any command in parallel. To do this effectively we need what is often called an “embarrasingly parallel” problem. These are problems where a dataset can be split into several parts and each can be processed independently and simultaneously. Such problems often occur when a dataset is split across multiple files or there are multiple parameters to process.

Basic use of GNU Parallel

In the Unix shell we could loop over a dataset one item at a time by using a for loop and the ls command together.

for file in $(ls) ; do
    echo $file
done

We can ask GNU parallel to perform the same task and at least several of the echo commands will run simultaneously. The {1} after the echo will be substituted by what ever comes after :::, in this case the output of the ls command.

parallel echo {1} ::: $(ls)

We could also use a set of values instead of ls:

parallel echo {1} ::: 1 2 3 4 5 6 7 8

Just running echo commands isn’t very useful, but we could use parallel to invoke a Python script too. The serial example to process a series of NetCDF files would be:

for file in $(ls *.nc) ; do
    python myscript.py $file
done

And with parllel it would be:

parallel python myscript.py {1} ::: $(ls *.nc)

Citing Software

It is good practice to cite the software we use in research. GNU Parallel is particularly vocal about this and it will remind you to cite it. Running parallel --citation will show us all of the information we’ll need if we are going to cite it in a publication, it will also prevent further reminders about it.

Working with multiple arguments

The {1} can be used multiple times if we want the same argument to be repeated. If for example the script required an input and output file name and the output was the input file with .out on the end, then we could do the following:

parallel python myscript-2.py {1} {1}.out ::: $(ls *.nc)

Using a list of files stored in a file

Using commands or lists of arguments is fine for many use cases, but sometimes there are cases where we might want to use a list of files in a text file. For this we use the :::: (note four, not three :s) separator and specify the file name after that, each line in file will be used as a line of input.

ls *.nc | grep "^ABC" > files.txt
parallel python myscript-2.py {1} {1}.out :::: files.txt

More complex arguments

Parallel can also run two (or more) sets of arguments, the first argument will become {1}, the second {2} and so on. Each argument’s input list must be separated by a :::.

parallel echo "hello {1} {2}" ::: 1 2 3 ::: a b c

We can also mix the ::: and :::: notations to have some arguments come from files and others from lists. For example, if we had a list of netcdf files in files.txt, and you wanted to perform an analysis of two of the varibles, we could use:

parallel process.py --variable={1} {2} ::: temp sal :::: files.txt

{1} will be substituted for temp or sal, while {2} will be given the filenames. Parallel will run process.py for both variables on every file.

Pairing arguments

Sometimes we don’t want to run every variable with every other variable, but will want to run them in pairs, for example:

parallel echo "hello {1} {2}" ::: 1 2 3 :::+ a b c

which produces:

hello world 1 a
hello world 2 b
hello world 3 c

Job Control

By default Parallel will use every processing core on the system. Sometimes, especially on shared systems this isn’t what we want to do. On some HPC systems we might only be allocated a few cores, but the system will have many more and Parallel will try to use them all. Depending on how the system is configured that will either cause us to run several processes on each core we’re allocated or to exceed our allocation. We can tell Parallel to limit how many cores it is running on with the --max-procs argument.

Logging

In more complex jobs it can be useful to have a log of which jobs ran, when they started and how long they took. This is set with the --joblog option to Parallel and is followed by a file name. For example:

parallel --joblog=jobs.log echo {1} ::: 1 2 3 4 5 6 7 8 9 10

After Parallel has finished we can look at the contents of the file jobs.log and see the output:

Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
     :       1711502183.024       0.002      0       2       0       0       echo 1
     :       1711502183.025       0.003      0       2       0       0       echo 2
     :       1711502183.026       0.003      0       2       0       0       echo 3
     :       1711502183.028       0.002      0       2       0       0       echo 4
     :       1711502183.029       0.003      0       2       0       0       echo 5
     :       1711502183.030       0.003      0       2       0       0       echo 6
     :       1711502183.032       0.003      0       2       0       0       echo 7
     :       1711502183.034       0.004      0       2       0       0       echo 8
     :       1711502183.036       0.002      0       2       0       0       echo 9
    :       1711502183.037       0.003      0       3       0       0       echo 10

Timing the speed up with Parallel

There is a script included with the example dataset called plot_tempanomaly.py. This script will plot a map of the temperature anomaly data from our GISS dataset. It takes three arguments, the name of the NetCDF file to use, a start year (specified with –start) and an end year (specified with –end). It will create a PNG file for each month that it processes.

For example to run this for the year 2000 we would run:

python plot_tempanomaly.py gistemp1200-21c.nc --start 2000 --end 2001

We can time how long a command takes by prefixing it with the time command, this will return three numbers:

real: how long the whole command took to run

user: how much time the command used the processor for in user mode, this is typically within our code and the libraries it calls.

sys: how much time the command used the processor for in system mode, this typically means the time spent waiting for hardware devices to respond, for example the disk, screen or network. The sys and user time can exceed the real time when multiple processor cores are used.

Run this for the years 2000 to 2023 as a serial job with the commands:
time for year in $(seq 2000 2023) ; do python plot_tempanomaly.py gistemp1200-21c.nc --start $year --end $[$year+1] ; done
Now repeat the command using Parallel:
time parallel python plot_tempanomaly.py gistemp1200-21c.nc --start {1} --end {2} ::: $(seq 2000 2023) :::+ $(seq 2001 2024)
Note that if you are using parallel from outside of Jupyter lab then you running parallel decativates your conda/mamba environment. The easiest solution to this is to create a wrapper shell script that runs the python command. Type the following into your favourite text editor and save it as plot_tempanomaly.sh.
#!/bin/bash
python plot_tempanomaly.py $1 --start $2 --end $3
time parallel bash plot_tempanomaly.sh gistemp1200-21c.nc {1} {2} ::: $(seq 2000 2023) :::+ $(seq 2001 2024)
Compare the runtimes of the parallel and serial versions. Try adding the joblog option and examining how many jobs launched at once. How many jobs did Parallel launch simultaneously? How much faster was the parallel version than the serial version? Try adding the –max-procs option and setting this to 2,4 or 8 and compare the run time.

Bonus Challenge: Make a Movie

We now have 288 PNG images covering the time period of our dataset. A useful way to view these would be as a video. There are a number of programs you can use to convert these into a video file. One such program is FFmpeg. You might need to install FFmpeg via Conda/Mamba. Lookup in the FFmpeg documentation how to make your images into a video. Create the video, download it to your computer (you can’t play it in Jupyter Lab) and play it.
Solution
ffmpeg -framerate 25 -pattern_type glob -i "*.png" -c:v libx264 output.mp4

Key Points

GNU Parallel can apply the same command to every file in a dataset

GNU Parallel works on the command line and doesn’t require Python, but it can run multiple copies of a Python script

It is often the simplest way to apply parallelism

It requires a problem that works independently across a set of files or a range of parameters

Without invoking a more complex job scheduler, GNU Parallel only works on a single computer

By default GNU Parallel will use every CPU core available to it

previous episode

Advanced Python for Environmental Scientists

next episode