Introduction

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • What is the pre-requisite knowledge for this workshop.

Objectives
  • Recall the commands covered in the Introduction to Unix shell Workshop.

  • Recall Nelle’s pipeline.

Recap of basic shell commands

This workshop assumes you are familiar with the material in the Introduction to Unix Shell Software Carpentry lesson. Below is a quick recap of that lesson.

What is the shell

Command History

Files

Files and Directories

Text Editors

Pipes and Filters

Loops

Shell Scripts

Find and Grep

Nelle’s Pipeline

The Introduction to Unix Shell lesson was built around the story of Nelle Nemo, a marine biologist, who had just returned from a six-month survey of the North Pacific Gyre, where she has been sampling gelatinous marine life in the Great Pacific Garbage Patch. She had 1520 samples in all and needed to:

  1. Run each sample through an assay machine that will measure the relative abundance of 300 different proteins. The machine’s output for a single sample is a file with one line for each protein.
  2. Calculate statistics for each of the proteins separately using a program her supervisor wrote called goostats.
  3. Write up results. Her supervisor would really like her to do this by the end of the month so that her paper can appear in an upcoming special issue of Aquatic Goo Letters.

It takes about half an hour for the assay machine to process each sample. The good news is that it only takes two minutes to set each one up. Since her lab has eight assay machines that she can use in parallel, this step will “only” take about two weeks.

The bad news is that if she had to run goostats by hand using a GUI, she would have to select a files using an open file dialog 1520 times. At 30 seconds per sample, the whole process will take more than 12 hours (and that’s assuming the best-case scenario where she is ready to select the next file as soon as the previous sample analysis has finished).

A shell script for Nelle’s pipeline

During the Introduction to Unix shell lesson we developed a shell script to help run Nelle’s pipeline. This did the following:

This script is saved as do-stats.sh and contains the following code:

for datafile in "$@"
do
    echo $datafile
    bash goostats $datafile stats-$datafile
done

This is run with the command:

bash do-stats.sh NENE*[AB].txt

This runs with all files starting with the name “NENE” and ending in “A” or “B”.

Nelle’s new challenge

Nelle now has a new dataset to process that is much bigger than the previous one, it is taking a long time to process on her laptop. A colleague has suggested she uses her group’s Linux server instead. To use this she will need to transfer her data to the server too.

She would also like to do the following:

Let’s help Nelle to make these modifications to her pipeline.

Key Points

  • Familarity with the basic shell commands covered in the introductory lesson is assumed.

  • Nelle’s pipeline is available as a shell script which runs the goostats program on an entire dataset.