Introduction
Overview
Teaching: 15 min
Exercises: 0 minQuestions
What is the pre-requisite knowledge for this workshop.
Objectives
Recall the commands covered in the Introduction to Unix shell Workshop.
Recall Nelle’s pipeline.
Recap of basic shell commands
This workshop assumes you are familiar with the material in the Introduction to Unix Shell Software Carpentry lesson. Below is a quick recap of that lesson.
What is the shell
- “A shell is a program whose primary purpose is to read commands and run other programs.”
- “The shell’s main advantages are its high action-to-keystroke ratio, its support for automating repetitive tasks, and its capacity to access networked machines.”
- “The shell’s main disadvantages are its primarily textual nature and how cryptic its commands and operation can be.”
Command History
- “Use the up-arrow key to scroll up through previous commands to edit and repeat them.”
- “Use
Ctrl-R
to search through the previously entered commands.” - “Use
history
to display recent commands, and!number
to repeat a command by number.”
Files
- “
cd path
changes the current working directory.” - “
ls path
prints a listing of a specific file or directory;ls
on its own lists the current working directory.” - “
pwd
prints the user’s current working directory.” - ”
/
on its own is the root directory of the whole file system.” - “A relative path specifies a location starting from the current location.”
- “An absolute path specifies a location from the root of the file system.”
- “Directory names in a path are separated with
/
on Unix, but\\
on Windows.” - “
..
means ‘the directory above the current one’;.
on its own means ‘the current directory’.” - “Most files’ names are
something.extension
. The extension isn’t required, and doesn’t guarantee anything, but is normally used to indicate the type of data in the file.”
Files and Directories
- “
cp old new
copies a file.” - “
mkdir path
creates a new directory.” - “
mv old new
moves (renames) a file or directory.” - “
rm path
removes (deletes) a file.” - ”
*
matches zero or more characters in a filename, so*.txt
matches all files ending in.txt
.” - ”
?
matches any single character in a filename, so?.txt
matchesa.txt
but notany.txt
.”
Text Editors
- “Nano is a simple text editor available on most Unix systems.”
- “Use of the Control key may be described in many ways, including
Ctrl-X
,Control-X
, and^X
.” - “Depending on the type of work you do, you may need a more powerful text editor than Nano.”
Pipes and Filters
- “
cat
displays the contents of its inputs.” - “
head
displays the first 10 lines of its input.” - “
tail
displays the last 10 lines of its input.” - “
sort
sorts its inputs.” - “
wc
counts lines, words, and characters in its inputs.” - “
command > file
redirects a command’s output to a file.” - “
first | second
is a pipeline: the output of the first command is used as the input to the second.” - “The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).”
Loops
- “A
for
loop repeats commands once for every thing in a list.” - “Every
for
loop needs a variable to refer to the thing it is currently operating on.” - “Use
$name
to expand a variable (i.e., get its value).${name}
can also be used.” - “Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion.”
- “Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.”
Shell Scripts
- “Save commands in files (usually called shell scripts) for re-use.”
- “
bash filename
runs the commands saved in a file.” - “
bash -x filename
runs a script in debug mode.” - ”
$@
refers to all of a shell script’s command-line arguments.” - ”
$1
,$2
, etc., refer to the first command-line argument, the second command-line argument, etc.” - “Place variables in quotes if the values might have spaces in them.”
- “Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.”
Find and Grep
- “
find
finds files with specific properties that match patterns.” - “
grep
selects lines in files that match patterns.” - ”
--help
is a flag supported by many bash commands, and programs that can be run from within Bash, to display more information on how to use these commands or programs.” - “
man command
displays the manual page for a given command.” - ”
$(command)
inserts a command’s output in place.”
Nelle’s Pipeline
The Introduction to Unix Shell lesson was built around the story of Nelle Nemo, a marine biologist, who had just returned from a six-month survey of the North Pacific Gyre, where she has been sampling gelatinous marine life in the Great Pacific Garbage Patch. She had 1520 samples in all and needed to:
- Run each sample through an assay machine that will measure the relative abundance of 300 different proteins. The machine’s output for a single sample is a file with one line for each protein.
- Calculate statistics for each of the proteins separately
using a program her supervisor wrote called
goostats
. - Write up results. Her supervisor would really like her to do this by the end of the month so that her paper can appear in an upcoming special issue of Aquatic Goo Letters.
It takes about half an hour for the assay machine to process each sample. The good news is that it only takes two minutes to set each one up. Since her lab has eight assay machines that she can use in parallel, this step will “only” take about two weeks.
The bad news is that if she had to run goostats
by hand using a GUI,
she would have to select a files using an open file dialog 1520 times.
At 30 seconds per sample,
the whole process will take more than 12 hours
(and that’s assuming the best-case scenario where she is ready to select the
next file as soon as the previous sample analysis has finished).
A shell script for Nelle’s pipeline
During the Introduction to Unix shell lesson we developed a shell script to help run Nelle’s pipeline. This did the following:
- Looped through a list of files
- Displays the filename it is currently processing on screen
- Runs the
goostats
script on the file and creates an output file with a name starting withstats-
This script is saved as do-stats.sh and contains the following code:
for datafile in "$@"
do
echo $datafile
bash goostats $datafile stats-$datafile
done
This is run with the command:
bash do-stats.sh NENE*[AB].txt
This runs with all files starting with the name “NENE” and ending in “A” or “B”.
Nelle’s new challenge
Nelle now has a new dataset to process that is much bigger than the previous one, it is taking a long time to process on her laptop. A colleague has suggested she uses her group’s Linux server instead. To use this she will need to transfer her data to the server too.
She would also like to do the following:
- Allow the students in her group to help with the analysis and they will need permission to access the files.
- Keep track of how much disk space her analysis is using since her research group is billed based on how much disk space they use.
- Archive her data and send this to a colleague and to an archive service.
- Improve the way her script handles errors and have it print some helpful error messages when data isn’t formatted correctly.
- Make her code configurable depending on some attributes of the computer processing it.
- She has just been given another dataset processed with a new assay machine. This data is a slightly different format and her processing code needs to be adapted for it.
Let’s help Nelle to make these modifications to her pipeline.
Key Points
Familarity with the basic shell commands covered in the introductory lesson is assumed.
Nelle’s pipeline is available as a shell script which runs the
goostats
program on an entire dataset.
Manual Pages
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How to use man pages?
Objectives
Use
man
to display the manual page for a given command.Explain how to read the synopsis of a given command while using
man
.Search for specific options or flags in the manual page for a given command.
We can get help for any Unix command with the man
(short for manual) command.
For example,
here is the command to look up information on cp
:
$ man cp
The output displayed is referred to as the “man page”.
Most man pages contain much more information than can fit in one terminal screen.
To help facilitate reading, the man
command tries to use a “pager” to move and search
through the information screenfull by screenfull. The most common pager is called less
.
Detailed information is available using man less
. less
is typically the default
pager for Unix systems and other tools may use it for output paging as well.
When less
displays a colon ‘:’,
we can press the space bar to get the next page,
the letter ‘h’ to get help,
or the letter ‘q’ to quit.
man
’s output is typically complete but concise,
as it is designed to be used as a reference rather than a tutorial.
Most man pages are divided into sections:
- NAME: gives the name of the command and a brief description
- SYNOPSIS: how to run the command, including optional and mandatory parameters. (We will explain the syntax later.)
- DESCRIPTION: a fuller description than the synopsis, including a description of all the options to the command. This section may also include example usage or details about how the command works.
- EXAMPLES: self-explanatory.
- SEE ALSO: list other commands that we might find useful or other sources of information that might help us.
Other sections we might see include AUTHOR, REPORTING BUGS, COPYRIGHT, HISTORY, (known) BUGS, and COMPATIBILITY.
How to Read the Synopsis
Here is the is synopsis for the cp
command on Ubuntu Linux:
SYNOPSIS
cp [OPTION]... [-T] SOURCE DEST
cp [OPTION]... SOURCE... DIRECTORY
cp [OPTION]... -t DIRECTORY SOURCE...
This tells the reader that there are three ways to use the command. Let’s look at the first usage:
cp [OPTION]... [-T] SOURCE DEST
[OPTION]
means the cp
command can be followed by
one or more optional flags.
We can tell they’re optional because of the square brackets,
and we can tell that one or more are welcome because of the ellipsis (…).
For example,
the fact that [-T]
is in square brackets,
but after the ellipsis,
means that it’s optional,
but if it’s used,
it must come after all the other options.
SOURCE
refers to the source file or directory,
and DEST
to the destination file or directory.
Their precise meanings are explained at the top of the DESCRIPTION
section.
The other two usage examples can be read in similar ways.
Note that to use the last one, the -t
option is mandatory
(because it isn’t shown in square brackets).
The DESCRIPTION
section starts with a few paragraphs explaining the command and its use,
then expands on the possible options one by one:
The following options are available:
-a Same as -pPR options. Preserves structure and attributes of
files but not directory structure.
-f If the destination file cannot be opened, remove it and create
a new file, without prompting for confirmation regardless of
its permissions. (The -f option overrides any previous -n
option.)
The target file is not unlinked before the copy. Thus, any
existing access rights will be retained.
... ...
Finding Help on Specific Options
If we want to skip ahead to the option you’re interested in,
we can search for it using the slash key ‘/’.
(This isn’t part of the man
command:
it’s a feature of less
.)
For example,
to find out about -t
,
we can type /-t
and press return.
After that,
we can use the ‘n’ key to navigate to the next match
until we find the detailed information we need:
-t, --target-directory=DIRECTORY
copy all SOURCE arguments into DIRECTORY
This means that this option has the short form -t
and the long form --target-directory
and that it takes an argument.
Its meaning is to copy all the SOURCE arguments into DIRECTORY.
Thus, we can give the destination explicitly
instead of relying on having to place the directory at the end.
Limitations of Man Pages
Man pages can be useful for a quick confirmation of how to run a command, but they are not famous for being readable. If you can’t find what you need in the man page— or you can’t understand what you’ve found— try entering “unix command copy file” into your favorite search engine: it will often produce more helpful results.
You May Also Enjoy…
The explainshell.com site does a great job of breaking complex Unix commands into parts and explaining what each does. Sadly, it doesn’t work in reverse…
Looking up information in a man page
Open the manpage for the ssh command by running
man ssh
- Find out what the
-q
option does- What is the
~/.ssh/
directory used for?Solution
- Enables quiet mode, suppressing most error messages and warnings.
- It is the default location to store user-specific configuration data.
Looking up extra options to ls
One of Nelle’s colleagues always uses the the
-lhtr
option when they run ls. Nelle asks them what this does and they can’t quite remember but it always gives the format they prefer. Lookup in the man page forls
what these options do.Solution
-l
lists in long format which includes file creation time, ownership, permissions and size.-h
enables “human readable” file size suffixes such as K, M and G.-t
sorts by time.-r
reverse the order, so the newest files come last.
Key Points
man command
displays the manual page for a given command.
[OPTION]...
means the given command can be followed by one or more optional flags.Flags specified after ellipsis are still optional but must come after all other flags.
While inside the manual page,use
/
followed by your pattern to do interactive searching.
Working Remotely
Overview
Teaching: 45 min
Exercises: 30 minQuestions
How do I use ‘
ssh
’ and ‘scp
’ ?Objectives
Learn what SSH is
Learn what an SSH key is
Generate your own SSH key pair
Learn how to use your SSH key
Learn how to work remotely using
ssh
andscp
Add your SSH key to an remote server
Let’s take a closer look at what happens when we use the shell on a desktop or laptop computer. The first step is to log in so that the operating system knows who we are and what we’re allowed to do. We do this by typing our username and password; the operating system checks those values against its records, and if they match, runs a shell for us.
As we type commands, the 1’s and 0’s that represent the characters we’re typing are sent from the keyboard to the shell. The shell displays those characters on the screen to represent what we type, and then, if what we typed was a command, the shell executes it and displays its output (if any).
What if we want to run some commands on another machine, such as the server in the basement that manages our database of experimental results? To do this, we have to first log in to that machine. We call this a remote login.
In order for us to be able to login, the remote computer must be running a remote login server and we will run a client program that can talk to that server. The client program passes our login credentials to the remote login server and, if we are allowed to login, that server then runs a shell for us on the remote computer.
Once our local client is connected to the remote server, everything we type into the client is passed on, by the server, to the shell running on the remote computer. That remote shell runs those commands on our behalf, just as a local shell would, then sends back output, via the server, to our client, for our computer to display.
SSH History
Back in the day,
when everyone trusted each other and knew every chip in their computer by its first name,
people didn’t encrypt anything except the most sensitive information when sending it over a network
and the two programs used for running a shell (usually back then, the Bourne Shell, sh
) on, or copying
files to, a remote machine were named rsh
and rcp
, respectively. Think (r
)emote sh
and cp
However, anyone could watch the unencrypted network traffic, which meant that villains could steal usernames and passwords, and use them for all manner of nefarious purposes.
The SSH protocol was invented to prevent this (or at least slow it down). It uses several sophisticated, and heavily tested, encryption protocols to ensure that outsiders can’t see what’s in the messages going back and forth between different computers.
The remote login server which accepts connections from client programs
is known as the SSH daemon, or sshd
.
The client program we use to login remotely is
the secure shell,
or ssh
, think (s
)ecure sh
.
The ssh
login client has a companion program called scp
, think (s
)ecure cp
,
which allows us to copy files to or from a remote computer using the same kind of encrypted connection.
A remote login using ssh
To make a remote login, we issue the command ssh username@computer
which tries to make a connection to the SSH daemon running on the remote computer we have specified.
After we log in, we can use the remote shell to use the remote computer’s files and directories.
Typing exit
or Control-D
terminates the remote shell, and the local client program, and returns us to our previous shell.
In the example below Nelle connects to a computer called neptune.aquatic.edu
.
The remote machine’s command prompt is neptune>
instead of just $
.
To make it clearer which machine is doing what,
we’ll indent the commands sent to the remote machine
and their output.
$ pwd
/users/nelle
$ ssh nelle@neptune.aquatic.edu
Password: ********
neptune> hostname
neptune
neptune> pwd
/home/nelle
neptune> ls -F
bin/ fish.txt deep_sea/ rocks.cfg
neptune> exit
$ pwd
/users/nelle
Logging into a remote system
Open a connection to a remote system you have access to.
The first time you connect to a remote computer you will see a message saying that the authenticity of the host can’t be established. This is normal because you’ve never connected to that computer before, so we have no record of the key fingerprint which identifies that computer. If you receive this message on a subsequent connection then it is a sign that the remote computer has been changed (most likely the OS was reinstalled, but the system could have been hacked) or (much less likely) that somebody is interfering with the encryption of your connection. To accept the fingerprint of the remote system you must type “yes”.
Differences between remote and local system
Open a second terminal window on your local computer.
What differences do you see?
Are the prompts the same?
Run the
ls
command and see if the output style looks the same.Solution
You might find that the prompt has different information and if it displays a host (computer) name then this should be different. This is very important for making sure you know what system you are issuing commands on when in the shell. You might also find the colours are different, especially when running the
ls
command.
Copying files to, and from a remote machine using scp
To copy a file,
we specify the source and destination paths,
either of which may include computer names.
If we leave out a computer name,
scp
assumes we mean the machine we’re running on.
Using our web browser let’s download some of Nelle’s data which is stored in a zip file with this lesson from: https://noc-oi.github.io/data/north-pacific-gyre.zip.
Then we can copy it to a remote server with scp
:
scp ~/Downloads/north-pacific-gyre.zip nelle@backupserver:backups/north-pacific-gyre-2012-07-03.zip
Password: ********
north-pacific-gyre.zip 100% 40KB 554.5KB/s 00:00
Note the colon :
, seperating the hostname of the server and the pathname of
the file we are copying to.
It is this character that informs scp
that the source or target of the copy is
on the remote machine and the reason it is needed can be explained as follows:
In the same way that the default directory into which we are placed when running a shell on a remote machine is our home directory on that machine, the default target, for a remote copy, is also the home directory.
This means that
scp ~/Downloads/north-pacific-gyre.zip nelle@backupserver:
would copy north-pacific-gyre.zip
into our home directory on backupserver
, however,
if we did not have the colon to inform scp
of the remote machine, we would
still have a valid commmad.
scp ~/Downloads/north-pacific-gyre.zip nelle@backupserver:
but now we have merely created a file called nelle@backupserver
on our local
machine, as we would have done with cp
.
cp ~/Downloads/north-pacific-gyre.zip nelle@backupserver
Copying a whole directory betwen remote machines uses the same syntax as the
cp
command: we just use the -r
option to signal that we want copying to
be recursively. For example, this command copies all of our results from the
backup server to our laptop:
scp -r nelle@backupserver:backups ./backups
Password: ********
results-2011-09-18.dat 100% 7 1.0 MB/s 00:00
results-2011-10-04.dat 100% 9 1.0 MB/s 00:00
results-2011-10-28.dat 100% 8 1.0 MB/s 00:00
results-2011-11-11.dat 100% 9 1.0 MB/s 00:00
Choose the right command
Which of the following would you use to copy a directory called
data
and all the files and subdirectories contained within it to the/data
directory on a remote computer calleddatastore.aquatic.edu
:
- scp data nelle@datastore.aquatic.edu
- cp -r data nelle@datastore.aquatic.edu:
- scp -r data nelle@datastore.aquatic.edu:/data
- scp data nelle@datastore.aquatic.edu:
Solution
3 is the correct answer.
1 does not have
-r
option to copy all subdirectories and is missing the:
to specify the path on the remote computer. It will create a file callednelle@datastore.aquatic.edu
on the local computer.2 uses the
cp
command instead ofscp
, it will only copy files on the local computer.4 is missing the
-r
option to copy the subdirectories and doesn’t specify/data
as the destination path.
Copy Nelle’s data to your SSH server
Download ../data/north-pacific-gyre.zip to your computer using your web browser. This will typically place the file
north-pacific-gyre.zip
in your Downloads folder. Usingscp
copy the file to a server you have SSH access to.Solution
scp ~/Downloads/north-pacific-gyre.zip myuser@myserver
Running commands on a remote machine using ssh
Here’s one more thing the ssh
client program can do for us.
Suppose we want to check whether we have already created the file
north-pacific-gyre.zip
on the backup server.
Instead of logging in and then typing ls
,
we could do this to list all the zip files:
ssh nelle@backupserver "ls *.zip"
Password: ********
north-pacific-gyre.zip
2012-07-04.zip
Here, ssh
takes the argument after our remote username
and passes them to the shell on the remote computer.
(We have to put quotes around it to make it look like a single argument.)
Since those arguments are a legal command,
the remote shell runs ls *.zip
for us
and sends the output back to our local shell for display.
SSH Keys
Typing our password over and over again is annoying, especially if the commands we want to run remotely are in a loop. To remove the need to do this, we can create an SSH key to tell the remote machine that it should always trust us.
SSH keys come in pairs, a public key that gets shared with services like GitHub, and a private key that is stored only on your computer. If the keys match, you’re granted access.
The cryptography behind SSH keys ensures that no one can reverse engineer your private key from the public one.
The first step in using SSH authorization is to generate your own key pair.
You might already have an SSH key pair on your machine. You can check to see if
one exists by moving to your .ssh
directory and listing the contents.
$ cd ~/.ssh
$ ls
If you see id_rsa.pub
, you already have a key pair and don’t need to create a
new one.
If you don’t see id_rsa.pub
, use the following command to generate a new key
pair. Make sure to replace your@email.com
with your own email address.
$ ssh-keygen -t rsa -C "your@email.com"
When asked where to save the new key, hit enter to accept the default location.
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/username/.ssh/id_rsa):
You will then be asked to provide an optional passphrase. This can be used to make your key even more secure, but if what you want is avoiding type your password every time you can skip it by hitting enter twice.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
When the key generation is complete, you should see the following confirmation:
Your identification has been saved in /Users/nelle/.ssh/id_rsa.
Your public key has been saved in /Users/nelle/.ssh/id_rsa.pub.
The key fingerprint is:
01:0f:f4:3b:ca:85:d6:17:a1:7d:f0:68:9d:f0:a2:db nelle@eurphoic.edu
The key's randomart image is:
+--[ RSA 2048]----+
| |
| |
| . E + |
| . o = . |
| . S = o |
| o.O . o |
| o .+ . |
| . o+.. |
| .+=o |
+-----------------+
The random art image is an alternate way to match keys but we won’t be needing this.
Now you need to place a copy of your public key ony any servers you would like to use SSH to connect to, instead of logging in with a username and passwd.
Display the contents of your new public key file with cat
:
$ cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA879BJGYlPTLIuc9/R5MYiN4yc/YiCLcdBpSdzgK9Dt0Bkfe3rSz5cPm4wmehdE7GkVFXrBJ2YHqPLuM1yx1AUxIebpwlIl9f/aUHOts9eVnVh4NztPy0iSU/Sv0b2ODQQvcy2vYcujlorscl8JjAgfWsO3W4iGEe6QwBpVomcME8IU35v5VbylM9ORQa6wvZMVrPECBvwItTY8cPWH3MGZiK/74eHbSLKA4PY3gM4GHI450Nie16yggEg2aTQfWA1rry9JYWEoHS9pJ1dnLqZU3k/8OWgqJrilwSoC5rGjgp93iu0H8T6+mEHGRQe84Nk1y5lESSWIbn6P636Bl3uQ== nelle@aquatic.edu
Copy the contents of the output.
Login to the remote server with your username and password.
$ ssh nelle@neptune.aquatic.edu
Password: ********
Paste the content that you copy at the end of ~/.ssh/authorized_keys
.
neptune> nano ~/.ssh/authorized_keys
After append the content, logout of the remote machine and try login again. If you setup your SSH key correctly you won’t need to type your password.
neptune> exit
$ ssh nelle@neptune.aquatic.edu
Create an SSH key
Create an SSH key with the
ssh-keygen
command. Don’t add a pass-phrase to it at this stage, we’ll do that next.
Add (or change) a key’s passphrase
Add a passphrase to your key with the command
ssh-keygen -p
.
Authorising SSH keys
The example of copying our public key to a remote machine, so that it
can then be used when we next SSH into that remote machine, assumed
that we already had a directory ~/.ssh/
.
Whilst a remote server may support the use of SSH to login, your home
directory there may not contain a .ssh
directory by default.
We have already seen that we can use SSH to run commands on remote machines, so we can ensure that everything is set up as required before we place the copy of our public key on a remote machine.
The long way to do this is to copy the contents of our SSH public key
into the file .ssh/authorized_keys
on the remote machine.
The authorized_keys
file can contain multiple keys, one on each line.
Each key will represent a different computer we might connect FROM.
SSH provides a convienient command to copy the key to a server called
ssh-copy-id
. This will read our SSH public key and append it to the
~/.ssh/authorized_keys
file on a remote machine and create it if it
does not exist.
Firstly, let’s check if we have a .ssh/
directory on another remote
machine, beagle
Password: ********
ls: cannot access /home/nelle/.ssh: No such file or directory
Oh dear, we don’t have a .ssh
directory!
We chould create the directory; and check that it’s there.
But let’s have ssh-copy-id
do the hard work for us:
$ ssh-copy-id nelle@beagle
Password: ********
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'nelle@beagle'"
and check to make sure that only the key(s) you wanted were added.
Let’s try loging in again, this time there should be no password prompt
and there should be a .ssh
directory.
$ ssh nelle@beagle "ls -ld ~/.ssh"
drwxr----- 2 nelle nelle 512 Jan 01 09:09 /home/nelle/.ssh
Setup an SSH key for yourself
- Install it on a remote server using the
ssh-copy-id
command.- Verify it works by running SSH and checking you aren’t prompted for a password.
- Verify you have a
~/.ssh/authorized_keys
file, display the contents of this using thecat
command. How many keys does it contain?Note: that some systems are configured to not allow SSH keys or to require a password AND an SSH key for extra security. Some also don’t allow password logins and will have another mechanism to load a key before you login for the first time, this is often via a web portal.
Key Points
SSH is a secure alternative to username/password authorization
SSH keys are generated in public/private pairs. Your public key can be shared with others. The private keys stays on your machine only.
The ‘ssh’ and ‘scp’ utilities are secure alternatives to logging into, and copying files to/from remote machine
Working with archive files
Overview
Teaching: 20 min
Exercises: 30 minQuestions
Understanding how to extract and compress archive files.
What is the difference between a tar and zip file?
Objectives
What are archive files?
How to extract a zip archive.
How to extract a tar archive.
How to create a zip archive.
How to create a tar archive.
How to compress a tar archive.
Archive Files
There are many times where it is useful and/or convienient to store multiple files and directory sturctures inside a single file. Probably the three most common use cases for this are when we want to store a set of files, to copy them or to compress them.
Archiving
We might want to store a collection of files for long term preservation in a way that somebody else can easily obtain them with the download of a single file. Common platforms for this include the Zenodo archiving service or Github (through it’s releases feature).
File transfer
When copying a file to another computer either via email, a file transfer protocol such as SCP or even a memory stick/card it can often be convienient if we can copy just a single file rather than a whole set of files. This is especially true when there is a complex directory sturcture that goes with it.
Compression
Most archive formats at least have the option (some do it by default) to compress the data in the archive to make it take less disk space and transfer faster over a network. These compression formats are usually “lossless” formats which work by trying to remove redundant data in files but allow the data to be completley recreated when uncompressed without loosing anything. This is in contrast to “lossy” formats such as JPEG (for images), MP3 (for audio) and MPEG (for video) which remove some information that is unlikely to be perceived by people but can reduce the file size. As a result these lossless compression systems work well when compressing previously uncompressed data such as text, CSV or code files or raw images, but they do not work very well on previously compressed files such as JPEG or PNG images, MP3 audio or MPEG video. Compression rates of 20-50% for text files are not uncommon, which can be a significant saving when storing or transferring data. Compression can also be very helpful when emailing files as most email systems have a size limit of between 8 and 20 megabytes.
Zip files
One of the most popular archive formats is the ZIP format, which as the name suggests is both a compression and archiving format. ZIP files are more common in the Windows world than the Unix world as they didn’t always support Unix file permission and ownership information.
Extracting a Zip file
In the last episode we copied a ZIP file called north-pacific-gyre.zip
to a
remote server using the scp
command. We can now use the unzip
commmand to
list and extract the contents of that ZIP file.
First let’s connect to the remote system using SSH and check the file is there.
ssh nelle@backupserver
ls -lh north-pacific-gyre.zip
We should see our north-pacific-gyre.zip
file listed along with some metadata
about it including who own’s it, it’s size and creation date/time.
-rw-rw-r-- 1 nelle nelle 41K Mar 17 18:33 north-pacific-gyre.zip
Now that we are sure the file is available to us let’s have a look at what is
inside it using the unzip -l
command.
unzip -l north-pacific-gyre.zip
Archive: north-pacific-gyre.zip
Length Date Time Name
--------- ---------- ----- ----
0 2025-03-17 18:27 north-pacific-gyre/
0 2025-03-17 18:27 north-pacific-gyre/2012-07-03/
4400 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01729B.txt
4391 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02040A.txt
4406 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01729A.txt
4371 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01736A.txt
4393 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02043B.txt
4389 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01978B.txt
4401 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01812A.txt
3517 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02018B.txt
4381 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01978A.txt
4381 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02040Z.txt
4386 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02043A.txt
4375 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01843B.txt
4411 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01751A.txt
4395 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01843A.txt
4372 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01971Z.txt
4367 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02040B.txt
4409 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01751B.txt
219 2025-03-17 18:33 north-pacific-gyre/goostats
345 2025-03-17 18:33 north-pacific-gyre/goodiff
92 2025-03-17 18:33 north-pacific-gyre/do-stats.sh
--------- -------
74401 22 files
This shows that all the files in the archive are inside a directory called
north-pacific-gyre
. Inside this we have a subdirectory called 2012-07-03
with some text files in it and in the main directory we have the goostats
,
goodiff
and do-stats.sh
programs/shell scripts.
Let’s go ahead and extract this ZIP file, do that we simply run the unzip
command with the ZIP file name as an argument.
unzip north-pacific-gyre.zip
If we run the ls
command after this we should see a north-pacific-gyre
directory now exists and if we cd
into this and run ls
again there should
be a 2012-07-03
subdirectory and the goostats
, goodiff
and do-stats.sh
prgorams.
ls
cd north-pacific-gyre
ls
Creating a ZIP file
We can create our own ZIP archives using the zip
command. This takes the name
of the ZIP file to create followed by a list of filenames. We previously found
that Nelle has files named after which machine was used to process her samples.
This is the “A” or “B” at the end of the filename before the “.txt” extension.
There are a few samples which have a “Z” in the name where it was not known
which machine processed them. Let’s create a new archive in the
north-pacfiic-gyre/2012-07-03
directory which contains just the file ending
“A” or “B”.
cd 2012-07-03
zip goodfiles.zip NENE*[AB].txt
This will display the names of all the files added to our new ZIP file and the amount of compression (deflation) that is applied.
adding: NENE01729A.txt (deflated 51%)
adding: NENE01729B.txt (deflated 51%)
adding: NENE01736A.txt (deflated 51%)
adding: NENE01751A.txt (deflated 51%)
adding: NENE01751B.txt (deflated 51%)
adding: NENE01812A.txt (deflated 51%)
adding: NENE01843A.txt (deflated 51%)
adding: NENE01843B.txt (deflated 51%)
adding: NENE01978A.txt (deflated 51%)
adding: NENE01978B.txt (deflated 51%)
adding: NENE02018B.txt (deflated 51%)
adding: NENE02040A.txt (deflated 51%)
adding: NENE02040B.txt (deflated 51%)
adding: NENE02043A.txt (deflated 51%)
adding: NENE02043B.txt (deflated 51%)
We can now verify that these file were added to our ZIP by running unzip -l
on it:
unzip -l goodfiles.zip
Archive: goodfiles.zip
Length Date Time Name
--------- ---------- ----- ----
4406 2017-02-16 15:58 NENE01729A.txt
4400 2017-02-16 15:58 NENE01729B.txt
4371 2017-02-16 15:58 NENE01736A.txt
4411 2017-02-16 15:58 NENE01751A.txt
4409 2017-02-16 15:58 NENE01751B.txt
4401 2017-02-16 15:58 NENE01812A.txt
4395 2017-02-16 15:58 NENE01843A.txt
4375 2017-02-16 15:58 NENE01843B.txt
4381 2017-02-16 15:58 NENE01978A.txt
4389 2017-02-16 15:58 NENE01978B.txt
3517 2017-02-16 15:58 NENE02018B.txt
4391 2017-02-16 15:58 NENE02040A.txt
4367 2017-02-16 15:58 NENE02040B.txt
4386 2017-02-16 15:58 NENE02043A.txt
4393 2017-02-16 15:58 NENE02043B.txt
--------- -------
64992 15 files
Exercises
Adding and removing files to/from an existing ZIP
If the
zip
command is run multiple times it will update the files in the ZIP and if new files have been specified they will be added. If you run azip
command multiple times then you will notice on the second/subsequent runs it will say “updating” instead of “adding” next to the files which already exist in the zip file.
- Run the zip command from above (
zip goodfiles.zip NENE*[AB].txt
).- Repeat the command but change the file name list to
NENE*.txt
. What is displayed when the files ending in “Z” are added? How does this differ from the other files?- Verify the files ending in “Z” are now present using
unzip -l
.- Look at the
zip
man page, how can you now delete the files ending in “Z” without rebuilding the entire ZIP file?- Try the option you just found and verify the result with
unzip -l
.Solution
zip goodfiles.zip NENE*[AB].txt zip goodfiles.zip NENE*.txt # should show "updating" next to A/B files and "adding" next to the Z files unzip -l goodfiles.zip # verify the addition zip -d goodfiles.zip NENE*Z.txt # -d deletes files from the zip unzip -l goodfiles.zip # verify the deletion
Varying the compression level
You can select the amount/speed of compression applied to a ZIP file when creating it. More compression should result in a smaller file but the compression (and > decompression) will take longer.
Open the
zip
manpage and find the option which controls the compression speed.Try compressing the text files in the
2012-07-03
directory with the different options. What difference does it make to the level of compression?Find out how long it takes to compress at each level by prefixing the
zip
command with thetime
command to measure the time the command takes to run (this gives three numbers, you want the “real” number), as the files are small there will be a lot of variation, so try it a few times.time zip goodfiles.zip NENE*[AB].txt
Solution
The
-0
to-9
options vary the compression level/speed.-0
will give no compression,-9
will give the most and-6
is the default. File sizes vary between 66KB with no compression, 35KB at level 1 and 34KB beyond level 4. Times (on the author’s laptop) vary between 2 and 3 seconds at level 0 to 4-7 seconds at level 9.The full command will be:
time zip -9 goodfiles.zip NENE*[AB].txt
Tar archives
An alternative to ZIP archives are Tar archives. Tar stands for “tape archive” and was originally used to prepare a set of files to be written sequentially onto a magnetic tape for storage/backup purposes. The Tar command originates in the Unix world and natively supports Unix file ownership/group information, symbolic links and permissions.
Creating a Tar file
Tar files are created and extracted with the tar
command. To create one we
use the -c
or --create
option. The name of the tar file is then specified
with the -f
or --file
option. Like with zip we end the command with the
list of files to place inside the archive. For example to make a tar archive
of our “good” files from Nelle’s dataset inside the 2012-07-03
we can run:
tar --create --file goodfiles.tar NENE*[AB].txt
or
tar -c -f goodfiles.tar NENE*[AB].txt
or for an even more compact version (note the “f” must be the last argument):
tar -cf goodfiles.tar NENE*[AB].txt
Unlike zip
, tar
does not give any output to confirm what it has done unless
we add the -v
or --verbose
option. This will list the name of every file
added to our archive.
tar -cvf goodfiles.tar NENE*[AB].txt
Listing the contents of a Tar file
We can list the contents of a Tar by using the -t
or --list
option. We must
still use the -f
or --file
option to specify the name of the tar file we
are working with.
tar -tf goodfiles.tar
If we want an ls -l
style output the we can add the -v
or --verbose
option
tar -tvf goodfiles.tar
Compressing Tar files
Unlike the zip
command, the tar
command does not use any compression by
default, instead tar
files can be compressed by another program. Common
choices for this are gzip
or bzip2
, both of which compress a single file
and append a .gz
or .bz2
extension on the end. So you will often see tar
files with an extension of .tar.gz
(sometimes shortened to .tgz
) or
.tar.bz2
. Modern versions of tar
make this easier for us though, they can
take an extra -z
(or --gzip
) option for gzip or -j
(or --bzip2
) option
for bzip.
tar -cvjf goodfiles.tar.bz2 NENE*[AB].txt
Extracting Tar files
We can extract the contents of a tar file with the -x
or --extract
option.
As before this needs to be combined with -f
or --files
to specify the file
name and optionally -v
or --verbose
if we want a list of the file names we
are extracting. In older versions of tar
we needed to add -z
or -j
to
extract a compressed archive, but newer versions will automatically detect this
and do the decompression for us.
tar -xvf goodfiles.tar.bz2
Comparing Bzip2, Gzip and Zip compression
Compress the entire north-pacific-gyre directory using the following formats:
- Uncompressed Tar
- Gzip compressed Tar
- Bzip2 compressed Tar
- Zip
Make sure you delete any zip/tar files from inside the
2012-07-03
directory first.You will need to find an extra option for
zip
to make it recursively archive the directories ofnorth-pacific-gyre
, look in the man page to find this.Compare the file size for each archiving method. Which is smallest? Which is largest?
Solution
cd ~/ #get back to the home directory, north-pacific-gyre should be a subdirectory of this rm north-pacific-gyre/2012-07-03/*.zip north-pacific-gyre/2012-07-03/*.tar* # remove any old files tar -cvf north-pacific-gyre.tar north-pacific-gyre tar -cvzf north-pacific-gyre.tar.gz north-pacific-gyre tar -cvjf north-pacific-gyre.tar.bz2 north-pacific-gyre zip -9 -r north-pacific-gyre.zip north-pacific-gyre ls -lh north-pacific-gyre*.*
- north-pacific-gyre.tar - 90K
- north-pacific-gyre.tar.bz2 - 31K
- north-pacific-gyre.tar.gz - 36K
- north-pacific-gyre.zip - 41K
Key Points
Archive files are files which contain one or more other files. They are a convienient way to store or transfer multiple files and directory structures inside a single file.
Zip archives are (usually) compressed by default.
Zip files can be extracted with the
unzip
command.Zip files can be created with the
zip
command.Tar archives are not compressed.
Tar archives can be compressed using the
gzip
orbzip2
utilities.Tar files can be extracted with the
tar -xf
option.Modern versions of tar will automatically uncompress gzipped or bzipped archives.
Tar files can be created with the
tar -cf
option.
Transferring Files
Overview
Teaching: 25 min
Exercises: 10 minQuestions
How to use Wget, curl and rsync to transfer file?
Objectives
To know different ways to interact with remote files
There are other ways to interact with remote files other than scp.
Wget
Wget is a simple tool developed for the GNU Project that downloads files with the HTTP, HTTPS and FTP protocols. It is widely used by Unix-like users and is available with most Linux distributions.
To download this lesson (located at https://noc-oi.github.io/shell-extras/04-file-transfer/index.html) from the web via HTTP we can simply type:
$ wget https://noc-oi.github.io/shell-extras/04-file-transfer/index.html
--2021-05-29 02:12:18—
https://noc-oi.github.io/shell-extras/04-file-transfer/index.html
Resolving carpentries-incubator.github.io (carpentries-incubator.github.io)... 185.199.111.153, 185.199.110.153, 185.199.109.153, ...
Connecting to carpentries-incubator.github.io (carpentries-incubator.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22467 (22K) [text/html]
Saving to: ‘index.html’
index.html 100%[===================>] 21.94K --.-KB/s in 0.003s
2021-05-29 02:12:19 (6.35 MB/s) - ‘index.html’ saved [22467/22467]
Alternatively, you can add more options, which are in the form:
wget -r -np -D domain_name target_URL
where -r
means recursively crawl to other files and directories, -np
means
avoid crawling to parent directories, and -D
means to target only the
following domain name
For our URL it would be:
$ wget -r -np -D carpentries-incubator.github.io https://noc-oi.github.io/shell-extras
To restrict retrieval to a particular extension(s)
we can use the -A
option followed by a comma separated list:
wget -r -np -D carpentries-incubator.github.io -A html https://noc-oi.github.io/shell-extras/04-file-transfer/index.html
We can also clone a webpage with its local dependencies:
$ wget -mkq target_URL
We could also clone the entire website:
$ wget -mkq -np -D domain_name domain_name_URL
and add the -nH
option if we do not want a subdirectory created for the websites content:
e.g.
$ wget -mkq -np -nH -D example.com http://example.com
where:
-m
is for mirroring with time stamping, infinite recursion depth, and preservation of FTP directory settings
-k
converts links to make them suitable for local viewing
-q
supresses the output to the screen
The above command can also save the clone the contents of one domain to another if we are using ssh or sshfs to access a webserver.
Please refer to the man page by typing man wget
in the shell for more information.
cURL
Alternatively, we can use cURL
.
It supports a much larger range of protocols including common mail based protocols like pop3 and smtp.
To download this lesson (located at https://noc-oi.github.io/shell-extras/04-file-transfer/index.html) from the web via HTTP we can simply type:
$ curl -o index.html https://noc-oi.github.io/shell-extras/04-file-transfer/index.html
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 14005 100 14005 0 0 35170 0 --:--:-- --:--:-- --:--:-- 105k
This input to curl is in the form:
curl -o filename_for_local_machine target_url
where the -o
option says write the output to a file instead of the stdout
(the screen), and file_name_for_local_machine is any file name you choose to
save to the local machine, and target_URL is where the file is the URL where
the file is on the web.
Removing the -o
option, and following the syntax curl target_URL
outputs the contents of the url to the screen.
If we wanted to enhance the functionality we have we could use information
from the pipes and filters section, which is lesson 4 from the unix shell session.
For example, we could type
curl https://noc-oi.github.io/shell-extras/04-file-transfer/index.html | grep curl
which would tell us that indeed this URL contains the string curl.
We could make the output cleaner by limiting the output of curl to just the
file contents by using the -s
option
(e.g. curl -s https://noc-oi.github.io/shell-extras/04-file-transfer/index.html | grep curl
).
If we wanted only the text and not the html tags in our output we could use
html to text parser such as html2text
.
$ curl -s https://noc-oi.github.io/shell-extras/04-file-transfer/index.html | html2text | grep curl
With wget
, we can obtain the same results by typing:
$ wget -q -D carpentries-incubator.github.io -O /dev/stdout https://noc-oi.github.io/shell-extras/04-file-transfer/index.html | html2text | grep curl
wget
offers more functionality natively than curl
for retrieving entire
directories. We could use wget
to first retrieve an entire directory and
then run html2text
and grep
to find a particular string.
cURL
is limited to retrieving one or more specified URLs that cannot be
obtained by recursively crawling a directory. The situation may be improved by
combining with other unix tools, but is not thought as being as good as wget
.
Please refer to the man pages by typing man wget
, man curl
, and
man html2text
in the shell for more information.
Continuing a stopped download
Start a download of a large file (e.g. https://www.zenodo.org/record/5307070/files/S-only-10000x.tar.gz, a file from a dataset of simulated wastewater sequencing data for SARS-CoV-2 using Wget and stop the download before it has finished by pressing the ‘ctrl’ and ‘c’ keys together. This will leave a partially downloaded file on your computer.
Open the Wget man page by running
man wget
and find the option to continue a partial download.Resume your download with this option.
Solution
The -c or –continue option will tell Wget to resume a partial download.
wget -c https://www.zenodo.org/record/5307070/files/S-only-10000x.tar.gz
Download an additional dataset for Nelle
Nelle has another dataset to process from July 4th 2012. It is located online at https://noc-oi.github.io/shell-extras/data/north-pacific-gyre-2012-07-04.zip. Login to a remote system over SSH and download this file on there using either Wget or cURL. Then extract the data from this file using the
unzip
command.Solution
ssh nelle@neptune.aquatic.edu wget https://noc-oi.github.io/shell-extras/data/north-pacific-gyre-2012-07-04.zip unzip north-pacific-gyre-2012-07-04.zip
Rsync
Rsync is a utility for synchronising directories between computers (and on the same computer). It can use the SSH protocol for copying to a remote computer but also has it’s own (less commonly used) file transfer protocol.
Rsync Syntax
The -a option to rsync specifies that we want to use “archive” mode, which will set several other options to make the copy mirror the names and permissions of the source directory. The -v option enables verbose mode to tell us more about what is being copied. To use rsync over ssh we need to specify “-e ssh”. Finally we give the source and destination directories, just like the cp or scp command.
rsync -a -v -e ssh 04-file-transfer nelle@neptune.aquatic.edu:
Why use rsync instead of scp?
Rsync only transfers files (or parts of files) if they don’t exist in the destination directory. This means that if a transfer is stopped for any reason, when you resume it won’t copy things that were already copied. Scp does not do this and will start the transfer again. If you are copying large files that take many days/hours there is a chance your transfer might be interrupted at some point and you don’t want to have to repeat what you’ve already done when resuming it.
Key Points
wget
is the default tool, available in most Linux distributions, to download files from web and FTP servers.
curl
is another utility for downloading remote webpages. It defaults to outputting the result on screen, this can be piped to other programs.
rsync
is a utility for transferring files. It can use the SSH protocol and is useful for mirroring complicated directory structures from one computer to another.
Disk Usage
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How do we find out how big a directory is on the command line?
How do we find out how much space is left on a disk from the command line?
Objectives
Understand that the
du
command tells us how much disk space a directory uses.Understand that the
df
command tells us how much space is free on a particular disk.
Measuring Disk Usage
How big is that directory?
Now that Nelle has two datasets (one for 2012-07-03 and one for 2012-07-04) on
her computer she is wondering how much disk space these are using. The du
command is useful here as it tells us how much disk space is used by an entire
directory, all it’s subdirectories and all the files they contain.
If we run this in the directory above north-pacific-gyre
then we the command
du north-pacific-gyre
will tell us how big the entire north-pacific-gyre
directory is in bytes.
Reading this number in bytes can become difficult when we get into even the
range of megabytes (millions of bytes) and certainly when it is gigabytes or
more. Fortunately du
has a “human readable” option which will use units of
kilobytes/megabyte/gigabytes/terabytes etc with the K/M/G/T suffixes. If we
repeat our command with the the -h
option then we will get this suffix.
du -h north-pacific-gyre
Marketing Kilobytes vs Kilobytes
Traditionally a kilobyte was defined as 1024 bytes (2 to the power 10) and a megabyte 1024 kilobytes, a gigabyte 1024 megabytes etc. But often this is approximated to 1000 bytes in a kilobyte etc. At smaller scales the differences are quite small, but they multiply with each order of mangitude. Sometimes the large power of 2 units are known as kebi/mebi/gebi/tebibytes, abbreviated KiB/MiB/GiB/TiB and the power of 10 versions as KB/MB/GB/TB.
The list below shows how these numbers compare as we move up the scale:
- 1,024B = 1 KiB = 1.024 KB
- 1,048,576B = 1 MiB = 1.049 MB
- 1,073,741,824B = 1 GiB = 1.074 GB
- 1,099,511,627,776B = 1 TiB = 1.1 TB
As we can see by the time we get into the terabyte range there is almost a 10% discrepancy between the number of bytes in a terabyte and a tebibyte. When you are selling storage being able to claim that you have a 1.1TB disk instead of a 1TB disk then this can be quite a marketing advantage. This has developed the term “Marketing Mega(Giga|Tera)bytes”. The
du
command defaults to using 1024 byte kilobytes (kebibytes), but if we want 1000 byte kilobytes then we can add the option--si
.
Explore the
-s
option todu
Try out the
-s
option todu
. Find out what it does from the man or help page. When/why might this option be useful?Solution
This option shows a summary of how much disk space is used by the entire directory without telling us any information about each subdirectory. This can be useful when we don’t want all the information about the subdirectories and just the total. When there are a lot of subdirectories this can be much faster to run too.
How much disk space do we have?
The du
command is great for telling us how much space we’ve used in a given
directory but it doesn’t tell us how much free space we have. For that we have
another command called df
which is short for “disk free”. With no arguments
this will tell us how much free space we have on every disk mounted on this
system in bytes. Like du
, there is a -h
option for human readable formats.
df -h
On a lot of shared systems such as High Performance Computing systems it is
common for each user to receive a quota for their home directory (and possibly
some other directories). This limits how much they can use, even if there is
plenty more space on the disk. Running df
on such a system will return how
much space is free on the entire disk, not for the current user. On many
systems the quota
command will tell you how much space is left in your disk
quota. The quota command defaults to displaying disk usage in a unit of “blocks”
these are usually 1KB each. Like the df
and du
commands there is a human
readable option, but this time it is -s
not -h
.
quota -s
Key Points
The
du
command tells us how much disk space a directory is using.The
-h
option todu
gives us human readable units such a K, M and G.The
df
command tells us how much space is in use on a disk.The
df
command can also take a-h
option for human readable units.On some shared systems the
quota
command tells us how much space is left in our disk allocation.
Permissions
Overview
Teaching: 25 min
Exercises: 15 minQuestions
Understanding file/directory permissions
Objectives
What are file/directory permissions?
How to view permissions?
How to change permissions?
File/directory permissions in Windows
Unix controls who can read, modify, and execute files using permissions. We’ll also discuss Windows permissions later, as the concepts are similar but implementation differs.
Let’s start with a normal user called Nelle.
She has a unique user name (e.g., nelle
), and a user ID, (a unique number, like 1404
).
Why Integer IDs?
Why integers for IDs? Again, the answer goes back to the early 1970s. Character strings like
alan.turing
are of varying length, and comparing one to another takes many instructions. Integers, on the other hand, use a fairly small amount of storage (typically four characters), and can be compared with a single instruction. To make operations fast and simple, programmers often keep track of things internally using integers, then use a lookup table of some kind to translate those integers into user-friendly text for presentation. Of course, programmers being programmers, they will often skip the user-friendly string part and just use the integers, in the same way that someone working in a lab might talk about Experiment 28 instead of “the chronotypical alpha-response trials on anacondas”.
Users can belong to any number of groups,
each of which has a unique group name
and numeric group ID.
The list of who’s in what group is usually stored in the file /etc/group
. To see all the groups on a Unix system, you can run:
cat /etc/group
Now let’s look at files and directories. Every file and directory on a Unix computer belongs to one owner and one group. Along with each file’s content, the operating system stores the numeric IDs of the user and group that own it.
The user-and-group model means that for each file every user on the system falls into one of three categories:
- Owner - the owner of the file (User -
u
) - Group - someone in the file’s group (
g
) - Others - everyone else. (
o
)
For each of these three categories, the computer keeps track of whether people in that category can read the file, write to the file, or execute the file (i.e., run it if it is a program).
For example, if a file had the following set of permissions:
user | group | all | |
---|---|---|---|
read (r) | yes | yes | no |
write (w) | yes | no | no |
execute (x) | no | no | no |
it would mean that:
- the file’s owner can read and write it, but not run it;
- other people in the file’s group can read it, but not modify it or run it; and
- everybody else can do nothing with it at all.
Let’s look at this model in action. Now let’s download some sample data for test and for checking the permission. So, please run:
# to download the data
wget https://noc-oi.github.io/shell-extras/data/labs.zip
# to unzip the data
unzip -l labs.zip
If we cd
into the labs
directory and run ls -F
,
it puts a *
at the end of setup
’s name.
This is its way of telling us that setup
is executable,
i.e.,
that it’s (probably) something the computer can run.
cd labs
ls -F
final.grd safety.txt setup* waiver.txt
Necessary But Not Sufficient
The fact that something is marked as executable doesn’t actually mean it contains a program of some kind. We could easily mark this HTML file as executable using the commands that are introduced below. Depending on the operating system we’re using, trying to “run” it will either fail (because it doesn’t contain instructions the computer recognizes) or cause the operating system to open the file with whatever application usually handles it (such as a web browser).
Now let’s run the command ls -l
:
ls -l
-rwxrwxrwx 1 nelle bio 4215 2010-07-23 20:04 final.grd
-rw-rw-r-- 1 nelle bio 1158 2010-07-11 08:22 safety.txt
-rwxr-xr-x 1 nelle bio 31988 2010-07-23 20:04 setup
-rw-rw-r-- 1 nelle bio 2312 2010-07-11 08:23 waiver.txt
The -l
flag tells ls
to give us a long-form listing.
It’s a lot of information, so let’s go through the columns in turn.
On the right side, we have the files’ names. Next to them, moving left, are the times and dates they were last modified. Backup systems and other tools use this information in a variety of ways, but you can use it to tell when you (or anyone else with permission) last changed a file.
Next to the modification time is the file’s size in bytes
and the names of the user and group that owns it
(in this case, nelle
and bio
respectively).
We’ll skip over the second column for now
(the one showing 1
for each file)
because it’s the first column that we care about most.
This shows the file’s permissions, i.e., who can read, write, or execute it.
Let’s have a closer look at one of those permission strings:
-rwxr-xr-x
.
The first character tells us what type of thing this is:
‘-‘ means it’s a regular file,
while ‘d’ means it’s a directory,
and other characters mean more esoteric things.
The next three characters tell us what permissions the file’s owner has.
Here, the owner can read, write, and execute the file: rwx
.
The middle triplet shows us the group’s permissions.
If the permission is turned off, we see a dash, so r-x
means “read and execute, but not write”.
The final triplet shows us what everyone who isn’t the file’s owner, or in the file’s group, can do.
In this case, it’s ‘r-x’ again, so everyone on the system can look at the file’s contents and run it.
To change permissions, we use the chmod
command
(whose name stands for “change mode”).
Here’s a long-form listing showing the permissions on the final grades
in the course Nelle is teaching:
ls -l final.grd
-rwxrwxrwx 1 nelle bio 4215 2010-08-29 22:30 final.grd
Whoops: everyone in the world can read it—and what’s worse, modify it! (They could also try to run the grades file as a program, which would almost certainly not work.)
The command to change the owner’s permissions to rw-
is:
chmod u=rw final.grd
The ‘u’ signals that we’re changing the privileges
of the user (i.e., the file’s owner),
and rw
is the new set of permissions.
A quick ls -l
shows us that it worked,
because the owner’s permissions are now set to read and write:
ls -l final.grd
-rw-rwxrwx 1 nelle bio 4215 2010-08-30 08:19 final.grd
Let’s run chmod
again to give the group read-only permission:
chmod g=r final.grd
ls -l final.grd
-rw-r--rw- 1 nelle bio 4215 2010-08-30 08:19 final.grd
And finally, let’s give “others” (everyone on the system who isn’t the file’s owner or in its group) no permissions at all:
chmod o= final.grd
ls -l final.grd
-rw-r----- 1 nelle bio 4215 2010-08-30 08:20 final.grd
Here, the ‘o’ signals that we’re changing permissions for “others”, and since there’s nothing on the right of the “=”, “others”’s new permissions are empty.
Alternatively, you can also use numeric notation:
chmod 640 final.grd # Equivalent to rw-r-----
This sets:
- Owner: read (
r
) and write (w
). - Group: read (
r
) only. - Others: No permissions.
This is the meaning of the numbers:
Number | Meaning |
---|---|
7 | read, write, and execute |
6 | read and write |
5 | read and execute |
4 | read only |
3 | write and execute |
2 | write only |
1 | execute only |
0 | none |
We can search by permissions, too.
Here, for example, we can use -type f -perm -u=x
to find files
that the user can execute:
find . -type f -perm -u=x
./setup
Before we go any further,
let’s run ls -a -l
to get a long-form listing that includes directory entries that are normally hidden:
ls -a -l
drwxr-xr-x 1 nelle bio 0 2010-08-14 09:55 .
drwxr-xr-x 1 nelle bio 8192 2010-08-27 23:11 ..
-rw-r----- 1 nelle bio 1158 2010-07-11 08:22 final.grd
-rw-rw-r-- 1 nelle bio 1158 2010-07-11 08:22 safety.txt
-rwxr-xr-x 1 nelle bio 31988 2010-07-23 20:04 setup
-rw-rw-r-- 1 nelle bio 2312 2010-07-11 08:23 waiver.txt
The permissions for .
and ..
(this directory and its parent) start with a ‘d’.
But look at the rest of their permissions:
the ‘x’ means that “execute” is turned on.
What does that mean?
A directory isn’t a program—how can we “run” it?
In fact, ‘x’ means something different for directories.
It gives someone the right to traverse the directory, but not to look at its contents.
The distinction is subtle, so let’s have a look at an example.
Nelle’s home directory has three subdirectories called taiti
, vanuatu
, and tonga
:
- 📁 nelle
- 📁 taiti r-x
- 📄 notes
- 📁 vanuatu r--
- 📄 notes
- 📁 tonga --x
- 📄 notes
- 📁 taiti r-x
Each of these has a subdirectory in turn called notes
,
and those sub-subdirectories contain various files.
If a user’s permissions on taiti
are ‘r-x’,
then if she tries to see the contents of taiti
and taiti/notes
using ls
,
the computer lets her see both.
If her permissions on vanuatu
are just ‘r–’,
then she is allowed to read the contents of both vanuatu
and vanuatu/notes
.
But if her permissions on tonga
are only ‘–x’,
she cannot see what’s in the tonga
directory:
ls tonga
will tell her she doesn’t have permission to view its contents.
If she tries to look in tonga/notes
, though, the computer will let her do that.
She’s allowed to go through tonga
, but not to look at what’s there.
This trick gives people a way to make some of their directories visible to the world as a whole
without opening up everything else.
Shebang
Shebang is the #! syntax used in scripts to indicate an interpreter for execution under UNIX/Linux operating systems. For shell, we can use two different approaches,
#!/bin/bash
or,
#!/usr/bin/env bash
at the top of the script. The second approach is more portable and recommended.
For instance, check the file_info.sh
script in the code
directory.
First, after creating or downloading the script, we need to make it executable using chmod
command.
chmod u+x file_info.sh
The u+x
option is used to permit the “user to execute” the script.
Then we can run the script using the following command:
./file_info.sh example.txt
Shebang is necessary if we want to run the code without explicitly telling Unix what the interpreter is.
We still run the code without shebang, i.e., by telling the interpreter to run the code,
e.g., bash file_info.sh example.txt
. If we run the code directly but no shebang
is given, or the permission is not given, the code will not run (“Permission denied” error).
Add shebang and execute permission to Nelle’s scripts
Nelle has three scripts in the
north-pacific-gyre
directory calleddo-stats.sh
,goostats
andgoodiff
.Edit each of these using
nano
(or your text editor of choice) and add a#!/bin/bash
at the start. Give the owner and group execute permission on these scripts too.Previously we ran
do-stats.sh
by using the command:bash ./do-stats.sh <filename>
How can we run the script now?
Solution
nano do-stats.sh nano goostats nano goodiff chmod ug+x do-stats.sh chmod ug+x goo*
This means we can now start the scrit without the
bash
command. For example:./do-stats.sh <filename>
User Groups and Members
In Linux, users can belong to one or more groups, which help define access control for files and directories.
Each user has a primary group and can be part of additional groups. Groups are defined in the /etc/group
file.
To list a user’s group memberships, run:
groups username
For example, running groups nelle
might return:
nelle : nelle developers data-science
This means that Nelle belongs to the developers
and data-science
groups in addition to her own user group.
To view all groups on the system, use:
cat /etc/group
Adding a user to a group requires administrative privileges. The command to add a user to a group is:
sudo usermod -aG group_name username
For example, to add nelle
to the sysadmin
group:
sudo usermod -aG sysadmin nelle
To verify the change, log out and log back in, then rerun groups
.
To remove a user from a group:
sudo gpasswd -d username group_name
For example:
sudo gpasswd -d nelle developers
Access Control Lists (ACLs)
You can also use Access Control Lists (ACLs) to grant permissions to specific users or groups. It gives you finer-grained control and flexibility over file permissions. However, it is more complex to administer and understand on small systems (If you have a large computer system, nothing is easy to administer or understand.) For example, you could give the Mummy permission to append data to a file without giving him permission to read or delete it, and give Frankenstein permission to delete a file without being able to see what it contains.
Some modern variants of Unix support ACLs as well as the older read-write-execute permissions, but hardly anyone uses them.
Example of ACLs commands in UNIX
- Set ACL for a specific user:
setfacl -m u:alex:rwx filename
- Set ACL for a group:
setfacl -m g:groupname:r filename
- View ACLs on a file:
getfacl filename
- Remove an ACL entry:
setfacl -x u:username filename
What about Windows?
In Windows, permissions are defined by access control lists.
As in Unix, each file and directory has an owner and a group. However, Windows permissions are more complex than Unix permissions.
To view and change permissions in Windows:
- File Explorer: Right-click on a file, select
Properties
, and go to theSecurity
tab. - Command Line: Use the
icacls
command.
Challenge
If
ls -l myfile.php
returns the following details:-rwxr-xr-- 1 caro zoo 2312 2014-10-25 18:30 myfile.php
Which of the following statements is true?
- caro (the owner) can read, write, and execute myfile.php
- caro (the owner) cannot write to myfile.php
- members of caro (a group) can read, write, and execute myfile.php
- members of zoo (a group) cannot execute myfile.php
Key Points
Correct permissions are critical for the security of a system.
File permissions describe who and what can read, write, modify, and access a file.
Use
ls -l
to view the permissions for a specific file.Use
chmod
to change permissions on a file or directory.Use
chmod 777
carefully, as it grants all permissions to everyone.Access Control Lists (ACLs) provide finer-grained permission control.
Processes and Job Control
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How do keep track of the process running on my machine?
Can I run more than one program/script from within a shell?
Objectives
Learn how to use
ps
to get information about the state of processesLearn how to control, ie., “stop/pause/background/foreground” processes
With her two days worth of data Nelle now has over 1500 files to process and this is going to take her a while. She would like to monitor what is running and to be able to work on some other things while this runs in the background.
Job Control
We’ll now take a look at how to control programs once they’re running. This is called job control.
When we talk about controlling programs, what we really mean is
controlling processes. A process is just a program
that’s in memory and executing. Some of the processes on your computer
are yours: they’re running programs you explicitly asked for, like
Nelle’s do-stats
script. Many others belong to the operating system that
manages your computer for you, or, if you’re on a shared machine, to other
users.
The ps
command
You can use the ps
command to list processes, just as you use ls
to list files and directories.
Behaviour of the
ps
commandThe
ps
command has a swathe of option flags that control its behaviour and, what’s more, the sets of flags and default behaviour vary across different platforms.A bare invocation of
ps
only shows you basic information about your, active processes.After that, this is a command that it is worth reading the ‘
man
page’ for.
$ ps
PID TTY TIME CMD
12767 pts/0 00:00:00 bash
15283 pts/0 00:00:00 ps
At the time you ran the ps
command, you had
two active processes, your (bash
) shell and the (ps
) command
you had invoked in it.
Chances are that you were aware of that information, without needing to run a command to tell you it, so let’s try and put some flesh on that bare bones information.
$ ps -f
UID PID PPID C STIME TTY TIME CMD
nelle 12396 25397 0 14:28 pts/0 00:00:00 ps -f
nelle 25397 25396 0 12:49 pts/0 00:01:39 bash
In case you haven’t had time to do a man ps
yet, be aware that
the -f
flag doesn’t stand for “flesh on the bones” but for
“Do full-format listing”, although even then, there are “fuller”
versions of the ps
output.
What is ps
telling us?
Every process has a unique process id (PID). Remember, this is a property of the process, not of the program that process is executing: if you are running three instances of your browser at once, each will have its own process ID.
The third column in this listing, PPID, shows the ID of each process’s parent. Every process on a computer is spawned by another, which is its parent (except, of course, for the bootstrap process that runs automatically when the computer starts up).
Clearly, the ps -f
that was run is a child process of the (bash
)
shell it was invoked in.
Column 1 shows the username of the user the processes are being run by. This is the username the computer uses when checking permissions: each process is allowed to access exactly the same things as the user running it, no more, no less.
Column 5, STIME, shows when the process started running, whilst Column 7, TIME, shows you how much time process has used, whilst Column 8, CMD, shows what program the process is executing.
Column 6, TTY, shows
the ID of the terminal this process is running in. Once upon a time,
this really would have been a terminal connected to a central timeshared
computer. It isn’t as important these days, except that if a process is
a system service, such as a network monitor, ps
will display a
question mark for its terminal, since it doesn’t actually have one.
The fourth column, C, is an indication of the perCentage of processor utilization.
Your version of ps
may
show more or fewer columns, or may show them in a different order, but
the same information is generally available everywhere, and the column
headers are generally consistent.
Stopping, pausing, resuming, and backgrounding, processes
The shell provides several commands for stopping, pausing, and resuming
processes. To see them in action, let’s run our do-stats.sh
script on our
latest data files. After a few minutes go by, we realize that this is
going to take a while to finish. Being impatient, we kill the process by
pressing the Control and C keys at the same time. This stops the
currently-executing program right away. Any results it had calculated, but not
written to disk, are lost.
cd north-pacific-gyre
./do-stats.sh 2012-07-03/NENE*txt
...a few minutes pass...
^C
Let’s run that same command again, with an ampersand &
at the end of
the line to tell the shell we want it to run in the
background:
./do-stats.sh 2012-07-03/NENE*txt &
When we do this, the shell launches the program as before. Instead of
leaving our keyboard and screen connected to the program’s standard
input and output, though, the shell hangs onto them. But any output
will still appear on screen, which depending on how frequent it is can be quite
annoying or quite useful. This means the shell can give us a fresh command
prompt, and start running other commands, right away. For example we can run
the ls
command in the 2012-07-03
directory to see how many of the results
(.stats) files have been created.
ls 2012-07-03
Now let’s run the jobs
command, which tells us what processes are currently
running in the background:
$ jobs
[1]+ Running ./do-stats.sh 2012-07-03/NENE*txt &
Since we’re about to go and get coffee, we might as well use the
foreground command, fg
, to bring our background job into the
foreground:
$ fg
...a few minutes pass...
When do-stats.sh
finishes running, the shell will give us a fresh prompt as
usual. If we had several jobs running in the background, we could
control which one we brought to the foreground using fg %1
, fg %2
,
and so on. The IDs are not the process IDs. Instead, they are the job
IDs displayed by the jobs
command.
The shell gives us one more tool for job control: if a process is
already running in the foreground, Control-Z will pause it and return
control to the shell. We can then use fg
to resume it in the
foreground, or bg
to resume it as a background job. For example, let’s
run do-stats.sh
again, and then type Control-Z. The shell immediately
tells us that our program has been stopped, and gives us its job number:
./do-stats.sh 2012-07-03/NENE*txt
^Z
[1]+ Stopped ./do-stats.sh 2012-07-03/NENE*txt
If we type bg %1
, the shell starts the process running again, but in
the background. We can check that it’s running using jobs
, and kill it
while it’s still in the background using kill
and the job number. This
has the same effect as bringing it to the foreground and then typing
Control-C:
$ bg %1
$ jobs
[1]+ Running ./do-stats.sh 2012-07-03/NENE*txt &
$ kill %1
Job control was important when users only had one terminal window at a time. It’s less important now: if we want to run another program, it’s easy enough to open another window and run it there. However, these ideas and tools are making a comeback, as they’re often the easiest way to run and control programs on remote computers elsewhere on the network. This lesson’s ssh episode has more to say about that.
Killing by process ID
The kill
command can also take a process ID that we can discover using the
ps
command. Let’s launch our do-stats.sh
process again, background it
and discover it’s process ID.
./do-stats.sh 2012-07-03/NENE*.txt &
ps -f
UID PID PPID C STIME TTY TIME CMD
nelle 1551177 15724 0 Mar14 pts/11 00:00:01 /bin/bash
nelle 1982690 1551177 0 00:28 pts/11 00:00:00 bash ./do-stats.sh 2012-07-03/NENE01729A.txt 2012-07-03/NENE01729B.txt 2012-07-03/NENE01736A.txt 2012-07-03/NENE01751A.txt
nelle 1982694 1982690 0 00:28 pts/11 00:00:00 bash goostats 2012-07-03/NENE01729A.txt 2012-07-03/NENE01729A.txt.stats
nelle 1982698 1982694 0 00:28 pts/11 00:00:00 sleep 2
nelle 1982702 1551177 0 00:28 pts/11 00:00:00 ps -f
We can see several processes in the list now, the first one
(bash ./do-stats.sh
) is the one we started from the command line. We can tell
this because it’s PPID is our bash
processes that is giving the command
prompt. The third one is the goostats
process launched by do-stats.sh
and the fourth is a sleep
process (which just waits the specified number of
seconds) that is used by goostats
. If we want to stop the whole pipeline
then we want to kill the do-stats.sh
process, which in the above output
has the PID 1982690. We can give this PID to the kill
command.
kill 1982690
Killing by process name
Instead of looking up the process name we can also kill a process by it’s name
using the killall
command. As the name suggests, this will match all processes
with the specified name. In our case killall do-stats.sh
will stop our
pipeline. We need to be careful with killall
, running killall bash
would
kill every bash
process on our system (at least owned by the current user),
if we had several terminals open they would all be killed by this.
Tmux: A more advanced way to background processes
A more advanced way to background processes is to use the tmux
command.
This allows us to detach completely from the process, keep it running in the
background and capture it’s output in a way that we can get back to it.
Unlike using &
or bg
it doesn’t put any output onto our terminal when we
are detached. If we are running a process on a remote system using SSH then
it can also keep running once we logout and close our SSH connection, this
does not happen when backgrounding a process and (eventually) the process will
stop as the SSH session is a parent process of the bash process which launched
the command and when we exit the SSH session all of it’s child processes are
killed.
Launching Tmux
To launch a tmux
session simply type:
tmux
Note that tmux
is not always installed on every system and you might need to
install it or ask your system administrator to do so.
When tmux
launches it will clear the screen and a bar with the name of the
current process, the username, hostname, time and date will appear at the bottom.
We can now start a process such as our pipeline inside tmux
:
./do-stats.sh 2012-07-03/NENE*.txt
We can now disconnect from our tmux
session by pressing the
Control and B keys and then pressing the d key (for disconnect). Our process is
left running inside tmux
but none of the output is display on the screen.
Reconnecting Tmux
To reconnect to tmux
we need to run it with the -a
option.
tmux -a
Exiting Tmux
To end a tmux
session we use the exit
command to close down tmux
completely.
Screen - An alternative to Tmux
GNU Screen is a similar program to tmux
that has the same basic functionality
. Some older systems might have screen
installed instead of tmux
. The basic
keys are the same, except you press Control and A instead of Control and B. The
command to reattach is screen -r
instead of tmux -d
.
Top - A more advanced way to monitor processes
The ps
command is very useful for monitoring what processes are runnning,
but sometimes we have situations that are rapidly changing and by the time
we’ve typed ps
the process we are interested in has exited. The top
command is like running ps
continuously. It also supports sorting the
process list by a number of attributes including how much CPU time or memory a
process is using. This can be very useful when trying to find which process(es)
are using the most resources. To launch top
, simply run the command:
top
Which field we are sorting by can be controlled by pressing the >
or <
keys
. To exit press the “q” key.
Htop - a nicer version of top
A newer and easier to use version of top
is called htop
. This uses the
“F” keys on your keyboard to operate it and is a bit more intuitive than top
.
htop
is not always installed by default and your might need to install it or
ask your system administrator to install it.
Find the PID of tmux
Start a
tmux
session and then disconnect from it by pressing control b and then pressing d. Now useps
to find out the PID of yourtmux
. You will need some extra options you haven’t used before. Hint: the correct option might return a lot of processes, you can filter this by piping the output to thegrep tmux
command.Solution
ps -A | grep tmux
or
ps -aux | grep tmux
Monitor Nelle’s Pipeline with (h)top
Inside a
tmux
session start running Nelle’s pipeline with thedo-stats.sh
script on the bigger2012-07-04
dataset. Disconnect from thetmux
session and usetop
orhtop
to monitor the progress. Isgoostats
using much CPU time? Would you expect it to be? Have a look at the contents of thegoostats
program (it is a shell script) what is taking most of the time? How much CPU time should that operation use?Solution
goostats
isn’t really doing very much, it spends most of it’s time sleeping, which doesn’t use much CPU activity. So it won’t be showing much CPU time and won’t be at the top oftop
’s list of processes.
Monitoring other processes with (h)top or ps
If you have SSH access to some kind of shared server connect to this server and run
top
,htop
orps aux
on there and have a look at which processes are using the most CPU or memory resources. What is the process name? Which user does that process belong to?
Key Points
When we talk of ‘job control’, we really mean ‘process control’
A running process can be stopped, paused, and/or made to run in the background
A process can be started so as to immediately run in the background
Paused or backgrounded processes can be brought back into the foreground
Process information can be inspected with
ps
The
top
(orhtop
) command shows a live view of the resouces used by each process.The
tmux
command allows us to leave a command running and disconnect from the system by press Ctrl+B d.The
tmux
session can be reattached withtmux -a
.
Aliases and Bash Customization
Overview
Teaching: 10 minutes min
Exercises: 0 minQuestions
How do I customize my bash environment?
Objectives
Create aliases.
Add customizations to the
.bashrc
and.bash_profile
files.Change the prompt in a bash environment.
Bash allows us to customize our environments to fill our own particular needs.
Aliases
Sometimes we need to use long commands that have to be typed over and
over again. Fortunately, the alias
command allows us to create
shortcuts for these long commands.
As an example, let’s create aliases for going up one, two, or three directories.
alias up='cd ..'
alias upup='cd ../..'
alias upupup='cd ../../..'
Let’s try these commands out.
cd /usr/local/bin
upup
pwd
/usr
We can also remove a shortcut with unalias
.
unalias upupup
If we create one of these aliases in a bash session, they will only last until the end of that session. Fortunately, bash allows us to specify customizations that will work whenever we begin a new bash session.
Bash customization files
Bash environments can be customized by adding commands to the
.bashrc
, .bash_profile
, and .bash_logout
files in our home
directory. The .bashrc
file is executed whenever entering
interactive non-login shells whereas .bash_profile
is executed for
login shells. If the .bash_logout
file exists, then it will be run
after exiting a shell session.
Let’s add the above commands to our .bashrc
file.
Be careful to append to .bashrc
, with >>
. for concatenate, rather than one
>
which would overwrite.
echo "alias up='cd ..'" >> ~/.bashrc
tail -n 1 ~/.bashrc
alias up='cd ..'
We can execute the commands in .bashrc
using source
, so this creates the
alias up
which we can then use in directory /usr/local/bin
:
source ~/.bashrc
cd /usr/local/bin
up
pwd
/usr/local
Having to add customizations to two files can be cumbersome. It we
would like to always use the customizations in our .bashrc
file,
then we can add the following lines to our .bash_profile
file.
if [ -f $HOME/.bashrc ]; then
source $HOME/.bashrc
fi
Customizing your prompt
We can also customize our bash prompt by setting the PS1
system
variable. To set our prompt to be $
, then we can run the command
export PS1="$ "
To set the prompt to $
for all bash sessions, add this line to the
end of .bashrc
.
Further [bash prompt customizations]
(https://www.howtogeek.com/307701/how-to-customize-and-colorize-your-bash-prompt)
are possible. To have our prompt be username@hostname[directory]:
,
we would set
export PS1="\u@\h[\W]: "
where \u
represents username, \h
represents hostname, and \W
represents the current directory.
Key Points
Aliases are used to create shortcuts or abbreviations
The
.bashrc
and.bash_profile
files allow us to customize our bash environment.The
PS1
system variable can be changed to customize your bash prompt.
Shell Variables
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How are variables set and accessed in the Unix shell?
Objectives
Understand how variables are implemented in the shell
Explain how the shell uses the
PATH
variable to search for executablesRead the value of an existing variable
Create new variables and change their values
The shell is just a program, and like other programs, it has variables. Those variables control its execution, so by changing their values you can change how the shell and other programs behave.
Let’s start by running the command set
and looking at some of the variables
in a typical shell session:
$ set
COMPUTERNAME=TURING
HOME=/home/nelle
HOMEDRIVE=C:
HOSTNAME=TURING
HOSTTYPE=i686
NUMBER_OF_PROCESSORS=4
OS=Windows_NT
PATH=/Users/nelle/bin:/usr/local/git/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin
PWD=/home/nelle
UID=1000
USERNAME=nelle
...
As you can see, there are quite a few — in fact, four or five times more
than what’s shown here. And yes, using set
to show things might seem a
little strange, even for Unix, but if you don’t give it any arguments,
it might as well show you things you could set.
Every variable has a name.
By convention, variables that are always present are given upper-case names.
All shell variables’ values are strings, even those (like UID
) that look
like numbers. It’s up to programs to convert these strings to other types
when necessary. For example, if a program wanted to find out how many
processors the computer had, it would convert the value of the
NUMBER_OF_PROCESSORS
variable from a string to an integer.
Similarly, some variables (like PATH
) store lists of values.
In this case, the convention is to use a colon ‘:’ as a separator.
If a program wants the individual elements of such a list, it’s the
program’s responsibility to split the variable’s string value into pieces.
The PATH
Variable
Let’s have a closer look at that PATH
variable.
Its value defines the shell’s
search path,
i.e., the list of directories that the shell looks in for runnable programs
when you type in a program name without specifying what directory it is in.
For example,
when we type a command like analyze
,
the shell needs to decide whether to run ./analyze
or /bin/analyze
.
The rule it uses is simple:
the shell checks each directory in the PATH
variable in turn,
looking for a program with the requested name in that directory.
As soon as it finds a match, it stops searching and runs the program.
To show how this works,
here are the components of PATH
listed one per line:
/Users/nelle/bin
/usr/local/git/bin
/usr/bin
/bin
/usr/sbin
/sbin
/usr/local/bin
On our computer,
there are actually three programs called analyze
in three different directories:
/bin/analyze
,
/usr/local/bin/analyze
,
and /users/nelle/analyze
.
Since the shell searches the directories in the order they’re listed in PATH
,
it finds /bin/analyze
first and runs that.
Notice that it will never find the program /users/nelle/analyze
unless we type in the full path to the program,
since the directory /users/nelle
isn’t in PATH
.
Showing the Value of a Variable
Let’s show the value of the variable HOME
:
$ echo HOME
HOME
That just prints “HOME”, which isn’t what we wanted (though it is what we actually asked for). Let’s try this instead:
$ echo $HOME
/home/nelle
The dollar sign tells the shell that we want the value of the variable
rather than its name.
This works just like wildcards:
the shell does the replacement before running the program we’ve asked for.
Thanks to this expansion, what we actually run is echo /home/nelle
,
which displays the right thing.
Creating and Changing Variables
Creating a variable is easy—we just assign a value to a name using “=”:
$ SECRET_IDENTITY=Dracula
$ echo $SECRET_IDENTITY
Dracula
To change the value, just assign a new one:
$ SECRET_IDENTITY=Camilla
$ echo $SECRET_IDENTITY
Camilla
If we want to set some variables automatically every time we run a shell,
we can put commands to do this in a file called .bashrc
in our home
directory. (The .
character at the front prevents ls
from listing this file
unless we specifically ask it to using -a
:
we normally don’t want to worry about it.
The “rc” at the end is an abbreviation for “run control”,
which meant something really important decades ago,
and is now just a convention everyone follows without understanding why.)
For example,
here are two lines in /home/nelle/.bashrc
:
export SECRET_IDENTITY=Dracula
export TEMP_DIR=/tmp
export BACKUP_DIR=$TEMP_DIR/backup
These three lines create the variables SECRET_IDENTITY
,
TEMP_DIR
,
and BACKUP_DIR
,
and export them so that any programs the shell runs can see them as well.
Notice that BACKUP_DIR
’s definition relies on the value of TEMP_DIR
,
so that if we change where we put temporary files,
our backups will be relocated automatically.
While we’re here,
it’s also common to use the alias
command to create shortcuts for things we
frequently type. For example, we can define the alias backup
to run
/bin/zback
with a specific set of arguments:
alias backup=/bin/zback -v --nostir -R 20000 $HOME $BACKUP_DIR
As you can see, aliases can save us a lot of typing, and hence a lot of typing mistakes. You can find interesting suggestions for other aliases and other bash tricks by searching for “sample bashrc” in your favorite search engine.
Key Points
Shell variables are by default treated as strings
The
PATH
variable defines the shell’s search pathVariables are assigned using “
=
” and recalled using the variable’s name prefixed by “$
”
Command Substitution
Overview
Teaching: 10 min
Exercises: 5 minQuestions
How to substitute variables with command outputs.
Objectives
Understand the need for flexibility regarding arguments
Generate the values of the arguments on the fly using command substitution
Understand the difference between pipes/redirection and the command substitution operator
Introduction
In the Loops topic we saw how to improve productivity by letting the computer do the repetitive work. Often, this involves doing the same thing to a whole set of files, e.g.:
$ cd data/pdb
$ mkdir sorted
$ for file in *cyclo*.pdb; do
> sort $file > sorted/sorted-$file
> done
In this example, the shell generates for us the list of things to loop
over, using the wildcard mechanism we saw in the Pipes and Filters topic.
This results in the cyclo*.pdf
being replaced with
cyclobutane.pdb cyclohexanol.pdb cyclopropane.pdb ethylcyclohexane.pdb
before the loop starts.
Another example is a so-called parameter sweep, where you run the same program a number of times with different arguments. Here is a fictitious example:
$ for cutoff in 0.001 0.01 0.05; do
> run_classifier.sh --input ALL-data.txt --pvalue $cutoff --output results-$cutoff.txt
> done
In the second example, the things to loop over: "0.001 0.01 0.05"
are spelled out by you.
Looping over the words in a string
In the previous example you can make your code neater and self-documenting by putting the cutoff values in a separate string:
$ cutoffs="0.001 0.01 0.05" $ for cutoff in $cutoffs; do run_classifier.sh --input ALL-data.txt --pvalue $cutoff --output results-$cutoff.txt done
This works because, just as with the filename wildcards,
$cutoffs
is replaced with0.001 0.01 0.05
before the loop starts.
However, you don’t always know in advance what you have to loop
over. It could well be that it is not a simple file name pattern (in
which case you can use wildcards), or that it is not a small, known set
of values (in which case you can write them out explicitly as was done
in the second example). It would, therefore be nice if you could loop
over filenames or over words contained in a file. Suppose that file
cohort2010.txt
contains the filenames over which to iterate; then it would
be nice to able to say something like:
# (imaginary syntax)
$ for file in [INSERT THE CONTENTS OF cohort2010.txt HERE]
> do
> run_classifier.sh --input $file --pvalue -0.05 --output $file.results
> done
Command substitution
This would be more general, more flexible and more tractable than
relying on the wildcard mechanism. What we need, therefore, is a
mechanism that actually replaces everything between [
and ]
with the
desired names of input files, just before the loop starts. Thankfully,
this mechanism exists, and it is called the command substitution operator
(previously written using the backtick operator). It looks much like the
previous snippet:
# (actual syntax)
$ for file in $(cat cohort2010.txt)
> do
> run_classifier.sh --input $file --pvalue -0.05 --output $file.results
> done
It works simply as follows: everything between the $(
and the )
is
executed as a Unix command, and the command’s standard output replaces
everything from $(
up to and including )
, just before the loop
starts. For convenience, newlines in the command’s output are
replaced with simple spaces.
Backtick operator
In legacy code, you may see the same construct but with a
different syntax. It starts and ends with backticks, `
(not to be
confused with the single quote '
!). The backticks work exactly the
same as the command substitution done by $(
and )
. However, its use
is discouraged as backticks cannot be nested.
Example
OK. Recall from the Pipes and Filters topic that cat
prints the contents
of its argument (a filename) to standard output. So, if the contents of
file cohort2010.txt
look like
patient1033130.txt
patient1048338.txt
patient7448262.txt
.
.
.
patient1820757.txt
then the construct
$ for file in $(cat cohort2010.txt)
> do
> ...
> done
will be expanded to
$ for file in patient1033130.txt patient1048338.txt patient7448262.txt ... patient1820757.txt
> do
> ...
> done
(notice the convenience of newlines having been replaced with simple spaces).
This example uses $(cat somefilename)
to supply arguments to the
for variable in ... do ... done
-construct, but any output from any
command, or even pipeline, can also be used. For example, if
cohort2010.txt
contains a few thousand patients but you just want to
try the first two for a test run, you can use the head
command to just
get the first few lines of its argument, like so:
$ for file in $(cat cohort2010.txt | head -n 2)
> do
> ...
> done
which will expand to
$ for file in patient1033130.txt patient1048338.txt
> do
> ...
> done
simply because cat cohort2010.txt | head -n 2
produces
patient1033130.txt patient1048338.txt
after the command substitution.
Everything between the $(
and )
is executed verbatim by the shell, so
also the -n 2
argument to the head
command works as expected.
Important
Recall from the Loops and the Shell Scripts topics that Unix uses whitespace to separate command, options (flags) and parameters / arguments. For the same reason it is essential that the command (or pipeline) inside the backticks produces clean output: single word output works best within single commands and whitespace- or newline-separated words works best for lists over which to iterate in loops.
Generating filenames based on a timestamp
It can be useful to create the filename ‘on the fly’. For instance, if some program called
qualitycontrol
is run periodically (or unpredictably) it may be necessary to supply the time stamp as an argument to keep all the output files apart, along the following lines:qualitycontrol --inputdir /data/incoming/ --output qcresults-[INSERT TIMESTAMP HERE].txt
Getting
[INSERT TIMESTAMP HERE]
to work is a job for the command subsitution operator. The Unix command you need here is thedate
command, which provides you with the current date and time (try it).In the current form, its output is less useful for generating filenames because it contains whitespace (which, as we know from now, should preferably be avoided in filenames). You can tweak
date
’s format in great detail, for instance to get rid of whitespace:$ date +"%Y-%m-%d_%T"
(Try it).
Write the command that will copy a file of your choice to a new file whose name contains the time stamp. Test it by executing the command a few times, waiting a few seconds between invocations (use the arrow-up key to avoid having to retype the command)
Solution
cp file file.$(date +"%Y-%m-%d_%T")
Juggling filename extensions
When running an analysis program with a certain input file, it is often required that the output has the same name as the input, but with a different filename extension, e.g.
$ run_classifier.sh --input patient1048338.txt --pvalue -0.05 --output patient1048338.results
A good trick here is to use the Unix
basename
command. It takes a string (typically a filename), and strips off the given extension (if it is part of the input string). Example:$ basename patient1048338.txt .txt
gives
patient1048338
Write a loop that uses the command substitution operator and the
basename
command to sort each of the*.pdb
files into a corresponding*.sorted
file. That is, make the loop do the following:$ sort ammonia.pdb > ammonia.sorted
but for each of the
.pdb
-files.Solution
for file in *.pdb; do sort $file > $(basename $file .pdb).sorted; done
Closing remarks
The command substitution operator provides us with a
powerful new piece of ‘plumbing’ that allows us to connect “small
pieces, loosely together” to keep with the Unix philosophy. It is remotely
similar to the |
operator in the sense that it connects two
programs. But there is also a clear difference: |
connects the
standard output of one command to the standard input of another command,
where as $(command)
is substituted ‘in-place’ into the shell script, and
always provides parameters, options, and arguments to other commands.
Key Points
We can substitue variables for the output of commands using the $(command) syntax.
We can loop through sets of values in a “parameter sweep”.
For loops can take a single variable with space separated arguments and treat each as a separate item to iterate over.
Streams
Overview
Teaching: 15 min
Exercises: 5 minQuestions
What are the standard output streams?
How can I redirect them?
Objectives
Understand the difference betwen STDERR and STDOUT.
Split STDOUT and STDERR output with 2> and 1> redirects.
Use the
tee
command to redirect to a file and the screen.
There are three standard input/outputs streams created when you run a Unix command. These can be thought of as the transfer of data to and from your command. The three streams are standard input (STDIN), standard output (STDOUT) and standard error (STDERR).
To explore these streams, we’re going to use Nelle’s do-stats.sh script from earlier. Therefore, return to the north-pacific-gyre
directory.
STDIN
STDIN is the stream by which the program you are running is provided with its input data. Unix automatically connects this to your terminal keyboard. For example, when you are entering your password or passphrase when ssh-ing into a remote server, you are using STDIN. The stream ID of STDIN is 0.
STDOUT vs STDERR
STDOUT and STDERR are both connected to your terminal screen. STDOUT is the stream used by the program you are running for its output data and STDERR is the stream for any error messages or diagnostics such as log messages.
For example,
bash do-stats.sh 2012-07-03/NENE01812A.txt
results in
2012-07-03/NENE01812A.txt
being printed to the terminal. This is because we used echo
in do-stats.sh
- by default, echo uses the output stream.
If we make a typo and run
bash do-stat.sh 2012-07-03/NENE01812A.txt
we get
bash: do-stat.sh: No such file or directory
printed to the terminal. This is actually using the error stream, but because STDOUT and STDERR are both automatically displayed by your terminal, it might not immediately be obvious that these two streams are different.
However, they can be separated. This means that you can stop error messages and warnings being mixed in with your output.
2>
and 1>
redirects
Let’s remind ourselves of how to redirect the output from a command to a text file (as we saw in the Introduction to Unix Shell workshop), uing the >
symbol.
bash do-stats.sh 2012-07-03/NENE01812A.txt > output.txt
Now there is nothing printed to the screen, because our output is being redicted to a file named output.txt
.
cat output.txt
2012-07-03/NENE01812A.txt
Let’s repeat this, but, this time, use our command with a typo from before that we know will generate an error.
bash do-stat.sh 2012-07-03/NENE01812A.txt > output.txt
bash: do-stat.sh: No such file or directory
In this case, the error is still printed to the terminal, because >
, by default, redirects STDOUT, not STDERR.
However, >
can be used to redirect STDERR, or both STDOUT and STDERR. Putting a number in front of the >
controls which stream it redirects. 1
is the stream ID of STDOUT, so 1>
is the same as >
.
bash do-stats.sh 2012-07-03/NENE01812A.txt 1> output.txt
To redirct STDERR, use a stream ID of 2, i.e. 2>
.
bash do-stat.sh 2012-07-03/NENE01812A.txt 2> error.txt
cat error.txt
bash: do-stat.sh: No such file or directory
It is also possible to redirect both the output and error streams at once. In order to see this, let’s try running do-stats.sh
on a file that we know does not exist.
bash do-stats.sh 2012-07-03/NENE01812C.txt
2012-07-03/NENE01812C.txt
head: cannot open '2012-07-03/NENE01812C.txt' for reading: No such file or directory
If you want both streams redirected to the same file you can do:
bash do-stats.sh 2012-07-03/NENE01812C.txt > output-and-error.txt 2>&1
cat output-and-error.txt
2012-07-03/NENE01812C.txt
head: cannot open '2012-07-03/NENE01812C.txt' for reading: No such file or directory
This redirects STDOUT to a text file as before, then 2>&1
redirects STDERR to wherever STDOUT is being redirected to.
Redirecting STDOUT and STDERR to different files
Our initial aim was to redirect STDERR and STDOUT to two different files, in order to seperate any warnings or diagnostic errors from our program output. Can you work out how Nelle can do this when running
bash do-stats.sh 2012-07-03/NENE01812C.txt
?Solution
bash do-stats.sh 2012-07-03/NENE01812C.txt > output.txt 2> error.txt
cat output.txt
2012-07-03/NENE01812C.txt
cat error.txt
head: cannot open '2012-07-03/NENE01812C.txt' for reading: No such file or directory
The tee
command
The Unix command tee
duplicates STDOUT and sends the second copy to a file.
Consider the input/output stream model we’ve already discussed as a system of pipes.
tee
sensibly splits the flow of information, allowing one copy to be written
to disk and leaving one copy available for a subsequent command in the chain.
bash do-stats.sh 2012-07-03/NENE*.txt | tee output.txt
2012-07-03/NENE01729A.txt
2012-07-03/NENE01729B.txt
2012-07-03/NENE01736A.txt
2012-07-03/NENE01751A.txt
2012-07-03/NENE01751B.txt
2012-07-03/NENE01812A.txt
2012-07-03/NENE01843A.txt
2012-07-03/NENE01843B.txt
2012-07-03/NENE01971Z.txt
2012-07-03/NENE01978A.txt
2012-07-03/NENE01978B.txt
2012-07-03/NENE02018B.txt
2012-07-03/NENE02040A.txt
2012-07-03/NENE02040B.txt
2012-07-03/NENE02040Z.txt
2012-07-03/NENE02043A.txt
2012-07-03/NENE02043B.txt
cat output.txt
2012-07-03/NENE01729A.txt
2012-07-03/NENE01729B.txt
2012-07-03/NENE01736A.txt
2012-07-03/NENE01751A.txt
2012-07-03/NENE01751B.txt
2012-07-03/NENE01812A.txt
2012-07-03/NENE01843A.txt
2012-07-03/NENE01843B.txt
2012-07-03/NENE01971Z.txt
2012-07-03/NENE01978A.txt
2012-07-03/NENE01978B.txt
2012-07-03/NENE02018B.txt
2012-07-03/NENE02040A.txt
2012-07-03/NENE02040B.txt
2012-07-03/NENE02040Z.txt
2012-07-03/NENE02043A.txt
2012-07-03/NENE02043B.txt
Where might this be useful? For instance, you can use this to both passively log and actively monitor a compilation or a data processing step.
Because tee
preserves STDOUT, it allows recovery from actions that overwhelm
the buffer of your shell’s window as well, which is often limited.
Key Points
STDERR can be redirected by using 2>.
STDOUT can be redirected by using 1> or >.
tee can be used to duplicate STDOUT and send the second copy to a file.
AWK
Overview
Teaching: 20 min
Exercises: 5 minQuestions
How to use AWK for text processing?
Objectives
Explain why AWK is useful and when it is better than pipes
Show a basic usage similar to the command
cat
commandIntroduce the filed separator parameter
Use regular expressions to perform different instructions
Introduce BEGIN and END keywords
Use the
if-then
structure to change behaviour for the same matching regexIntroduce the array data structure
Use the for loop to cycle through an array
AWK is a tool for manipulating and filtering complex data. It stands for Aho, Weinberger, and Kernighan, the designers of this program. This chapter requires understanding of previous shell lessons and any programming language.
Let’s start. The example.txt
for exercise is available under data
directory. You can also download it from here.
If we need to count the number of lines in a file, we can use the previously shown command for word counting wc
.
wc -l example.txt
As you probably remember, -l is an option that asks for the number of lines only.
However, wc counts the number of newlines in the file, if the last line does not contain a carriage return (i.e. there is no empty line at the end of the file), the result is going be the actual number of lines minus one.
A workaround is to use Awk. Awk is a command line program that takes as input a set of instructions and one or more files. The instructions are executed on each line of the input file(s).
The instructions are enclosed in single quotes or they can be read from a file.
Example:
awk '{print $0}' example.txt
This command has the same output as cat
: it prints each line from the example.txt
file.
The structure of the instruction is the following:
- curly braces surround the set of instructions
- print is the instruction that sends its arguments to the terminal
- $0 is a variable, it means “the content of the current line”
As you can see, the file contains a table.
Awk automatically splits the processed line by looking at spaces: in our case, it has knowledge of the different columns in the table.
Each column value for the current line is stored into a variable: $1 for the first column, $2 for the second and so on.
So, if we like to print only the second column from the table, we execute
awk '{print $2}' example.txt
We can also print more than one value, or add text (e.g. “chr”) to the printed line:
awk '{print "chr",$2,$4}' example.txt
The comma puts a space between the printed values. Strings of text should be enclosed in double quotes. In this case we are printing the text “chr”, the second and the fourth column for each row in the table.
So, $0 is the whole line, $1 the first field, $2 the second and so on. What if we want to print the last column, but we don’t know its number? Maybe it is a huge table, or maybe different lines have a different number of columns.
Awk helps us thanks to the variable NF. NF stores the number of fields (our columns) in the row. Let’s see for our table:
awk '{print NF}' example.txt
We can see that some lines contain 6 fields while others contain 7 of them. Since NF is the number of the last field, $NF contains its value.
awk '{print "This line has",NF,"columns. The last one contains",$NF}' example.txt
Field separator
Out there we have different file formats: our data may be comma separated (csv), tab separated (tsv), by a semicolon or by any other character.
To specify the field separator, we should provide it at the command line like:
awk -F "," '{print $2}' example2.txt
In this case, we are printing the second field in each line, using comma as separator. Please notice that the character space is now part of the field value, since it is no longer the separator.
Matching lines
Maybe we would like to perform different instructions on different lines.
Awk allows you to specify a matching pattern, like the command grep does.
Let’s look at the file content
awk '{print $0}' methane.pdb
It seems an abriged PDB file. If we would like to print only lines starting with the word “ATOM”, we type:
awk '/^ATOM/ {print $0}' example.pdb
In this case, we specify the pattern before the instructions: only lines starting with the text “ATOM”. As you remember, ^ means “at the beginning of the line”.
We can specify more than one pattern:
awk '/^ATOM/ {print $7,$8,$9} /^HEADER/ {print $NF}' example.pdb
In this case, we are printing the spatial coordinates of each atom.
Key Points
awk can be used to manipulate and filter data, e.g. adding text or printing specific columns
NF is a variable that stores the number of fields in the current line
Field separator can be specified with the
-F
option, default is spaceMatching patterns can be specified with
/^PATTERN/
instruction