Working with archive files
Overview
Teaching: 20 min
Exercises: 30 minQuestions
Understanding how to extract and compress archive files.
What is the difference between a tar and zip file?
Objectives
What are archive files?
How to extract a zip archive.
How to extract a tar archive.
How to create a zip archive.
How to create a tar archive.
How to compress a tar archive.
Archive Files
There are many times where it is useful and/or convienient to store multiple files and directory sturctures inside a single file. Probably the three most common use cases for this are when we want to store a set of files, to copy them or to compress them.
Archiving
We might want to store a collection of files for long term preservation in a way that somebody else can easily obtain them with the download of a single file. Common platforms for this include the Zenodo archiving service or Github (through it’s releases feature).
File transfer
When copying a file to another computer either via email, a file transfer protocol such as SCP or even a memory stick/card it can often be convienient if we can copy just a single file rather than a whole set of files. This is especially true when there is a complex directory sturcture that goes with it.
Compression
Most archive formats at least have the option (some do it by default) to compress the data in the archive to make it take less disk space and transfer faster over a network. These compression formats are usually “lossless” formats which work by trying to remove redundant data in files but allow the data to be completley recreated when uncompressed without loosing anything. This is in contrast to “lossy” formats such as JPEG (for images), MP3 (for audio) and MPEG (for video) which remove some information that is unlikely to be perceived by people but can reduce the file size. As a result these lossless compression systems work well when compressing previously uncompressed data such as text, CSV or code files or raw images, but they do not work very well on previously compressed files such as JPEG or PNG images, MP3 audio or MPEG video. Compression rates of 20-50% for text files are not uncommon, which can be a significant saving when storing or transferring data. Compression can also be very helpful when emailing files as most email systems have a size limit of between 8 and 20 megabytes.
Zip files
One of the most popular archive formats is the ZIP format, which as the name suggests is both a compression and archiving format. ZIP files are more common in the Windows world than the Unix world as they didn’t always support Unix file permission and ownership information.
Extracting a Zip file
In the last episode we copied a ZIP file called north-pacific-gyre.zip
to a
remote server using the scp
command. We can now use the unzip
commmand to
list and extract the contents of that ZIP file.
First let’s connect to the remote system using SSH and check the file is there.
ssh nelle@backupserver
ls -lh north-pacific-gyre.zip
We should see our north-pacific-gyre.zip
file listed along with some metadata
about it including who own’s it, it’s size and creation date/time.
-rw-rw-r-- 1 nelle nelle 41K Mar 17 18:33 north-pacific-gyre.zip
Now that we are sure the file is available to us let’s have a look at what is
inside it using the unzip -l
command.
unzip -l north-pacific-gyre.zip
Archive: north-pacific-gyre.zip
Length Date Time Name
--------- ---------- ----- ----
0 2025-03-17 18:27 north-pacific-gyre/
0 2025-03-17 18:27 north-pacific-gyre/2012-07-03/
4400 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01729B.txt
4391 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02040A.txt
4406 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01729A.txt
4371 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01736A.txt
4393 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02043B.txt
4389 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01978B.txt
4401 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01812A.txt
3517 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02018B.txt
4381 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01978A.txt
4381 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02040Z.txt
4386 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02043A.txt
4375 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01843B.txt
4411 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01751A.txt
4395 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01843A.txt
4372 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01971Z.txt
4367 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE02040B.txt
4409 2017-02-16 15:58 north-pacific-gyre/2012-07-03/NENE01751B.txt
219 2025-03-17 18:33 north-pacific-gyre/goostats
345 2025-03-17 18:33 north-pacific-gyre/goodiff
92 2025-03-17 18:33 north-pacific-gyre/do-stats.sh
--------- -------
74401 22 files
This shows that all the files in the archive are inside a directory called
north-pacific-gyre
. Inside this we have a subdirectory called 2012-07-03
with some text files in it and in the main directory we have the goostats
,
goodiff
and do-stats.sh
programs/shell scripts.
Let’s go ahead and extract this ZIP file, do that we simply run the unzip
command with the ZIP file name as an argument.
unzip north-pacific-gyre.zip
If we run the ls
command after this we should see a north-pacific-gyre
directory now exists and if we cd
into this and run ls
again there should
be a 2012-07-03
subdirectory and the goostats
, goodiff
and do-stats.sh
prgorams.
ls
cd north-pacific-gyre
ls
Creating a ZIP file
We can create our own ZIP archives using the zip
command. This takes the name
of the ZIP file to create followed by a list of filenames. We previously found
that Nelle has files named after which machine was used to process her samples.
This is the “A” or “B” at the end of the filename before the “.txt” extension.
There are a few samples which have a “Z” in the name where it was not known
which machine processed them. Let’s create a new archive in the
north-pacfiic-gyre/2012-07-03
directory which contains just the file ending
“A” or “B”.
cd 2012-07-03
zip goodfiles.zip NENE*[AB].txt
This will display the names of all the files added to our new ZIP file and the amount of compression (deflation) that is applied.
adding: NENE01729A.txt (deflated 51%)
adding: NENE01729B.txt (deflated 51%)
adding: NENE01736A.txt (deflated 51%)
adding: NENE01751A.txt (deflated 51%)
adding: NENE01751B.txt (deflated 51%)
adding: NENE01812A.txt (deflated 51%)
adding: NENE01843A.txt (deflated 51%)
adding: NENE01843B.txt (deflated 51%)
adding: NENE01978A.txt (deflated 51%)
adding: NENE01978B.txt (deflated 51%)
adding: NENE02018B.txt (deflated 51%)
adding: NENE02040A.txt (deflated 51%)
adding: NENE02040B.txt (deflated 51%)
adding: NENE02043A.txt (deflated 51%)
adding: NENE02043B.txt (deflated 51%)
We can now verify that these file were added to our ZIP by running unzip -l
on it:
unzip -l goodfiles.zip
Archive: goodfiles.zip
Length Date Time Name
--------- ---------- ----- ----
4406 2017-02-16 15:58 NENE01729A.txt
4400 2017-02-16 15:58 NENE01729B.txt
4371 2017-02-16 15:58 NENE01736A.txt
4411 2017-02-16 15:58 NENE01751A.txt
4409 2017-02-16 15:58 NENE01751B.txt
4401 2017-02-16 15:58 NENE01812A.txt
4395 2017-02-16 15:58 NENE01843A.txt
4375 2017-02-16 15:58 NENE01843B.txt
4381 2017-02-16 15:58 NENE01978A.txt
4389 2017-02-16 15:58 NENE01978B.txt
3517 2017-02-16 15:58 NENE02018B.txt
4391 2017-02-16 15:58 NENE02040A.txt
4367 2017-02-16 15:58 NENE02040B.txt
4386 2017-02-16 15:58 NENE02043A.txt
4393 2017-02-16 15:58 NENE02043B.txt
--------- -------
64992 15 files
Exercises
Adding and removing files to/from an existing ZIP
If the
zip
command is run multiple times it will update the files in the ZIP and if new files have been specified they will be added. If you run azip
command multiple times then you will notice on the second/subsequent runs it will say “updating” instead of “adding” next to the files which already exist in the zip file.
- Run the zip command from above (
zip goodfiles.zip NENE*[AB].txt
).- Repeat the command but change the file name list to
NENE*.txt
. What is displayed when the files ending in “Z” are added? How does this differ from the other files?- Verify the files ending in “Z” are now present using
unzip -l
.- Look at the
zip
man page, how can you now delete the files ending in “Z” without rebuilding the entire ZIP file?- Try the option you just found and verify the result with
unzip -l
.Solution
zip goodfiles.zip NENE*[AB].txt zip goodfiles.zip NENE*.txt # should show "updating" next to A/B files and "adding" next to the Z files unzip -l goodfiles.zip # verify the addition zip -d goodfiles.zip NENE*Z.txt # -d deletes files from the zip unzip -l goodfiles.zip # verify the deletion
Varying the compression level
You can select the amount/speed of compression applied to a ZIP file when creating it. More compression should result in a smaller file but the compression (and > decompression) will take longer.
Open the
zip
manpage and find the option which controls the compression speed.Try compressing the text files in the
2012-07-03
directory with the different options. What difference does it make to the level of compression?Find out how long it takes to compress at each level by prefixing the
zip
command with thetime
command to measure the time the command takes to run (this gives three numbers, you want the “real” number), as the files are small there will be a lot of variation, so try it a few times.time zip goodfiles.zip NENE*[AB].txt
Solution
The
-0
to-9
options vary the compression level/speed.-0
will give no compression,-9
will give the most and-6
is the default. File sizes vary between 66KB with no compression, 35KB at level 1 and 34KB beyond level 4. Times (on the author’s laptop) vary between 2 and 3 seconds at level 0 to 4-7 seconds at level 9.The full command will be:
time zip -9 goodfiles.zip NENE*[AB].txt
Tar archives
An alternative to ZIP archives are Tar archives. Tar stands for “tape archive” and was originally used to prepare a set of files to be written sequentially onto a magnetic tape for storage/backup purposes. The Tar command originates in the Unix world and natively supports Unix file ownership/group information, symbolic links and permissions.
Creating a Tar file
Tar files are created and extracted with the tar
command. To create one we
use the -c
or --create
option. The name of the tar file is then specified
with the -f
or --file
option. Like with zip we end the command with the
list of files to place inside the archive. For example to make a tar archive
of our “good” files from Nelle’s dataset inside the 2012-07-03
we can run:
tar --create --file goodfiles.tar NENE*[AB].txt
or
tar -c -f goodfiles.tar NENE*[AB].txt
or for an even more compact version (note the “f” must be the last argument):
tar -cf goodfiles.tar NENE*[AB].txt
Unlike zip
, tar
does not give any output to confirm what it has done unless
we add the -v
or --verbose
option. This will list the name of every file
added to our archive.
tar -cvf goodfiles.tar NENE*[AB].txt
Listing the contents of a Tar file
We can list the contents of a Tar by using the -t
or --list
option. We must
still use the -f
or --file
option to specify the name of the tar file we
are working with.
tar -tf goodfiles.tar
If we want an ls -l
style output the we can add the -v
or --verbose
option
tar -tvf goodfiles.tar
Compressing Tar files
Unlike the zip
command, the tar
command does not use any compression by
default, instead tar
files can be compressed by another program. Common
choices for this are gzip
or bzip2
, both of which compress a single file
and append a .gz
or .bz2
extension on the end. So you will often see tar
files with an extension of .tar.gz
(sometimes shortened to .tgz
) or
.tar.bz2
. Modern versions of tar
make this easier for us though, they can
take an extra -z
(or --gzip
) option for gzip or -j
(or --bzip2
) option
for bzip.
tar -cvjf goodfiles.tar.bz2 NENE*[AB].txt
Extracting Tar files
We can extract the contents of a tar file with the -x
or --extract
option.
As before this needs to be combined with -f
or --files
to specify the file
name and optionally -v
or --verbose
if we want a list of the file names we
are extracting. In older versions of tar
we needed to add -z
or -j
to
extract a compressed archive, but newer versions will automatically detect this
and do the decompression for us.
tar -xvf goodfiles.tar.bz2
Comparing Bzip2, Gzip and Zip compression
Compress the entire north-pacific-gyre directory using the following formats:
- Uncompressed Tar
- Gzip compressed Tar
- Bzip2 compressed Tar
- Zip
Make sure you delete any zip/tar files from inside the
2012-07-03
directory first.You will need to find an extra option for
zip
to make it recursively archive the directories ofnorth-pacific-gyre
, look in the man page to find this.Compare the file size for each archiving method. Which is smallest? Which is largest?
Solution
cd ~/ #get back to the home directory, north-pacific-gyre should be a subdirectory of this rm north-pacific-gyre/2012-07-03/*.zip north-pacific-gyre/2012-07-03/*.tar* # remove any old files tar -cvf north-pacific-gyre.tar north-pacific-gyre tar -cvzf north-pacific-gyre.tar.gz north-pacific-gyre tar -cvjf north-pacific-gyre.tar.bz2 north-pacific-gyre zip -9 -r north-pacific-gyre.zip north-pacific-gyre ls -lh north-pacific-gyre*.*
- north-pacific-gyre.tar - 90K
- north-pacific-gyre.tar.bz2 - 31K
- north-pacific-gyre.tar.gz - 36K
- north-pacific-gyre.zip - 41K
Key Points
Archive files are files which contain one or more other files. They are a convienient way to store or transfer multiple files and directory structures inside a single file.
Zip archives are (usually) compressed by default.
Zip files can be extracted with the
unzip
command.Zip files can be created with the
zip
command.Tar archives are not compressed.
Tar archives can be compressed using the
gzip
orbzip2
utilities.Tar files can be extracted with the
tar -xf
option.Modern versions of tar will automatically uncompress gzipped or bzipped archives.
Tar files can be created with the
tar -cf
option.