Analyzing Data from Multiple Files
Last updated on 2025-05-21 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- “How can I do the same operations on many different files?”
Objectives
- “Use a library function to get a list of filenames that match a wildcard pattern.”
- “Write a
for
loop to process multiple files.”
When working with real datasets, we may not always have everything in
a single file. We need a way to get a list of all the files in our
data
directory whose names start with wave_
and end with .csv
. The following library will help us to
achieve this:
The glob
library contains a function, also called
glob
, that finds files and directories whose names match a
pattern. We provide those patterns as strings: the character
*
matches zero or more characters, while ?
matches any one character. We can use this to get the names of all the
CSV files in the current directory:
OUTPUT
['waves_10s.csv', 'waves_00s.csv', 'waves_80s.csv', 'waves_90s.csv']
These files show waveheight data from some years in the decades
indicated in the filename. As this example shows,
glob.glob
’s result is a list of files in arbitrary order.
This means we can loop over it to do something with each filename in
turn. In our case, the “something” we want to do is generate a set of
plots for each file in our waveheight dataset.
If we want to start by analyzing just the first three files in
alphabetical order, we can use the sorted
built-in function
to generate a new sorted list from the glob.glob
output:
PYTHON
import glob
import numpy
import matplotlib.pyplot as plt
filenames = sorted(glob.glob('waves_*.csv'))
filenames = filenames[0:3]
for filename in filenames:
print(filename)
data = numpy.loadtxt(fname=filename, delimiter=',')
number_of_rows = data.shape[0]
number_of_years = number_of_rows//12
# need to reshape the data for plotting
data = numpy.reshape(data[:,2], [number_of_years,12])
fig = plt.figure(figsize=(10.0, 3.0))
axes1 = fig.add_subplot(1, 3, 1)
axes2 = fig.add_subplot(1, 3, 2)
axes3 = fig.add_subplot(1, 3, 3)
axes1.set_ylabel('average')
axes1.plot(numpy.mean(data, axis=0))
axes2.set_ylabel('max')
axes2.plot(numpy.max(data, axis=0))
axes3.set_ylabel('min')
axes3.plot(numpy.min(data, axis=0))
fig.tight_layout()
plt.show()
OUTPUT
waves_00.csv
OUTPUT
waves_10s.csv
OUTPUT
waves_80s.csv
Looking at these three decades, we can see similarities - with the average value representing a smooth climate with clear annual cycle. There are differences year on year, and the minimum and maximum data tend to be more variable than the mean.
Different types of division
You might have noticed that we calculated the number of years when reshaping the data, rather than using 10 years for each decade. This was useful here because the file containing data from the 2010s only has data from 6 years, but each one of these years did contain data for every month. If any year had missing data from any month, we would have needed to have done some additional preprocessing.
When we divided by 12 to get the number of years, we used
//
- you might have expected us just to use /
.
Python creates a float as a result from a division operation using
/
, even if both inputs are integers:
OUTPUT
float
However, if you use a float value as an array index, you get an error
(even if it refers to a whole number). //
is Python’s
integer division operator - meaning that the result will always be an
integer:
OUTPUT
int
If the result would normally result in a floating-point number, the actual result will instead be an integer rounded towards minus infinity:
OUTPUT
4.75
4
Sometimes, plots can help us spot patterns in data, or problems with data.
Let’s load waves_90s.csv
and reshape it so that the 3rd
column (the average signifncant wave height for each month) is converted
into 12 columns for each year.
PYTHON
data = numpy.loadtxt(fname = "waves_90s.csv", delimiter=',')
reshaped_data = numpy.reshape(data[:,2], [10,12])
print(data.shape)
print(reshaped_data.shape)
OUTPUT
(120, 3)
(10, 12)
If we try and take the mean for the entire year, we’ll see that there must be NaNs:
OUTPUT
np.float64(nan)
If we plot the reshaped data, we would see white squares where there are NaNs in the data. Let’s do the shaping operation again but this time in a more generic way where we don’t need to know how many years are represented in the dataset.
PYTHON
number_of_rows = data.shape[0]
# assume that every 12 rows represents one year of data and there's no missing data
number_of_years = number_of_rows//12
data = numpy.reshape(data[:,2], [number_of_years,12])
plt.imshow(data)
plt.show()
We can clearly see that there must have been some problem with the collection of data in the first 6 months of 1994. Apart from that, the rest of the data looks relatively sensible, with low wave heights in the summer months of the decade, and higher wave heights in the winter months of the decade. This means that we have no reason not to trust the rest of the dataset.
There are other ways that we could have determined where the missing
data is. The Numpy function isnan
will tell us whether any
given value is NaN; if we give it an ndarray
, it will
return an ndarray
of the same shape with boolean values
(True
or False
) showing if the value at that
index was a NaN. argwhere
will return the indices for an
ndarray
of booleans where the value is
True
.
OUTPUT
array([[4, 0],
[4, 1],
[4, 2],
[4, 3],
[4, 4],
[4, 5]])
There are almost always multiple ways to achieve the same result in programming, and often the choice as to which method to use comes down to personal preference. Some people might prefer to look at patterns in plots, while others may be happy to look at summary statistics and aggregate values from tabular data to come to the same conclusion.
Plotting Differences
Plot the difference between the average monthly waveheights in the 1980s and 1990s. Can you do this using glob, without specifying the filenames explicitly in the code?
PYTHON
import glob
import numpy
import matplotlib.pyplot as plt
filenames = sorted(glob.glob('waves_*.csv'))
data0 = numpy.loadtxt(fname=filenames[2], delimiter=',')
data1 = numpy.loadtxt(fname=filenames[3], delimiter=',')
number_of_rows0 = data0.shape[0]
number_of_years0 = number_of_rows0//12
data0 = numpy.reshape(data0[:,2], [number_of_years0,12])
number_of_rows1 = data1.shape[0]
number_of_years1 = number_of_rows1//12
data1 = numpy.reshape(data1[:,2], [number_of_years1,12])
fig = plt.figure(figsize=(10.0, 3.0))
plt.ylabel('Difference in average')
plt.plot(numpy.mean(data0, axis=0) - numpy.mean(data1, axis=0))
fig.tight_layout()
plt.show()
Congratulations! We’ve investigated the wave data and spotted some trends, and identified bad / missing data. Now that we understand what is in the model data set, and trust that it is realistic, we can use it to make practical decisions about things like boat operations
Key Points
- “Use
glob.glob(pattern)
to create a list of files whose names match a pattern.” - “Use
*
in a pattern to match zero or more characters, and?
to match any single character.”