Analyzing Data from Multiple Files

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How can I do the same operations on many different files?

Objectives

Use a library function to get a list of filenames that match a wildcard pattern.

Write a for loop to process multiple files.

When working with real datasets, we may not always have everything in a single file. We need a way to get a list of all the files in our data directory whose names start with wave_ and end with .csv. The following library will help us to achieve this:

import glob

The glob library contains a function, also called glob, that finds files and directories whose names match a pattern. We provide those patterns as strings: the character * matches zero or more characters, while ? matches any one character. We can use this to get the names of all the CSV files in the current directory:

print(glob.glob('waves_*.csv'))

['waves_10s.csv', 'waves_00s.csv', 'waves_80s.csv', 'waves_90s.csv']

These files show waveheight data from some years in the decades indicated in the filename. As this example shows, glob.glob’s result is a list of files in arbitrary order. This means we can loop over it to do something with each filename in turn. In our case, the “something” we want to do is generate a set of plots for each file in our waveheight dataset.

If we want to start by analyzing just the first three files in alphabetical order, we can use the sorted built-in function to generate a new sorted list from the glob.glob output:

import glob
import numpy
import matplotlib.pyplot as plt

filenames = sorted(glob.glob('waves_*.csv'))
filenames = filenames[0:3]

for filename in filenames:
    print(filename)
    
    data = numpy.loadtxt(fname=filename, delimiter=',')
    number_of_rows = data.shape[0]
    number_of_years = number_of_rows//12

    # need to reshape the data for plotting
    data = numpy.reshape(data[:,2], [number_of_years,12])
    
    fig = plt.figure(figsize=(10.0, 3.0))
        
    axes1 = fig.add_subplot(1, 3, 1)
    axes2 = fig.add_subplot(1, 3, 2)
    axes3 = fig.add_subplot(1, 3, 3)

    axes1.set_ylabel('average')
    axes1.plot(numpy.mean(data, axis=0))

    axes2.set_ylabel('max')
    axes2.plot(numpy.max(data, axis=0))

    axes3.set_ylabel('min')
    axes3.plot(numpy.min(data, axis=0))

    fig.tight_layout()
    plt.show()

waves_00.csv

Output from the first iteration of the for loop. Three line graphs showing the average,
maximum and minimum waveheight in the 2000s.

waves_10s.csv

Output from the second iteration of the for loop. Three line graphs showing the average,
maximum and minimum waveheight in the 2010s.

waves_80s.csv

Output from the third iteration of the for loop. Three line graphs showing the average,
maximum and minimum waveheight in the 1980s.

Looking at these three decades, we can see similarities - with the average value representing a smooth climate with clear annual cycle. There are differences year on year, and the minimum and maximum data tend to be more variable than the mean.

Different types of division

You might have noticed that we calculated the number of years when reshaping the data, rather than using 10 years for each decade. This was useful here because the file containing data from the 2010s only has data from 6 years, but each one of these years did contain data for every month. If any year had missing data from any month, we would have needed to have done some additional preprocessing.

When we divided by 12 to get the number of years, we used // - you might have expected us just to use /. Python creates a float as a result from a division operation using /, even if both inputs are integers:
type(10/2)
float
However, if you use a float value as an array index, you get an error (even if it refers to a whole number). // is Python’s integer division operator - meaning that the result will always be an integer:
type(10//2)
int
If the result would normally result in a floating-point number, the actual result will instead be an integer rounded towards minus infinity:
print(19/4)
print(19//4)
4.75
4

Sometimes, plots can help us spot patterns in data, or problems with data.

Let’s load waves_90s.csv and reshape it so that the 3rd column (the average signifncant wave height for each month) is converted into 12 columns for each year.

data = numpy.loadtxt(fname = "waves_90s.csv", delimiter=',')
reshaped_data = numpy.reshape(data[:,2], [10,12])
print(data.shape)
print(reshaped_data.shape)

(120, 3)
(10, 12)

If we try and take the mean for the entire year, we’ll see that there must be NaNs:

numpy.mean(reshaped_data)

np.float64(nan)

If we plot the reshaped data, we would see white squares where there are NaNs in the data. Let’s do the shaping operation again but this time in a more generic way where we don’t need to know how many years are represented in the dataset.

number_of_rows = reshaped_data.shape[0]
# assume that every 12 rows represents one year of data and there's no missing data
number_of_years = number_of_rows//12
data = numpy.reshape(data[:,2], [number_of_years,12])
plt.imshow(data)
plt.show()

1990s data highlighting NaNs

We can clearly see that there must have been some problem with the collection of data in the first 6 months of 1994. Apart from that, the rest of the data looks relatively sensible, with low wave heights in the summer months of the decade, and higher wave heights in the winter months of the decade. This means that we have no reason not to trust the rest of the dataset.

There are other ways that we could have determined where the missing data is. The Numpy function isnan will tell us whether any given value is NaN; if we give it an ndarray, it will return an ndarray of the same shape with boolean values (True or False) showing if the value at that index was a NaN. argwhere will return the indices for an ndarray of booleans where the value is True.

numpy.argwhere(numpy.isnan(data))

array([[4, 0],
       [4, 1],
       [4, 2],
       [4, 3],
       [4, 4],
       [4, 5]])

There are almost always multiple ways to achieve the same result in programming, and often the choice as to which method to use comes down to personal preference. Some people might prefer to look at patterns in plots, while others may be happy to look at summary statistics and aggregate values from tabular data to come to the same conclusion.

Plotting Differences

Plot the difference between the average monthly waveheights in the 1980s and 1990s. Can you do this using glob, without specifying the filenames explicitly in the code?

Solution

import glob
import numpy
import matplotlib.pyplot as plt

filenames = sorted(glob.glob('waves_*.csv'))

data0 = numpy.loadtxt(fname=filenames[2], delimiter=',')
data1 = numpy.loadtxt(fname=filenames[3], delimiter=',')

number_of_rows0 = data0.shape[0]
number_of_years0 = number_of_rows0//12 
data0 = numpy.reshape(data0[:,2], [number_of_years0,12])

number_of_rows1 = data1.shape[0]
number_of_years1 = number_of_rows1//12 
data1 = numpy.reshape(data1[:,2], [number_of_years1,12])

fig = plt.figure(figsize=(10.0, 3.0))

plt.ylabel('Difference in average')
plt.plot(numpy.mean(data0, axis=0) - numpy.mean(data1, axis=0))

fig.tight_layout()
plt.show()

Congratulations! We’ve investigated the wave data and spotted some trends, and identified bad / missing data. Now that we understand what is in the model data set, and trust that it is realistic, we can use it to make practical decisions about things like boat operations

Key Points

Use glob.glob(pattern) to create a list of files whose names match a pattern.

Use * in a pattern to match zero or more characters, and ? to match any single character.

previous episode

Programming with Python

next episode

Analyzing Data from Multiple Files

Overview

Different types of division

Plotting Differences

Solution

Key Points

previous episode

next episode