Loading and Analyzing Argo Float Data
Last updated on 2025-10-14 | Edit this page
Overview
Questions
- “How can I process tabular data files in Python?”
Objectives
- “Explain what a library is and what libraries are used for.”
- “Import a Python library and use the functions it contains.”
- “Read tabular data from a file into a program.”
- “Select individual values and subsections from data.”
- “Perform operations on arrays of data.”
Words are useful, but what’s more useful are the sentences and stories we build with them. Similarly, while a lot of powerful, general tools are built into Python, specialized tools built up from these basic units live in libraries that can be called upon when needed.
Loading data into Python
To begin processing the Argo data, we need to load it into Python. We can do that using a library called NumPy, which stands for Numerical Python. In general, you should use this library when you want to do fancy things with lots of numbers, especially if you have matrices or arrays. To tell Python that we’d like to start using NumPy, we need to import it:
Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program.
Before we load any data it can be helpful to tell NumPy not to print
all the lines in our data since some of our data is quite big and we
probably don’t want to see every line of it. NumPy includes
a function called set_printoptions which we can use to tell
NumPy how many lines of our data to show.
Functions, Parameters and Return Values
- In the last episode we looked at using the
printandtypefunctions which are built into Python. - We “call” a function by writing its name followed by a
(, then we can give the values of any parameters that the function might need. If there is more than one of these we separate each of them with a comma. Finally we write a closing)to end the function call.
- Parameters have to be given in the order the function expects them.
Alternatively we can put a name in front of each paraemter followed by
an
=sign and the parameter value or the name of the variable we are sending.
- Functions can also send data back to the code which called them, this is known as “returning” data from a function.
- We can save this return data into a variable to use it again later. If we don’t save it into a variable then its value is displayed on the screen.
- When we import a library like
NumpPymore functions become available to us.
Once we’ve imported the NumpPy library, we can ask it to
read our data file for us:
But this gives us a FileNotFoundError because we don’t have a file
called argo_data.csv yet.
OUTPUT
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[3], line 1
----> 1 numpy.loadtxt(fname='argo_data2.csv', delimiter=',', skiprows=1)
...
This file is available from https://raw.githubusercontent.com/NOC-OI/python-for-future-oceanographers/refs/heads/main/data/argo_data.csv.
We can download this using the external command wget.
This is not part of Python and we can tell Jupyter to run it by starting
the cell with an !.
PYTHON
!wget https://raw.githubusercontent.com/NOC-OI/python-for-future-oceanographers/refs/heads/main/data/argo_data.csv
Or we can change the filename to the full web address and Numpy will get the file from the Internet for us.
PYTHON
numpy.loadtxt(fname='https://raw.githubusercontent.com/NOC-OI/python-for-future-oceanographers/refs/heads/main/data/argo_data.csv', delimiter=',', skiprows=1)
OUTPUT
array([[0.0000000e+00, 3.5025002e+01, 2.8898001e+01, 3.0000000e+00],
[1.0000000e+00, 3.5026001e+01, 2.8898001e+01, 4.0000000e+00],
[2.0000000e+00, 3.5026001e+01, 2.8896000e+01, 5.0000000e+00],
...,
[1.0500000e+02, 3.4988998e+01, 3.7710000e+00, 1.9380000e+03],
[1.0600000e+02, 3.4987999e+01, 3.7340000e+00, 1.9630000e+03],
[1.0700000e+02, 3.4987999e+01, 3.6930000e+00, 1.9890000e+03]])
The expression numpy.loadtxt(...) is a function call that asks Python
to run the function
loadtxt which belongs to the numpy library.
The dot notation in Python is used most of all as an object
attribute/property specifier or for invoking its method.
object.property will give you the object.property value,
object_name.method() will invoke an object_name method.
As an example, John Smith is the John that belongs to the Smith
family. We could use the dot notation to write his name
smith.john, just as loadtxt is a function that
belongs to the numpy library.
numpy.loadtxt has two parameters: the name of the file we
want to read and the delimiter
that separates values on a line. These both need to be character strings
(or strings for short), so we put
them in quotes. Notice that we also had to tell NumPy to skip the first
row, which contains the column titles.
Since we haven’t told it to do anything else with the function’s
output, the notebook displays it.
In this case, that output is the data we just loaded. By default, only a
few rows and columns are shown (with ... to omit elements
when displaying big arrays). Note that, to save space when displaying
NumPy arrays, Python does not show us trailing zeros, so
1.0 becomes 1..
Our call to numpy.loadtxt read our file but didn’t save
the data in memory. To do that, we need to assign the array to a
variable. In a similar manner to how we assign a single value to a
variable, we can also assign an array of values to a variable using the
same syntax. Let’s re-run numpy.loadtxt and save the
returned data:
This statement doesn’t produce any output because we’ve assigned the
output to the variable data. If we want to check that the
data have been loaded, we can print the variable’s value:
OUTPUT
[[0.0000000e+00 3.5025002e+01 2.8898001e+01 3.0000000e+00]
[1.0000000e+00 3.5026001e+01 2.8898001e+01 4.0000000e+00]
[2.0000000e+00 3.5026001e+01 2.8896000e+01 5.0000000e+00]
...
[1.0500000e+02 3.4988998e+01 3.7710000e+00 1.9380000e+03]
[1.0600000e+02 3.4987999e+01 3.7340000e+00 1.9630000e+03]
[1.0700000e+02 3.4987999e+01 3.6930000e+00 1.9890000e+03]]
Now that the data are in memory, we can manipulate them. First, let’s
ask what type of thing
data refers to:
OUTPUT
<class 'numpy.ndarray'>
The output tells us that data currently refers to a
NumPy array, the functionality for which is provided by the NumPy
library. These data correspond to Argo float data. Each row represents
one reading and the columns are the different data values.
Data Type
A Numpy array contains one or more elements of the same type. The
type function will only tell you that a variable is a NumPy
array but won’t tell you the type of thing inside the array. We can find
out the type of the data contained in the NumPy array.
OUTPUT
float64
This tells us that the NumPy array’s elements are floating-point numbers.
With the following command, we can see the array’s shape:
OUTPUT
(108, 4)
The output tells us that the data array variable
contains 108 rows and 4 columns (sequence number, conductivity/salinity,
temperature and pressure/depth).
If we want to get a single number from the array, we must provide an index in square brackets after the variable name, just as we do in math when referring to an element of a matrix. Our data has two dimensions, so we will need to use two indices to refer to one specific value:
OUTPUT
first value in data: 28.898001
OUTPUT
middle value in data: 9.876
The expression data[53, 2] accesses the element at the
54th row and 3rd column not the 53rd row and 2nd column as you might
think. Programming languages like Fortran, MATLAB and R start counting
at 1 because that’s what human beings have done for thousands of years.
Languages in the C family (including C++, Java, Perl, and Python) count
from 0 because it represents an offset from the first value in the array
(the second value is offset by one index from the first value). This is
closer to the way that computers represent arrays (if you are interested
in the historical reasons behind counting indices from zero, you can
read Mike
Hoye’s blog post). As a result, if we have an M×N array in Python,
its indices go from 0 to M-1 on the first axis and 0 to N-1 on the
second. It takes a bit of getting used to, but one way to remember the
rule is that the index is how many steps we have to take from the start
to get the item we want.
In the Corner
What may also surprise you is that when Python displays an array, it
shows the element with index [0, 0] in the upper left
corner rather than the lower left. This is consistent with the way
mathematicians draw matrices but different from the Cartesian
coordinates. The indices are (row, column) instead of (column, row) for
the same reason, which can be confusing when plotting data.
Explore the data
If you haven’t already, download the data we have been using with the
wget command:
!wget https://raw.githubusercontent.com/NOC-OI/python-for-future-oceanographers/refs/heads/main/data/argo_data.csv
You should then see a file called argo_data.csv appear
in the file manager on the left hand side of your screen. Click on this
file and open it.
What values do columns 1, 2 and 3 represent?
Now load the data using NumPy and write some Python code to read from the data. What is the temperature on the last row of the data?
Column 1 is salinity, column 2 is temperature and column 3 is pressure.
We can find the final temperature value on row 107, column 2 (counting from zero).
PYTHON
import numpy
data = numpy.loadtxt(fname="argo_data.csv", delimiter=',', skiprows=1)
#there are 108 rows to the data, so row number 107 is the last one because we started from 0
print(data[107,2])
The temperature value on the last row is 3.693 degrees celcius.
Slicing data
An index like [53, 2] selects a single element of an
array, but we can select whole sections as well. For example, we can
select the Argo data for the first five readings like this:
OUTPUT
[[ 0. 35.025002 28.898001 3. ]
[ 1. 35.026001 28.898001 4. ]
[ 2. 35.026001 28.896 5. ]
[ 3. 35.025002 28.893 6. ]
[ 4. 35.025002 28.892 7. ]]
The slice 0:5 means,
“Start at index 0 and go up to, but not including, index 5”. Again, the
up-to-but-not-including takes a bit of getting used to, but the rule is
that the difference between the upper and lower bounds is the number of
values in the slice.
We don’t have to start slices at 0:
OUTPUT
[[35.027 28.896 8. ]
[35.025002 28.902 9. ]
[35.026001 28.900999 10. ]
[35.027 28.907 16. ]
[35.549999 28.858999 26. ]]
We also don’t have to include the upper and lower bound on the slice. If we don’t include the lower bound, Python uses 0 by default; if we don’t include the upper, the slice runs to the end of the axis, and if we don’t include either (i.e., if we use ‘:’ on its own), the slice includes everything:
The above example selects rows 0 through 4 and columns 1 through to the end of the array (which gives us the salinity, temperature and depth).
OUTPUT
data from first five readings is:
[[35.025002 28.898001 3. ]
[35.026001 28.898001 4. ]
[35.026001 28.896 5. ]
[35.025002 28.893 6. ]
[35.025002 28.892 7. ]]
Slicing Strings
A section of an array is called a slice. We can take slices of character strings as well:
PYTHON
element = 'oxygen'
print('first three characters:', element[0:3])
print('last three characters:', element[3:6])
OUTPUT
first three characters: oxy
last three characters: gen
What is the value of element[:4]? What about
element[4:]? Or element[:]?
OUTPUT
oxyg
en
oxygen
Not All Functions Have Input
Generally, a function uses inputs to produce outputs. However, some
functions produce outputs without needing any input. These functions
don’t need any parameters, so we just write () after the
function name.
For example, checking the current time doesn’t require any input.
OUTPUT
Sat Mar 26 13:07:33 2016
We still need parentheses (()) to tell Python to go and
do something for us.
Loading data with ArgoPy
Instead of passing around spreadsheets or CSV files of data, all of
the data recorded by Argo floats is sent to a Data Assembly Centre
(DAC). After some checks of the data have been made it is sent to a
Global Data Assembly Centre (GDAC). There are two of these, one in the
USA and one in France, but they both hold a copy of all of the Argo data
ever received. To make accessing the data easy from Python a special
library called argopy has been developed. This can load
data directly from one of the GDACs and turn it into a Numpy array. This
saves us having to search through the GDAC, picking the data we want and
downloading it to a file on our computer.
The argopy library has a lot of different features, but
we want to use the DataFetcher function which gets data
from a GDAC. The ArgoDataFetcher will return something
called a class that has more functions we can call. One of these is
called profile and that gets an individual profile given a
float number and a profile number. The data we’ve been using came from
profile 12 of float number 6902746.
If we run the profile function with the float number and profile
number we get back a datafetcher.erddap object.
OUTPUT
<datafetcher.erddap>
Name: Ifremer erddap Argo data fetcher for floats
API: https://erddap.ifremer.fr/erddap/
Domain: phy;WMO6902746
Performances: cache=False, parallel=False
User mode: standard
Dataset: phy
This doesn’t contain much useful data, although it does tell us which
GDAC supplied the data. To get the actual data we need to call yet
another function that the datafetcher.erdapp object
provides called to_xarray. This gets the data ready for
processing using another library called Xarray, which works well with
Numpy data but is very good at working with really big datasets.
Now we get a lot more information including a list of what data
variables this float has. To get one of those we add its name to the end
of the command; for example, to get temperature we add
.TEMP.
Now we have something which just looks like real data. However one
last thing, the type of this data is xarray.DataArray not
numpy.ndarray. To do that final conversion we add
.values on the end (note that there’s no brackets on this
as this is a variable name not a function).
Let’s capture this into a variable called temp_data and
check it’s type.
and now we have a Numpy array with our temperature data.
OUTPUT
numpy.ndarray
This should be the same as the 3rd (2nd if you count from zero!) column of our earlier data. Let’s do a basic check of this by comparing the mean values.
print(temp_data.mean())
print(data[:,2].mean())
OUTPUT
13.058639
13.058638888888888
The values are slightly differnent because when they got saved into the CSV file they got rounded a little bit.
Analyzing data
NumPy has several useful functions that take an array as input to
perform operations on its values. If we want to find the average of all
our Argo float data, for example, we can ask NumPy to compute
data’s mean value:
OUTPUT
219.47419444212963
mean is a function
that takes an array as an argument. Given that our array
contains the sequence numbers and three different data variables taking
the mean of the whole array doesn’t really make much sense.
We can use slicing to calculate the mean temperature from our dive:
OUTPUT
13.058638888888888
Let’s use two other NumPy functions to get some descriptive values about the temperature range.
PYTHON
maxval = numpy.max(data[:,2])
minval = numpy.min(data[:,2])
print('Max temperature:', maxval)
print('Min temperature:', minval)
Here we’ve assigned the return value from
numpy.max(data[:,2]) to the variable maxval
and the value from numpy.min(data[:,2]) to
minval. Note that we used maxval, rather than
just max - it’s not good practice to use variable names
that are the same as Python
keywords or fuction names.
OUTPUT
Max temperature: 28.907
Min temperature: 3.693
Getting help on functions
How did we know what functions NumPy has and how to use them? If you
are working in IPython or in a Jupyter Notebook, there is an easy way to
find out. If you type the name of something followed by a dot, then you
can use tab completion
(e.g. type numpy. and then press Tab) to see a
list of all functions and attributes that you can use. After selecting
one, you can also add a question mark (e.g. numpy.abs?),
and IPython will return an explanation of the method! This is the same
as doing help(numpy.abs).
Find the temperature range for an Arctic float
The float 5906983 has been deployed in the Arctic by NOC for the MetOffice. You can see a map of where it’s been at https://fleetmonitoring.euro-argo.eu/float/5906983.
Adapt the code above to load profile number 33 from float 5906983. Calculate it’s minimum, maximum, mean and median temperature.
We haven’t calculated median before, search on the internet or look at the NumPy documentation (https://numpy.org/devdocs/reference/routines.statistics.html) to find out how to calculate this.
PYTHON
temperatures = argopy.DataFetcher().profile(5906983, 33).to_xarray().TEMP.values
maxval = numpy.max(temperatures)
minval = numpy.min(temperatures)
meanval = numpy.mean(temperatures)
medianval = numpy.median(temperatures)
print('Max temperature:', maxval)
print('Min temperature:', minval)
print('Mean Temperature:', meanval)
print('Median Temperature:', medianval)
OUTPUT
Max temperature: 7.463699817657471
Min temperature: -0.6674000024795532
Mean Temperature: 2.9974963312872487
Median Temperature: 3.9305999279022217
- “Import a library into a program using
import libraryname.” - “Use the
numpylibrary to work with arrays in Python.” - “The expression
array.shapegives the shape of an array.” - “Use
array[x, y]to select a single element from a 2D array.” - “Array indices start at 0, not 1.”
- “Use
low:highto specify aslicethat includes the indices fromlowtohigh-1.” - “Use
numpy.mean(array),numpy.max(array), andnumpy.min(array)to calculate simple statistics.” - “The
argopylibrary can load Argo float data over the internet from the GDAC”