This lesson is in the early stages of development (Alpha version)

Introduction to the Web and Online APIs

HTTP

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • What are protocols and ports?

  • What are HTTP and HTTPS?

  • What are requests and responses? How can we look at them?

Objectives
  • Understand the meaning of the terms protocol and port.

  • Understand what HTTP and HTTPS are, and how it relates to the Web and other aspects of the modern Internet.

  • Be able to use curl to make requests and view responses.

Since it was first introduced to the world in 1991, the World Wide Web has gone from the toy of computer scientists and particle physicists to a dominant part of everyday life for billions of people. At its core, the initial World Wide Web concept brought together three key ideas:

  1. The use of HTML (Hypertext Markup Language) documents which could contain hyperlinks to other documents (or different parts of the same document). These could reference documents located on any web server in the world.
  2. That every file on the world wide web would have a unique URL (Uniform Resource Locator).
  3. The Hypertext Transfer Protocol (HTTP) that is used to transfer data from the web server to the requesting client.

It has gradually consumed many services that were previously separate online services, or not available on the Internet at all.

Since the mid-2000s, the Web has increasingly been used to go beyond this traditional model of serving HTML to browsers. The same HTTP protocol which once served static HTML pages and images is now used to send dynamic content generated on the fly for consumption by other computer programs.

These Application Programming Interfaces (APIs) provide incredible amounts of structured data, as well as the ability to control things that may previously have required specialist proprietary software or even hardware. In particular, the data available via web APIs is particularly useful for data scientists; many data are now only made available via these APIs, and even in cases where data are made available in other formats, using an API is frequently more convenient.

To make effective use of web APIs, we need to understand a little more about how the Web works than a typical Web user might. This lesson will focus on clients—computers and software applications that make requests to other computers or applications, and receive information in response. Computers and applications that respond to such requests are referred to as servers.

Protocols and ports

You may (or may not) have wondered how it is that different web browsers, written independently by different companies and running on different operating systems, are able to talk to the same web servers using the same addresses, and get the same web pages back. This is because all web browsers implement the HyperText Transfer Protocol, or HTTP.

A protocol is nothing more than a system of rules that allow for communication between computers (or other devices). Much like a (human) language, it defines rules and syntax that when all parties follow, allow information to be transmitted from one device to another. Other examples of protocols you may be familiar with include the Secure Shell SSH, the File Transfer Protocol FTP, and the Simple Mail Transfer Protocol SMTP. Wikipedia has a long list of protocols that are (or once were) in common usage. HTTPS is a protocol closely related to HTTP; it follows many of the same conventions as HTTP, particularly in the way client and server code is written, but includes additional encryption to ensure that untrusted third parties can’t read or modify data in transit.

Given the large number of protocols in existence, computers need a way to identify which protocol a particular network connection is using, in particular on devices that have many different servers running. This is done by another set of protocols, which the above protocols build on top of: the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP). The difference between these isn’t important today; the important fact is that both protocols define port numbers (or ports) that are used to identify which server should handle a particular connection.

A server application must register a particular port to listen for connections on, and then all connections with that port number will be directed to that application. Ports are numbered 1–65,535, with ports up to 1,023 being “system ports” that on Unix-like systems require root access to listen to. Many protocols have standard ports that are used by convention—for example, HTTP uses port 80 by default, and HTTPS port 443. However, there is nothing stopping any protocol being used on any port.

You may have noticed that web addresses sometimes include a colon and a number after the server name; this indicates to the browser which port to connect on, in cases where you don’t want to connect to the default port (80 or 443). For example, Jupyter notebooks are frequently served at http://localhost:8888; this indicates that your browser should make an HTTP connection to your own local machine, on port 8888. Since only one application can listen to a port at a time, sometimes Jupyter finds it can’t listen on port 8888, and so will reserve port 8889 or 8890 instead.

URLs

A URL (also sometimes known as a URI or Uniform Resource Indicator) consists of two or three parts: the protocol followed by ://, the server name or IP address and optionally the path to the resource we wish to access. For example the URL http://carpentries.org means we want to access the default location on the server carpentries.org using the HTTP protocol. The URL https://carpentries.org/contact/ means we want to access the contact location on the carpentries.org server using the secure HTTPS protocol.

Requests and responses

The two main objects in HTTP are the request and the response. Each HTTP connection is initiated by sending a request, and is replied to with a response. Both the request and response have a header, that defines metadata about what is requested and what is included in the response, and both can also have a body, containing data. To look at these in more detail, we can use the curl command. Specifically, to see the request headers, we can use curl -v followed by the URL we wish to request.

$ curl -v http://carpentries.org
*   Trying 13.32.168.28...
* TCP_NODELAY set
* Connected to carpentries.org (13.32.168.28) port 80 (#0)
> GET / HTTP/1.1
> Host: carpentries.org
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Server: CloudFront
< Date: Sat, 13 Mar 2021 01:10:22 GMT
< Content-Type: text/html
< Content-Length: 183
< Connection: keep-alive
< Location: https://carpentries.org/
< X-Cache: Redirect from cloudfront
< Via: 1.1 f25763791d7f1173b560742bb9507145.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LHR62-C5
< X-Amz-Cf-Id: JJLCGx6qUOpaid_ArD0kph8QddidHgWnKoi72yNn0Jazmla8H5mUGg==
<
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>CloudFront</center>
</body>
</html>
* Connection #0 to host carpentries.org left intact
* Closing connection 0

Lines starting > here are request headers, and lines starting < are response headers. Following this is the body (the section from <html> to </html>), which in this case is a short web page.

In this case, after identifying what type of request this is (a GET request), the location to look for (/), and the HTTP version, we include three headers: the first states the domain name we are looking to contact (in case one server is serving multiple domain names, as is quite common), the second identifies what software we’re using to connect (as some servers will adjust the content depending on, for example, which browser you connect with), and the third tells the server what we’re looking for—in this case we will accept whatever the server has to offer.

The server then responds with a status code, followed by a lot of metadata. In this case, the status code 301 indicates that the site is no longer at the location we tried, so the metadata includes where to look instead. This is followed by a short web page explaining the same thing. Most browsers will see the 301 and automatically redirect to the correct location so you never see this error message.

Let’s see what happens when we follow the redirect. Web pages can be quite long, so for now let’s ignore the body and look only at the headers.

$ curl -v https://carpentries.org > /dev/null

In this case, because we’re connecting via HTTPS, curl gives a lot more debugging information about the secure connection, but after this we see similar request headers (although this time we’re using HTTP/2), and then the response headers start with HTTP/2 200, with the status code 200 indicating that this was a successful request, with the body providing what we asked for.

HTTP status codes are three digits long, and almost always begin with 2, 3, 4, or 5. Status codes beginning 2xx indicate that the request was successfully received, understood, and accepted; 3xx indicates a redirect of some kind; 4xx indicates an error caused by the client (for example the famous 404 Not found where the client has requested a resource that does not exist on the server), and 5xx indicates an error on the server side.

It’s rarely necessary to inspect the request, so if you’re interested in the headers, it’s more convenient to use curl -I to just show the response headers.

$ curl -I https://carpentries.org
HTTP/2 200
content-type: text/html
content-length: 55036
date: Sat, 13 Mar 2021 01:32:50 GMT
last-modified: Sat, 13 Mar 2021 01:26:59 GMT
etag: "f16c8eaddc88e035134aa23e0f8a94ba"
server: AmazonS3
x-cache: Hit from cloudfront
via: 1.1 a25f829e86f504a329e71fa3f4d21485.cloudfront.net (CloudFront)
x-amz-cf-pop: LHR62-C5
x-amz-cf-id: WGyZEdVLxTFbdQ3eKX2rdnPWO0214DDcQi8TA5UpObYt2CgHjCUz7g==
age: 87

Noteworthy here is the first header content-type: text/html; this indicates that the response body is an HTML document (also known as a web page). HTML, the HyperText Markup Language, is the language that all web pages are written in; while we won’t write any today, we will look a little more at how to read it (and get your code to read it) in a later episode.

HyperText?

Both HTTP and HTML refer to HyperText. This was a popular buzzword in the 1990s, and refers to the Web’s ability to include not only text, but also cross-references in the form of links (hypertext links, or hyperlinks) to other documents stored elsewhere, which the user can immediately access.

While this seems entirely obvious and second-nature today, it was revolutionary when it was first introduced, hence the name appearing prominently in technologies that supported it.

Another website

Pick a web page you’ve visited recently and take a look at its response headers with curl -I. How do they differ from the https://carpentries.org/ headers we looked at above? What parts are similar?

Key Points

  • A protocol is a standard for communicating data across a network. A port is a number to identify which program should process a network connection.

  • HTTP is the protocol originally designed for requesting and receiving Web pages, but now also used as the basis for a variety of APIs. HTTPS is the encrypted version of HTTP.

  • Every page on the world wide web is identified with a URL or Uniform Resource Locator.

  • A request is how you tell a server what you want to see. A response will either give you what you asked for, or tell you why the server can’t do that. Both requests and responses have a header, and optionally a body.

  • We can make requests and receive responses, as well as see their headers, using curl.


What do APIs look like?

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • How can requests be made of web APIs?

  • How can responses from web APIs arrive?

  • How can requests to web APIs be authenticated?

Objectives
  • Be able to make requests to web APIs using curl using endpoints, query parameters, and JSON data.

  • Be able to identify responses in plain text and JSON.

  • Be able to authenticate to web APIs with passwords and authentication tokens.

We’ve done a lot of talking about the technologies that will let us interact with APIs so far. Let’s now start putting this into practice and query an API.

$ curl http://numbersapi.com/42
42 is the number of laws of cricket.

Numbers API provides facts about numbers. By putting the number of interest into the address, we tell Numbers API which number to give a fact about. By adding other keywords to the address, we can refine the domain that we’re asking for information in; for example, for specifically mathematical trivia, we can add /math.

$ curl http://numbersapi.com/42/math
42 is a perfect score on the USA Math Olympiad (USAMO) and International Mathematical Olympiad (IMO).

Numbers API is not an especially sophisticated API. In particular, it only offers a single endpoint (specifically, /), and each response to a query is a single string, provided as plain text.

We can think of an API as being similar to a package or library in a programming language, but one that is usable from almost any programming language. In these terms, an endpoint is equivalent to a function; Numbers API provides a single function, /, which gives information about numbers. The response is the return value of the function, and in this case is a single string. This maps well onto HTTP, as the response body of a request is a string of either characters or of bytes. (Byte strings don’t translate well between languages, so are usually avoided, except for specific portable formats such as images.)

However, many useful functions need to return something other than character strings. For example, you might want to return a list, or an array, or a set of related data. Let’s look at another example of a web API and see how this can be handled. Newton is a web API for advanced mathematics. One thing it can do is factorization:

$ curl https://newton.vercel.app/api/v2/factor/x^2-1
{"operation":"factor","expression":"x^2-1","result":"(x - 1) (x + 1)"}

Two things have changed. Firstly, now instead of /, we are specifying that we want to use the factor endpoint provided by the v2 version of the API. This is a very common way of structuring APIs: firstly a version, and then one or more levels of endpoints to specify what function you would like the API to perform.

Secondly, rather than a plain text response, we get a data structure. This is still encoded as plain text (because HTTP can’t natively transmit much else), but we can’t use the text directly—instead, we need to parse it, first. The syntax used here is the most common format for modern web APIs, and is called JSON (pronounced like the name “Jason”; short for JavaScript Object Notation). (You may also encounter older or more old-fashioned APIs that instead use XML, the eXtensible Markup Language.) We can see that this response includes three names, or keys ("operation", "expression", and "result"), and three associated values ("factor", "x^2-1", and "(x - 1) ( x + 1)", respectively).

factor is not the only thing that Newton can do. Let’s try a different endpoint, for integration.

$ curl https://newton.vercel.app/api/v2/integrate/x^2-1
{"operation":"integrate","expression":"x^2-1","result":"1/3 x^3 - x"}

In this case Newton correctly tells us that the "result" of this integration is "1/3 x^3 - x".

The endpoints an API offers, and what format it will give its responses in, will generally be listed in the API’s documentation. Newton’s documentation for example can be found on GitHub.

More math

Read through Newton’s documentation. Try one or more of the other endpoints that we haven’t tried. Check that the results match what you would expect.

Try using a different input function than x^2-1. Again, check that the answers give what you expect.

Errors (or not)

Try using the simplify endpoint for Newton to simplify the expression 0^(-1) (i.e. 1 divided by 0).

Use curl -i to see both the headers and the response. Do these match what you expect?

(You will need to enclose the URl in quotes.)

Solution

The response code for this request is 200 (OK), but the "result" indicates that an error occurred.

This is not uncommon; not all APIs will use the HTTP status code to indicate an error condition. Some will even give you an HTML web page describing an error condition when usually you would expect a non-HTML response. It’s good to check this behaviour for each API that you use, so that you can guard for it in your software.

Authentication and identification

Many web APIs restrict access to registered users or applications. This may be because they are used to control things that are specific to a particular user account, because different people have different privilege levels and so different endpoints available, or simply because the API provider wants to collect statistics on how the API is being used.

Various ways exist for developers to authenticate to an API, including:

For everything other than HTTP authentication, there are also a variety of ways to present the credential to the server, such as:

One important fact about HTTP is that it is stateless: each request is treated entirely separately, with no memory from one request to the next. This means that you must present your authentication credentials with every request you make to the API. (This is in contrast to other protocols like SSH or FTP, where you authenticate once at the start of a session, and then subsequent messages can be sent back and forth without the need for re-authentication.)

For example, NASA offers an API that exposes much of the data that they make public. They require an API key to identify you, but don’t require any authentication beyond this.

Let’s try working with the NASA API now. To do this, first we need to generate our API key by providing our details at the API home page. Once that is done, NASA sends a copy to the email address you provide. Let’s use Astronomy Picture of the Day (APOD) as an example of an API query to try. This shows us that NASA expects the API key to be encoded as a query parameter.

$ curl -i https://api.nasa.gov/planetary/apod?api_key=ejgThfasPCRf4kTd39ar55Aqhxv8cwKBdVOyZ9Rr
HTTP/1.1 200 OK
Date: Mon, 15 Mar 2021 00:08:34 GMT
Content-Type: application/json
Content-Length: 1135
Connection: keep-alive
Vary: Accept-Encoding
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1998
Access-Control-Allow-Origin: *
Age: 0
Via: http/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])
X-Cache: MISS
Strict-Transport-Security: max-age=31536000; preload

{"copyright":"Mia St\u00e5lnacke","date":"2021-03-14","explanation":"It appeared, momentarily, like a 50-km tall banded flag.  In mid-March of 2015, an energetic Coronal Mass Ejection directed toward a clear magnetic channel to Earth led to one of the more intense geomagnetic storms of recent years. A visual result was wide spread auroras being seen over many countries near Earth's magnetic poles.  Captured over Kiruna, Sweden, the image features an unusually straight auroral curtain with the green color emitted low in the Earth's atmosphere, and red many kilometers higher up. It is unclear where the rare purple aurora originates, but it might involve an unusual blue aurora at an even lower altitude than the green, seen superposed with a much higher red.  Now past Solar Minimum, colorful nights of auroras over Earth are likely to increase.   Follow APOD: Through the Free NASA App","hdurl":"https://apod.nasa.gov/apod/image/2103/AuroraFlag_Stalnacke_6677.jpg","media_type":"image","service_version":"v1","title":"A Flag Shaped Aurora over Sweden","url":"https://apod.nasa.gov/apod/image/2103/AuroraFlag_Stalnacke_960.jpg"}

We can see that this API gives us JSON output including a links to two versions of the picture of the day, and then metadata about the picture including its title, description, and copyright. The headers also give us some information about our API usage—our rate limit is 2000 requests per day, and we have 1998 of these remaining (probably because the malware scanner on my email server tested the link first to make sure it wasn’t malicious).

With all of these ways to provide identification and authentication information, we don’t have time to cover each possibility exhaustively. For the vast majority of APIs, there will exist good developer documentation that provides examples of how to use the token or other identifier that they provide to connect to their service, including examples.

More complicated queries

Thus far we have queried APIs where any parameters are included as part of the effective “filename” on the server. For example, in http://numbersapi.com/42, the 42 is a parameter to the API, but at first glance it could equally well be an endpoint.

Many APIs make this distinction more clear, by accepting arguments in a query string. This is a sequence of name=value pairs, separated from each other by &s, and separated from the endpoint by a ?.

Using quotes with Curl

When we put an & into a web address for Curl we need to put it inside quotes. If we don’t then our shell will interpret them as meaning we should run the preceeding command in the background instead of passing it as a parameter to curl. This will effectively truncate the address to everything up to the first &.

We have already seen one example of this—we used it to provide our API key to NASA’s APOD endpoint. The APOD endpoint also accepts other parameters, for example, to select the date or dates for which the picture is returned.

$ curl -i "https://api.nasa.gov/planetary/apod?date=2005-04-01&api_key=ejgThfasPCRf4kTd39ar55Aqhxv8cwKBdVOyZ9Rr"
HTTP/1.1 200 OK
Date: Mon, 15 Mar 2021 00:31:45 GMT
Content-Type: application/json
Content-Length: 965
Connection: keep-alive
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1996
Access-Control-Allow-Origin: *
Age: 0
Via: http/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])
X-Cache: MISS
Strict-Transport-Security: max-age=31536000; preload

{"copyright":"Ellen Roper","date":"2005-04-01","explanation":"Can you help discover water on Mars?  Finding water on different regions on Mars has implications for understanding its complex geologic history, the possible existence of past life and the sustenance of potential future astronauts.  Many space missions have taken photographs of the surface of the red planet, and some of them might show a subtle clue pointing to water on Mars that has been missed.  By close inspection of images, following curiosity, applying scientific principles, applying knowledge about features on the Martian surface, and applying principles of planetary geology, such clues might be brought to light.  In the meantime, happy April Fool's Day from the folks at APOD!","hdurl":"https://apod.nasa.gov/apod/image/0504/WaterOnMars2_gcc_big.jpg","media_type":"image","service_version":"v1","title":"Water On Mars","url":"https://apod.nasa.gov/apod/image/0504/WaterOnMars2_gcc.jpg"}

One benefit of being able to construct queries in this way is that the query is more self-descriptive—for unfamiliar APIs, keyword arguments are significantly easier to read than positional ones.

One other way to provide parameters, in particular when they are more complex data structures than can be easily represented in a small string, is to use JSON in the body of the request. Since constructing JSON by hand is tedious, we will defer such APIs to the next section.

NASA aerial imagery

Look through NASA’s API documentation. Use the Earth API to retrieve an aerial image of your current location.

Try first using curl without any flags. What message do you get from curl? Why might this be?

Now try inspecting the headers for the request using curl -I, and look at the Content-Type. Does this match your suspicion as to the reason for curl’s message?

Finally, follow curl’s advice to save the output to a file. Open the resulting file and see if it matches what you expected.

Key Points

  • Interact with web APIs by sending requests to an endpoint representing a function of interest. Parameters can be encoded into the request, or attached as e.g. JSON.

  • Responses are typically plain text or JSON, but could be anything.

  • Most APIs require some form of authentication. This can be by username and password, or via a token.

  • Which choices a given API makes for each of these will be described in the API’s documentation.


dicts

Overview

Teaching: 12 min
Exercises: 8 min
Questions
  • What is a Python dict?

  • How do I use a dict?

Objectives
  • Understand what a dict is.

  • Be able to create, modify, and use dicts in Python.

In the previous episode we saw that some APIs will return data formatted as JSON, including names (or keys) and values associated with them.

Since we would ultimately like to work with data from these APIs in Python, it would be nice if Python had a data structure that behaved similarly. In the Software Carpentry introduction to Python, we learned about lists, which are ordered collections of things, indexed by their position in the ordering. What we would like here is similarly a collection, but rather than having ordering and indexing by position, instead we would like elements to have an arbitrary index of our choice.

In fact, Python has such a collection built into it; it is called a dict (short for dictionary). Let’s construct one now, to hold data from the Mayo Clinic about caffeine levels in various beverages.

caffeine_mg_per_serving = {'coffee': 96, 'tea': 47, 'cola': 24, 'energy drink': 29}

We see here that the dict is created within curly braces {}, and contains keys and corresponding values separated by a :, with successive pairs being separated by a , like in a list.

Again, similarly to a list, we can access elements of the dict with square brackets []. For example, to get the number of mg of caffeine per serving of coffee, we could use the following:

print("Coffee has", caffeine_mg_per_serving['coffee'], "mg of caffeine per serving")
Coffee has 96 mg of caffeine per serving

We can also replace elements in the same way that we can for a list. For instance, you may have spotted that the value for 'cola' is incorrect. Let’s fix that now.

caffeine_mg_per_serving['cola'] = 22
print(caffeine_mg_per_serving)
{'coffee': 96, 'tea': 47, 'cola': 22, 'energy drink': 29}

One thing that we can’t do for lists is create new elements by indexing with []. But dicts let us do that, as well:

caffeine_mg_per_serving['green tea'] = 28
print(caffeine_mg_per_serving)
{'coffee': 96, 'tea': 47, 'cola': 22, 'energy drink': 29, 'green tea': 28'}

Ordering

Python dicts historically were not ordered—you would not be guaranteed to get back results in the same order that you put them in. In more recent versions of Python, dicts do preserve the ordering in which they are created, so 'green tea', having been added most recently, appears at the end.

Missing values

dicts will throw an error, though, if we try to access values for keys that we have not added previously.

print(caffeine_mg_per_serving['guarana'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-dff37d2ef7d1> in <module>
----> 1 caffeine_mg_per_serving['guarana']

KeyError: 'guarana'

To write more robust code, we might like to check whether we can use a particular key before trying to access it. In a list, this is simple, as we can check whether a particular index is less than the length of the list. With a dict, we need to use a keyword to check whether a particular key is in the list:

'coffee' in caffeine_mg_per_serving
True

Alternatively, if we want to get an element of the list and use a default value if the key isn’t found, we can use the .get() method:

print(caffeine_mg_per_serving.get("coffee", 0))
print(caffeine_mg_per_serving.get("hot chocolate", 0))
96
0

(If you don’t specify the default value, then Python uses None for keys that are not found.)

Looping

Now, a particularly useful thing to do with a list is to loop over it. What happens when we loop over a dict?

for item in caffeine_mg_per_serving:
    print(item)
coffee
tea
cola
energy drink
green tea

Looping (or otherwise iterating) over a dict in fact loops over its keys. This matches with what the in keyword does—it would be strange for the two to look at different aspects of the dict. But sometimes we may want to use the values as well as the keys in a loop. We could index back into the dict via the key, but that is repetitive. We can instead use the .items() method of the dict:

for drink, quantity in caffeine_mg_per_serving.items():
    print(drink.capitalize(), "contains", quantity, "mg of caffeine per serving")
Coffee contains 96 mg of caffeine per serving
Tea contains 47 mg of caffeine per serving
Cola contains 22 mg of caffeine per serving
Energy drink contains 29 mg of caffeine per serving
Green tea contains 28 mg of caffeine per serving

What’s in a key?

In this episode, we have used strings as keys, as this is what we’re most likely to see when working with JSON. This is not a Python restriction, however. We can use any “hashable” type as a dict key; this includes strings, numbers, and tuples, among other immutable types. Most notably, this excludes lists and dicts (which are mutable).

dicts of functions

What will the following code do?

import numpy as np

operations = {
    'min': np.min,
    'max': np.max
}

def process(array, operation):
    return operations[operation](array)

print(process([1, 4, 7, 2, -3], 'min'))

When might this kind of behaviour be useful?

Try adjusting the example so that 'mean' and 'std' also work as you might expect.

Solution

This will pull the described function out of the dictionary. This could be useful when you want to allow the user to decide what functionality is desired at run-time, perhaps in a configuration file. Perhaps a choice of inversion algorithms or fitting functions could be offered.

To add other functions, the operations dict could be adjusted as:

operations = {
    'min': np.min,
    'max': np.max,
    'mean': np.mean,
    'std': np.std
}

Nested dicts

It is worth noting that the values in a dict can be of any type (this is not true for the keys). One notable case, is that values can be themselves dicts:

nutrition_values = {'energy': {'units': 'kCal/100g',
                               'values': {'white bread': 273,
                                          'almonds': 512}},
                    'caffeine': {'units': 'mg per serving',
                                 'values': caffeine_mg_per_serving}}

It is then possible to access data using multiple square bracket expressions:

print("Caffeine content of coffee:", nutrition_values['caffeine']['values']['coffee'])
print("Units:", nutrition_values['caffeine']['units'])
Caffeine content of coffee: 96 
Units: mg per serving

Key Points

  • A dict is a collection of key-value pairs.

  • Create a dict with the syntax {key1: value1, key2: value2, ...}.

  • Get and set elements of a dict with square brackets: my_dict[key1] = new_value1.


Requests

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can I send HTTP requests to a web server from Python?

  • How to interact with web services that require authentication?

  • What are the data formats that are used in HTTP messages?

Objectives
  • Use the Python requests library for GET and POST requests

  • Understand how to deal with common authentication mechanisms.

  • Understand what else the requests library can do for you.

So far, we have been interacting with web APIs by using curl to send HTTP requests and then inspecting the responses at the command line. This is very useful for running quick checks that we are able to access the API, and debugging if we’re not. However, to integrate web APIs into our software and analyses, we’d like to be able to make requests of web APIs from within Python, and work with the results.

In principle we could make subprocess calls to curl, and capture and parse the results, but this would be very cumbersome. Fortunately, other people thought the same thing, and have made libraries available to help with this. Basic functionality around making and processing requests is built into the Python standard library, but far more popular is to use a package called requests, which is available from PyPI.

First off, let’s check that we have requests installed.

$ python -c "import requests"

if you do not see any message, then requests is already installed. If on the other hand you see a message like

Traceback (most recent call last):
  File "<string>", line 1, in <module>\
ModuleNotFoundError: No module name 'requests'

then install requests from pip:

$ pip install requests

Recap: Requests, Responses and JSON

As a reminder, communication with web APIs is done through the HTTP protocol, and happens through messages, which are of two kinds: requests and responses.

A request is composed of a start line, a number of headers and an optional body.

Practically, a request needs to specify one of the HTTP verbs and a URL in the start line and an optional payload (the body).

A response is composed of a status line, a number of headers and an optional body.

The data to be transferred with the body of a request needs to be represented in some way. “Unstructured” text representations are used, e.g., to transmit CSV data. A popular text-based (ASCII) format to transmit data is the JavaScript Object Notation (JSON) format. The Python standard library includes a module to deal with JSON, for serialisation (i.e. representing Python objects as JSON strings):

import json
data = dict(a=1, b=dict(c=(2,3,4)))
representation = json.dumps(data)
representation
'{"a": 1, "b": {"c": [2, 3, 4]}}'

And for parsing (i.e. recovering python objects from their JSON string representation):

data_reparsed = json.loads(representation)
data_reparsed
{'a': 1, 'b': {'c': [2, 3, 4]}}

You can see that for dicts containing strings, integers, and lists, at least, the JSON representation looks very similar to the Python representation. The two are not always directly interchangeable, however.

The Python requests library can parse JSON and serialise the objects, so that you don’t have to deal with this aspect on your own.

Another ASCII format that is used with APIs is the eXtensible Markup Language (XML), which is much more complex to deal with than JSON. Facilities to deal with the XML format are in the xml.etree.ElementTree library.

Another markup language widely used in HTTP message bodies is the HyperText Markup Language, HTML.

HTTP verbs

Up until now we have exclusively used GET requests, to retrieve information from a server. In fact, the HTTP protocol has a number of such verbs, each associated with an operation falling in one of four categories: Create, Read, Update, or Delete (sometimes called the CRUD categories). The most common verbs are:

In this lesson we will focus on GET and POST requests only.

A GET request example

Let’s take the first example we looked at earlier, now with the Python requests library:

import requests
response = requests.get("http://carpentries.org")

requests gives us access to both the headers and the body of the response. Looking at the headers first, we can look at what type of data is in the body. As this is the URL of a website, we expect the reponse to contain a web page:

response.headers["Content-Type"]
text/html

Our expectations are confirmed. We can also check the Content-Length header to see how much data we expect to find in the body:

response.headers["Content-Length"]
29741

And, as expected, the length of the body of the response is the same:

len(response.text)
29741

We can look at the content of the body:

response.text

Another GET request example

APIs, like other pieces of code, need documentation. We’ve already seen some examples of API documentation, such as NASA’s API documentation.

One popular way of creating API documentation is by generating it from the API specification (essentially a means of providing metadata for the API). One of the specification languages you are likely to hear about is called OpenAPI, a specification language for HTTP APIs. This is a machine readable format, meaning that a lot of tooling has been developed around it for tasks like the generation of API documentation.

The documentation that can be generated from an OpenAPI description can be interactive, even allowing you to test API endpoints without ever leaving the documentation page. We’ll see an example of this in the exercise below, as well as another example of a GET request.

EDS Citation API using the documentation

Look at BODC’s EDS citation API. (See this page for background information.) Can you find out the total number of citation count for the Polar Data Centre (PDC), without leaving the page?

Solution

There are multiple ways to do this. One is to use the /centre endpoint. You could also use the /centre/{centre_name} endpoint, where centre_name is Polar Data Centre (PDC). To interact with the documentation on the page, click the Try it out button, enter any desired parameters, then click the Execute button.

EDS Citation API using the requests library

Can you now make exactly the same request but using the requests library, rather than just using the interactive documentation?

Solution

For example:

import requests
response = requests.get("https://www.bodc.ac.uk/eds-citation/centre")
response.json()

GET with parameters

As we have seen when talking about curl, some endpoints accept parameters in GET requests. Using Python’s requests library, the call to NASA’s APOD endoint that we previously made

$ curl -i "https://api.nasa.gov/planetary/apod?date=2005-04-01&api_key=<your-api-key>"

can be expressed in a more human-friendly format:

response = requests.get(url="https://api.nasa.gov/planetary/apod",
                        params={"date":"2005-04-01",
                                "api_key":"<your-api-key>"})

using a dictionary to contain all the arguments.

Get a list of GitHub repositories

The CDT-AIMLAC GitHub organisation (cdt-aimlac) has a number of repositories. Using the official API documentation of GitHub, can you list their name, ordered in ascending order by last updated time? (Look at the examples in the documentation!)

Solution

The url to use is https://api.github.com/orgs/cdt-aimlac/repos. In addition to that, we need to use the parameters sort with value updated and direction with value asc.

response = requests.get(url="https://api.github.com/orgs/cdt-aimlac/repos",
                        params={'sort':'updated',
                                'direction':'asc'})
response
<Response [200]> 

Once we verify that there are no errors, we can extract the data, which is available via the json() method:

for repo in response.json():
   print(repo["name"], ':', repo["updated_at"]) 
testing_exercise : 2020-04-28T13:56:42Z
docker-introduction-2021 : 2021-01-26T19:20:19Z
grid : 2021-03-10T11:59:09Z
training-cloud-vm : 2021-03-23T13:43:03Z
ccintro-2021 : 2021-09-21T13:57:35Z
git-novice : 2021-11-24T10:21:58Z
docker-introduction-2022 : 2022-01-24T17:31:39Z
blogs : 2022-09-07T15:56:33Z
ccintro-2022 : 2022-09-15T15:51:29Z
aber-pubs : 2022-11-23T13:41:57Z
agile_snails_coding_challenge : 2022-11-23T15:45:05Z
team_7564616d_models : 2022-11-23T15:45:42Z
coding-challenge-2022_23-task1 : 2023-02-08T17:02:07Z
pl_curves : 2023-03-29T22:38:10Z
ccintro-2023 : 2023-09-18T13:58:06Z
marketintro-2023 : 2023-11-16T17:12:00Z

Another GET request with parameters example - Open-Meteo API

As an additional example of using requests to connect to an API rather than a plain web site we’ll use the Open-Meteo API. (This is free for non-commercial use and does not require an API key.)

Looking at the documentation for this API (specifically, the API URL under the API Response section), we can build a URL to access a temperature forecast for the next three days for NOC Southampton. This URL has four parameters (latitude, longitude, variable we are accessing and number of days we’re interested in).

As we saw in the previous episode, with curl from the command line, we would have to use the following command

curl "https://api.open-meteo.com/v1/forecast?latitude=50.89&longitude=-1.39&hourly=temperature_2m&forecast_days=3"

building the parameter string explicitly. This is also the syntax that is used in a browser address bar:

"protocol://host/resource/path?parname1=value1&parname2=value2..."

However, using the requests library allows us to use a nicer syntax:

response = requests.get(url="https://api.open-meteo.com/v1/forecast", params={"latitude": "50.89", "longitude": "-1.39", "hourly": "temperature_2m", "forecast_days": "3"})
response
<Response [200]>

As we saw previously, the code 200 means “success”. To make sure the response contains what we expect, let’s quickly print its headers (which has the structure of a dictionary):

for key, value in response.headers.items():
    print((key, value))
('Date', 'Wed, 23 Apr 2025 12:57:34 GMT')
('Content-Type', 'application/json; charset=utf-8')
('Transfer-Encoding', 'chunked')
('Connection', 'keep-alive')
('Content-Encoding', 'deflate')

As expected the Content-Type is application-json. We can now look at the body of the response:

response.text[:100]
'{"latitude":50.86,"longitude":-1.3800001,"generationtime_ms":0.024437904357910156,"utc_offset_second'

As mentioned, the requests library can parse this JSON representation and return a more convenient Python object, using which we can access the inner data:

data = response.json()
data["hourly"]["temperature_2m"]

Another location

Refer back to the the API reference. Can you produce a forecast for a fortnight for precipation probability at NOC Liverpool?

Solution

We query the MetOffice API using something similar to the following:

response = requests.get(url="https://api.open-meteo.com/v1/forecast", params={"latitude": "53.40", "longitude": "-2.97", "hourly": "precipitation_probability", "forecast_days": "14"})

Authentication and POST

As mentioned above, thus far we have only used GET requests. GET requests are intended to be used for retrieving data, without modifying any state—effectively, “look, but don’t touch”. To modify state, other HTTP verbs should be used instead. Most commonly used for this purpose in web APIs are POST requests.

As such, we’ll switch to using the GitHub API to look at how POST requests can be used.

This will require a GitHub Personal Access Token. If you don’t already have one, then the instructions in the Setup walk through how to obtain one.

Take care with access tokens!

This access token identifies your individual user account, rather than just the application you’re developing, so anyone with this token can impersonate you and manage your account. Be very sure not to commit this (or any other personal access token) to a public repository, (or any repository that might be made public in the future) as it will very rapidly be discovered and used against you.

The most common mistake some people have made here is committing tokens for a cloud service. This has allowed unscrupulous individuals to take over cloud computing services and spend hundreds of thousands of pounds on activities such as mining cryptocurrency.

To POST requests, we can use the function requests.post.

For this example, we are going to post a comment on an issue on GitHub. Issues on GitHub are a simple way to keep track of bugs, and a great way to manage focused discussions on the code.

In order to do so, we need to authenticate. We will now create an object of the HTTPBasicAuth class provided by requests, and pass it to requests.post.

First of all, let’s load the GitHub access token:

with open("github-access-token.txt", "r") as file:
  ghtoken = file.read().strip()

Let’s then create the HTTPBasicAuth object:

from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth("your-github-username",ghtoken)

We will now create the body of the comment, as a JSON string:

import json
body = json.dumps({"body": "Another test comment"})

Finally, we will post the comment on GitHub and make sure we get a success code:

response = requests.post(url="https://api.github.com/repos/mmesiti/web-novice-test-repo/issues/1/comments",
              data=body,
              auth=auth)
response
<Response [201]>

The code 201 is the typical success response for a POST request, signaling that the creation of a resource has been successful. We can go to the issue page and check that our new comment is there.

Curl and POST

curl can be also used for POST requests, which can be useful for shell-based workflows. One needs to use the --data option.

What have I asked you?

The request that generated a given response object can be retrieved as response.request. Can you see the headers of that request? And what about the body of the message? What is the type of the request object?

Solution

To print the headers:

print(response.request.headers)
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

The body of the request is accessible just with

response.requests.body
'{"body": "A test comment"}'

And the type is PreparedRequest:

type(response.request)
request.models.PreparedRequest

For better control, one could in principle create a Request object beforehand, call the prepare method on it to obtain a PreparedRequest, and then send it through a Session object.

Forgot the key

What error code do we get if we just forget to add the auth? How do the headers of the request change?

Solution

r = requests.post(url="https://api.github.com/repos/mmesiti/web-novice-test-repo/issues/1/comments",data=body)
r
<Response [401]>

The request headers are:

('User-Agent', 'python-requests/2.25.1')
('Accept-Encoding', 'gzip, deflate')
('Accept', '*/*')
('Connection', 'keep-alive')
('Content-Length', '26')

Most notably, the “Authorization” header is missing.

Authentication is a vast topic. The requests library implements a number of authentication mechanisms that you can use. To handle authentication for multiple requests, one could also use a Session object from the requests library (see Advanced Usage).

Key Points

  • GET requests are used to read data from a particular resource.

  • POST requests are used to write data to a particular resource.

  • GET and POST methods may require some form of authentication (POST usually does)

  • The Python requests library offers various ways to deal with authentication.

  • curl can be used instead for shell-based workflows and debugging purposes.


Elements of Web Scraping with BeautifulSoup

Overview

Teaching: 25 min
Exercises: 15 min
Questions
  • How can I obtain data in a programmatic way from the web without an API?

Objectives
  • Have an idea about how to navigate the HTML element tree with Beautiful Soup and extract relevant information.

Sometimes, the data we are looking for is not available from an API, but it is available on web pages that we can view with our browser. As an example task, in this episode we are going to use the Beautiful Soup Python package for web scraping to find all the relevant information about Software Carpentry lessons.

Exploring HTML code in the browser

Navigate to The Software Carpentry Lessons. The page we see has been rendered by the browser from the HTML, CSS (Cascading Style Sheets) and JavaScript code that is available or linked in the page in some way.

In many browsers (for example, Chrome, Chromium, and Firefox), we can look at the HTML source code of the page we are viewing with the CTRL+u shortcut (alternatively, you can right click on the page and choose “View Source” from the context menu).

Things to notice:

Another way to explore the HTML code is to use the Developer Tools. In most browser, (Chrome, Chromium and Firefox), you can use the CTRL+Shift+I key combination to open the Developer Tools (alternatively, find the right option in your browser menu).

Developer Tools in Safari

In Safari on macOS, the Developer Tools are hidden by default. To enable them, open the Preferences window, go to the Advanced tab, and enable the “Show Develop menu in menu bar” option.

By using these, by pressing the combination CTRL+Shift+C (or clicking on the mouse pointer icon in the top left of the window) you can hover with the mouse on the elements in the rendered page and view their properties. If you click on one of these, the relevant part of the HTML code will be shown to you.

By using these techniques, we can understand how to locate the elements that we want when using Beautiful Soup later on.

Relevant HTML tags for this lessons

There is a number of tags that may be interesting in general, but specifically for what follows, we need to notice:

Scraping the page with Beautiful Soup

From the BeautifulSoup documentaion:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

First of all, let’s verify that we have BeautifulSoup installed:

python -c "import bs4"

If there is no output, then we are all set. If instead you see something along the lines of

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'bs4'

Then you have to install the package. One way of doing that is via pip, with

pip install beautifulsoup4

Once we are sure the BeautifulSoup is available, we can import the necessary libraries in Python and use requests to GET the Software Carpentries website content:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://software-carpentry.org/lessons/")
response
<Response [200]>

So, the request was successful. The HTML of the web page is in the text member of the response. We can pass that directly the the BeautifulSoup constructor, obtaining a soup object that we still need to navigate:

soup = BeautifulSoup(markup=response.text,
                     features="html.parser")

Looking at the HTML code, we see that just above the first table there is the text “Core Lessons in English” inside a <h2> tag (code reindented for clarity)

...
<h2 id=core-lessons-in-english>Core Lessons in English</h2>
<div class="table-striped overflow-x-auto">
    <table>
        <thead>
            <tr>
                <th>Lesson</th>
                <th>Site</th>
                <th>Repository</th>
                <th>Reference</th>
                <th>Instructor Notes</th>
                <th>Maintainers</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>The Unix Shell</td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/shell-novice /><i
                        class="fas fa-window-maximize"></i></a></td>
                <td style=text-align:center><a href=https://github.com/swcarpentry/shell-novice><i
                            class="fab fa-github"></i></a></td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/shell-novice/reference><i
                            class="fas fa-eye"></i></a></td>
                <td style=text-align:center><a
                        href=https://swcarpentry.github.io/shell-novice/instructor/instructor-notes><i
                            class="fas fa-plus"></i></a></td>
                <td>Jacob Deppen, Benson Muite</td>
            </tr>
            <tr>
                <td>Version control with Git</td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/git-novice><i
                            class="fas fa-window-maximize"></i></a></td>
                <td style=text-align:center><a href=https://github.com/swcarpentry/git-novice><i
                            class="fab fa-github"></i></a></td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/git-novice/reference><i
                            class="fas fa-eye"></i></a></td>
                <td style=text-align:center><a
                        href=https://swcarpentry.github.io/git-novice/instructor/instructor-notes><i
                            class="fas fa-plus"></i></a></td>
                <td>Erin Graham, Katherine Koziar, Martino Sorbaro</td>
            </tr>
            <tr>
                <td>Programming with Python</td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/python-novice-inflammation><i
                            class="fas fa-window-maximize"></i></a></td>
                <td style=text-align:center><a href=https://github.com/swcarpentry/python-novice-inflammation><i
                            class="fab fa-github"></i></a></td>
                <td style=text-align:center><a
                        href=https://swcarpentry.github.io/python-novice-inflammation/reference><i
                            class="fas fa-eye"></i></a></td>
                <td style=text-align:center><a
                        href=https://swcarpentry.github.io/python-novice-inflammation/instructor/instructor-notes><i
                            class="fas fa-plus"></i></a></td>
                <td>Indraneel Chakraborty, Toan Phung, Alberto Villagran</td>
            </tr>
            <tr>
                <td>Plotting and programming with Python</td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/python-novice-gapminder><i
                            class="fas fa-window-maximize"></i></a></td>
                <td style=text-align:center><a href=https://github.com/swcarpentry/python-novice-gapminder><i
                            class="fab fa-github"></i></a></td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/python-novice-gapminder/reference><i
                            class="fas fa-eye"></i></a></td>
                <td style=text-align:center><a
                        href=https://swcarpentry.github.io/python-novice-gapminder/instructor/instructor-notes><i
                            class="fas fa-plus"></i></a></td>
                <td>Allen Lee, Sourav Singh, Olav Vahtras</td>
            </tr>
            <tr>
                <td>Programming with R</td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/r-novice-inflammation /><i
                        class="fas fa-window-maximize"></i></a></td>
                <td style=text-align:center><a href=https://github.com/swcarpentry/r-novice-inflammation><i
                            class="fab fa-github"></i></a></td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/r-novice-inflammation/reference><i
                            class="fas fa-eye"></i></a></td>
                <td style=text-align:center><a
                        href=https://swcarpentry.github.io/r-novice-inflammation/instructor/instructor-notes><i
                            class="fas fa-plus"></i></a></td>
                <td>Rohit Goswami, Hugo Gruson, Isaac Jennings</td>
            </tr>
            <tr>
                <td>R for Reproducible Scientific Analysis</td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/r-novice-gapminder><i
                            class="fas fa-window-maximize"></i></a></td>
                <td style=text-align:center><a href=https://github.com/swcarpentry/r-novice-gapminder><i
                            class="fab fa-github"></i></a></td>
                <td style=text-align:center><a href=https://swcarpentry.github.io/r-novice-gapminder/reference><i
                            class="fas fa-eye"></i></a></td>
                <td style=text-align:center><a
                        href=https://swcarpentry.github.io/r-novice-gapminder/instructor/instructor-notes><i
                            class="fas fa-plus"></i></a></td>
                <td>Matthieu Bruneaux, Sehrish Kanwal, Naupaka Zimmerman</td>
            </tr>
        </tbody>
    </table>
</div>
...

We can then look for the table by finding the HTML element that contains that text, using the string keyword argument:

(soup.find(string="Core Lessons in English"))
'Core Lessons in English'

By using the find method on a BeautifulSoup object, we look at all of its descendants and obtain other BeautifulSoup objects that we can search in the same way as the original one. But how do we get the parent element? We can use the find_parents() method, which returns a list of BeautifulSoup objects that represents the ancestors in the tree of the given element, starting from the immediate parent of the element itself and ending with the element at the root of the tree (soup in this case). The second parent in the list is the one that also contains the table we are interested in:

(soup
 .find(string = "Core Lessons in English")
 .find_parents()[1])
<div class="prose h2-wrap max-w-none">
    <p>A Software Carpentry workshop is taught by at least one trained and badged Instructor. Over the course of the
        workshop, Instructors teach our three core topics: the Unix shell, version control with Git, and a programming
        language (Python or R). Curricula for these lessons in English and Spanish (select lessons only) are below.</p>
    <p>You may also enjoy <a href="https://datacarpentry.org/lessons">Data Carpentry’s lessons</a> (which focus on data
        organisation, cleanup, analysis, and visualisation) and <a href="https://librarycarpentry.org/lessons">Library
            Carpentry’s lessons</a> (which apply concepts of software development and data science to library contexts).
    </p>
    <p>Please <a href="https://carpentries.org/contact">contact us</a> with any general questions.</p>
    <h2 id="core-lessons-in-english">Core Lessons in English</h2>
    <div class="table-striped overflow-x-auto">
        <table>
            <thead>
                <tr>
                    <th>Lesson</th>
                    <th>Site</th>
                    <th>Repository</th>
                    <th>Reference</th>
                    <th>Instructor Notes</th>
                    <th>Maintainers</th>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td>The Unix Shell</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/shell-novice/"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/shell-novice"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/shell-novice/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/shell-novice/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Jacob Deppen, Benson Muite</td>
                </tr>
                <tr>
                    <td>Version control with Git</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/git-novice"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/git-novice"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/git-novice/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/git-novice/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Erin Graham, Katherine Koziar, Martino Sorbaro</td>
                </tr>
                <tr>
                    <td>Programming with Python</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/python-novice-inflammation"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/python-novice-inflammation"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/python-novice-inflammation/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/python-novice-inflammation/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Indraneel Chakraborty, Toan Phung, Alberto Villagran</td>
                </tr>
                <tr>
                    <td>Plotting and programming with Python</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/python-novice-gapminder"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/python-novice-gapminder"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/python-novice-gapminder/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/python-novice-gapminder/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Allen Lee, Sourav Singh, Olav Vahtras</td>
                </tr>
                <tr>
                    <td>Programming with R</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/r-novice-inflammation/"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/r-novice-inflammation"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/r-novice-inflammation/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/r-novice-inflammation/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Rohit Goswami, Hugo Gruson, Isaac Jennings</td>
                </tr>
                <tr>
                    <td>R for Reproducible Scientific Analysis</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/r-novice-gapminder"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/r-novice-gapminder"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/r-novice-gapminder/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/r-novice-gapminder/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Matthieu Bruneaux, Sehrish Kanwal, Naupaka Zimmerman</td>
                </tr>
            </tbody>
        </table>
    </div>
    <h2 id="core-lessons-in-spanish">Core Lessons in Spanish</h2>
    <div class="table-striped overflow-x-auto">
        <table>
            <thead>
                <tr>
                    <th>Lección</th>
                    <th>Sitio web</th>
                    <th>Repositorio</th>
                    <th>Referencias</th>
                    <th>Notas para Instructoras/es</th>
                    <th>Reponsable(s) del mantenimiento</th>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td>La Terminal de Unix</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/shell-novice-es"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/shell-novice-es"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/shell-novice-es/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/shell-novice-es/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Verónica Jiménez, Clara Llebot, Heladia Salgado</td>
                </tr>
                <tr>
                    <td>Control de versiones con Git</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/shell-novice-es"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/git-novice-es"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/git-novice-es/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/git-novice-es/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Jean-Paul Courneya, Clara Llebot, Mariana Patricia Gomez Nicolas</td>
                </tr>
                <tr>
                    <td>R para Análisis Científicos Reproducibles</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/r-novice-gapminder-es"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/r-novice-gapminder-es"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/r-novice-gapminder-es/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/r-novice-gapminder-es/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Verónica Jiménez, Heladia Salgado, Nelly Sélem</td>
                </tr>
            </tbody>
        </table>
    </div>
    <h2 id="additional-lessons">Additional Lessons</h2>
    <p>These lessons are not part of the core Software Carpentry curriculum but can be offered as supplementary lessons.
        Please <a href="https://carpentries.org/contact">contact us</a> for more information.</p>
    <div class="table-striped overflow-x-auto">
        <table>
            <thead>
                <tr>
                    <th>Lesson</th>
                    <th>Site</th>
                    <th>Repository</th>
                    <th>Reference</th>
                    <th>Instructor Notes</th>
                    <th>Maintainers</th>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td>Automation and Make</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/make-novice"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/make-novice"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/make-novice/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/make-novice/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Gerard Capes</td>
                </tr>
                <tr>
                    <td>Programming with MATLAB</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/matlab-novice-inflammation"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/matlab-novice-inflammation"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/matlab-novice-inflammation/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/matlab-novice-inflammation/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Daniel Cummins, Padem dhar Dwivedi</td>
                </tr>
                <tr>
                    <td>Using Databases and SQL</td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/sql-novice-survey"><i
                                class="fas fa-window-maximize"></i></a></td>
                    <td style="text-align:center"><a href="https://github.com/swcarpentry/sql-novice-survey"><i
                                class="fab fa-github"></i></a></td>
                    <td style="text-align:center"><a href="https://swcarpentry.github.io/sql-novice-survey/reference"><i
                                class="fas fa-eye"></i></a></td>
                    <td style="text-align:center"><a
                            href="https://swcarpentry.github.io/sql-novice-survey/instructor/instructor-notes"><i
                                class="fas fa-plus"></i></a></td>
                    <td>Henry Senyondo</td>
                </tr>
            </tbody>
        </table>
    </div>
</div>

It seems we are on the right track. Now let’s focus on the first table element:

(soup
 .find(string = "Core Lessons in English")
 .find_parents()[1]
 .find("table"))
<table>
    <thead>
        <tr>
            <th>Lesson</th>
            <th>Site</th>
            <th>Repository</th>
            <th>Reference</th>
            <th>Instructor Notes</th>
            <th>Maintainers</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>The Unix Shell</td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/shell-novice/"><i
                        class="fas fa-window-maximize"></i></a></td>
            <td style="text-align:center"><a href="https://github.com/swcarpentry/shell-novice"><i
                        class="fab fa-github"></i></a></td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/shell-novice/reference"><i
                        class="fas fa-eye"></i></a></td>
            <td style="text-align:center"><a
                    href="https://swcarpentry.github.io/shell-novice/instructor/instructor-notes"><i
                        class="fas fa-plus"></i></a></td>
            <td>Jacob Deppen, Benson Muite</td>
        </tr>
        <tr>
            <td>Version control with Git</td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/git-novice"><i
                        class="fas fa-window-maximize"></i></a></td>
            <td style="text-align:center"><a href="https://github.com/swcarpentry/git-novice"><i
                        class="fab fa-github"></i></a></td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/git-novice/reference"><i
                        class="fas fa-eye"></i></a></td>
            <td style="text-align:center"><a
                    href="https://swcarpentry.github.io/git-novice/instructor/instructor-notes"><i
                        class="fas fa-plus"></i></a></td>
            <td>Erin Graham, Katherine Koziar, Martino Sorbaro</td>
        </tr>
        <tr>
            <td>Programming with Python</td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/python-novice-inflammation"><i
                        class="fas fa-window-maximize"></i></a></td>
            <td style="text-align:center"><a href="https://github.com/swcarpentry/python-novice-inflammation"><i
                        class="fab fa-github"></i></a></td>
            <td style="text-align:center"><a
                    href="https://swcarpentry.github.io/python-novice-inflammation/reference"><i
                        class="fas fa-eye"></i></a></td>
            <td style="text-align:center"><a
                    href="https://swcarpentry.github.io/python-novice-inflammation/instructor/instructor-notes"><i
                        class="fas fa-plus"></i></a></td>
            <td>Indraneel Chakraborty, Toan Phung, Alberto Villagran</td>
        </tr>
        <tr>
            <td>Plotting and programming with Python</td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/python-novice-gapminder"><i
                        class="fas fa-window-maximize"></i></a></td>
            <td style="text-align:center"><a href="https://github.com/swcarpentry/python-novice-gapminder"><i
                        class="fab fa-github"></i></a></td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/python-novice-gapminder/reference"><i
                        class="fas fa-eye"></i></a></td>
            <td style="text-align:center"><a
                    href="https://swcarpentry.github.io/python-novice-gapminder/instructor/instructor-notes"><i
                        class="fas fa-plus"></i></a></td>
            <td>Allen Lee, Sourav Singh, Olav Vahtras</td>
        </tr>
        <tr>
            <td>Programming with R</td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/r-novice-inflammation/"><i
                        class="fas fa-window-maximize"></i></a></td>
            <td style="text-align:center"><a href="https://github.com/swcarpentry/r-novice-inflammation"><i
                        class="fab fa-github"></i></a></td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/r-novice-inflammation/reference"><i
                        class="fas fa-eye"></i></a></td>
            <td style="text-align:center"><a
                    href="https://swcarpentry.github.io/r-novice-inflammation/instructor/instructor-notes"><i
                        class="fas fa-plus"></i></a></td>
            <td>Rohit Goswami, Hugo Gruson, Isaac Jennings</td>
        </tr>
        <tr>
            <td>R for Reproducible Scientific Analysis</td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/r-novice-gapminder"><i
                        class="fas fa-window-maximize"></i></a></td>
            <td style="text-align:center"><a href="https://github.com/swcarpentry/r-novice-gapminder"><i
                        class="fab fa-github"></i></a></td>
            <td style="text-align:center"><a href="https://swcarpentry.github.io/r-novice-gapminder/reference"><i
                        class="fas fa-eye"></i></a></td>
            <td style="text-align:center"><a
                    href="https://swcarpentry.github.io/r-novice-gapminder/instructor/instructor-notes"><i
                        class="fas fa-plus"></i></a></td>
            <td>Matthieu Bruneaux, Sehrish Kanwal, Naupaka Zimmerman</td>
        </tr>
    </tbody>
</table>

Now we can get a list of row elements with

rows = (soup
 .find(string = "Core Lessons in English")
 .find_parents()[1]
 .find("table")
 .find_all("tr"))

Let’s focus now on the second element (the first contains the column headings):

rows[1]
<tr>
    <td>The Unix Shell</td>
    <td style="text-align:center"><a href="https://swcarpentry.github.io/shell-novice/"><i
                class="fas fa-window-maximize"></i></a></td>
    <td style="text-align:center"><a href="https://github.com/swcarpentry/shell-novice"><i
                class="fab fa-github"></i></a></td>
    <td style="text-align:center"><a href="https://swcarpentry.github.io/shell-novice/reference"><i
                class="fas fa-eye"></i></a></td>
    <td style="text-align:center"><a href="https://swcarpentry.github.io/shell-novice/instructor/instructor-notes"><i
                class="fas fa-plus"></i></a></td>
    <td>Jacob Deppen, Benson Muite</td>
</tr>

We can now split the row into six table data elements:

td0, td1, td2, td3, td4, td5 = rows[1].find_all("td")

If we want the link to the lesson page, we can look at the <a> tag in td1, and specifically at its href attribute:

link = td1.find("a")["href"]
link
'https://swcarpentry.github.io/shell-novice/'

We can get a list of maintainer names from the text content of td5:

maintainers = td5.text.split(",")

print(maintainers)
['Jacob Deppen', 'Benson Muite'] 

A more direct way

Can we look directly for table elements in the soup? How would you do that? Would that work?

Solution

We can check how many table elements are in the soup with

len(soup.find_all("table"))

We gather that there are three tables in the soup. find_all returns a list of them, so we can index into the list to access the one we want. For example:

soup.find_all("table")[1]

can be used to access the second table.

List the Lessons

Create a list of all the lessons, reporting for each one:

  • lesson name
  • link
  • names of maintainers

Solution

rows = soup.find("table").find_all("tr")
 # Remove the first row that only contains headings
rows.pop(0)

def process_row(row):
    td0,td1, _, _, _,td5 = row.find_all("td")
    link = td1.find("a")["href"]
    lesson = td0.text
    maintainers = td5.text.split(",")
    return dict(
        lesson = lesson,
        link = link,
        maintainers = maintainers
    ) 

lessons = []
for row in rows:
    lessons.append(process_row(row))
print(lessons)

Additional material

Beautiful Soup is a rich library that has a lot of powerful features that we are unable to discuss here.

A close look at the official documentation is worth the time for anyone seriously interested in web scraping.

Scraping the locations for tide gauge stations into a Pandas dataframe

Look at the locations for tide gauge stations. How would you extract these data as a Pandas dataframe?

Solution

import requests
import pandas
from bs4 import BeautifulSoup

# From the url displayed in the browser in the address bar 
response = requests.get("https://psmsl.org/data/obtaining/")

soup = BeautifulSoup(response.text,"html.parser")

rows = soup.find_all("table")

Then we can convert the string to a pandas dataframe:

df = pandas.read_html(str(rows))[0]

we now have the station location data inside a pandas dataframe ready for processing, graphing etc.

Javascript code, the DOM and Selenium

The JavaScript code running on the page can actively change the structure of the HTML document. For some web pages, this is a crucial part of the rendering process: in some of those cases the JavaScript code must be run to download the data you are looking for from another URL, and populate the web page with that data and any additional element of the page design.

In those cases, using requests and BeautifulSoup might not be enough (as requests gets the HTML without running the JavaScript code on the page), but you can use the Selenium WebDriver to load the page in a fully-fledged browser and automate the interaction with it.

Key Points

  • A BeautifulSoup object can be navigated in many ways:

  • Use find to look for the first element that matches the given criteria in a subtree

  • Use find_all to obtain a list of elements that matches the given criteria in a subtree

  • Use find_parents to get the list of ancestor of the given element