Web APIs with Python

Questions

Have you ever needed to get some data from somewhere else on the web?

Objectives

Understand a web server and API and why might you need to talk to one.
Basics of the requests Python library
Some lightweight recommendations on saving data when you get to more serious data download.

Requests

Requests is a Python library that makes requests to web servers. It provides a nice interface and is one of the go-to tools. It does the raw data-download for simple web servers.

First, let’s take a tour of the Requests webpage. Below, we embed the Requests website into a Jupyter notebook, but you might want to open it in another browser tab: https://requests.readthedocs.io/en/latest/

# Embed the requests homepage
from IPython.display import IFrame
requests_documentation_url = "https://requests.readthedocs.io/en/latest/"
IFrame(requests_documentation_url, '100%', '30%')

Retrieve data from API

An API (Application Programming Interface) is the definition of the way computer programs communicate with each other. We use Requests to connect to the API of a web server, tell it what we want, and it returns it to us. This is called the request-response cycle.

We can find a list of some free APIs (available without authentication) at https://apipheny.io/free-api/#apis-without-key . These APIs can be used for developing and testing our code.

Let’s make a request to the Cat Fact API. If we go to https://catfact.ninja/, it gives us the definitions:

GET /fact is the API endpoint.
GET is the type of request we make and
/fact is the path.

You can even test this in your web browser: https://catfact.ninja/fact

Using the Requests library, we do this with get().

# Import
import requests

# URL
url = 'https://catfact.ninja/fact'

# Make a request
response = requests.get(url)

The requests.Response object tells us what the server said. We can access the response content using content.

response_content = response.content

# Display
display(response_content)

The response content is in the JSON format and Requests gives us the json() method that decodes it and returns the corresponding data as Python objects. This is equivalent to json.load().

response_json = response.json()

# Display
display(response_json)

(Note that, normally, we could study the API documentation to check the response format beforehand. However, many times manual inspection and trial-and-error is needed, as we did here.)

API which requires parameters

Let’s then examine another API which accepts parameters to specify the information request. In particular, we will request a list of Finnish universities from http://universities.hipolabs.com using the /search end point and a parameter country with value Finland, like this: http://universities.hipolabs.com/search?country=Finland .

# URL
url = 'http://universities.hipolabs.com/search?country=Finland'

# Make a request
response = requests.get(url)

# Decode JSON
response_json = response.json()

# Display
display(response_json[:2])

URLs containing parameters can always be constructed manually using the & character and then listing the parameter (key, value) pairs as above.

However, Requests allows us to provide the parameters as a dictionary of strings, using the params keyword argument to get(). This is easier to read and less error-prone.

# URL
url = 'http://universities.hipolabs.com/search'

# Make the parameter dictionary
parameters = {'country' : 'Finland'}

# Get response
response = requests.get(url, params=parameters)

# Decode JSON
response_json = response.json()

# Display
display(response_json[:2])

Exercises 1

Exercise WebAPIs-1: Request different activity suggestions from the Bored API

Go to the documentation page of the Bored API. The Bored API is an open API which can be used to randomly generate activity suggestions.

Let’s examine the first sample query on the page http://www.boredapi.com/api/activity/ with a sample JSON response

{
    "activity": "Learn Express.js",
    "accessibility": 0.25,
    "type": "education",
    "participants": 1,
    "price": 0.1,
    "link": "https://expressjs.com/",
    "key": "3943506"
} 

Let’s replicate the query and see if we can get another random suggestion.

# Import module
import requests

# URL of the activity API end point
url = "http://www.boredapi.com/api/activity/"

# Send the request using the get() function
response = requests.get(url)

# Show the JSON content of the response
display(response.json())

Next, let’s try to narrow down the suggestions by adding some parameters

type
participants

All possible parameter values are presented at the bottom of the bored documentation page. Relevant parts in the Requests documentation

# Define some parameters
params = {
    'type' : 'education',
    'participants' : 1,
}

# Send the request using get() with parameters
response = requests.get(url, params)

# Show the JSON content of the response
display("Response")
display(response.json())

Let’s narrow the request further with more parameters

price range
accessibility range

(All possible parameter values are again presented at the bottom of the document page.)

# Define some parameters
params = {
    'type' : 'social',
    'participants' : 2,
    'minprice' : 0,
    'maxprice' : 1000,
}

# Send the request using get() with parameters
response = requests.get(url, params)

# Show the JSON content of the response
display(response.json())
display("")

Exercises 2

Exercise WebAPIs-2: Examine request and response headers

Request headers are similar to request parameters but usually define meta information regarding, e.g., content encoding (gzip, utf-8) or user identification (user-agent/user ID/etc., password/access token/etc.).

Let’s first make a request.

# Import modules
import requests

# URL of the activity API end point
url = "http://www.boredapi.com/api/activity/"

# Make the request using the get() function
response = requests.get(url)

We can access the headers of the original request

display("Request headers")
display(dict(response.request.headers))

We can also access the headers of the response

display("Response headers")
display(dict(response.headers))

In many cases, the default headers

{'User-Agent': 'python-requests/2.28.1',
 'Accept-Encoding': 'gzip, deflate, br',
 'Accept': '*/*',
 'Connection': 'keep-alive'}

added automatically by Requests are sufficient. However, similarly to parameters, we can pass custom headers to the get function as an argument.

This is useful when, for example, the API has restricted access and requires a user ID and/or password as a part of the headers.

{'User-Agent': 'python-requests/2.28.1',
 'Accept-Encoding': 'gzip, deflate, br',
 'Accept': '*/*',
 'Connection': 'keep-alive',
 'example-user-id' : 'example-password'}

For examples of APIs using this type of authentication, see

Imgur API

For more on authentication, see also Requests documentation.

Exercises 3

Exercise WebAPIs-3: Scrape links from a webpage (Advanced)

Let’s use Requests to get the HTML source code of www.example.com, examine it, and use the Beautiful Soup library to extract links from it. Note: This requires the extra bs4 Python package to be installed, which was not in our initial requirements. Consider this a demo.

# Import module
import requests

# Define webpage to scrape
url = "http://www.example.com/"

# Make a request for the URL
response = requests.get(url)

# Examine the response
display(response.content)

# Looks like HTML :) Let's access it using the text attribute
html = response.text

print(html)

# Import beautiful soup module
from bs4 import BeautifulSoup

# Create soup
soup = BeautifulSoup(html, 'html.parser')

# Extract page title from the HTML
print(f"Found title: {soup.title.text}")

# Extract links (hrefs) from the HTML
for link in soup.find_all('a'):
    print(f"Found link: {link.get('href')}")

# Extract all text from the HTML
print(f"Found text: {soup.get_text()}")    

After exercises: Saving retrieved data to disk

Usually, we want to save the retrieved data to disk for later use. For example, we might collect data for one year and later analyze it for a longitudinal study.

To save the retrieved JSON objects to disk, it is practical to use the JSONLINES file format. The JSONLINES format contains a single valid JSON object on each line. This is preferable to saving each object as its own file since we don’t, in general, want to end up with excessive amounts of individual files (say, hundreds of thousands or millions).

For example, let’s retrieve three cat facts and save them to a JSONLINES file using the jsonlines library.

# Import
import requests
import jsonlines
import time

# URL
url = 'https://catfact.ninja/fact'

# Make three requests in loop and make a list of response JSON objects
for i in range(3):

    # Logging
    print(f"Make request {i}")

    # Make a request
    response = requests.get(url)
    
    # Decode to JSON
    response_json = response.json()
                
    # Open a jsonlines writer in 'append' mode 
    with jsonlines.open('catfacts.jsonl', mode='a') as writer:

        # Write
        writer.write(response_json)
        
    # Sleep for one second between requests
    time.sleep(1)

We can then read the objects from the disk using the same library.

# Open a jsonlines reader
with jsonlines.open('catfacts.jsonl', mode='r') as reader:
    
    # Read and display
    for obj in reader:
        display(obj)

Wrap-up

Keypoints

Requests is a common tool
Web APIs may often require some trial and error, but actually getting data is usually not that difficult
Storing all the data and processing it well can be a much larger issue.