Web APIs with Python

Questions

  • Have you ever needed to get some data from somewhere else on the web?

Objectives

  • Understand a web server and API and why might you need to talk to one.

  • Basics of the requests Python library

  • Some lightweight recommendations on saving data when you get to more serious data download.

Requests

Requests is a Python library that makes requests to web servers. It provides a nice interface and is one of the go-to tools. It does the raw data-download for simple web servers.

First, let’s take a tour of the Requests webpage. Below, we embed the Requests website into a Jupyter notebook, but you might want to open it in another browser tab: https://requests.readthedocs.io/en/latest/

# Embed the requests homepage
from IPython.display import IFrame
requests_documentation_url = "https://requests.readthedocs.io/en/latest/"
IFrame(requests_documentation_url, '100%', '30%')

Retrieve data from API

An API (Application Programming Interface) is the definition of the way computer programs communicate with each other. We use Requests to connect to the API of a web server, tell it what we want, and it returns it to us. This is called the request-response cycle.

We can find a list of some free APIs (available without authentication) at https://apipheny.io/free-api/#apis-without-key . These APIs can be used for developing and testing our code.

Let’s make a request to the Cat Fact API. If we go to https://catfact.ninja/, it gives us the definitions:

  • GET /fact is the API endpoint.

  • GET is the type of request we make and

  • /fact is the path.

You can even test this in your web browser: https://catfact.ninja/fact

Using the Requests library, we do this with get().

# Import
import requests

# URL
url = 'https://catfact.ninja/fact'

# Make a request
response = requests.get(url)

The requests.Response object tells us what the server said. We can access the response content using content.

response_content = response.content

# Display
display(response_content)
b'{"fact":"Smuggling a cat out of ancient Egypt was punishable by death. Phoenician traders eventually succeeded in smuggling felines, which they sold to rich people in Athens and other important cities.","length":192}'

The response content is in the JSON format and Requests gives us the json() method that decodes it and returns the corresponding data as Python objects. This is equivalent to json.load().

response_json = response.json()

# Display
display(response_json)
{'fact': 'Smuggling a cat out of ancient Egypt was punishable by death. Phoenician traders eventually succeeded in smuggling felines, which they sold to rich people in Athens and other important cities.',
 'length': 192}

(Note that, normally, we could study the API documentation to check the response format beforehand. However, many times manual inspection and trial-and-error is needed, as we did here.)

API which requires parameters

Let’s then examine another API which accepts parameters to specify the information request. In particular, we will request a list of Finnish universities from http://universities.hipolabs.com using the /search end point and a parameter country with value Finland, like this: http://universities.hipolabs.com/search?country=Finland .

# URL
url = 'http://universities.hipolabs.com/search?country=Finland'

# Make a request
response = requests.get(url)

# Decode JSON
response_json = response.json()

# Display
display(response_json[:2])
[{'name': 'Abo Akademi University',
  'country': 'Finland',
  'state-province': None,
  'alpha_two_code': 'FI',
  'web_pages': ['http://www.abo.fi/'],
  'domains': ['abo.fi']},
 {'name': 'Central Ostrobothnia University of Applied Sciences',
  'country': 'Finland',
  'state-province': None,
  'alpha_two_code': 'FI',
  'web_pages': ['http://www.cou.fi/'],
  'domains': ['cou.fi']}]

URLs containing parameters can always be constructed manually using the & character and then listing the parameter (key, value) pairs as above.

However, Requests allows us to provide the parameters as a dictionary of strings, using the params keyword argument to get(). This is easier to read and less error-prone.

# URL
url = 'http://universities.hipolabs.com/search'

# Make the parameter dictionary
parameters = {'country' : 'Finland'}

# Get response
response = requests.get(url, params=parameters)

# Decode JSON
response_json = response.json()

# Display
display(response_json[:2])
[{'name': 'Abo Akademi University',
  'country': 'Finland',
  'state-province': None,
  'alpha_two_code': 'FI',
  'web_pages': ['http://www.abo.fi/'],
  'domains': ['abo.fi']},
 {'name': 'Central Ostrobothnia University of Applied Sciences',
  'country': 'Finland',
  'state-province': None,
  'alpha_two_code': 'FI',
  'web_pages': ['http://www.cou.fi/'],
  'domains': ['cou.fi']}]

Exercises 1

Exercise WebAPIs-1: Request different activity suggestions from the Bored API

Go to the documentation page of the Bored API. The Bored API is an open API which can be used to randomly generate activity suggestions.

Let’s examine the first sample query on the page http://www.boredapi.com/api/activity/ with a sample JSON response

{
    "activity": "Learn Express.js",
    "accessibility": 0.25,
    "type": "education",
    "participants": 1,
    "price": 0.1,
    "link": "https://expressjs.com/",
    "key": "3943506"
} 

Let’s replicate the query and see if we can get another random suggestion.

# Import module
import requests

# URL of the activity API end point
url = "http://www.boredapi.com/api/activity/"

# Send the request using the get() function
response = requests.get(url)
# Show the JSON content of the response
display(response.json())
{'activity': 'Make a simple musical instrument',
 'type': 'music',
 'participants': 1,
 'price': 0.4,
 'link': '',
 'key': '7091374',
 'accessibility': 0.25}

Next, let’s try to narrow down the suggestions by adding some parameters

  • type

  • participants

All possible parameter values are presented at the bottom of the bored documentation page. Relevant parts in the Requests documentation

# Define some parameters
params = {
    'type' : 'education',
    'participants' : 1,
}

# Send the request using get() with parameters
response = requests.get(url, params)
# Show the JSON content of the response
display("Response")
display(response.json())
'Response'
{'activity': 'Learn to greet someone in a new language',
 'type': 'education',
 'participants': 1,
 'price': 0.1,
 'link': '',
 'key': '4704256',
 'accessibility': 0.2}

Let’s narrow the request further with more parameters

  • price range

  • accessibility range

(All possible parameter values are again presented at the bottom of the document page.)

# Define some parameters
params = {
    'type' : 'social',
    'participants' : 2,
    'minprice' : 0,
    'maxprice' : 1000,
}

# Send the request using get() with parameters
response = requests.get(url, params)
# Show the JSON content of the response
display(response.json())
display("")
{'activity': 'Catch up with a friend over a lunch date',
 'type': 'social',
 'participants': 2,
 'price': 0.2,
 'link': '',
 'key': '5590133',
 'accessibility': 0.15}
''

Exercises 2

Exercise WebAPIs-2: Examine request and response headers

Request headers are similar to request parameters but usually define meta information regarding, e.g., content encoding (gzip, utf-8) or user identification (user-agent/user ID/etc., password/access token/etc.).

Let’s first make a request.

# Import modules
import requests

# URL of the activity API end point
url = "http://www.boredapi.com/api/activity/"

# Make the request using the get() function
response = requests.get(url)

We can access the headers of the original request

display("Request headers")
display(dict(response.request.headers))
'Request headers'
{'User-Agent': 'python-requests/2.31.0',
 'Accept-Encoding': 'gzip, deflate',
 'Accept': '*/*',
 'Connection': 'keep-alive'}

We can also access the headers of the response

display("Response headers")
display(dict(response.headers))
'Response headers'
{'Server': 'Cowboy',
 'Report-To': '{"group":"heroku-nel","max_age":3600,"endpoints":[{"url":"https://nel.heroku.com/reports?ts=1702435708&sid=67ff5de4-ad2b-4112-9289-cf96be89efed&s=1CptPsLjgkBBAVu%2B78M4NT8Niemr58rn9E7XiWbXAYA%3D"}]}',
 'Reporting-Endpoints': 'heroku-nel=https://nel.heroku.com/reports?ts=1702435708&sid=67ff5de4-ad2b-4112-9289-cf96be89efed&s=1CptPsLjgkBBAVu%2B78M4NT8Niemr58rn9E7XiWbXAYA%3D',
 'Nel': '{"report_to":"heroku-nel","max_age":3600,"success_fraction":0.005,"failure_fraction":0.05,"response_headers":["Via"]}',
 'Connection': 'keep-alive',
 'X-Powered-By': 'Express',
 'Access-Control-Allow-Origin': '*',
 'Access-Control-Allow-Headers': 'Origin, X-Requested-With, Content-Type, Accept',
 'Content-Type': 'application/json; charset=utf-8',
 'Content-Length': '137',
 'Etag': 'W/"89-q3mDLnm/woJGRdOtqBBQJs2FGGE"',
 'Date': 'Wed, 13 Dec 2023 02:48:28 GMT',
 'Via': '1.1 vegur'}

In many cases, the default headers

{'User-Agent': 'python-requests/2.28.1',
 'Accept-Encoding': 'gzip, deflate, br',
 'Accept': '*/*',
 'Connection': 'keep-alive'}

added automatically by Requests are sufficient. However, similarly to parameters, we can pass custom headers to the get function as an argument.

This is useful when, for example, the API has restricted access and requires a user ID and/or password as a part of the headers.

{'User-Agent': 'python-requests/2.28.1',
 'Accept-Encoding': 'gzip, deflate, br',
 'Accept': '*/*',
 'Connection': 'keep-alive',
 'example-user-id' : 'example-password'}

For examples of APIs using this type of authentication, see

For more on authentication, see also Requests documentation.

Exercises 3

# Import module
import requests

# Define webpage to scrape
url = "http://www.example.com/"

# Make a request for the URL
response = requests.get(url)

# Examine the response
display(response.content)
b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n</body>\n</html>\n'
# Looks like HTML :) Let's access it using the text attribute
html = response.text

print(html)
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
# Import beautiful soup module
from bs4 import BeautifulSoup

# Create soup
soup = BeautifulSoup(html, 'html.parser')
# Extract page title from the HTML
print(f"Found title: {soup.title.text}")
Found title: Example Domain
# Extract links (hrefs) from the HTML
for link in soup.find_all('a'):
    print(f"Found link: {link.get('href')}")
Found link: https://www.iana.org/domains/example
# Extract all text from the HTML
print(f"Found text: {soup.get_text()}")    
Found text: 


Example Domain







Example Domain
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...

After exercises: Saving retrieved data to disk

Usually, we want to save the retrieved data to disk for later use. For example, we might collect data for one year and later analyze it for a longitudinal study.

To save the retrieved JSON objects to disk, it is practical to use the JSONLINES file format. The JSONLINES format contains a single valid JSON object on each line. This is preferable to saving each object as its own file since we don’t, in general, want to end up with excessive amounts of individual files (say, hundreds of thousands or millions).

For example, let’s retrieve three cat facts and save them to a JSONLINES file using the jsonlines library.

# Import
import requests
import jsonlines
import time

# URL
url = 'https://catfact.ninja/fact'

# Make three requests in loop and make a list of response JSON objects
for i in range(3):

    # Logging
    print(f"Make request {i}")

    # Make a request
    response = requests.get(url)
    
    # Decode to JSON
    response_json = response.json()
                
    # Open a jsonlines writer in 'append' mode 
    with jsonlines.open('catfacts.jsonl', mode='a') as writer:

        # Write
        writer.write(response_json)
        
    # Sleep for one second between requests
    time.sleep(1)
Make request 0
Make request 1
Make request 2

We can then read the objects from the disk using the same library.

# Open a jsonlines reader
with jsonlines.open('catfacts.jsonl', mode='r') as reader:
    
    # Read and display
    for obj in reader:
        display(obj)
{'fact': 'During the Middle Ages, cats were associated with withcraft, and on St. John’s Day, people all over Europe would stuff them into sacks and toss the cats into bonfires. On holy days, people celebrated by tossing cats from church towers.',
 'length': 235}
{'fact': "The cat's front paw has 5 toes, but the back paws have 4. Some cats are born with as many as 7 front toes and extra back toes (polydactl).",
 'length': 138}
{'fact': "Many cats cannot properly digest cow's milk. Milk and milk products give them diarrhea.",
 'length': 87}

Wrap-up

Keypoints

  • Requests is a common tool

  • Web APIs may often require some trial and error, but actually getting data is usually not that difficult

  • Storing all the data and processing it well can be a much larger issue.