HTTP Requests

How to download a file programmatically

import requests
import os

folder_name = 'ebert_reviews'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

url = 'https://2342342.cloudfront.net/....'
response = requests.get(url)

with open(os.path.join(folder_name,
                        url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)
    
os.listdir(folder_name)

# output

Downloading images

The following would work most of the time but i prone to errors.

import requests
r = requests.get(url)
with open(folder_name + '/' + filename, 'wb') as f:
        f.write(r.content)

This type of error is why the requests library maintainers recommend using the PIL library (short for Pillow) and BytesIO from the io library for non-text requests, like images. They recommend that you access the response body as bytes, for non-text requests. For example, to create an image from binary data returned by a request:

import requests
from PIL import Image
from io import BytesIO
r = requests.get(url)
i = Image.open(BytesIO(r.content))

Though you may still encounter a similar file error, this code above will at least warn us with an error message, at which point we can manually download the problematic images.

Example:

290KB

download img from wikipedia store in dataframe.html

Let's gather the last piece of data for the Roger Ebert review word clouds now: the movie poster image files. Let's also keep each image's URL to add to the master DataFrame later.

Though we're going to use a loop to minimize repetition, here's how the major parts inside that loop will work, in order:

We're going to query the MediaWiki API using wptools to get a movie poster URL via each page object's image attribute.
Using that URL, we'll programmatically download that image into a folder called bestofrt_posters.

This one's a doozy, so take your time. If you get stuck, the solution is presented in full Jupyter Notebook-form on the next page.

The Jupyter Notebook below contains template code that:

Contains title_list, which is a list of all of the Wikipedia page titles for each movie in the Rotten Tomatoes Top 100 Movies of All Time list. This list is in the same order as the Top 100.
Creates an empty list, df_list, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the most efficient way of building a DataFrame row by row).
Creates an empty folder, bestofrt_posters, to store the downloaded movie poster image files.
Creates an empty dictionary, image_errors, to fill to keep track of movie poster image URLs that don't work.
Loops through the Wikipedia page titles in title_list and:
- Stores the ranking of that movie in the Top 100 list based on its position in title_list. Ranking is needed so we can join this DataFrame with the master DataFrame later. We can't join on title because the titles of the Rotten Tomatoes pages and the Wikipedia pages differ.
- Uses try and except blocks to attempt to query MediaWiki for a movie poster image URL and to attempt to download that image. If the attempt fails and an error is encountered, the offending movie is documented in image_errors.
- Appends a dictionary with ranking, title, and poster_url as the keys and the extracted values for each as the values to df_list.
Inspects the images that caused errors and downloads the correct image individually (either via another URL in the image attribute's list or a URL from Google Images)
Creates a DataFrame called df by converting df_list using the pd.DataFrame constructor.

PreviousStoring Data NextSQL

Last updated 5 years ago