JulienBeaulieu
  • Introduction
  • Sciences
    • Math
      • Probability
        • Bayes Rule
        • Binomial distribution
        • Conditional Probability
      • Statistics
        • Descriptive Statistics
        • Inferential Statistics
          • Normal Distributions
          • Sampling Distributions
          • Confidence Intervals
          • Hypothesis Testing
          • AB Testing
        • Simple Linear Regression
        • Multiple Linear Regression
          • Statistical learning course
          • Model Assumptions And How To Address Each
        • Logistic Regression
      • Calculus
        • The big picture of Calculus
          • Derivatives
          • 2nd derivatives
          • The exponential e^x
        • Calculus
        • Gradient
      • Linear Algebra
        • Matrices
          • Matrix Multiplication
          • Inverses and Transpose and permutations
        • Vector Space and subspaces
        • Orthogonality
          • Orthogonal Sets
          • Projections
          • Least Squares
        • Gaussian Elimination
    • Programming
      • Command Line
      • Git & GitHub
      • Latex
      • Linear Algebra
        • Element-wise operations, Multiplication Transpose
      • Encodings and Character Sets
      • Uncategorized
      • Navigating Your Working Directory and File I/O
      • Python
        • Problem Solving
        • Strings
        • Lists & Dictionaries
        • Storing Data
        • HTTP Requests
      • SQL
        • Basic Statements
        • Entity Relationship Diagram
      • Jupyter Notebooks
      • Data Analysis
        • Data Visualization
          • Data Viz Cheat Sheet
          • Explanatory Analysis
          • Univariate Exploration of Data
            • Bar Chart
            • Pie Charts
            • Histograms
            • Kernel Density Estimation
            • Figures, Axes, and Subplots
            • Choosing a Plot for Discrete Data
            • Scales and Transformations (Log)
          • Bivariate Exploration of Data
            • Scatterplots
            • Overplotting, Transparency, and Jitter
            • Heatmaps
            • Violin & Box Plots
            • Categorical Variable Analysis
            • Faceting
            • Line Plots
            • Adapted Bar Charts
            • Q-Q, Swarm, Rug, Strip, Stacked, and Rigeline Plots
          • Multivariate Exploration of Data
            • Non-Positional Encodings for Third Variables
            • Color Palettes
            • Faceting for Multivariate Data
            • Plot and Correlation Matrices
            • Other Adaptations of Bivariate PLots
            • Feature Engineering for Data Viz
        • Python - Cheat Sheet
    • Machine Learning
      • Courses
        • Practical Deep learning for coders
          • Convolutional Neural Networks
            • Image Restauration
            • U-net
          • Lesson 1
          • Lesson 2
          • Lesson 3
          • Lesson 4 NLP, Collaborative filtering, Embeddings
          • Lesson 5 - Backprop, Accelerated SGD
          • Tabular data
        • Fast.ai - Intro to ML
          • Neural Nets
          • Business Applications
          • Class 1 & 2 - Random Forests
          • Lessons 3 & 4
      • Unsupervised Learning
        • Dimensionality Reduction
          • Independant Component Analysis
          • Random Projection
          • Principal Component Analysis
        • K-Means
        • Hierarchical Clustering
        • DBSCAN
        • Gaussian Mixture Model Clustering
        • Cluster Validation
      • Preprocessing
      • Machine Learning Overview
        • Confusion Matrix
      • Linear Regression
        • Feature Scaling and Normalization
        • Regularization
        • Polynomial Regression
        • Error functions
      • Decision Trees
      • Support Vector Machines
      • Training and Tuning
      • Model Evaluation Metrics
      • NLP
      • Neural Networks
        • Perceptron Algorithm
        • Multilayer Perceptron
        • Neural Network Architecture
        • Gradient Descent
        • Backpropagation
        • Training Neural Networks
  • Business
    • Analytics
      • KPIs for a Website
  • Books
    • Statistics
      • Practice Statistics for Data Science
        • Exploring Binary and Categorical Data
        • Data and Sampling Distributions
        • Statistical Experiments and Significance Testing
        • Regression and Prediction
        • Classification
        • Correlation
    • Pragmatic Thinking and Learning
      • Untitled
    • A Mind For Numbers: How to Excel at Math and Science
      • Focused and diffuse mode
      • Procrastination
      • Working memory and long term memory
        • Chunking
      • Importance of sleeping
      • Q&A with Terrence Sejnowski
      • Illusions of competence
      • Seeing the bigger picture
        • The value of a Library of Chunks
        • Overlearning
Powered by GitBook
On this page
  • How to download a file programmatically
  • Downloading images

Was this helpful?

  1. Sciences
  2. Programming
  3. Python

HTTP Requests

How to download a file programmatically

import requests
import os

folder_name = 'ebert_reviews'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

url = 'https://2342342.cloudfront.net/....'
response = requests.get(url)

with open(os.path.join(folder_name,
                        url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)
    
os.listdir(folder_name)

# output     

Downloading images

The following would work most of the time but i prone to errors.

import requests
r = requests.get(url)
with open(folder_name + '/' + filename, 'wb') as f:
        f.write(r.content)
import requests
from PIL import Image
from io import BytesIO
r = requests.get(url)
i = Image.open(BytesIO(r.content))

Though you may still encounter a similar file error, this code above will at least warn us with an error message, at which point we can manually download the problematic images.

Example:

Let's gather the last piece of data for the Roger Ebert review word clouds now: the movie poster image files. Let's also keep each image's URL to add to the master DataFrame later.

Though we're going to use a loop to minimize repetition, here's how the major parts inside that loop will work, in order:

  1. We're going to query the MediaWiki API using wptools to get a movie poster URL via each page object's image attribute.

  2. Using that URL, we'll programmatically download that image into a folder called bestofrt_posters.

This one's a doozy, so take your time. If you get stuck, the solution is presented in full Jupyter Notebook-form on the next page.

The Jupyter Notebook below contains template code that:

  • Contains title_list, which is a list of all of the Wikipedia page titles for each movie in the Rotten Tomatoes Top 100 Movies of All Time list. This list is in the same order as the Top 100.

  • Creates an empty folder, bestofrt_posters, to store the downloaded movie poster image files.

  • Creates an empty dictionary, image_errors, to fill to keep track of movie poster image URLs that don't work.

  • Loops through the Wikipedia page titles in title_list and:

    • Stores the ranking of that movie in the Top 100 list based on its position in title_list. Ranking is needed so we can join this DataFrame with the master DataFrame later. We can't join on title because the titles of the Rotten Tomatoes pages and the Wikipedia pages differ.

    • Appends a dictionary with ranking, title, and poster_url as the keys and the extracted values for each as the values to df_list.

  • Inspects the images that caused errors and downloads the correct image individually (either via another URL in the image attribute's list or a URL from Google Images)

PreviousStoring DataNextSQL

Last updated 5 years ago

Was this helpful?

This type of error is why the requests library maintainers using the library (short for Pillow) and BytesIO from the io library for non-text requests, like images. They recommend that you access the response body as bytes, for non-text requests. For example, to create an image from binary data returned by a request:

Creates an empty list, df_list, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the ).

Uses to attempt to query MediaWiki for a movie poster image URL and to attempt to download that image. If the attempt fails and an error is encountered, the offending movie is documented in image_errors.

Creates a DataFrame called df by converting df_list using the .

recommend
PIL
most efficient way of building a DataFrame row by row
try and except blocks
pd.DataFrame constructor
290KB
download img from wikipedia store in dataframe.html
Download images programatically - store in folder