There’s a lot of interesting statistics you can do by scraping websites, whether it’s reading tables from Wikipedia, grabbing and aggregating climate change tables from government websites, or quantifying the exponential growth of covid-19 articles on Medium. Many of these pages will have easy-to-use methods to download the data in a develop-friendly format, for example CSV or JSON. For others, you’ll have to rely on scraping web pages, an ancient, dark art. In this article, I’ll introduce you to some of the tools of the trade to do so in Python.
Identifying targets with developer tools
Both Chrome and Firefox have developer tools (Shortcut: Ctrl+Shift+I). These tools allow you:
- Identify where a particular UI element is created in the page’s HTML
These can help you identify which part of a page you need to scrape. Let’s say, for instance, that we want to find out how the top article of a major news website changes as a function of time during a time of crisis (as of now, covid-19). Perhaps we’ll want to retroactively run an analysis of the sentiment behind the headline, the focus (health, business, etc.), the frequency of updates, or of how these headers co-vary with numbers released by different governments.
Here I’ll use the New York Times as the example webpage. What we’d like is to grab, every half-hour, the title of the biggest headline. With developer tools, I can easily use the Picker tool (Ctrl+Shift+C) to highlight the biggest headline on the page.
Here I find that the biggest headline is a
span with a class name of
balancedHeadline. Let’s look at how we can grab this information with Selenium.
Selenium: remote control browsers
urllib to grab the data and parsing it with
- is dynamically loaded
- is hidden behind authentication layers
In Python, we can easily install Selenium with:`
pip install selenium
Then we’ll need to install a WebDriver – this will allow your favorite web browser to be manipulated via Python. Here I installed the Firefox webdriver on the Ubuntu platform. I followed the instructions here on the Selenium website, downloading the gecko webdriver and extracting the binary to
/usr/bin. You can check that you’ve done the installation by running a tiny bit of Python:
from selenium import webdriver driver = webdriver.Firefox() driver.get('https://nytimes.com/')
Running this code opens up Firefox and navigates to the website. Notice the unusual appearance of the browser: the url bar is a different color than usual, and there’s a robot icon next to the address.
We can grab the information in at least a couple of ways:
document.querySelectorAll("span.balancedHeadline")to grab the headline
- We could read the DOM to find the right span
Let’s use the second method:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox() driver.get('https://nytimes.com/') element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CLASS_NAME, "balancedHeadline")) ) print(element.text) driver.quit()
Running headless and automating
Although the explicit window is useful for debugging, if you want to run this script every half-hour, you’ll want to run this in a headless browser. A headless browser doesn’t display a window, so it won’t pop up over your other windowsand capture your inputs. In addition, we’ll want to save the data to disk, which we can easily accomplish like so:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.firefox.options import Options options = Options() options.headless = True driver = webdriver.Firefox(options=options) driver.get('https://nytimes.com/') element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CLASS_NAME, "balancedHeadline")) ) with open('/home/pmin/Documents/headless-browser/top_news.txt', 'a') as f: f.write('"' + str(datetime.datetime.now()) + '","' + element.text + '"\n') driver.quit()
All that’s left now is to set up a cron job to run this script every 15 minutes. First, we’ll need a shell file. We could run python directly, but I installed selenium inside of a special conda environment called
headless-browser, so my shell file activates this environment before, like so:
#!/bin/bash source /home/pmin/miniconda3/etc/profile.d/conda.sh && \ conda activate headless-browser && \ python /home/pmin/Documents/headless-browser/grab_headline.py
I saved this file as
crontab -e, I added an entry that looks like this:
# m h dom mon dow command */15 * * * * /home/pmin/Documents/headless-browser/grab_headline.sh 2>&1 /var/log/grab-headline.log
You can use this website to figure out what incantations to give to cron for different schedules. I also captured any output or error into an error log. If you don’t want to have to leave your computer on all the time to do this, you can run cron jobs inside of a cloud service instead.
A final tweak we’ll need to deal with is authenticating. Sometimes the website we’re trying to scrape requires us to be logged in. It’s possible for the headless browser to use the same cookies and credentials as the regular browser you use. To do so, you need to locate the name of the default profile in Firefox. Locate a directory that finished with
In my case, I found a directory called
lasd74g9.default. Putting everything together, here’s my final script:
import datetime from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.firefox.options import Options options = Options() options.headless = True profile = profile = webdriver.FirefoxProfile(profile_directory='/home/pmin/.mozilla/firefox/lasd74g9.default') driver = webdriver.Firefox(firefox_profile=profile, options=options) driver.get('https://nytimes.com/') element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CLASS_NAME, "balancedHeadline")) ) with open('/home/pmin/Documents/headless-browser/grim_news.txt', 'a') as f: f.write('"' + str(datetime.datetime.now()) + '","' + element.text + '"\n') driver.quit()