Scraping webpages with developer tools and Selenium

There’s a lot of interesting statistics you can do by scraping websites, whether it’s reading tables from Wikipedia, grabbing and aggregating climate change tables from government websites, or quantifying the exponential growth of covid-19 articles on Medium. Many of these pages will have easy-to-use methods to download the data in a develop-friendly format, for example CSV or JSON. For others, you’ll have to rely on scraping web pages, an ancient, dark art. In this article, I’ll introduce you to some of the tools of the trade to do so in Python.

Identifying targets with developer tools

Both Chrome and Firefox have developer tools (Shortcut: Ctrl+Shift+I). These tools allow you:

  • Visualize the HTML, JavaScript, and CSS behind a webpage
  • See the traffic between your browser and the internet, include late http requests triggered by JavaScript
  • Execute custom JavaScript that can read and modify the DOM of a webpage
  • Identify where a particular UI element is created in the page’s HTML

These can help you identify which part of a page you need to scrape. Let’s say, for instance, that we want to find out how the top article of a major news website changes as a function of time during a time of crisis (as of now, covid-19). Perhaps we’ll want to retroactively run an analysis of the sentiment behind the headline, the focus (health, business, etc.), the frequency of updates, or of how these headers co-vary with numbers released by different governments.

Here I’ll use the New York Times as the example webpage. What we’d like is to grab, every half-hour, the title of the biggest headline. With developer tools, I can easily use the Picker tool (Ctrl+Shift+C) to highlight the biggest headline on the page.

Here I find that the biggest headline is a span with a class name of balancedHeadline. Let’s look at how we can grab this information with Selenium.

Selenium: remote control browsers

The method I’ll showcase here uses Selenium: we use a headless browser (that is, an instance of Firefox or Chrome that’s running “invisibly”) that navigates to the website we care about. Once on the webpage, we can inject javascript, simulate clicks or manipulate the DOM to get to the information we need. This is a rather heavyweight solution compared to, say, using urllib to grab the data and parsing it with BeautifulSoup. However, the big advantage we get is that the headless browser actually runs the javascript on the page and loads all the extra assets – it’s literally like a person navigating to the website. Sometimes, it can be the only solution, especially if the information we need:

  • is dynamically loaded
  • is hidden behind authentication layers
  • requires emulating clicks and interacting with JavaScript

In Python, we can easily install Selenium with:`

pip install selenium

Then we’ll need to install a WebDriver – this will allow your favorite web browser to be manipulated via Python. Here I installed the Firefox webdriver on the Ubuntu platform. I followed the instructions here on the Selenium website, downloading the gecko webdriver and extracting the binary to /usr/bin. You can check that you’ve done the installation by running a tiny bit of Python:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('https://nytimes.com/')

Running this code opens up Firefox and navigates to the website. Notice the unusual appearance of the browser: the url bar is a different color than usual, and there’s a robot icon next to the address.

A remote-controlled version of Firefox

We can grab the information in at least a couple of ways:

  • We could inject some Javascript into the page and use document.querySelectorAll("span.balancedHeadline") to grab the headline
  • We could read the DOM to find the right span

Let’s use the second method:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get('https://nytimes.com/')
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "balancedHeadline"))
)
print(element.text)
driver.quit()

Running headless and automating

Although the explicit window is useful for debugging, if you want to run this script every half-hour, you’ll want to run this in a headless browser. A headless browser doesn’t display a window, so it won’t pop up over your other windowsand capture your inputs. In addition, we’ll want to save the data to disk, which we can easily accomplish like so:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True

driver = webdriver.Firefox(options=options)
driver.get('https://nytimes.com/')
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "balancedHeadline"))
)
with open('/home/pmin/Documents/headless-browser/top_news.txt', 'a') as f:
    f.write('"' + str(datetime.datetime.now()) + '","' + element.text + '"\n')
driver.quit()

Cron job

All that’s left now is to set up a cron job to run this script every 15 minutes. First, we’ll need a shell file. We could run python directly, but I installed selenium inside of a special conda environment called headless-browser, so my shell file activates this environment before, like so:

#!/bin/bash
source /home/pmin/miniconda3/etc/profile.d/conda.sh && \
conda activate headless-browser && \
python /home/pmin/Documents/headless-browser/grab_headline.py

I saved this file as /home/pmin/Documents/headless-browser/grab_headline.sh. Running crontab -e, I added an entry that looks like this:

# m h  dom mon dow   command
*/15 * * * * /home/pmin/Documents/headless-browser/grab_headline.sh 2>&1 /var/log/grab-headline.log

You can use this website to figure out what incantations to give to cron for different schedules. I also captured any output or error into an error log. If you don’t want to have to leave your computer on all the time to do this, you can run cron jobs inside of a cloud service instead.

Authentication

A final tweak we’ll need to deal with is authenticating. Sometimes the website we’re trying to scrape requires us to be logged in. It’s possible for the headless browser to use the same cookies and credentials as the regular browser you use. To do so, you need to locate the name of the default profile in Firefox. Locate a directory that finished with .default with:

ls ~/.mozilla/firefox/

In my case, I found a directory called lasd74g9.default. Putting everything together, here’s my final script:

import datetime

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True

profile = profile = webdriver.FirefoxProfile(profile_directory='/home/pmin/.mozilla/firefox/lasd74g9.default')
driver = webdriver.Firefox(firefox_profile=profile, options=options)
driver.get('https://nytimes.com/')
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "balancedHeadline"))
)
with open('/home/pmin/Documents/headless-browser/grim_news.txt', 'a') as f:
    f.write('"' + str(datetime.datetime.now()) + '","' + element.text + '"\n')
driver.quit()

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s