There’s a lot of interesting statistics you can do by scraping websites, whether it’s reading tables from Wikipedia, grabbing and aggregating climate change tables from government websites, or quantifying the exponential growth of covid-19 articles on Medium. Many of these pages will have easy-to-use methods to download the data in a develop-friendly format, for example CSV or JSON. For others, you’ll have to rely on scraping web pages, an ancient, dark art. In this article, I’ll introduce you to some of the tools of the trade to do so in Python.
Identifying targets with developer tools
Both Chrome and Firefox have developer tools (Shortcut: Ctrl+Shift+I). These tools allow you:
- Visualize the HTML, JavaScript, and CSS behind a webpage
- See the traffic between your browser and the internet, include late http requests triggered by JavaScript
- Execute custom JavaScript that can read and modify the DOM of a webpage
- Identify where a particular UI element is created in the page’s HTML
These can help you identify which part of a page you need to scrape. Let’s say, for instance, that we want to find out how the top article of a major news website changes as a function of time during a time of crisis (as of now, covid-19). Perhaps we’ll want to retroactively run an analysis of the sentiment behind the headline, the focus (health, business, etc.), the frequency of updates, or of how these headers co-vary with numbers released by different governments.
Here I’ll use the New York Times as the example webpage. What we’d like is to grab, every half-hour, the title of the biggest headline. With developer tools, I can easily use the Picker tool (Ctrl+Shift+C) to highlight the biggest headline on the page.

Here I find that the biggest headline is a span
with a class name of balancedHeadline
. Let’s look at how we can grab this information with Selenium.
Selenium: remote control browsers
The method I’ll showcase here uses Selenium: we use a headless browser (that is, an instance of Firefox or Chrome that’s running “invisibly”) that navigates to the website we care about. Once on the webpage, we can inject javascript, simulate clicks or manipulate the DOM to get to the information we need. This is a rather heavyweight solution compared to, say, using urllib
to grab the data and parsing it with BeautifulSoup
. However, the big advantage we get is that the headless browser actually runs the javascript on the page and loads all the extra assets – it’s literally like a person navigating to the website. Sometimes, it can be the only solution, especially if the information we need:
- is dynamically loaded
- is hidden behind authentication layers
- requires emulating clicks and interacting with JavaScript
In Python, we can easily install Selenium with:`
pip install selenium
Then we’ll need to install a WebDriver – this will allow your favorite web browser to be manipulated via Python. Here I installed the Firefox webdriver on the Ubuntu platform. I followed the instructions here on the Selenium website, downloading the gecko webdriver and extracting the binary to /usr/bin
. You can check that you’ve done the installation by running a tiny bit of Python:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://nytimes.com/')
Running this code opens up Firefox and navigates to the website. Notice the unusual appearance of the browser: the url bar is a different color than usual, and there’s a robot icon next to the address.

We can grab the information in at least a couple of ways:
- We could inject some Javascript into the page and use
document.querySelectorAll("span.balancedHeadline")
to grab the headline - We could read the DOM to find the right span
Let’s use the second method:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('https://nytimes.com/')
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "balancedHeadline"))
)
print(element.text)
driver.quit()
Running headless and automating
Although the explicit window is useful for debugging, if you want to run this script every half-hour, you’ll want to run this in a headless browser. A headless browser doesn’t display a window, so it won’t pop up over your other windowsand capture your inputs. In addition, we’ll want to save the data to disk, which we can easily accomplish like so:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options)
driver.get('https://nytimes.com/')
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "balancedHeadline"))
)
with open('/home/pmin/Documents/headless-browser/top_news.txt', 'a') as f:
f.write('"' + str(datetime.datetime.now()) + '","' + element.text + '"\n')
driver.quit()
Cron job
All that’s left now is to set up a cron job to run this script every 15 minutes. First, we’ll need a shell file. We could run python directly, but I installed selenium inside of a special conda environment called headless-browser
, so my shell file activates this environment before, like so:
#!/bin/bash
source /home/pmin/miniconda3/etc/profile.d/conda.sh && \
conda activate headless-browser && \
python /home/pmin/Documents/headless-browser/grab_headline.py
I saved this file as /home/pmin/Documents/headless-browser/grab_headline.sh
. Running crontab -e
, I added an entry that looks like this:
# m h dom mon dow command
*/15 * * * * /home/pmin/Documents/headless-browser/grab_headline.sh 2>&1 /var/log/grab-headline.log
You can use this website to figure out what incantations to give to cron for different schedules. I also captured any output or error into an error log. If you don’t want to have to leave your computer on all the time to do this, you can run cron jobs inside of a cloud service instead.
Authentication
A final tweak we’ll need to deal with is authenticating. Sometimes the website we’re trying to scrape requires us to be logged in. It’s possible for the headless browser to use the same cookies and credentials as the regular browser you use. To do so, you need to locate the name of the default profile in Firefox. Locate a directory that finished with .default
with:
ls ~/.mozilla/firefox/
In my case, I found a directory called lasd74g9.default
. Putting everything together, here’s my final script:
import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
profile = profile = webdriver.FirefoxProfile(profile_directory='/home/pmin/.mozilla/firefox/lasd74g9.default')
driver = webdriver.Firefox(firefox_profile=profile, options=options)
driver.get('https://nytimes.com/')
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "balancedHeadline"))
)
with open('/home/pmin/Documents/headless-browser/grim_news.txt', 'a') as f:
f.write('"' + str(datetime.datetime.now()) + '","' + element.text + '"\n')
driver.quit()