⭐️For an overview of scraping techniques in Workbench, see this Medium article.
Advanced scraping requires two steps:
- First, use HTML scraper to scrape one or several web pages as HTML documents. Those documents contain all the data displayed on each page.
- Then “extract” specific data from each HTML document, using HTML to Table.
This two step process allows you to experiment with different extraction methods without having to download the HTML over and over.
⭐️If the information you need is already displayed in a table on the page, Scrape Table may be the best tool.
How to use HTML Scraper
1. Choose one of the following options to load URLs
- Paste a single URL, with the option to adding page numbers
- Paste a list of URLs
- Load a column of URLs created by a previous step
2. If needed, add the page numbers corresponding to the page you want to scrape.
- Select the 'Series of numbered pages' option
- Enter the pages to scrape
Notice that the first page number is zero by default, not one. Like many sites (but not all) oversight.gov counts pages from zero, so when you go to the second page the URL ends with page=1.
3. Press Scrape.
Workbench will download the HTML file (it can take a few minutes), and produce a table, with one row per page scraped. Each scrape is saved as a version.
- URL: The reference URL
- Date: date and time of the most recent attempt to scrape, successful or not.
- Status -
200means that the page was scrapped successfully.
Can't connectmeans that no HTML was found at the URL. Either the URL is incorrect, or the webpage is offline.
- HTML: The entire HTML of the webpage is contained in the cell. It can be filtered, searched and edited using other modules.
Automatic update and version control
Once data is loaded, you can set the module to automatically check if new data is available and update the workflow. All previously loaded versions will still be accessible. Learn more about data version control.
The next step is extracting content from the HTML column using the step HTML to Table.