⭐️ For an overview of scraping techniques in Workbench, see this Medium article.
HTML to Table extracts specific from HTML documents and returns as table columns.
When added to a workflow, HTML to Table expects HTML text to be already stored in a column called “html.” If it doesn’t find this column, the data source list will appear. If so, add the HTML scraper.
Learn more about the HTML scraper here.
Extract specific information from the HTML using XPath
⭐️Here’s a tutorial on XPath Extraction
- Add the XPath Extractor
- Select the XPath option to select the data you want to extract - See below how to get the XPath.
- Specify what content from the page should go in each column of your new table. Each column is defined by an “XPath selector” which is a short piece of code written in a special language designed for selecting parts of web pages. Learn how to generate XPath below.
- Name each column of content
How to get XPath selectors
XPath is a language for selecting elements within HTML documents. It’s called a “path” because it identifies a particular element by specifying its parent element, then the parent’s parents, all the way to the root of the HTML tree.
- We recommend using SelectorGadget browser extension. It is a point-and-click interface for visually selecting elements on the page and generating XPath selectors. Learn more in this tutorial.
- Get the text of all links: //a
- Get the URL of all links: //a/@href
- Get all elements matching a particular CSS class: //*[contains(@class,’foo’)]
- Get a div matching a particular id: //div[@id="foo”]
XPaths can be combined with slashes to find nested elements. For example:
- All link text inside divs within the foo class: //div[contains(@class,’foo’)]//a
- The third li inside the second div on the page: /div/li