XPath Extractor finds and extract data from HTML documents that have been previously scraped from webpages. 

If the information you need is already displayed in a table on the page, Scrape Table may be the best tool.  

When added to a workflow, Xpath extractor expects HTML text to be already stored in a column called “html.” If it doesn’t find this column, it will suggest adding an HTML Scraper step.

Scrape the HTML from webpages

  1. Add the HTML Scraper step
  2. Paste the URL(s) of the pages containing the content you want to scrape into Scrape HTML.
  3. Press “Scrape”

Extract specific information from the HTML

  1. Add the Xpath Extractor
  2. Set the Xpath to select the data you want to extract - See below how to get the Xpath.
  3. You can store data in as many columns as you need. Each requires an Xpath selector, and a column name to store the data in. 

How to get Xpath selectors 

Xpath is a language for selecting elements within HTML documents. It’s called a “path” because it identifies a particular element by specifying its parent element, then the parent’s parents, all the way to the root of the HTML tree.

  • We recommend using SelectorGadget browser extension. It is a point-and-click interface for visually selecting elements on the page and generating Xpath selectors.
  • Here’s a tutorial on how to use your web browser’s “inspector” to find the Xpath for any element on the page. 

Useful Xpaths

  • Get the text of all links: //a
  • Get the URL of all links: //a/@href
  • Get all elements matching a particular CSS class: //*[contains(@class,’foo’)] 
  • Get a div matching a particular id: //div[@id="foo”] 

Xpaths can be combined with slashes to find nested elements. For example:

  • All link text inside divs within the foo class: //div[contains(@class,’foo’)]//a 
  • The third li inside the second div on the page: /div[1]/li[2] 

For more, here’s an interactive xpath tester where you can experiment on your page. Here's a handy XPath primer that includes a glossary, and a quick reference to the syntax

Did this answer your question?