XPath Extractor finds and extract data from HTML documents that have been previously scraped from webpages.
If the information you need is already displayed in a table on the page, Scrape Table may be the best tool.
When added to a workflow, Xpath extractor expects HTML text to be already stored in a column called “html.” If it doesn’t find this column, it will suggest adding an HTML Scraper step.
Scrape the HTML from webpages
- Add the HTML Scraper step
- Paste the URL(s) of the pages containing the content you want to scrape into Scrape HTML.
- Press “Scrape”
Extract specific information from the HTML
- Add the Xpath Extractor
- Set the Xpath to select the data you want to extract - See below how to get the Xpath.
- You can store data in as many columns as you need. Each requires an Xpath selector, and a column name to store the data in.
How to get Xpath selectors
Xpath is a language for selecting elements within HTML documents. It’s called a “path” because it identifies a particular element by specifying its parent element, then the parent’s parents, all the way to the root of the HTML tree.
- We recommend using SelectorGadget browser extension. It is a point-and-click interface for visually selecting elements on the page and generating Xpath selectors.
- Here’s a tutorial on how to use your web browser’s “inspector” to find the Xpath for any element on the page.
- Get the text of all links: //a
- Get the URL of all links: //a/@href
- Get all elements matching a particular CSS class: //*[contains(@class,’foo’)]
- Get a div matching a particular id: //div[@id="foo”]
Xpaths can be combined with slashes to find nested elements. For example:
- All link text inside divs within the foo class: //div[contains(@class,’foo’)]//a
- The third li inside the second div on the page: /div/li