Download and extract data from HTML/XML pages as well as JSON APIs (with examples)
1: Print all URLs found by a Google search
xidel https://www.google.com/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
Motivation:
This command allows us to extract all URLs found on a Google search results page. This can be useful for various purposes such as collecting specific data from those URLs or analyzing the search results.
Explanation:
xidel
is the command used to run Xidel.https://www.google.com/search?q=test
is the URL of the Google search query, where “test” is the search term.--extract
specifies the extraction mode to extract specific elements from the HTML page.//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']
is the XPath expression used to extract all URLs from the search results. It extracts the “href” attribute value of all “a” elements that match the given regular expression.
Example Output:
https://example1.com
https://example2.com
https://example3.com
...
2: Print the title of all pages found by a Google search and download them
xidel https://www.google.com/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'
Motivation:
This command allows us to extract the titles of all web pages found by a Google search and download them for further analysis or offline viewing.
Explanation:
--follow
instructs Xidel to follow the URLs extracted from the Google search results.--extract //title
specifies to extract the “title” element from each of the followed pages.--download '{$host}/'
tells Xidel to download the web page with the extracted title as the file name.{$host}
is a placeholder that gets replaced with the host name of the downloaded page.
Example Output:
Downloading: https://example1.com
Downloading: https://example2.com
Downloading: https://example3.com
...
Downloaded files:
example1_com.html
example2_com.html
example3_com.html
...
3: Follow all links on a page and print the titles, with XPath
xidel https://example.org --follow //a --extract //title
Motivation:
This command allows us to follow all links on a given web page and extract the titles of the linked pages. This can be useful for analyzing the structure and content of a website.
Explanation:
--follow //a
instructs Xidel to follow all “a” elements on the page, i.e., follow the links.--extract //title
specifies to extract the “title” element from each of the followed pages.
Example Output:
Title 1
Title 2
Title 3
...
4: Follow all links on a page and print the titles, with CSS selectors
xidel https://example.org --follow "css('a')" --css title
Motivation:
This command accomplishes the same task as the previous example, but uses CSS selectors instead of XPath. CSS selectors are often preferred by those who are more familiar with CSS.
Explanation:
--follow "css('a')"
tells Xidel to follow all “a” elements on the page using CSS selectors.--css title
specifies to extract the “title” element from each of the followed pages using a CSS selector.
Example Output:
Title 1
Title 2
Title 3
...
5: Follow all links on a page and print the titles, with pattern matching
xidel https://example.org --follow "<a>{.}</a>*" --extract "<title>{.}</title>"
Motivation:
This command accomplishes the same task as the previous examples, but uses pattern matching instead of XPath or CSS selectors. Pattern matching can be useful for extracting structured information from HTML.
Explanation:
--follow "<a>{.}</a>*"
instructs Xidel to follow all elements that match the specified pattern. Here, it follows all elements that start with “a” and end with any text content.--extract "<title>{.}</title>"
specifies to extract the text content of the followed elements, which is used to construct a title element.
Example Output:
Title 1
Title 2
Title 3
...
6: Read the pattern from example.xml
xidel path/to/example.xml --extract "<x><foo>ood</foo><bar>{.}</bar></x>"
Motivation:
This command allows us to extract specific elements from an XML file, based on a given pattern. It ensures that the XML file contains the required elements, failing otherwise.
Explanation:
path/to/example.xml
is the path to the XML file to be processed.<x><foo>ood</foo><bar>{.}</bar></x>
is the pattern that represents the desired structure of the XML file. It specifies the required elements and their values.
Example Output:
<bar>Some value</bar>
7: Print all newest Stack Overflow questions with title and URL using pattern matching on their RSS feed
xidel http://stackoverflow.com/feeds --extract "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"
Motivation:
This command allows us to extract the titles and URLs of the newest Stack Overflow questions from their RSS feed. It can be useful for staying updated with the latest questions or collecting specific information from those questions.
Explanation:
http://stackoverflow.com/feeds
is the URL of the Stack Overflow RSS feed.--extract "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"
specifies the pattern for extracting the desired information. It extracts the “title” and “link” elements from each “entry” element in the feed.
Example Output:
Title: Question 1
URL: https://stackoverflow.com/questions/123456/question-1
Title: Question 2
URL: https://stackoverflow.com/questions/789012/question-2
Title: Question 3
URL: https://stackoverflow.com/questions/345678/question-3
...
8: Check for unread Reddit mail, Webscraping, combining CSS, XPath, JSONiq, and automatically form evaluation
xidel https://reddit.com --follow "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})" --extract "css('#mail')/@title"
Motivation:
This command demonstrates more advanced usage of Xidel by combining various techniques such as following links, interacting with forms, and extracting specific elements using CSS selectors.
Explanation:
https://reddit.com
is the URL of the Reddit homepage.--follow "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})"
instructs Xidel to fill and submit the login form on the page using the specified CSS selector for the form and the provided username and password.--extract "css('#mail')/@title"
specifies to extract the “title” attribute of the element with the ID “mail” using a CSS selector.
Example Output:
Unread Mail: 3