Download and extract data from HTML/XML pages as well as JSON APIs (with examples)

Download and extract data from HTML/XML pages as well as JSON APIs (with examples)

xidel https://www.google.com/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"

Motivation:

This command allows us to extract all URLs found on a Google search results page. This can be useful for various purposes such as collecting specific data from those URLs or analyzing the search results.

Explanation:

  • xidel is the command used to run Xidel.
  • https://www.google.com/search?q=test is the URL of the Google search query, where “test” is the search term.
  • --extract specifies the extraction mode to extract specific elements from the HTML page.
  • //a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != ''] is the XPath expression used to extract all URLs from the search results. It extracts the “href” attribute value of all “a” elements that match the given regular expression.

Example Output:

https://example1.com
https://example2.com
https://example3.com
...

2: Print the title of all pages found by a Google search and download them

xidel https://www.google.com/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'

Motivation:

This command allows us to extract the titles of all web pages found by a Google search and download them for further analysis or offline viewing.

Explanation:

  • --follow instructs Xidel to follow the URLs extracted from the Google search results.
  • --extract //title specifies to extract the “title” element from each of the followed pages.
  • --download '{$host}/' tells Xidel to download the web page with the extracted title as the file name. {$host} is a placeholder that gets replaced with the host name of the downloaded page.

Example Output:

Downloading: https://example1.com
Downloading: https://example2.com
Downloading: https://example3.com
...

Downloaded files:
example1_com.html
example2_com.html
example3_com.html
...
xidel https://example.org --follow //a --extract //title

Motivation:

This command allows us to follow all links on a given web page and extract the titles of the linked pages. This can be useful for analyzing the structure and content of a website.

Explanation:

  • --follow //a instructs Xidel to follow all “a” elements on the page, i.e., follow the links.
  • --extract //title specifies to extract the “title” element from each of the followed pages.

Example Output:

Title 1
Title 2
Title 3
...
xidel https://example.org --follow "css('a')" --css title

Motivation:

This command accomplishes the same task as the previous example, but uses CSS selectors instead of XPath. CSS selectors are often preferred by those who are more familiar with CSS.

Explanation:

  • --follow "css('a')" tells Xidel to follow all “a” elements on the page using CSS selectors.
  • --css title specifies to extract the “title” element from each of the followed pages using a CSS selector.

Example Output:

Title 1
Title 2
Title 3
...
xidel https://example.org --follow "<a>{.}</a>*" --extract "<title>{.}</title>"

Motivation:

This command accomplishes the same task as the previous examples, but uses pattern matching instead of XPath or CSS selectors. Pattern matching can be useful for extracting structured information from HTML.

Explanation:

  • --follow "<a>{.}</a>*" instructs Xidel to follow all elements that match the specified pattern. Here, it follows all elements that start with “a” and end with any text content.
  • --extract "<title>{.}</title>" specifies to extract the text content of the followed elements, which is used to construct a title element.

Example Output:

Title 1
Title 2
Title 3
...

6: Read the pattern from example.xml

xidel path/to/example.xml --extract "<x><foo>ood</foo><bar>{.}</bar></x>"

Motivation:

This command allows us to extract specific elements from an XML file, based on a given pattern. It ensures that the XML file contains the required elements, failing otherwise.

Explanation:

  • path/to/example.xml is the path to the XML file to be processed.
  • <x><foo>ood</foo><bar>{.}</bar></x> is the pattern that represents the desired structure of the XML file. It specifies the required elements and their values.

Example Output:

<bar>Some value</bar>

7: Print all newest Stack Overflow questions with title and URL using pattern matching on their RSS feed

xidel http://stackoverflow.com/feeds --extract "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"

Motivation:

This command allows us to extract the titles and URLs of the newest Stack Overflow questions from their RSS feed. It can be useful for staying updated with the latest questions or collecting specific information from those questions.

Explanation:

  • http://stackoverflow.com/feeds is the URL of the Stack Overflow RSS feed.
  • --extract "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+" specifies the pattern for extracting the desired information. It extracts the “title” and “link” elements from each “entry” element in the feed.

Example Output:

Title: Question 1
URL: https://stackoverflow.com/questions/123456/question-1

Title: Question 2
URL: https://stackoverflow.com/questions/789012/question-2

Title: Question 3
URL: https://stackoverflow.com/questions/345678/question-3

...

8: Check for unread Reddit mail, Webscraping, combining CSS, XPath, JSONiq, and automatically form evaluation

xidel https://reddit.com --follow "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})" --extract "css('#mail')/@title"

Motivation:

This command demonstrates more advanced usage of Xidel by combining various techniques such as following links, interacting with forms, and extracting specific elements using CSS selectors.

Explanation:

  • https://reddit.com is the URL of the Reddit homepage.
  • --follow "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})" instructs Xidel to fill and submit the login form on the page using the specified CSS selector for the form and the provided username and password.
  • --extract "css('#mail')/@title" specifies to extract the “title” attribute of the element with the ID “mail” using a CSS selector.

Example Output:

Unread Mail: 3

Related Posts

How to use the command 'meteor' (with examples)

How to use the command 'meteor' (with examples)

Meteor is a full-stack JavaScript platform that allows developers to build web applications.

Read More
How to use the command clang-tidy (with examples)

How to use the command clang-tidy (with examples)

Clang-tidy is an LLVM-based linter for C/C++ code that helps identify style violations, bugs, and security flaws through static analysis.

Read More
How to use the command qm reset (with examples)

How to use the command qm reset (with examples)

The qm reset command is used to reset a virtual machine on the QEMU/KVM Virtual Machine Manager.

Read More