XPath is a query language for selecting the nodes from the XML or HTML document. It is used for scraping the web content from the XML or HTML document using a different type of selectors to get the actual content that we want to get from the document. Xpath was defined by the Wide Web Consortium. There is also CSS selectors but these selectors are not used while getting the complicated path. CSS selectors work with attributes like ID and Class, while Xpath selectors will traverse the whole DOM with the top down approach and also work with nested elements, so xpath is preferred for scraping the web content. So with the help of XPath, you can extract the data based on text elements contents.
Types of Xpath:
Based on the location of node in the HTML document. Xpath is classified into two types:
• ABSOLUTE XPATH If the location of xpath starts with root node or with ‘/'(single slash) then it is an absolute path.
• RELATIVE XPATH If the location of xpath starts with the node that we have selected or with //(double slash) then it is a relative path. Consider the following example:
Below are the different types of nodes in an XPath tree: • Element node: represents the HTML element, i.e an HTML tag. • Attribute node: it represents an attribute in an element node, e.g. “href” attribute in example • Comment node: represents comments in the HTML document which is denoted by () • Text node: represents the text enclosed in an element node
Now, suppose that if we want the title of the html page by using the xpath expression it should be: /html/head/title Instead of giving the full node path from root of the tree, we can also select them by using: //title, which means it will look into the whole tree, starting from the root of the tree (//) and select only those nodes whose name matches with title.
|/html||Selects node named html, which is under the root of the html document.|
|/html/head||Selects the node named head, which is inside the html node.|
|/title||Selects all the title nodes from the HTML document.|
|//h2/a||Selects all the nodes which are directly under an h2 node.|
|//comment()||Selects only comment nodes.|
|//node()||Selects any kind of node in the tree.|
|//text()||Selects only the text nodes, for eg. “This is the first paragraph”.|
|//*||Selects all nodes, except comment and text nodes.|
We can also combine name and node tests in a single expression. Let us consider an example:
//p/text() This expression selects the text nodes inside of p elements. So from the above example, it will select “This is the first paragraph.” Consider this HTML document
Suppose we want to select only the first li node from the above example. Then the expression is: //li[position() = 1]. This expression which is enclosed by square brackets is called as predicate and it filters the node set returned by //li. It will checks each node’s position using the position() function, which returns the position of the current node. We can abbreviate the expression above to: //li Both XPath expressions above would select the following element:
|//li[position()%2=0]||Selects the li elements at even positions..|
|//li[a]||Selects the li elements which is enclose in an a element.|
|//li[a or h2]||Selects li elements from HTML document which enclose either an a or an h2 element.|
|//li[ a [ text() = “link” ] ]||Selects the li elements which enclose an a element and whose text is “link”. Can also be written as //li[ a/text()=”link” ]|
|//li[last()]||Selects the last li element in the document.|
We can also combine multiple XPath expressions in a single XPath using the union operator |.
It will combine two different xpath into single XPath. For an example,
we can select all the a and h2 elements in the document above: //a | //h2 Now, consider this HTML document:
Wrap up-Xpath is a very powerful and vast topic, this is just an introduction to the basic concepts. If you want to learn more about it, please refer below links:
• https://www.w3schools.com/xml/xpath_intro.asp One can use online XPath tester (https://extendsclass.com/xpath-tester.html) as well.