JavaScript in Browser vs ASObjC XPath

@ShaneStanley, in a ASUL post you provided an excellent example of how to use ASObjC with an XPath to webscrape. While I was able to quickly determine the JavaScript querySelectorAll() CSS selector to get what I needed, I can’t even begin to determine the XPath that I could use with ASObjC.

I’m wondering if you (or anyone) could offer any help?

Here’s the web page URL:
WebMD Article List

For purpose of this example, I need to get the list of articles on this page:

0:"How Type 2 Leads to Heart Disease"
1:"Prediabetes: A Wake-Up Call"
2:"The Right Foods to Fuel Exercise"
3:"How Trackers and Tools Can Help You"
4:"Gum Disease and Diabetes"
5:"High-Fiber Superfoods"

Here is my JavaScript:

var linkElemList = document.querySelectorAll('table.articles a.sub-header')
var linkList = [];
var len = linkElemList.length;

for (var i = 0; i < len; i++) {
  linkList.push(linkElemList[i].innerText);
}

linkList;
/*
0:"How Type 2 Leads to Heart Disease"
1:"Prediabetes: A Wake-Up Call"
2:"The Right Foods to Fuel Exercise"
3:"How Trackers and Tools Can Help You"
4:"Gum Disease and Diabetes"
5:"High-Fiber Superfoods"
*/

It was very easy for me to determine the CSS selector for:
document.querySelectorAll('table.articles a.sub-header')

But after an hour or so search and test, I cannot determine the equivalent XPath for this CSS:
table.articles a.sub-header

which basically say return a list of HTML elements where the element
<table width="100%" class="articles" cellpadding="0" cellspacing="0">

has a child element of
<a href="http://click.messages.webmd.com/ . . ." class="sub-header">Displayed text</a>

So a table tag with a class of “articles” followed by a anchor tag with a class of “sub-header”

Of course you can inspect the HTML of the above URL, but here is a screenshot that may help understand it:

Thanks for any help/suggestions.

Sure. Let’s assume there are no other links with a class of sub-header. So:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- load page
set pageURL to current application's |NSURL|'s URLWithString:"http://view.messages.webmd.com/?qs=4948f2bdf0d5e9c3c724c89294d973f5fbfea38e20e24cf5ab1ffbdc852fa462e171e6924fa5bf151573952100e75fa0c327a9d6147c25ca6d747170a22decb8b9c40b8e1681d2b3d66c6f7bd46a48e7"
set {pageHTML, theError} to current application's NSData's dataWithContentsOfURL:pageURL options:0 |error|:(reference)
if pageHTML = missing value then error (theError's localizedDescription() as text)
-- make XML
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:pageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
-- parse for info
set {theMatches, theError} to (theXMLDoc's nodesForXPath:"//a[@class = 'sub-header']" |error|:(reference))
if theMatches = missing value then error (theError's localizedDescription() as text)
return (theMatches's valueForKey:"stringValue") as list 
--> {"How Type 2 Leads to Heart Disease", "Prediabetes: A Wake-Up Call", "The Right Foods to Fuel Exercise", "How Trackers and Tools Can Help You", "Gum Disease and Diabetes", "High-Fiber Superfoods"}

Pretty simple. Starting with // means it will search anywhere in the hierarchy, so it looks for all a elements with a class attribute of sub-header.

If there’s a possibility of other sub-headers, you can specify more of the path. For example if you want to make sure it within the correct table, you could do this:

set {theMatches, theError} to (theXMLDoc's nodesForXPath:"//table[@class = 'articles']/tr/td/a[@class = 'sub-header']" |error|:(reference))

Or more simply:

set {theMatches, theError} to (theXMLDoc's nodesForXPath:"//table[@class = 'articles']//a[@class = 'sub-header']" |error|:(reference))

XPath looks daunting at first, but I find I can do most of what I want knowing a few key things: what // does, how to filter elements by attribute, and how to extract attributes (“attribute::att-name”).

Here’s that link to an introduction to XPath again:

https://www.w3schools.com/xml/xpath_intro.asp

2 Likes

OK, thanks. That works. Thanks also for the detailed explanation of using multiple tags:
set {theMatches, theError} to (theXMLDoc's nodesForXPath:"//table[@class = 'articles']//a[@class = 'sub-header']" |error|:(reference))

That is the one that matches my JavaScript:
document.querySelectorAll('table.articles a.sub-header')

Let me make sure I understand the Xpath syntax in this case:
//<TagName>[@<AttributeName> = 'value1'] // <AnyChildTag>[@<AttributeName> = 'value2']

which is directly analogous to the CSS Selector for the querySelectorAll() function.
Correct?

[Correct my error per Shane’s comment below]
Of course we have to escape the /, so they become //

So, one key follow-up of you don’t mind.

Once I have a result, like theMatches, which contains a list of items, with each item being an HTML element, how do I extract an attribute from that item/element?
For example, I also need the value of the href attribute.

In case it makes any difference, in this use case I want the href and innerText (as separate keys/items) for each match.

It’s not really escaping. A single / means a direct parent-child relationship, whereas // means an ancestor-descendent relationship.

The result is an NSArray of NSXMLNodes, so you can use the attributeForName: method, which in turn returns another NSXMLNode, of which you want the stringValue property. So:

[...]
set {theMatches, theError} to (theXMLDoc's nodesForXPath:"//a[@class = 'sub-header']" |error|:(reference))
if theMatches = missing value then error (theError's localizedDescription() as text)
set theResult to {}
repeat with aMatch in theMatches
	set end of theResult to {(aMatch's attributeForName:"href")'s stringValue() as text, aMatch's stringValue() as text}
end repeat
return theResult
1 Like

Thanks again, Shane.

This is now a great example of using ASObjC to select XML nodes by XPath, and then get the desired data (attributes) from the results.

I’m curious about how you develop and test the XPath that you want to use.
I don’t see a good way of doing that in AppleScript using SD7, but maybe I’m missing something.
Any suggestions?

So far, I have found it easiest to inspect, develop and text XPaths and querySelector() using the Chrome JavaScript Console, open for the web page of interest. For those interested, open the web page in Chrome, the press ⌘⌥I to open Chrome Dev Tools.

Here is the equivalent JavaScript to the ASObjC XPath:

var xPath = "//table[@class = 'articles']//a[@class = 'sub-header']"
var nodes = document.evaluate(xPath, document, null, XPathResult.ANY_TYPE, null)
var dataList = [];
var result = nodes.iterateNext();

while (result) {
  dataList.push(
    {
      href: result.getAttribute("href"),
      text: result.innerText
    })
  result = nodes.iterateNext();
}
JSON.stringify(dataList, undefined, 4);

Output in the JavaScript Console looks like this:

"[
    {
        "href": "http://click.messages.webmd.com/?qs=5afd155d787cd0c8868b988a2e43dd283a029ffb5e412d1fd85ceb950bb3dbaa7b3fe4584d64a5809b651abcf623bd5795452a44e974731f7a5c63d909a0e721",
        "text": "How Type 2 Leads to Heart Disease"
    },
    {
        "href": "http://click.messages.webmd.com/?qs=df0db0e3cff3a9670d7a4ecaa1b7fe6979df4de5382bd2df6e333644c5a56615b3e2d6ecf7f806fb448adee30bf93576b3e1f092d728b1afa90e052e98907feb",
        "text": "Prediabetes: A Wake-Up Call"
    },
    {
        "href": "http://click.messages.webmd.com/?qs=df0db0e3cff3a967450050c955396941006708c4b17b34dd11f3c5393a18c5a0f79981f0f7851f2d8326e73779ef08976342d676ef348be26590e1dbc475a9eb",
        "text": "The Right Foods to Fuel Exercise"
    },
    {
        "href": "http://click.messages.webmd.com/?qs=df0db0e3cff3a96764e8cbfd4fa11748b5868c4637827a09fe9383de3daa8cc73c3ad979598e41645e7fa2600f8f34d8eb0391d186dcb0267068e08fcfded771",
        "text": "How Trackers and Tools Can Help You"
    },
    {
        "href": "http://click.messages.webmd.com/?qs=c94752a4d0ba47236b7aa932a1e242fd30accde93b4c59f36c727ca96b84fd07a060cb0abdbb806591de176a39203c22311d07c26eb8ed5f08f0cc1ce0e5628c",
        "text": "Gum Disease and Diabetes"
    },
    {
        "href": "http://click.messages.webmd.com/?qs=c94752a4d0ba47236c5decf8a628bbfe3d28f4d00b845e21b70344400438d38e9e9548971c4309e78479a07f80f5e500524271b0c72f1677f90d14c3b4fde2d5",
        "text": "High-Fiber Superfoods"
    }
]"

Simple trial and error. If I don’t get close after a few tries, I’ll sometimes make a local copy of the XML to speed it up. It’s not like it’s a complex language – just a bit different.

On that point I’d have to disagree. I find the combination of ASObjC and XPath very complex, and depending on the web page, it can be a challenge to identify the proper tags and attributes to use to identify the target element.

That’s why I like using the Chrome JavaScript Console. If a statement fails, or gives the wrong answer, it is easy to just recall the prior execution, make a change, and re-execute. Also the Console auto-complete is awesome! Makes trial-and-error go very fast.

Whereas with ASObjC, you have to rerun the entire script.

So my process is to use the Chrome Console to identify, develop, and test, then, once I have a XPath that works, implement in ASObjC.

Each to his own. The more tools, the merrier.

Shane, how can I set the pageHTML from normal AppleScript text instead of getting from a URL?

Looking at the data in SD7, pageHTML seems to be some type of binary data.

I’ll still be looking for an anchor tag (<a href=. . .>) but if that tag is not found, then I will want just the plain text.

So here is an example with the anchor tag:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title></head>
<body>
<a href="https://kapeli.com/dash" style="font-size: 16px; font-weight: bold;">
Dash for OS X - Code Snippet Manager, API Doc Browser, - Kapeli
</a>
</body>
</html>

from this I want the href and text:
https://kapeli.com/dash
Dash for OS X - Code Snippet Manager, API Doc Browser, - Kapeli

and here is an example without the anchor tag:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title></head>
<body>
<span style="color: rgb(51, 51, 51); font-family: 'Open Sans', sans-serif; font-size: 16px; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-position: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none;">
API Documentation Browser
</span>
</body></html>

I think I just want the html innerText. In this specific case, that would be:
API Documentation Browser

Thanks again for your help.

When I get my project done, I’ll try to put together a mini user’s guide for this process.

It is data — UTF-8-encoded text as data. So:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use framework "AppKit" 
use scripting additions

set theHTML to "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">
<html xmlns=\"http://www.w3.org/1999/xhtml\">
<head><title></title></head>
<body>
<a href=\"https://kapeli.com/dash\" style=\"font-size: 16px; font-weight: bold;\">
Dash for OS X - Code Snippet Manager, API Doc Browser, - Kapeli
</a>
</body>
</html>"

set theHTML to current application's NSString's stringWithString:theHTML
set pageHTML to theHTML's dataUsingEncoding:(current application's NSUTF8StringEncoding)

set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:pageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
-- parse for info
set {theMatches, theError} to (theXMLDoc's nodesForXPath:"//a" |error|:(reference))
if theMatches = missing value then error (theError's localizedDescription() as text)
set theResult to {}
repeat with aMatch in theMatches
	set end of theResult to {(aMatch's attributeForName:"href")'s stringValue() as text, aMatch's stringValue() as text}
end repeat

Like this:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use framework "AppKit"
use scripting additions

set theHTML to "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">
<html xmlns=\"http://www.w3.org/1999/xhtml\">
<head><title></title></head>
<body>
<span style=\"color: rgb(51, 51, 51); font-family: 'Open Sans', sans-serif; font-size: 16px; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-position: normal; letter-spacing: normal; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none;\">
API Documentation Browser
</span>
</body></html>"

set theHTML to current application's NSString's stringWithString:theHTML
set pageHTML to theHTML's dataUsingEncoding:(current application's NSUTF8StringEncoding)

set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:pageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
-- parse for info
set {theMatches, theError} to (theXMLDoc's nodesForXPath:"//span" |error|:(reference))
if theMatches = missing value then error (theError's localizedDescription() as text)
set theResult to {}
repeat with aMatch in theMatches
	set end of theResult to aMatch's stringValue() as text
end repeat
1 Like

Thanks, Shane. That was exactly what I needed.