As an alternate to using curl in a shell script, I’m trying to write an ASObjC script based on @ShaneStanley’s ASUL script.
The objective is to get the web page HTML without opening a browser.
It seems to work OK down to the last step: Convert NSXMLDocument to normal AppleScript text. Here I get an error. How can I do this?
If there is a better solution, I’m open to all suggestions.
If you see any issues with my script, please advise.
TIA for all help.
###ASObjC Script
(*
PURPOSE: Get Web Page HTML using ASObjC
(as an alternate to curl)
REF: Script posted by @ShaneStanley to ASUL, 2017-03-31
https://lists.apple.com/archives/applescript-users/2017/Mar/msg00421.html
*)
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
--- SET URL OF WEB PAGE ---
set pageURLStr to "https://forum.keyboardmaestro.com/"
set pageURLnsStr to current application's NSString's stringWithString:pageURLStr
set nsPageURL to current application's NSURL's URLWithString:pageURLnsStr
--- GET WEB PAGE HTML ---
set {nsPageHTML, theError} to current application's NSData's dataWithContentsOfURL:nsPageURL options:0 |error|:(reference)
if nsPageHTML = missing value then error (theError's localizedDescription() as text)
-- convert to XML
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:nsPageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
--- CONVERT TO NORMAL TEXT ---
## FAILS with Can’t make «class ocid» id «data optr00000000508CFD5EF17F0000» into type text.
set pageHTMLStr to theXMLDoc as text ##FAILS
Thanks Shane. That solves it, and is very helpful.
If I’m using the two methods correctly, it looks like they both return the same result.
Do these return the HTML as formatted in the original web page?
Here’s my script updated using the solution provided by @ShaneStanley:
###Final Script (as an example)
(*
PURPOSE: Get Web Page HTML using ASObjC
(as an alternate to curl)
REF: Script posted by @ShaneStanley to ASUL, 2017-03-31
https://lists.apple.com/archives/applescript-users/2017/Mar/msg00421.html
*)
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
--- SET URL OF WEB PAGE ---
set pageURLStr to "http://forum.latenightsw.com/t/welcome-to-the-late-night-software-support-forum/8"
set pageURLnsStr to current application's NSString's stringWithString:pageURLStr
set nsPageURL to current application's NSURL's URLWithString:pageURLnsStr
--- GET WEB PAGE HTML ---
set {nsPageHTML, theError} to current application's NSData's dataWithContentsOfURL:nsPageURL options:0 |error|:(reference)
if nsPageHTML = missing value then error (theError's localizedDescription() as text)
-- convert to XML
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:nsPageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
--- SEARCHING & EXTRACTING INFO FROM WEB PAGE ---
-- • As an alternate to JavaScript in the Browser, use ASObjC XML methods
-- • For an example, see https://lists.apple.com/archives/applescript-users/2017/Mar/msg00421.html
---------------------------------------------------
-- CONVERT TO NORMAL TEXT
-- • There are several options to choose from
-- SEE: Writing XML From NSXML Objects
-- https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/NSXML_Concepts/Articles/WritingXML.html
--------------------------------------------------
-- 1. SIMPLE
set htmlSimpleStr to theXMLDoc's XMLString() as text
-- 2. TIDY
set htmlTidyStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLDocumentTidyHTML)) as text
-- (I don't see any difference between #1 and #2)
-- 3. PRETTY PRINT (produces a very readable XML/HTML output)
set htmlPPStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLNodePrettyPrint)) as text
set the clipboard to htmlPPStr
OK, well I’m done. I hope you find this helpful.
Please feel free to add any enhancements.
I realise I was looking too much at your code, and not enough at your objective. In fact, making an XML document is the long way around – the actual data you have is what you want, converted to a string.
You can convert data to a string using NSString’s -dataUsingEncoding: method, but it’s a bit tricky here because you can’t be sure of the encoding. You can try UTF-8, and drop back to something else if it fails, but that’s a bit messy.
But as of macOS 10.10 there’s a method that will guess the encoding. It’s a bit confusing – it looks like a way of finding the encoding, but in fact does the conversion at the same time – but it will do what you want.
So something like this:
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
set pageURLStr to "http://forum.latenightsw.com/t/welcome-to-the-late-night-software-support-forum/8"
set nsPageURL to current application's |NSURL|'s URLWithString:pageURLStr
set {nsPageHTML, theError} to current application's NSData's dataWithContentsOfURL:nsPageURL options:0 |error|:(reference)
if nsPageHTML = missing value then error (theError's localizedDescription() as text)
set encodingOptions to current application's NSDictionary's dictionaryWithObject:false forKey:(current application's NSStringEncodingDetectionAllowLossyKey)
set {theEncoding, theString} to current application's NSString's stringEncodingForData:(nsPageHTML) encodingOptions:encodingOptions convertedString:(reference) usedLossyConversion:(missing value)
if theEncoding = 0 then error "Unknown encoding"
set the clipboard to theString as text
This begs the question of what you want to do with the source. If you want to parse it, then it’s probably better to go back to the XML document method, and use its tools to do the parsing.
I’ve been working on this on and off for months trying to get it to work with Keyboard Maestro. KM was choking on the AppleScriptObjC somewhere, but for whatever reason it’s working now.
------------------------------------------------------------------------------
# Auth: Christopher Stone { Heavy Lifting by Shane Stanley }
# dCre: 2017/02/26 18:30 CST
# dMod: 2017/04/24 19:32 CDT
# Appl: AppleScriptObjC
# Task: Create WebArchives for a list of remote URLs (applet version).
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Create, @Webarchives, @List, @URLs, @EXIF, @Tags
------------------------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "WebKit"
use scripting additions
------------------------------------------------------------------------------
property theSender : missing value
property thePath : missing value
property loadDone : true
------------------------------------------------------------------------------
--» User Setting
set destinationFolderPath to POSIX path of ((path to downloads folder as text) & "WebArchive Downloads:")
------------------------------------------------------------------------------
its createDirectoryAtPath:destinationFolderPath
set urlList to getUrlList() -- see handler for url list.
repeat with remoteURL in urlList
set fileName to (its cngStr:"/$" intoString:"" inString:remoteURL)
set fileName to (its cngStr:"^.+/" intoString:"" inString:fileName)
set fileName to (its cngStr:"(\\.\\w+)?$" intoString:".webarchive" inString:fileName)
set pageDestPath to destinationFolderPath & fileName
(its archivePage:remoteURL toPath:pageDestPath sender:me)
repeat
if loadDone then
exit repeat
else
delay 0.25
end if
end repeat
end repeat
------------------------------------------------------------------------------
--» HANDLERS
------------------------------------------------------------------------------
on archivePage:thePageURL toPath:aPath sender:mySender
set my loadDone to false
set my theSender to mySender # Store main script so we can call back
set my thePath to aPath # Store path for use later
my performSelectorOnMainThread:"loadURL:" withObject:thePageURL waitUntilDone:false
end archivePage:toPath:sender:
------------------------------------------------------------------------------
on cngStr:findString intoString:replaceString inString:dataString
set anNSString to current application's NSString's stringWithString:dataString
set dataString to (anNSString's ¬
stringByReplacingOccurrencesOfString:findString withString:replaceString ¬
options:(current application's NSRegularExpressionSearch) range:{0, length of dataString}) as text
end cngStr:intoString:inString:
------------------------------------------------------------------------------
on createDirectoryAtPath:thePath
set {theResult, theError} to current application's NSFileManager's defaultManager()'s createDirectoryAtPath:thePath withIntermediateDirectories:true attributes:(missing value) |error|:(reference)
if not (theResult as boolean) then
set errorMsg to theError's localizedDescription() as text
error errorMsg
end if
end createDirectoryAtPath:
------------------------------------------------------------------------------
on getKMVar(varName)
tell application "Keyboard Maestro Engine"
return getvariable varName
end tell
end getKMVar
------------------------------------------------------------------------------
# Called when the job's done
on jobDone:theMessage
display notification theMessage
end jobDone:
------------------------------------------------------------------------------
on loadURL:thePageURL
# Stuff to be done on main thread
# Make a WebView
set theView to current application's WebView's alloc()'s initWithFrame:{origin:{x:0, y:0}, |size|:{width:100, height:100}}
# Tell it call delegate methods on me
theView's setFrameLoadDelegate:me
# Load the page
theView's setMainFrameURL:thePageURL
end loadURL:
------------------------------------------------------------------------------
# Called when our WebView loads a frame
on WebView:aWebView didFinishLoadForFrame:webFrame
# The main frame is our interest
if webFrame = aWebView's mainFrame() then
# Get the text of the page
set theText to (webFrame's DOMDocument()'s documentElement()'s outerText())
# Search it
# Get the data and write it to file
set theArchiveData to webFrame's dataSource()'s webArchive()'s |data|()
set x to theArchiveData's writeToFile:thePath atomically:true
# Tell our script it's all done
set my loadDone to true
theSender's jobDone:"The webarchive was saved"
end if
end WebView:didFinishLoadForFrame:
------------------------------------------------------------------------------
# Called if there's a problem
on WebView:WebView didFailLoadWithError:theError forFrame:webFrame
# Got an error, bail
WebView's stopLoading:me
set my loadDone to true
theSender's jobDone:"The webarchive was not saved"
end WebView:didFailLoadWithError:forFrame:
------------------------------------------------------------------------------
on getUrlList()
set urlList to paragraphs 2 thru -2 of "
http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/JPEG.html
http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html
http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/IPTC.html
"
end getUrlList
------------------------------------------------------------------------------
Chris, I was hoping that you would chime in here with your more complete solution. Your script goes to the next level of downloading everything (not just the web page source) to a web archive file/folder.
I would guess a tool like yours would normally be run from the Apple Scripts menu, FastScripts, or KM. Any ideas why it chokes on SD?
Shane, thanks for taking another look at my script/objective.
But I’m wondering, rather that either the user or the script “guess” at the encoding, why not just stick with the XML process?
Seems like either way we have the same number of steps, and the XML method gives us more options:
###XML Method
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:nsPageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
set htmlPPStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLNodePrettyPrint)) as text
###Guess Encoding Method
set encodingOptions to current application's NSDictionary's dictionaryWithObject:false forKey:(current application's NSStringEncodingDetectionAllowLossyKey)
set {theEncoding, theString} to current application's NSString's stringEncodingForData:(nsPageHTML) encodingOptions:encodingOptions convertedString:(reference) usedLossyConversion:(missing value)
if theEncoding = 0 then error "Unknown encoding"
The good news is we have more options on how to achieve the same objective.
In addition to my stated objective, I also wanted to show the various methods available. My script is more of an example than a complete script.
It’s a bit less efficient. It also depends what you’re after – the XML document method may correct some of the HTML, for example, or remove whitespace, whereas the other method returns a 100% faithful rendition.
Thanks, I’ve got a lot to look at now. I can say it’s already helpful. Most of these solutions seem to be designed to allow you to go to any web page and download. I have very specific web pages that never change their format but only change parts of their content. I may need to simply get links or look for key phrases or specific strings. So for my purposes some of these are overkill, but now I have several alternatives. I’m mostly hoping for the speed and reliability you don’t get from getting text out of a browser. (I had asked about this a few years ago and got a few suggestions, but none were as reliable as opening the page in Safari and parsing the source text.)
I have one script in particular that sends Safari to over 1000 web pages to get about 20 specific strings from each page to build an internal table. It’s looking at team pages for the largest AYSO soccer region in country, collecting a variety of information for further processing. The first run gets all the basic information that doesn’t change during the season, then it runs on a weekly basis, updating scores, rosters, etc. that may change.
My first look tells me this method will turn that multi-hour weekly process into one that will take a few minutes.
So I’ll be starting with that script. The season teams are formed in July and the season starts in September, and I can practice with last year’s data so I’m in good shape. (I may use this to clone last years data so I have a local copy I can test against, if it’s not ready to go by the end of June.
You can use a slightly more complex method, and it allows you to set the user agent. Here’s an example, with some other stuff thrown in:
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
set URLString to "http://forum.latenightsw.com/t/welcome-to-the-late-night-software-support-forum/8"
set theURL to current application's |NSURL|'s URLWithString:URLString
-- make a URL request and set custom header
set theRequest to current application's NSMutableURLRequest's requestWithURL:theURL
theRequest's setValue:"Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341" forHTTPHeaderField:"User-Agent"
-- send request and wait for reply
set {theData, theResponse, theError} to current application's NSURLConnection's sendSynchronousRequest:theRequest returningResponse:(reference) |error|:(reference)
-- if missing value, there was a problem
if theData = missing value then error (theError's localizedDescription() as text)
set theCode to theResponse's statusCode()
if theCode is not 200 then error (current application's NSHTTPURLResponse's localizedStringForStatusCode:theCode) as text
-- get mime type
set theMimeType to theResponse's MIMEType()
if theMimeType's hasPrefix:"text" then
-- its a string
set theEnc to theResponse's textEncodingName() -- IANA string, so no easy way to convert to usable value
set encodingOptions to current application's NSDictionary's dictionaryWithObject:false forKey:(current application's NSStringEncodingDetectionAllowLossyKey)
set {theEncoding, theString} to current application's NSString's stringEncodingForData:(theData) encodingOptions:encodingOptions convertedString:(reference) usedLossyConversion:(missing value)
if theEncoding = 0 then error "Unknown encoding"
return theString
end if