Use either:
set pageHTMLStr to theXMLDoc's XMLString() as text
or the variant where you can set various options, for example:
set pageHTMLStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLDocumentTidyHTML)) as text
Use either:
set pageHTMLStr to theXMLDoc's XMLString() as text
or the variant where you can set various options, for example:
set pageHTMLStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLDocumentTidyHTML)) as text
Thanks Shane. That solves it, and is very helpful.
If I’m using the two methods correctly, it looks like they both return the same result.
Do these return the HTML as formatted in the original web page?
However, I found this option which is very nice: NSXMLNodePrettyPrint
set htmlPPStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLNodePrettyPrint)) as text
which produces a very readable HTML output.
Any other options you think I should look at?
Look at them all – then try what you think suits. It’s not like there’s one true format.
Following this very carefully…
Here’s my script updated using the solution provided by @ShaneStanley:
###Final Script (as an example)
(*
PURPOSE: Get Web Page HTML using ASObjC
(as an alternate to curl)
REF: Script posted by @ShaneStanley to ASUL, 2017-03-31
https://lists.apple.com/archives/applescript-users/2017/Mar/msg00421.html
*)
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
--- SET URL OF WEB PAGE ---
set pageURLStr to "http://forum.latenightsw.com/t/welcome-to-the-late-night-software-support-forum/8"
set pageURLnsStr to current application's NSString's stringWithString:pageURLStr
set nsPageURL to current application's NSURL's URLWithString:pageURLnsStr
--- GET WEB PAGE HTML ---
set {nsPageHTML, theError} to current application's NSData's dataWithContentsOfURL:nsPageURL options:0 |error|:(reference)
if nsPageHTML = missing value then error (theError's localizedDescription() as text)
-- convert to XML
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:nsPageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
--- SEARCHING & EXTRACTING INFO FROM WEB PAGE ---
-- • As an alternate to JavaScript in the Browser, use ASObjC XML methods
-- • For an example, see https://lists.apple.com/archives/applescript-users/2017/Mar/msg00421.html
---------------------------------------------------
-- CONVERT TO NORMAL TEXT
-- • There are several options to choose from
-- SEE: Writing XML From NSXML Objects
-- https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/NSXML_Concepts/Articles/WritingXML.html
--------------------------------------------------
-- 1. SIMPLE
set htmlSimpleStr to theXMLDoc's XMLString() as text
-- 2. TIDY
set htmlTidyStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLDocumentTidyHTML)) as text
-- (I don't see any difference between #1 and #2)
-- 3. PRETTY PRINT (produces a very readable XML/HTML output)
set htmlPPStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLNodePrettyPrint)) as text
set the clipboard to htmlPPStr
OK, well I’m done. I hope you find this helpful.
Please feel free to add any enhancements.
I realise I was looking too much at your code, and not enough at your objective. In fact, making an XML document is the long way around – the actual data you have is what you want, converted to a string.
You can convert data to a string using NSString’s -dataUsingEncoding:
method, but it’s a bit tricky here because you can’t be sure of the encoding. You can try UTF-8, and drop back to something else if it fails, but that’s a bit messy.
But as of macOS 10.10 there’s a method that will guess the encoding. It’s a bit confusing – it looks like a way of finding the encoding, but in fact does the conversion at the same time – but it will do what you want.
So something like this:
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
set pageURLStr to "http://forum.latenightsw.com/t/welcome-to-the-late-night-software-support-forum/8"
set nsPageURL to current application's |NSURL|'s URLWithString:pageURLStr
set {nsPageHTML, theError} to current application's NSData's dataWithContentsOfURL:nsPageURL options:0 |error|:(reference)
if nsPageHTML = missing value then error (theError's localizedDescription() as text)
set encodingOptions to current application's NSDictionary's dictionaryWithObject:false forKey:(current application's NSStringEncodingDetectionAllowLossyKey)
set {theEncoding, theString} to current application's NSString's stringEncodingForData:(nsPageHTML) encodingOptions:encodingOptions convertedString:(reference) usedLossyConversion:(missing value)
if theEncoding = 0 then error "Unknown encoding"
set the clipboard to theString as text
This begs the question of what you want to do with the source. If you want to parse it, then it’s probably better to go back to the XML document method, and use its tools to do the parsing.
Hey Folks,
I’ve been working on this on and off for months trying to get it to work with Keyboard Maestro. KM was choking on the AppleScriptObjC somewhere, but for whatever reason it’s working now.
Keyboard Maestro Macro — Create Web Archives (Download Web Page) from a List of URLs
The version below runs as an applet (provided in the zip file), and it will run equally well from FastScripts or Keyboard Maestro.
Script Debugger chokes on it though…
Many thanks to Shane for examples he provided publicly and for the direct help he gave me with this project.
-Chris
WebArchive Downloader.zip (66.3 KB)
------------------------------------------------------------------------------
# Auth: Christopher Stone { Heavy Lifting by Shane Stanley }
# dCre: 2017/02/26 18:30 CST
# dMod: 2017/04/24 19:32 CDT
# Appl: AppleScriptObjC
# Task: Create WebArchives for a list of remote URLs (applet version).
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Create, @Webarchives, @List, @URLs, @EXIF, @Tags
------------------------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "WebKit"
use scripting additions
------------------------------------------------------------------------------
property theSender : missing value
property thePath : missing value
property loadDone : true
------------------------------------------------------------------------------
--» User Setting
set destinationFolderPath to POSIX path of ((path to downloads folder as text) & "WebArchive Downloads:")
------------------------------------------------------------------------------
its createDirectoryAtPath:destinationFolderPath
set urlList to getUrlList() -- see handler for url list.
repeat with remoteURL in urlList
set fileName to (its cngStr:"/$" intoString:"" inString:remoteURL)
set fileName to (its cngStr:"^.+/" intoString:"" inString:fileName)
set fileName to (its cngStr:"(\\.\\w+)?$" intoString:".webarchive" inString:fileName)
set pageDestPath to destinationFolderPath & fileName
(its archivePage:remoteURL toPath:pageDestPath sender:me)
repeat
if loadDone then
exit repeat
else
delay 0.25
end if
end repeat
end repeat
------------------------------------------------------------------------------
--» HANDLERS
------------------------------------------------------------------------------
on archivePage:thePageURL toPath:aPath sender:mySender
set my loadDone to false
set my theSender to mySender # Store main script so we can call back
set my thePath to aPath # Store path for use later
my performSelectorOnMainThread:"loadURL:" withObject:thePageURL waitUntilDone:false
end archivePage:toPath:sender:
------------------------------------------------------------------------------
on cngStr:findString intoString:replaceString inString:dataString
set anNSString to current application's NSString's stringWithString:dataString
set dataString to (anNSString's ¬
stringByReplacingOccurrencesOfString:findString withString:replaceString ¬
options:(current application's NSRegularExpressionSearch) range:{0, length of dataString}) as text
end cngStr:intoString:inString:
------------------------------------------------------------------------------
on createDirectoryAtPath:thePath
set {theResult, theError} to current application's NSFileManager's defaultManager()'s createDirectoryAtPath:thePath withIntermediateDirectories:true attributes:(missing value) |error|:(reference)
if not (theResult as boolean) then
set errorMsg to theError's localizedDescription() as text
error errorMsg
end if
end createDirectoryAtPath:
------------------------------------------------------------------------------
on getKMVar(varName)
tell application "Keyboard Maestro Engine"
return getvariable varName
end tell
end getKMVar
------------------------------------------------------------------------------
# Called when the job's done
on jobDone:theMessage
display notification theMessage
end jobDone:
------------------------------------------------------------------------------
on loadURL:thePageURL
# Stuff to be done on main thread
# Make a WebView
set theView to current application's WebView's alloc()'s initWithFrame:{origin:{x:0, y:0}, |size|:{width:100, height:100}}
# Tell it call delegate methods on me
theView's setFrameLoadDelegate:me
# Load the page
theView's setMainFrameURL:thePageURL
end loadURL:
------------------------------------------------------------------------------
# Called when our WebView loads a frame
on WebView:aWebView didFinishLoadForFrame:webFrame
# The main frame is our interest
if webFrame = aWebView's mainFrame() then
# Get the text of the page
set theText to (webFrame's DOMDocument()'s documentElement()'s outerText())
# Search it
# Get the data and write it to file
set theArchiveData to webFrame's dataSource()'s webArchive()'s |data|()
set x to theArchiveData's writeToFile:thePath atomically:true
# Tell our script it's all done
set my loadDone to true
theSender's jobDone:"The webarchive was saved"
end if
end WebView:didFinishLoadForFrame:
------------------------------------------------------------------------------
# Called if there's a problem
on WebView:WebView didFailLoadWithError:theError forFrame:webFrame
# Got an error, bail
WebView's stopLoading:me
set my loadDone to true
theSender's jobDone:"The webarchive was not saved"
end WebView:didFailLoadWithError:forFrame:
------------------------------------------------------------------------------
on getUrlList()
set urlList to paragraphs 2 thru -2 of "
http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/JPEG.html
http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html
http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/IPTC.html
"
end getUrlList
------------------------------------------------------------------------------
Great script/tool!
Chris, I was hoping that you would chime in here with your more complete solution. Your script goes to the next level of downloading everything (not just the web page source) to a web archive file/folder.
I would guess a tool like yours would normally be run from the Apple Scripts menu, FastScripts, or KM. Any ideas why it chokes on SD?
Shane, thanks for taking another look at my script/objective.
But I’m wondering, rather that either the user or the script “guess” at the encoding, why not just stick with the XML process?
Seems like either way we have the same number of steps, and the XML method gives us more options:
###XML Method
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:nsPageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
set htmlPPStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLNodePrettyPrint)) as text
###Guess Encoding Method
set encodingOptions to current application's NSDictionary's dictionaryWithObject:false forKey:(current application's NSStringEncodingDetectionAllowLossyKey)
set {theEncoding, theString} to current application's NSString's stringEncodingForData:(nsPageHTML) encodingOptions:encodingOptions convertedString:(reference) usedLossyConversion:(missing value)
if theEncoding = 0 then error "Unknown encoding"
The good news is we have more options on how to achieve the same objective.
In addition to my stated objective, I also wanted to show the various methods available. My script is more of an example than a complete script.
It’s a bit less efficient. It also depends what you’re after – the XML document method may correct some of the HTML, for example, or remove whitespace, whereas the other method returns a 100% faithful rendition.
It’s because of how part of it requires the main thread. SD can use performSelectorOnMainThread:::
, but there comes a point where that’s not enough.
Thanks, I’ve got a lot to look at now. I can say it’s already helpful. Most of these solutions seem to be designed to allow you to go to any web page and download. I have very specific web pages that never change their format but only change parts of their content. I may need to simply get links or look for key phrases or specific strings. So for my purposes some of these are overkill, but now I have several alternatives. I’m mostly hoping for the speed and reliability you don’t get from getting text out of a browser. (I had asked about this a few years ago and got a few suggestions, but none were as reliable as opening the page in Safari and parsing the source text.)
I have one script in particular that sends Safari to over 1000 web pages to get about 20 specific strings from each page to build an internal table. It’s looking at team pages for the largest AYSO soccer region in country, collecting a variety of information for further processing. The first run gets all the basic information that doesn’t change during the season, then it runs on a weekly basis, updating scores, rosters, etc. that may change.
My first look tells me this method will turn that multi-hour weekly process into one that will take a few minutes.
So I’ll be starting with that script. The season teams are formed in July and the season starts in September, and I can practice with last year’s data so I’m in good shape. (I may use this to clone last years data so I have a local copy I can test against, if it’s not ready to go by the end of June.
That sounds like a perfect opportunity to use XML document parsing, and probably XPath
searching.
Hey Shane,
Do you know if this method will follow redirects?
-Chris
Hey Shane,
Also — do you know how it identifies itself to the server when it makes the request?
As WebKit?
-Chris
I’m guessing so, but I don’t see any documentation either way. So you tell us
Sorry, no idea.
You can use a slightly more complex method, and it allows you to set the user agent. Here’s an example, with some other stuff thrown in:
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
set URLString to "http://forum.latenightsw.com/t/welcome-to-the-late-night-software-support-forum/8"
set theURL to current application's |NSURL|'s URLWithString:URLString
-- make a URL request and set custom header
set theRequest to current application's NSMutableURLRequest's requestWithURL:theURL
theRequest's setValue:"Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341" forHTTPHeaderField:"User-Agent"
-- send request and wait for reply
set {theData, theResponse, theError} to current application's NSURLConnection's sendSynchronousRequest:theRequest returningResponse:(reference) |error|:(reference)
-- if missing value, there was a problem
if theData = missing value then error (theError's localizedDescription() as text)
set theCode to theResponse's statusCode()
if theCode is not 200 then error (current application's NSHTTPURLResponse's localizedStringForStatusCode:theCode) as text
-- get mime type
set theMimeType to theResponse's MIMEType()
if theMimeType's hasPrefix:"text" then
-- its a string
set theEnc to theResponse's textEncodingName() -- IANA string, so no easy way to convert to usable value
set encodingOptions to current application's NSDictionary's dictionaryWithObject:false forKey:(current application's NSStringEncodingDetectionAllowLossyKey)
set {theEncoding, theString} to current application's NSString's stringEncodingForData:(theData) encodingOptions:encodingOptions convertedString:(reference) usedLossyConversion:(missing value)
if theEncoding = 0 then error "Unknown encoding"
return theString
end if
Hey @ShaneStanley,
Do you know off the top of your head if this code works on recent versions of macOS?
I seem to remember someone saying the web frameworks had changed and caused problems here and there.
Presently I can only test up to Mojave…
TIA.
-Chris
It seems to be OK here under 13.1 if you change to https:
– it won’t do simple http:
.
Cool. Many thanks for checking!