Need Help Getting Web Page via ASObjC

foundation
asobjc

(Jim Underwood) #1

As an alternate to using curl in a shell script, I’m trying to write an ASObjC script based on @ShaneStanley’s ASUL script.

The objective is to get the web page HTML without opening a browser.

It seems to work OK down to the last step: Convert NSXMLDocument to normal AppleScript text. Here I get an error. How can I do this?

If there is a better solution, I’m open to all suggestions.
If you see any issues with my script, please advise.

TIA for all help.

###ASObjC Script

(*
  PURPOSE: Get Web Page HTML using ASObjC
              (as an alternate to curl)
              
  REF:  Script posted by @ShaneStanley to ASUL, 2017-03-31
        https://lists.apple.com/archives/applescript-users/2017/Mar/msg00421.html 
*)
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

--- SET URL OF WEB PAGE ---

set pageURLStr to "https://forum.keyboardmaestro.com/"
set pageURLnsStr to current application's NSString's stringWithString:pageURLStr

set nsPageURL to current application's NSURL's URLWithString:pageURLnsStr

--- GET WEB PAGE HTML ---

set {nsPageHTML, theError} to current application's NSData's dataWithContentsOfURL:nsPageURL options:0 |error|:(reference)
if nsPageHTML = missing value then error (theError's localizedDescription() as text)

-- convert to XML

set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:nsPageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)

--- CONVERT TO NORMAL TEXT ---

## FAILS with Can’t make «class ocid» id «data optr00000000508CFD5EF17F0000» into type text.
set pageHTMLStr to theXMLDoc as text ##FAILS


(Shane Stanley) #2

Use either:

set pageHTMLStr to theXMLDoc's XMLString() as text

or the variant where you can set various options, for example:

set pageHTMLStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLDocumentTidyHTML)) as text

(Jim Underwood) #3

Thanks Shane. That solves it, and is very helpful.

If I’m using the two methods correctly, it looks like they both return the same result.
Do these return the HTML as formatted in the original web page?

However, I found this option which is very nice: NSXMLNodePrettyPrint

set htmlPPStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLNodePrettyPrint)) as text

which produces a very readable HTML output.

Any other options you think I should look at?


(Shane Stanley) #4

Look at them all – then try what you think suits. It’s not like there’s one true format.


(Ed Stockly) #5

Following this very carefully…


(Jim Underwood) #6

Here’s my script updated using the solution provided by @ShaneStanley:

###Final Script (as an example)

(*
  PURPOSE: Get Web Page HTML using ASObjC
              (as an alternate to curl)
              
  REF:  Script posted by @ShaneStanley to ASUL, 2017-03-31
        https://lists.apple.com/archives/applescript-users/2017/Mar/msg00421.html 
*)
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

--- SET URL OF WEB PAGE ---

set pageURLStr to "http://forum.latenightsw.com/t/welcome-to-the-late-night-software-support-forum/8"
set pageURLnsStr to current application's NSString's stringWithString:pageURLStr

set nsPageURL to current application's NSURL's URLWithString:pageURLnsStr

--- GET WEB PAGE HTML ---

set {nsPageHTML, theError} to current application's NSData's dataWithContentsOfURL:nsPageURL options:0 |error|:(reference)
if nsPageHTML = missing value then error (theError's localizedDescription() as text)

-- convert to XML

set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:nsPageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)

--- SEARCHING & EXTRACTING INFO FROM WEB PAGE ---
--    • As an alternate to JavaScript in the Browser, use ASObjC XML methods
--    • For an example, see https://lists.apple.com/archives/applescript-users/2017/Mar/msg00421.html 

---------------------------------------------------
-- CONVERT TO NORMAL TEXT
--  • There are several options to choose from
--     SEE: Writing XML From NSXML Objects
--          https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/NSXML_Concepts/Articles/WritingXML.html
--------------------------------------------------

-- 1. SIMPLE
set htmlSimpleStr to theXMLDoc's XMLString() as text

-- 2. TIDY
set htmlTidyStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLDocumentTidyHTML)) as text

-- (I don't see any difference between #1 and #2)

-- 3. PRETTY PRINT (produces a very readable XML/HTML output)
set htmlPPStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLNodePrettyPrint)) as text

set the clipboard to htmlPPStr

OK, well I’m done. I hope you find this helpful.
Please feel free to add any enhancements.


(Shane Stanley) #7

I realise I was looking too much at your code, and not enough at your objective. In fact, making an XML document is the long way around – the actual data you have is what you want, converted to a string.

You can convert data to a string using NSString’s -dataUsingEncoding: method, but it’s a bit tricky here because you can’t be sure of the encoding. You can try UTF-8, and drop back to something else if it fails, but that’s a bit messy.

But as of macOS 10.10 there’s a method that will guess the encoding. It’s a bit confusing – it looks like a way of finding the encoding, but in fact does the conversion at the same time – but it will do what you want.

So something like this:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set pageURLStr to "http://forum.latenightsw.com/t/welcome-to-the-late-night-software-support-forum/8"
set nsPageURL to current application's |NSURL|'s URLWithString:pageURLStr

set {nsPageHTML, theError} to current application's NSData's dataWithContentsOfURL:nsPageURL options:0 |error|:(reference)
if nsPageHTML = missing value then error (theError's localizedDescription() as text)

set encodingOptions to current application's NSDictionary's dictionaryWithObject:false forKey:(current application's NSStringEncodingDetectionAllowLossyKey)
set {theEncoding, theString} to current application's NSString's stringEncodingForData:(nsPageHTML) encodingOptions:encodingOptions convertedString:(reference) usedLossyConversion:(missing value)
if theEncoding = 0 then error "Unknown encoding"
set the clipboard to theString as text

This begs the question of what you want to do with the source. If you want to parse it, then it’s probably better to go back to the XML document method, and use its tools to do the parsing.


(Christopher Stone) #8

Hey Folks,

I’ve been working on this on and off for months trying to get it to work with Keyboard Maestro. KM was choking on the AppleScriptObjC somewhere, but for whatever reason it’s working now.

Keyboard Maestro Macro — Create Web Archives (Download Web Page) from a List of URLs

The version below runs as an applet (provided in the zip file), and it will run equally well from FastScripts or Keyboard Maestro.

Script Debugger chokes on it though…

Many thanks to Shane for examples he provided publicly and for the direct help he gave me with this project.

-Chris


WebArchive Downloader.zip (66.3 KB)

------------------------------------------------------------------------------
# Auth: Christopher Stone { Heavy Lifting by Shane Stanley }
# dCre: 2017/02/26 18:30 CST
# dMod: 2017/04/24 19:32 CDT
# Appl: AppleScriptObjC
# Task: Create WebArchives for a list of remote URLs (applet version).
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Create, @Webarchives, @List, @URLs, @EXIF, @Tags
------------------------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "WebKit"
use scripting additions
------------------------------------------------------------------------------
property theSender : missing value
property thePath : missing value
property loadDone : true
------------------------------------------------------------------------------

--» User Setting
set destinationFolderPath to POSIX path of ((path to downloads folder as text) & "WebArchive Downloads:")

------------------------------------------------------------------------------

its createDirectoryAtPath:destinationFolderPath
set urlList to getUrlList() -- see handler for url list.

repeat with remoteURL in urlList
   set fileName to (its cngStr:"/$" intoString:"" inString:remoteURL)
   set fileName to (its cngStr:"^.+/" intoString:"" inString:fileName)
   set fileName to (its cngStr:"(\\.\\w+)?$" intoString:".webarchive" inString:fileName)
   
   set pageDestPath to destinationFolderPath & fileName
   (its archivePage:remoteURL toPath:pageDestPath sender:me)
   repeat
      if loadDone then
         exit repeat
      else
         delay 0.25
      end if
   end repeat
end repeat

------------------------------------------------------------------------------
--» HANDLERS
------------------------------------------------------------------------------
on archivePage:thePageURL toPath:aPath sender:mySender
   set my loadDone to false
   set my theSender to mySender # Store main script so we can call back
   set my thePath to aPath # Store path for use later
   my performSelectorOnMainThread:"loadURL:" withObject:thePageURL waitUntilDone:false
end archivePage:toPath:sender:
------------------------------------------------------------------------------
on cngStr:findString intoString:replaceString inString:dataString
   set anNSString to current application's NSString's stringWithString:dataString
   set dataString to (anNSString's ¬
      stringByReplacingOccurrencesOfString:findString withString:replaceString ¬
         options:(current application's NSRegularExpressionSearch) range:{0, length of dataString}) as text
end cngStr:intoString:inString:
------------------------------------------------------------------------------
on createDirectoryAtPath:thePath
   set {theResult, theError} to current application's NSFileManager's defaultManager()'s createDirectoryAtPath:thePath withIntermediateDirectories:true attributes:(missing value) |error|:(reference)
   if not (theResult as boolean) then
      set errorMsg to theError's localizedDescription() as text
      error errorMsg
   end if
end createDirectoryAtPath:
------------------------------------------------------------------------------
on getKMVar(varName)
   tell application "Keyboard Maestro Engine"
      return getvariable varName
   end tell
end getKMVar
------------------------------------------------------------------------------
# Called when the job's done
on jobDone:theMessage
   display notification theMessage
end jobDone:
------------------------------------------------------------------------------
on loadURL:thePageURL
   # Stuff to be done on main thread
   # Make a WebView
   set theView to current application's WebView's alloc()'s initWithFrame:{origin:{x:0, y:0}, |size|:{width:100, height:100}}
   # Tell it call delegate methods on me
   theView's setFrameLoadDelegate:me
   # Load the page
   theView's setMainFrameURL:thePageURL
end loadURL:
------------------------------------------------------------------------------
# Called when our WebView loads a frame
on WebView:aWebView didFinishLoadForFrame:webFrame
   # The main frame is our interest
   if webFrame = aWebView's mainFrame() then
      # Get the text of the page
      set theText to (webFrame's DOMDocument()'s documentElement()'s outerText())
      # Search it
      # Get the data and write it to file
      set theArchiveData to webFrame's dataSource()'s webArchive()'s |data|()
      set x to theArchiveData's writeToFile:thePath atomically:true
      # Tell our script it's all done
      set my loadDone to true
      theSender's jobDone:"The webarchive was saved"
   end if
end WebView:didFinishLoadForFrame:
------------------------------------------------------------------------------
# Called if there's a problem
on WebView:WebView didFailLoadWithError:theError forFrame:webFrame
   # Got an error, bail
   WebView's stopLoading:me
   set my loadDone to true
   theSender's jobDone:"The webarchive was not saved"
end WebView:didFailLoadWithError:forFrame:
------------------------------------------------------------------------------
on getUrlList()
   set urlList to paragraphs 2 thru -2 of "
http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/JPEG.html
http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html
http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/IPTC.html
"
end getUrlList
------------------------------------------------------------------------------

(Jim Underwood) #9

Great script/tool! :+1:

Chris, I was hoping that you would chime in here with your more complete solution. Your script goes to the next level of downloading everything (not just the web page source) to a web archive file/folder.

I would guess a tool like yours would normally be run from the Apple Scripts menu, FastScripts, or KM. Any ideas why it chokes on SD?


(Jim Underwood) #10

Shane, thanks for taking another look at my script/objective.

But I’m wondering, rather that either the user or the script “guess” at the encoding, why not just stick with the XML process?

Seems like either way we have the same number of steps, and the XML method gives us more options:

###XML Method

set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:nsPageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)

set htmlPPStr to (theXMLDoc's XMLStringWithOptions:(current application's NSXMLNodePrettyPrint)) as text

###Guess Encoding Method

set encodingOptions to current application's NSDictionary's dictionaryWithObject:false forKey:(current application's NSStringEncodingDetectionAllowLossyKey)
set {theEncoding, theString} to current application's NSString's stringEncodingForData:(nsPageHTML) encodingOptions:encodingOptions convertedString:(reference) usedLossyConversion:(missing value)
if theEncoding = 0 then error "Unknown encoding"

The good news is we have more options on how to achieve the same objective.

In addition to my stated objective, I also wanted to show the various methods available. My script is more of an example than a complete script.


(Shane Stanley) #11

It’s a bit less efficient. It also depends what you’re after – the XML document method may correct some of the HTML, for example, or remove whitespace, whereas the other method returns a 100% faithful rendition.


(Shane Stanley) #12

It’s because of how part of it requires the main thread. SD can use performSelectorOnMainThread:::, but there comes a point where that’s not enough.


(Ed Stockly) #13

Thanks, I’ve got a lot to look at now. I can say it’s already helpful. Most of these solutions seem to be designed to allow you to go to any web page and download. I have very specific web pages that never change their format but only change parts of their content. I may need to simply get links or look for key phrases or specific strings. So for my purposes some of these are overkill, but now I have several alternatives. I’m mostly hoping for the speed and reliability you don’t get from getting text out of a browser. (I had asked about this a few years ago and got a few suggestions, but none were as reliable as opening the page in Safari and parsing the source text.)

I have one script in particular that sends Safari to over 1000 web pages to get about 20 specific strings from each page to build an internal table. It’s looking at team pages for the largest AYSO soccer region in country, collecting a variety of information for further processing. The first run gets all the basic information that doesn’t change during the season, then it runs on a weekly basis, updating scores, rosters, etc. that may change.

My first look tells me this method will turn that multi-hour weekly process into one that will take a few minutes.

So I’ll be starting with that script. The season teams are formed in July and the season starts in September, and I can practice with last year’s data so I’m in good shape. (I may use this to clone last years data so I have a local copy I can test against, if it’s not ready to go by the end of June.


(Shane Stanley) #14

That sounds like a perfect opportunity to use XML document parsing, and probably XPath searching.


(Christopher Stone) #15

Hey Shane,

Do you know if this method will follow redirects?

-Chris


(Christopher Stone) #16

Hey Shane,

Also — do you know how it identifies itself to the server when it makes the request?

As WebKit?

-Chris


(Shane Stanley) #17

I’m guessing so, but I don’t see any documentation either way. So you tell us :slight_smile:

Sorry, no idea.


(Shane Stanley) #18

You can use a slightly more complex method, and it allows you to set the user agent. Here’s an example, with some other stuff thrown in:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set URLString to "http://forum.latenightsw.com/t/welcome-to-the-late-night-software-support-forum/8"
set theURL to current application's |NSURL|'s URLWithString:URLString
-- make a URL request and set custom header
set theRequest to current application's NSMutableURLRequest's requestWithURL:theURL
theRequest's setValue:"Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341" forHTTPHeaderField:"User-Agent"
-- send request and wait for reply
set {theData, theResponse, theError} to current application's NSURLConnection's sendSynchronousRequest:theRequest returningResponse:(reference) |error|:(reference)
-- if missing value, there was a problem
if theData = missing value then error (theError's localizedDescription() as text)
set theCode to theResponse's statusCode()
if theCode is not 200 then error (current application's NSHTTPURLResponse's localizedStringForStatusCode:theCode) as text
-- get mime type
set theMimeType to theResponse's MIMEType()
if theMimeType's hasPrefix:"text" then
	--	its a string
	set theEnc to theResponse's textEncodingName() -- IANA string, so no easy way to convert to usable value
	set encodingOptions to current application's NSDictionary's dictionaryWithObject:false forKey:(current application's NSStringEncodingDetectionAllowLossyKey)
	set {theEncoding, theString} to current application's NSString's stringEncodingForData:(theData) encodingOptions:encodingOptions convertedString:(reference) usedLossyConversion:(missing value)
	if theEncoding = 0 then error "Unknown encoding"
	return theString
end if