Downloading text from a website using ASObj-C

I wrote a script to test downloading text from a website and I thought some people might be interested in seeing the script. Given a URL, the script downloads whatever the URL points to. However I only implemented enough code to decode text that is downloaded. For my purposes I only needed to verify it could be downloaded. But if you want to see the text for a web page this can do it. The size for the CNN.com page was 155,648 bytes when I tried it last night. So if you run it in the debugger ScriptDebugger will truncate most of the text displayed in the debugger, but the full text will be in the file.

I added a .html to the file name so if the downloaded datafile is double-clicked something will come up in the web browser. For the CNN example the text shows up in the web browser but not the images.

I added some comments and also added an example of how to get the MIME type, TextEncoding and the URL.

use AppleScript version "2.4"
use scripting additions
use framework "Foundation"

set TheURLStr to "https://www.cnn.com"
set TheURL to (current application's |NSURL|'s URLWithString:TheURLStr)

-- Creates and returns an initialized URL request with the specified value.
-- NSURLRequestReloadIgnoringLocalCacheData Causes data for the URL to be loaded from the originating source.  No existing cache data would be used.
set TheRequest to current application's NSURLRequest's requestWithURL:TheURL cachePolicy:(current application's NSURLRequestReloadIgnoringLocalCacheData) timeoutInterval:10

-- sendSynchronousRequest:returningResponse:error: performs a synchronous download of the specified URL request.
-- It returns "missing value" if a connection could not be created or an error occurs.
-- Item 1 of the returned list contains the requested data & item 2 contains the status of request and other information about the data downloaded
set TheResult to current application's NSURLConnection's sendSynchronousRequest:TheRequest returningResponse:(reference) |error|:(missing value)

-- Returns the MIME type of the downloaded data
set MIMEType to MIMEType of item 2 of TheResult --> (NSString) "text/html"

-- Returns the data encoding of the returned data
set TextEncoding to textEncodingName of item 2 of TheResult --> (NSString) "utf-8"

-- Returns the URL the data was downloaded from
set URLValue to |URL| of item 2 of TheResult --> (NSURL) https://www.cnn.com/

set TheDownloadedData to item 1 of TheResult --> too big to show actual NSData in script
set HTTPURLResponse to item 2 of TheResult --> 
(* <NSHTTPURLResponse: 0x608000837da0> { URL: https://www.cnn.com/ } { status code: 200, headers {
    "Accept-Ranges" = bytes;
    "Access-Control-Allow-Origin" = "*";
    Age = 69;
    "Cache-Control" = "max-age=60";
    Connection = "keep-alive";
    "Content-Encoding" = gzip;
    "Content-Length" = 33977;
    "Content-Type" = "text/html; charset=utf-8";
    Date = "Mon, 15 Jan 2018 21:10:41 GMT";
    "Fastly-Debug-Digest" = 46be59e687681f2cbdc5286ab50024ed035dc360065b1aec7ce355bf418daeb9;
    "Set-Cookie" = "countryCode=US; Domain=.cnn.com; Path=/, geoData=surfside|CA|90743|US|NA; Domain=.cnn.com; Path=/";
    Vary = "Accept-Encoding, Fastly-SSL, Fastly-SSL";
    Via = "1.1 varnish, 1.1 varnish";
    "X-Cache" = "HIT, HIT";
    "X-Cache-Hits" = "2, 1";
    "X-Served-By" = "cache-iad2143-IAD, cache-lax8634-LAX";
    "X-Timer" = "S1516050641.061185,VS0,VE1";
    "content-security-policy" = "default-src 'self' blob: https://*.cnn.com:* http://*.cnn.com:* *.cnn.io:* *.cnn.net:* *.turner.com:* *.turner.io:* *.ugdturner.com:* courageousstudio.com *.vgtf.net:*; script-src 'unsafe-eval' 'unsafe-inline' 'self' *; style-src 'unsafe-inline' 'self' blob: *; child-src 'self' blob: *; frame-src 'self' *; object-src 'self' *; img-src 'self' data: blob: *; media-src 'self' data: blob: *; font-src 'self' data: *; connect-src 'self' *; frame-ancestors 'self' *.cnn.com:* *.turner.com:* courageousstudio.com;";
    "x-content-type-options" = nosniff;
    "x-servedByHost" = "::ffff:172.17.71.21";
    "x-xss-protection" = "1; mode=block";
} }
*)


-- NSString can be used to decode NSData data to "utf-8" text
set TheText to current application's NSString's alloc()'s initWithData:(TheDownloadedData) encoding:(current application's NSUTF8StringEncoding) --> too big to show actual NSData in script


-- Use NSString to write the text to the disk
set FileName to current application's NSString's stringWithString:"downloaded text.html"
set HomeDirPath to current application's NSHomeDirectory() -- Get the path to the folder

-- Get the path to write the text to
set DesktopPath to HomeDirPath's stringByAppendingPathComponent:"Desktop" -- Get the path to the folder
set ThePath to DesktopPath's stringByAppendingPathComponent:FileName -- Get the path to the file

-- If Sucessful = true the write did not get any errors, otherwise an error occured
set Sucessful to TheText's writeToFile:ThePath atomically:no encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)

Bill

3 Likes

Thanks for sharing that, Bill. One minor thing: you could also just save the data directly to disk, without bothering with the conversion to a string — the file would be the same.

Wouldn’t that mean that when I read the data back from the file I would still have to decode the NSData before I used it? My thinking was text in a file is easier to work with.

Bill

No. At the moment you’re getting the raw data, which is UTF-8 data, converting it to a string, then telling NSString to write it as UTF-8. The way NSString does that is by converting the string back to raw data, and writing that. You only need to go via a string if you want to change encodings.

Thanks, I didn’t know that.

Bill

Bill, thanks for sharing.

@BillKopp or @ShaneStanley: What is the difference, the advantages, of your script vs this one, which I’m pretty sure I got from a post by Shane:

on getUrlSource:urlStr
  set theURL to current application's class "NSURL"'s URLWithString:urlStr
  set theData to current application's NSData's dataWithContentsOfURL:theURL
  set theString to current application's NSString's alloc()'s initWithData:theData encoding:(current application's NSUTF8StringEncoding)
  set theString to theString as text
  return theString
end getUrlSource:

Are there use cases where one script would be preferred over the other?

One of the differences is I added other things like how to find the data type, mime type, etc. I also added code to always save to the desktop of any user. If you forget all my comments and me showing other helpful things, I added a line in to convert the NSData to text before saving to the disk which Shane letter told me later didn’t make any difference since text is still saved as NSData so that was pointless.

Also Shane used NSData’s dataWithContentsOfURL:theURL which replaced requestWithURL:cachePolicy: timeoutInterval:, NSURLConnection’s sendSynchronousRequest:returningResponse:error: with one line.

I think Shane’s is smaller, more concise and easier to follow. But I broke all the steps down into the smallest pieces I can make them. This might help someone figure out how to do something similar but where NSData’s dataWithContentsOfURL would not work.

Also when I am scripting I prefer to break things down in more steps so I can tell what is happening in the script in a more granular detail. It’s harder to find the error when multiple things are combined in one step. It is also harder to update a script later when many things are combined together. Updates can cause a powerful command that can take the place of many lines of code and force the scripter to then break the powerful command into more steps to get at different aspects of the same data or to achieve different things that the original script did, etc.

In Shane’s case he knows this stuff really, really, really well and combining functions into a single command is not as much of a problem as it is for me. Since I am less experienced I use more lines with commands that do less so I can tell better what the script is doing when it runs.

Basically if I have a 15 line script that doesn’t work it’s easier to fix a script with 15 lines then one with 4 lines in it.

I assume Shane has additional reason why his way is better. I’m just to new at this to say what they are.

Bill

1 Like

They’re both doing much the same under the hood. But Bill’s is something you can adapt when what you’re downloading is not text, or when you need to check something else in the response.

@ShaneStanley and @BillKopp, hey guys, sorry to revive this old topic, but I have a very relevant question:

Is there a way to get all of the HTML code from a dynamic page, where some of the page is dynamically created/changed when displayed in the browser?

Here’s an example:

The HTML for this part is NOT returned by ASObjC above:

Of course you will not see this unless you have purchased the book.

I need to extract the Purchase Date, and other data.

So, is my only option to display the page in a browser, and use JavaScript to get the data?

I suspect this relies on cookies. While it may be do-able in theory, in practice you’re going to have to use a browser.