List of Links on Webpage


(cobbap) #1

I’m trying to automate a webscrape download of PDFs, and can’t see to figure out an easy way to scrape all of the links from a page loaded into the browser using AppleScript. I would then iterate through the list of PDF links to automate downloading each document. What’s I’m stuck on is an easy way to scrape the list of links using AppleScript. So, for example, if I run the following to load a webpage, is there an easy way to put every link from that page into an AppleScript variable. I’m indifferent on using Safari or Chrome.

use theRoutines : script "Routines"
use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
global browserChoice
property browserChoice : "Chrome"
set thePage to "http://www.apple.com"

set docText to webPageLoad(thePage)

on webPageLoad(urlLink)
	set pageText to ""
	if browserChoice is "Chrome" then
		tell application "Google Chrome" to tell window 1 to tell active tab
			set the URL to urlLink
			set isLoading to true
			repeat until isLoading is false
				set isLoading to loading
				delay 0.25
			end repeat
			set pageText to execute javascript "document.body.innerText"
		end tell
	else
		tell application "Safari"
			set the URL of the front document to urlLink
		end tell
		delay 0.5
		set pageLoaded to false
		tell script "Routines" to set pageLoaded to page_loaded(20)
		if pageLoaded is false then return
		delay 0.5
		tell application "Safari" to set pageText to text of document 1
	end if
	return pageText
end webPageLoad

(Shane Stanley) #2

It’s easier to avoid a browser altogether. For example:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- load page
set pageURL to current application's |NSURL|'s URLWithString:"https://www.apple.com"
set {pageHTML, theError} to current application's NSData's dataWithContentsOfURL:pageURL options:0 |error|:(reference)
if pageHTML = missing value then error (theError's localizedDescription() as text)
-- make XML
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:pageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
-- parse for href attributes
set {linkAtts, theError} to (theXMLDoc's nodesForXPath:"//*[@href]/attribute::href" |error|:(reference))
if linkAtts = missing value then error (theError's localizedDescription() as text)
-- filter links to suit, for example
set thePred to current application's NSPredicate's predicateWithFormat:"stringValue BEGINSWITH %@ AND stringValue ENDSWITH[c] %@" argumentArray:{"http", "/"}
-- extract as list of strings
set theLinks to ((linkAtts's filteredArrayUsingPredicate:thePred)'s valueForKey:"stringValue") as list

You need to change the predicate to suit — for PDFs you change the "/" to ".pdf".

If the site isn’t using https, you will need to go to the Resources tab and turn on Allow arbitrary loads for the ATS setting.


(Ed Stockly) #3

FYI there are a few issues with your script.
Smart quotes don’t work.
No need to declare a variable as a property and a global. Properties are global
The Routines script does something? But I (we?) don’t have it.
The Safari part would never work without the routines.
Both versions are getting plain text, you need source to get the URLs.

I asked for the exact same thing a while back and got the script below (I’m afraid I don’t remember the source, but may have been Shane? )

I’m not sure how to get the source text from Chrome, so I’m only doing Safari

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

set urlLink to "http://www.apple.com"

tell application "Safari"
	set the URL of the front document to urlLink
	delay 2
	set pageText to source of document 1
end tell


my findURLsIn:pageText

on findURLsIn:theString
	set anNSString to current application's NSString's stringWithString:theString
	set theNSDataDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypeLink) |error|:(missing value)
	set theURLsNSArray to theNSDataDetector's matchesInString:theString options:0 range:{location:0, |length|:anNSString's |length|()}
	return (theURLsNSArray's valueForKeyPath:"URL.absoluteString") as list
end findURLsIn:

(cobbap) #4

Shane - it’s not a https site - can you clarify … Resources tab of what app?


(Shane Stanley) #5

The Resource tab of your .app document:


(Jim Underwood) #6

I prefer to use JavaScript in Browser for webscraping. It is very easy to develop and test using the JavaScript console of either Chrome or Safari.

Here’s a JavaScript that will return all links that contain “pdf”, ignoring case:

(function myMain() {
var pdfElem = document.querySelectorAll('a[href*="pdf" i]')
var pdfList = Array.from(pdfElem).map(function(x) {return x.href;});
return pdfList.join("\n");
})();

You gave us a source page of www.apple.com, but I don’t find any PDFs on that page.
So using this page:
https://support.apple.com/en_US/manuals/macnotebooks

I ran this script, which returns an AppleScript list of URLs of the PDFs on that page.
The script assumes the web page is already loaded, and does NOT download the PDF list. You said you already knew how to do that, so to keep it simple, it just returns the PDF list (6 items).

Questions?

use AppleScript version "2.5" -- El Capitan (10.11) or later
use scripting additions

set jsStr to "
(function myMain() {
//debugger;
var pdfElem = document.querySelectorAll('a[href*=\"pdf\" i]');
var pdfList = Array.from(pdfElem).map(function(x) {return x.href;});
return pdfList.join(\"\\n\");
})();
"

set pdfList to paragraphs of my doJavaScript("Chrome", jsStr)

--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
on doJavaScript(pBrowserName, pScriptStr) -- @JavaScript @Web @Browser   @KMLib
  --–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
  (*  VER: 1.1    2017-11-25
      PURPOSE:  Run JavaScript in the Designated Browser (Safari OR Chrome)
      PARAMETERS:
        • pBrowserName    |  text  | Name of brower in which to run JavaScript
            • IF this parameter is empty "", then the FrontMost App will 
                be determined and used, IF it is a browser (else an error)
        • pScriptStr      | text  | JavaScript text to be run
        
      RETURNS:  Results of the JavaScript
      
      AUTHOR:  JMichaelTX
    ## REQUIRES:  Either Google Chrome or Safari
  ----------------------------------------------------------------
  *)
  
  local jsScriptResults, frontAppName
  
  if (pBrowserName = "") then -- get FrontMost App Name
    tell application "System Events"
      set pBrowserName to item 1 of (get name of processes whose frontmost is true)
    end tell
  end if
  
  try
    
    set jsScriptResults to "TBD"
    
    if pBrowserName = "Safari" then
      tell application "Safari"
        set jsScriptResults to do JavaScript pScriptStr in front document
      end tell
      
    else if pBrowserName contains "Chrome" then
      tell application "Google Chrome"
        set jsScriptResults to execute (active tab of window 1) javascript pScriptStr
      end tell
      
    else
      error ("ERROR:  Invalid Broswer Name: " & pBrowserName ¬
        & return & "Script MUST BE run with the FrontMost app as either \"Safari\" OR \"Chrome\"")
    end if
    
  on error e
    error "Error in handler doJavaScript()" & return & return & e
    
  end try
  
  return jsScriptResults
  
end doJavaScript
--~~~~~~~~~~~~~~~~ END of Handler doJavaScript ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




(cobbap) #7

Thanks all for the input.

Shane - I was using SD6. Downloaded SD7 and tried your code with the Allow arbitrary loads checked. Modified both string predicates as needed – the XML doc links were relative – so instead of starting with “http” I needed it to say start with “/docs/” and change the endswith to “.pdf” as suggested.

I also added the necessary code to iterate though the source URL webpages and then to compile the links into one array without any duplicates. Script ran through about 4000 source page URLs in about 8 minutes and gave me 7300 PDF links. Amazing.

What I don’t like is that I don’t understand one bit how to write the code or what it is doing. Is that AppleScriptObjC or is there a resource I can read up on to better understand that code?

Also - I thought I had an easy way to save each of the PDF links to a local folder, but now I’m not seeing an easy way to do so. What would you suggest to automate saving a PDF http link to a local folder ?


(cobbap) #8

Ed - thanks for the input - I was able to get this code to work as well. The browserless option though seems to be much quicker.

I’m pretty much a novice at coding, but I can give you what works for me to get source text from either a safari or chrome page. This code also eliminates the need for the delay 2 step, as it checks the docreadystate of the browser to ensure the page has completed loaded.

For safari - as you’ve noted source of document 1 works inside a tell Safari block.

For chrome - I use execute javascript “document.getElementsByTagName(‘html’)[0].innerHTML”

I call it as a function:

getBrowserSource("Safari")
getBrowserSource("Chrome")

on getBrowserSource(theBrowser)
	if theBrowser contains "Safari" then
		set the_state to missing value
		tell application "Safari"
			repeat until the_state is "complete"
				set the_state to (do JavaScript "document.readyState" in document 1)
				log "busy"
				delay 0.2
			end repeat
			log "complete"
			set pageSource to the source of document 1
		end tell
	else if theBrowser contains "Chrome" then
		set the_state to missing value
		tell application "Google Chrome"
			tell active tab of window 1
				repeat until the_state is "complete"
					set the_state to (execute javascript "document.readyState")
					log "busy"
					delay 0.2
				end repeat
				log "complete"
				set pageSource to execute javascript "document.getElementsByTagName('html')[0].innerHTML"
			end tell
		end tell
	end if
	return pageSource
end getBrowserSource

(Ed Stockly) #9

Yes, used the direct URL for one task that would take hours using a browser, and took minutes (that was just getting source text, not scraping URLS).

There are two reasons I would use the browser. First, if a user were navigating and wanted to pull all the URLs from the current page they could run a script from the script menu.

Second, some web pages build on demand based in part on what browser and other factors (OS) are detected, and some pages won’t even open without a browser. (We have an internal database search site that simply returns the search page is you send a url directly).

So we have both options!


(Shane Stanley) #10

Yes it is, with a touch of XPath doing much of the hard work.

Do you mean the PDFs themselves? If so, probably something like this:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- load page
set pageURL to current application's |NSURL|'s URLWithString:"https://www.apple.com"
set {pageHTML, theError} to current application's NSData's dataWithContentsOfURL:pageURL options:0 |error|:(reference)
if pageHTML = missing value then error (theError's localizedDescription() as text)
-- make XML
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithData:pageHTML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
if theXMLDoc = missing value then error (theError's localizedDescription() as text)
-- parse for href attributes
set {linkAtts, theError} to (theXMLDoc's nodesForXPath:"//*[@href]/attribute::href" |error|:(reference))
if linkAtts = missing value then error (theError's localizedDescription() as text)
-- filter links to suit, for example
set thePred to current application's NSPredicate's predicateWithFormat:"stringValue BEGINSWITH %@ AND stringValue ENDSWITH[c] %@" argumentArray:{"/docs/", ".pdf"}
-- extract as list of strings
set theLinks to ((linkAtts's filteredArrayUsingPredicate:thePred)'s valueForKey:"stringValue")
-- define where to save and loop through
set basePath to POSIX path of (path to desktop)
repeat with aLink in theLinks
	-- make full URL
	set aURL to (current application's |NSURL|'s URLWithString:aLink relativeToURL:pageURL)
	-- get name for file
	set pdfName to aLink's lastPathComponent() as text
	-- download the PDF as data
	set {pdfData, theError} to (current application's NSData's dataWithContentsOfURL:aURL options:0 |error|:(reference))
	if pdfData = missing value then error (theError's localizedDescription() as text)
	-- save it to a file
	(pdfData's writeToFile:(basePath & pdfName) atomically:true)
end repeat

It probably needs code to check for duplicate file names, too.


(Jim Underwood) #11

Shane, that’s a very nice, complete solution. Thanks for sharing.

I think I understand most of the script, except for this statement:

set thePred to current application's NSPredicate's predicateWithFormat:"stringValue BEGINSWITH %@ AND stringValue ENDSWITH[c] %@" argumentArray:{"/docs/", ".pdf"}

Would you might explaining all of the functions and parameters, so I can make intelligent changes in the future?


(Shane Stanley) #12

A predicate is a bit like a whose clause in AppleScript. For convenience, you can use %@ as a placeholder for values, which are then included in the argumentArray parameter. The [c] means ignoring case.

Predicates are powerful, but also a bit complex. But this gives a detailed run-down, with the usual caveat of its being written for Objective-C users:

https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/Predicates/AdditionalChapters/Introduction.html

In particular, this page gives good summary of predicate string syntax:

https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/Predicates/Articles/pSyntax.html