Get clickable links from a web page

I am looking for a way to automate a tedious process.

We have a number of evergreen web pages with numerous links. On a regular basis we need to check that every link is still taking users to the correct location.

To do this now we’re opening the pages, clicking on every link.

What I’m hoping for is a script that will look at a page (either in safari, or preferably reading the URL) extract all the links and then look at the page (either in safari, or preferably reading the URL).

We’d then compare that result to the expected result and flag any that don’t match.

Right now it’s the first step, extracting the clickable links, that I need help with. (Seems like that’s the simplest).

Here is one solution among others:

use framework "Foundation"
use scripting additions

set thePage to "https://www.apple.com"
set theURL to current application's |NSURL|'s URLWithString:thePage
set theSource to current application's NSString's stringWithContentsOfURL:theURL encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)

set dataDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypeLink) |error|:(missing value)
set linkArray to dataDetector's matchesInString:theSource options:0 range:{location:0, |length|:theSource's |length|()}
return (linkArray's valueForKeyPath:"URL.absoluteString") as list
1 Like

Thanks, Jonas, that’s just what I needed!

@ionah
In your nice script I add searchTag to return a list if it find more and 1 URL.

use framework "Foundation"
use scripting additions

set thePage to "https://www.apple.com"
its searchFor:"drama" inURL:thePage

on searchFor:searchTag inURL:URLString
	set theURL to current application's |NSURL|'s URLWithString:URLString
	set theSource to current application's NSString's stringWithContentsOfURL:theURL encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
	
	set dataDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypeLink) |error|:(missing value)
	set linkArray to dataDetector's matchesInString:theSource options:0 range:{location:0, |length|:theSource's |length|()}
	set URLList to (linkArray's valueForKeyPath:"URL.absoluteString") as list
	
	set resultList to {}
	repeat with anItem in URLList
		if searchTag is in anItem then
			set end of resultList to (contents of anItem)
		end if
	end repeat
	return resultList
end searchFor:inURL:
1 Like

Or you could use NSPredicate to filter the list and NSSet to clean duplicates:

use framework "Foundation"
use scripting additions

my linksFrom:"https://www.apple.com" withTag:"drama"

on linksFrom:thePage withTag:theTag
	set theURL to current application's |NSURL|'s URLWithString:thePage
	set theSource to current application's NSString's stringWithContentsOfURL:theURL encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
	
	set dataDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypeLink) |error|:(missing value)
	set linkArray to dataDetector's matchesInString:theSource options:0 range:{location:0, |length|:theSource's |length|()}
	set linkArray to (linkArray's valueForKeyPath:"URL.absoluteString")
	
	set thePred to current application's NSPredicate's predicateWithFormat:"self contains[cd] %@" argumentArray:{theTag}
	set linkArray to linkArray's filteredArrayUsingPredicate:thePred
	
	set linkArray to current application's NSSet's setWithArray:linkArray
	return linkArray's allObjects() as list
end linksFrom:withTag:

@ionah
That is nice, I have a lot of NSPredicate scripts on my old computer.
Forgot how useful it is.

I try to follow Shane Stanley’s ASObjC Style Guide :wink:
As your script use 3 function or handlers and the 4 is the main one.

use framework "Foundation"
use scripting additions

its linksFrom:"https://www.apple.com" withTag:"drama"

on linksFrom:URLString withTag:theTag
	set theSource to its URLWithString:URLString
	set linkArray to detectorWithLink(theSource)
	return (its filterArray:linkArray predicateWithFormat:"self contains[cd] %@" withArguments:{theTag}) as list
end linksFrom:withTag:

on URLWithString:URLString
	set theURL to current application's |NSURL|'s URLWithString:URLString
	set {theContents, theError} to current application's NSString's stringWithContentsOfURL:theURL encoding:(current application's NSUTF8StringEncoding) |error|:(reference)
	if theContents is missing value then error (theError's localizedDescription() as string)
	return theContents
end URLWithString:

on detectorWithLink(theSource)
	set dataDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypeLink) |error|:(missing value)
	set theDetector to dataDetector's matchesInString:theSource options:0 range:{location:0, |length|:theSource's |length|()}
	set theDetector to (theDetector's valueForKeyPath:"URL.absoluteString")
	return theDetector
end detectorWithLink

on filterArray:anArray predicateWithFormat:formatString withArguments:argumentList
	set thePredicate to current application's NSPredicate's predicateWithFormat:formatString argumentArray:argumentList
	set anArray to anArray's filteredArrayUsingPredicate:thePredicate
	set anArray to current application's NSSet's setWithArray:anArray
	return anArray's allObjects()
end filterArray:predicateWithFormat:withArguments:
1 Like

I want to do something that’s related. I want to scrape all JPEGs on a page, and copy each to a folder from which I can do some post processing. I’ll study these solutions (unless someone has something closer to my need).

1 Like

Just use “.jpg” as a tag with either of the last two solutions and it should work.

its linksFrom:"https://www.apple.com" withTag:".jpg"
its linksFrom:"https://www.apple.com" withTag:".jpeg"

Keeping in mind that you’re out of luck if the content you want on your page is computed and not static.

It’s still possible, but you’ll have to use a “virtual browser” to get the source. Something on the order of:

If you’re using a web browser to view the page in question you’re probably better off using JavaScript in the browser.