Extracting strings using NSRegularExpression

asobjc
foundation
how-to

(Mark Alldritt) #1

Here’s a snippet of code demonstrating how to use NSRegularExpression to extract phone numbers from a string.

--
--	Created by: Mark Alldritt
--	Created on: 2018-01-05
--
--	Copyright (c) 2018 Late Night Software Ltd.
--	All Rights Reserved
--

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
use framework "Foundation"

-- classes, constants, and enums used
property NSRegularExpressionCaseInsensitive : a reference to 1
property NSRegularExpression : a reference to current application's NSRegularExpression
property NSNotFound : a reference to 9.22337203685477E+18 + 5807 -- see http://latenightsw.com/high-sierra-applescriptobjc-bugs/

--	Lets look for US phone numbers of the form 000-0000, (000) 000-0000, 000-000-0000, (000)-000-0000
set usPhoneNumberPattern to "\\(?(\\d{3})?\\)?\\s*-?\\s*(\\d{3})\\s*-?\\s*(\\d{4})"

set theSample to "333-1234, 250-888-8888, (123) 350-1234, (456)-350-1234"
set theRegEx to NSRegularExpression's regularExpressionWithPattern:usPhoneNumberPattern options:NSRegularExpressionCaseInsensitive |error|:(missing value)
set theMatches to theRegEx's matchesInString:theSample options:0 range:[0, theSample's length]
set thePhoneNumbers to {}

repeat with aMatch in theMatches
	--	Get the matched range of text
	set wholeRange to (aMatch's rangeAtIndex:0) as record
	set thePhoneNumber to text ((wholeRange's location) + 1) thru ((wholeRange's location) + (wholeRange's |length|)) of theSample
	
	--	Get the groups of the regular expression match
	set numRanges to aMatch's numberOfRanges as integer
	set parts to {"000", "000", "0000"}
	
	repeat with rangeIndex from 1 to numRanges - 1
		set partRange to (aMatch's rangeAtIndex:rangeIndex) as record
		if partRange's location is not NSNotFound then ¬
			set item rangeIndex of parts to text ((partRange's location) + 1) thru ((partRange's location) + (partRange's |length|)) of theSample
	end repeat
	
	--	Collect the results
	set end of thePhoneNumbers to {|phoneNumber|:thePhoneNumber, parts:parts}
end repeat

thePhoneNumbers
--> {
--		{phoneNumber:"333-1234", parts:{"000", "333", "1234"}},
--		{phoneNumber:"250-888-8888", parts:{"250", "888", "8888"}},
--		{phoneNumber:"(123) 350-1234", parts:{"123", "350", "1234"}},
--		{phoneNumber:"(456)-350-1234", parts:{"456", "350", "1234"}}
--	}

(Nigel Garvey) #2

Hi Mark.

Your script’s assuming there’s a one-to-one equivalence between characters in an AS string and locations in an NSString. There probably is when the AS string only contains US phone numbers, but there may not be otherwise. Ideally, the text matching should all be done in ASObjC and the results coerced to AS text as obtained.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
use framework "Foundation"

-- classes, constants, and enums used
property NSRegularExpressionCaseInsensitive : a reference to 1
property NSRegularExpression : a reference to current application's NSRegularExpression
-- property NSNotFound : a reference to 9.22337203685477E+18 + 5807 -- Not needed if the |length|'s checked for 0 instead.

--	Lets look for US phone numbers of the form (000) 000-0000, 000-000-0000, (000)-000-0000
set usPhoneNumberPattern to "\\(?(\\d{3})?\\)?\\s*-?\\s*(\\d{3})\\s*-?\\s*(\\d{4})"

set theSample to "333-1234, 250-888-8888, (123) 350-1234, (456)-350-1234"
set theNSStringSample to current application's NSString's stringWithString:theSample
set theRegEx to NSRegularExpression's regularExpressionWithPattern:usPhoneNumberPattern options:NSRegularExpressionCaseInsensitive |error|:(missing value)
set theMatches to theRegEx's matchesInString:theNSStringSample options:0 range:{0, theNSStringSample's |length|()}
set thePhoneNumbers to {}

repeat with aMatch in theMatches
	--	Get the matched range of text
	set wholeRange to (aMatch's rangeAtIndex:0) as record
	set thePhoneNumber to (theNSStringSample's substringWithRange:wholeRange) as text
	
	--	Get the groups of the regular expression match
	set numRanges to aMatch's numberOfRanges as integer
	set parts to {"000", "000", "0000"}
	
	repeat with rangeIndex from 1 to numRanges - 1
		set partRange to (aMatch's rangeAtIndex:rangeIndex) as record
		if partRange's |length| > 0 then ¬
			set item rangeIndex of parts to (theNSStringSample's substringWithRange:partRange) as text
	end repeat
	
	--	Collect the results
	set end of thePhoneNumbers to {|phoneNumber|:thePhoneNumber, parts:parts}
end repeat

thePhoneNumbers

(Jim Underwood) #3

Mark, thanks for sharing. It’s always great to see another example of using RegEx with ASObjC.

As @NigelGarvey pointed out, this can be a complicated issue. Over the years it has been much discussed by the RegEx community. Here is one example from StackOverflow.com:

A comprehensive regex for phone number validation

A Google search will reveal many others.


(Nigel Garvey) #4

If you just want the phone numbers without the individual parts, it’s also possible to use NSDataDetector. It even finds my non-US ones! But I don’t know how clever it is universally.

set theSample to "333-1234, 250-888-8888, (123) 350-1234, (456)-350-1234"
set theNSStringSample to current application's NSString's stringWithString:theSample
set theDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypePhoneNumber) |error|:(missing value)
set theMatches to theDetector's matchesInString:theNSStringSample options:0 range:{0, theNSStringSample's |length|()}

(Shane Stanley) #5

And ours:

set theSample to "333-1234, 250-888-8888, (123) 350-1234, (456)-350-1234, 0427 123 456, +61 427 123 456, 03 9123 1234, 9123 1234, +61 3 9123 1234"

(Jim Underwood) #6

Thanks Nigel. That seems to return a NS object. How do we get a std AS list?

Also, I don’t think these are valid phone numbers:
333-1234 – missing area code
(456)-350-1234 – a dash should not follow a closing parenthesis. It should be nothing or a space.


(Nigel Garvey) #7

Hi Jim. Sorry I wasn’t clear.

NSDataDetector is a relative of NSRegularExpression. Their matchesInString:options:range: methods both return an array of NSTextCheckingResult (or an empty array when there are no matches). So the two lines containing ‘theDetector’ in post #4 could simply replace the two containing ‘theRexEx’ in post #2. In practice, of course, the line defining the regex pattern then becomes redundant — as does the code for extracting the ‘parts’ of the numbers, since no ranges are returned for them.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
use framework "Foundation"

-- classes, constants, and enums used
property NSRegularExpressionCaseInsensitive : a reference to 1
property NSRegularExpression : a reference to current application's NSRegularExpression

set theSample to "333-1234, 250-888-8888, (123) 350-1234, (456)-350-1234"
set theNSStringSample to current application's NSString's stringWithString:theSample
set theDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypePhoneNumber) |error|:(missing value)
set theMatches to theDetector's matchesInString:theNSStringSample options:0 range:{0, theNSStringSample's |length|()}
set thePhoneNumbers to {}

repeat with aMatch in theMatches
	--	Get the matched range of text
	set wholeRange to (aMatch's rangeAtIndex:0) as record -- or aMatch's range() as record
	set thePhoneNumber to (theNSStringSample's substringWithRange:wholeRange) as text
	
	--	Collect the results
	set end of thePhoneNumbers to {|phoneNumber|:thePhoneNumber}
end repeat

thePhoneNumbers

I’m afraid I can accept no responsibility for that. :wink:


(Jim Underwood) #8

I’m a bit confused about the use case for this script.

It seems to accept as valid phone numbers those that I would call incomplete at the best. For example:

333-1234
9123 1234

AFAIK, at least 10 digits are required everywhere in the world today. Certainly that is true in the US.

So I can think of three main use cases for phone numbers:

  1. Determine if a proposed number is valid
  2. If a number is valid, return it in a standard format
  3. If a number is valid, return it as a pure stream of numbers

Maybe I’m missing something here, but none of these scripts seem to do any of these.


(Shane Stanley) #9

Not here. Only for calls to other states (and even then, not between all states) or mobile calls.


(Phil Stokes) #10

Nor here. 9 digits for land line numbers, unless you’re calling overseas.


(Nigel Garvey) #11

Nor here. An area code isn’t required between land-line numbers on the same area exchange, but can be used without confusing the system.

My understanding of the purpose of Mark’s script is that it’s a “how to” demonstrating the use of NSRegularExpression — say, to extract US/Canadian-format phone numbers from a text and return them along with breakdowns of their parts. It’s not intended to do anything else or to be used directly for anything other than educational purposes.


(Mark Alldritt) #12

Precisely. Right after the improvements began to flow I realized I picked a poor example. Still, I think the information that emerged is helpful.


(Jim Underwood) #13

Mark, thanks again for sharing an example of how to use RegEx with ASObjC.

May I suggest in the future that you make it clear in the opening description what the purpose of the script is. From what you posted, it looked like to me to be a tool to get phone numbers via RegEx. So that colored all of my thinking/responses subsequently. :smile:

In fact, I’d suggest that you even edit your OP here to make it clear, so that others, or even me months later, who come across this script will understand its purpose.