Returning every line containing "x"

asobjc

(Ed Stockly) #1

If I have a variable with a lot of text, what’s the best (fastest, most reliable) way to get every paragraph that contains a sub-string.

I know this can be done via the shell and reg ex, but I’m wondering if there’s not an ASObjC solution. (I’m also looking at Has’ libraries and others).

Suggestions?


(Shane Stanley) #2

ASObjC is unlikely to be the fastest offering because a search returns an array of ranges, and you have to loop through extracting the text for each range. As in all AppleScript, repeat loops tend to slow things down. That said, it should be perfectly reliable, and does give you an easy way to ensure the search string doesn’t contain any characters that have special meaning in search patterns.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set stringToSearch to "one
two
blah
Blah
more blah here
blaah"
set findString to "blah"
my findParsContaining:findString inString:stringToSearch matchCase:false

on findParsContaining:findString inString:stringToSearch matchCase:caseFlag
	set stringToSearch to current application's NSString's stringWithString:stringToSearch
	-- escape pattern, in case it contains characters that have special meaning in patterns
	set escString to current application's NSRegularExpression's escapedPatternForString:findString
	if caseFlag then
		set theOptions to current application's NSRegularExpressionAnchorsMatchLines
	else
		set theOptions to (current application's NSRegularExpressionAnchorsMatchLines) + (get current application's NSRegularExpressionCaseInsensitive)
	end if
	set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:("^.*" & escString & ".*$") options:theOptions |error|:(missing value)
	set theMatches to theRegex's matchesInString:stringToSearch options:0 range:{0, stringToSearch's |length|()}
	set theResults to {}
	repeat with aMatch in theMatches
		set end of theResults to (stringToSearch's substringWithRange:(aMatch's range())) as text
	end repeat
	return theResults
end findParsContaining:inString:matchCase:

(Jim Underwood) #3

Have you considered Satimage.osax?
It has a very powerful, very fast RegEx engine.

I’m pretty sure Chris (@ccstone) would tell you the same.


(Christopher Stone) #4

Hey Ed,

For really big data?

The Satimage.osax?   :sunglasses:

Very fast and handles very big data.

Barring that then sed for speed and simplicity of use.

The caveats are the time overhead of shelling-out, and the limitations on the size of data passed to a do shell script statement.

set textVar to "
abaft
abaisance
abaiser
abaissed
abalienate
abalienation
abalone
Abama
abampere
abandon
abandonable
abandoned
abandonedly
abandonee
abandoner
abandonment
Abanic – Unicode added for testing ˆÒÚ¨
Abantes – Unicode added for testing •
abaptiston
Abarambo
"

set shCMD to "sed -En '/an/p' <<< " & quoted form of textVar
set foundLines to do shell script shCMD

This will return the current command-size limit for your system: ```bash do shell script "sysctl kern.argmax" ```

On macOS Sierra 10.12.3 that would be:

kern.argmax: 262144

Bigger data can be handled by running the command on a file instead of a variable.




If you want to use AppleScriptObjC then changing the text should be significantly faster than finding the text.

This scary-looking regular expression removes lines that DON’T contain the given pattern:

(?m)^(?>(?:(?!an)).)*$\\R?

The relevant part is this: (?!an) — for NOT “an” — and that of course can be a different literal string or a more complex regular expression.

------------------------------------------------------------------------------
use framework "Foundation"
use scripting additions
------------------------------------------------------------------------------

set textVar to "
abaft
abaisance
abaiser
abaissed
abalienate
abalienation
abalone
Abama
abampere
abandon
abandonable
abandoned
abandonedly
abandonee
abandoner
abandonment
Abanic – Unicode added for testing ˆÒÚ¨
Abantes – Unicode added for testing •
abaptiston
Abarambo
"

# Remove lines not containing “an”
set newStr to its cngStr:"(?m)^(?>(?:(?!an)).)*$\\R?" intoString:"" inString:textVar
# Strip vertical whitespace from the bottom of the data:
set newStr to its cngStr:"\\s+\\Z" intoString:"" inString:newStr

------------------------------------------------------------------------------
--» HANDLERS
------------------------------------------------------------------------------
on cngStr:findString intoString:replaceString inString:dataString
	set anNSString to current application's NSString's stringWithString:dataString
	set dataString to (anNSString's stringByReplacingOccurrencesOfString:findString withString:replaceString ¬
		options:(current application's NSRegularExpressionSearch) range:{0, anNSString's |length|()}) as text
end cngStr:intoString:inString:
------------------------------------------------------------------------------

-Chris


(Shane Stanley) #5

And potential problems with normalization of some Unicode characters. Very rare, but IMO the fact that it can happen makes it a poor choice, given there are alternatives.

Yes, that’s a much faster approach. However…

The \R metacharacter was only introduced into ICU with ICU 55. I’m not sure when that was released or incorporated into macOS, but it wasn’t there in 10.10. (Your code was the first time I’ve seen it, which is why I looked it up. Maybe someone running 10.11 can check.)

That’s going to fail with some Unicode characters. Change “length of dataString” to “anNSString’s |length|()”.


(Christopher Stone) #6

gsed 4.4 will almost certainly play more nicely with Unicode than macOS’ 12 year old stock BSD sed and is easily installed with Macports or Homebrew.

Like most tools it isn’t suited to every job, but it’s very good at what it does and is easy to use.

Perl is the obvious jump from sed and has very significant Unicode support, but I was trying to stay in the lightweight class.

That can be changed to [\\n\\r] if necessary.

Done – in the original script. ( I got that handler from you btw. :)

Thanks.

-Chris


(Shane Stanley) #7

I know – and you weren’t the only one. So I keep jumping on it every time I see it… :wink:


(Shane Stanley) #8

FWIW, the issue is related more to normailzation required by some apps (I’m looking at you, Adobe Illustrator).


(Ed Stockly) #9

Is that the version posted here?


(Christopher Stone) #10

Yes – and the forum is making me have at least 20 characters to answer your question.