If I have a variable with a lot of text, what’s the best (fastest, most reliable) way to get every paragraph that contains a sub-string.
I know this can be done via the shell and reg ex, but I’m wondering if there’s not an ASObjC solution. (I’m also looking at Has’ libraries and others).
ASObjC is unlikely to be the fastest offering because a search returns an array of ranges, and you have to loop through extracting the text for each range. As in all AppleScript, repeat loops tend to slow things down. That said, it should be perfectly reliable, and does give you an easy way to ensure the search string doesn’t contain any characters that have special meaning in search patterns.
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
set stringToSearch to "one
two
blah
Blah
more blah here
blaah"
set findString to "blah"
my findParsContaining:findString inString:stringToSearch matchCase:false
on findParsContaining:findString inString:stringToSearch matchCase:caseFlag
set stringToSearch to current application's NSString's stringWithString:stringToSearch
-- escape pattern, in case it contains characters that have special meaning in patterns
set escString to current application's NSRegularExpression's escapedPatternForString:findString
if caseFlag then
set theOptions to current application's NSRegularExpressionAnchorsMatchLines
else
set theOptions to (current application's NSRegularExpressionAnchorsMatchLines) + (get current application's NSRegularExpressionCaseInsensitive)
end if
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:("^.*" & escString & ".*$") options:theOptions |error|:(missing value)
set theMatches to theRegex's matchesInString:stringToSearch options:0 range:{0, stringToSearch's |length|()}
set theResults to {}
repeat with aMatch in theMatches
set end of theResults to (stringToSearch's substringWithRange:(aMatch's range())) as text
end repeat
return theResults
end findParsContaining:inString:matchCase:
Barring that then sed for speed and simplicity of use.
The caveats are the time overhead of shelling-out, and the limitations on the size of data passed to a do shell script statement.
set textVar to "
abaft
abaisance
abaiser
abaissed
abalienate
abalienation
abalone
Abama
abampere
abandon
abandonable
abandoned
abandonedly
abandonee
abandoner
abandonment
Abanic – Unicode added for testing ˆÒÚ¨
Abantes – Unicode added for testing •
abaptiston
Abarambo
"
set shCMD to "sed -En '/an/p' <<< " & quoted form of textVar
set foundLines to do shell script shCMD
This will return the current command-size limit for your system:
```bash
do shell script "sysctl kern.argmax"
```
On macOS Sierra 10.12.3 that would be:
kern.argmax: 262144
Bigger data can be handled by running the command on a file instead of a variable.
If you want to use AppleScriptObjC then changing the text should be significantly faster than finding the text.
This scary-looking regular expression removes lines that DON’T contain the given pattern:
(?m)^(?>(?:(?!an)).)*$\\R?
The relevant part is this: (?!an) — for NOT “an” — and that of course can be a different literal string or a more complex regular expression.
------------------------------------------------------------------------------
use framework "Foundation"
use scripting additions
------------------------------------------------------------------------------
set textVar to "
abaft
abaisance
abaiser
abaissed
abalienate
abalienation
abalone
Abama
abampere
abandon
abandonable
abandoned
abandonedly
abandonee
abandoner
abandonment
Abanic – Unicode added for testing ˆÒÚ¨
Abantes – Unicode added for testing •
abaptiston
Abarambo
"
# Remove lines not containing “an”
set newStr to its cngStr:"(?m)^(?>(?:(?!an)).)*$\\R?" intoString:"" inString:textVar
# Strip vertical whitespace from the bottom of the data:
set newStr to its cngStr:"\\s+\\Z" intoString:"" inString:newStr
------------------------------------------------------------------------------
--» HANDLERS
------------------------------------------------------------------------------
on cngStr:findString intoString:replaceString inString:dataString
set anNSString to current application's NSString's stringWithString:dataString
set dataString to (anNSString's stringByReplacingOccurrencesOfString:findString withString:replaceString ¬
options:(current application's NSRegularExpressionSearch) range:{0, anNSString's |length|()}) as text
end cngStr:intoString:inString:
------------------------------------------------------------------------------
And potential problems with normalization of some Unicode characters. Very rare, but IMO the fact that it can happen makes it a poor choice, given there are alternatives.
Yes, that’s a much faster approach. However…
The \R metacharacter was only introduced into ICU with ICU 55. I’m not sure when that was released or incorporated into macOS, but it wasn’t there in 10.10. (Your code was the first time I’ve seen it, which is why I looked it up. Maybe someone running 10.11 can check.)
That’s going to fail with some Unicode characters. Change “length of dataString” to “anNSString’s |length|()”.