Is there an easy way to remove whitespace and leading/trailing empty lines for an attributed string like the “stringByTrimmingCharactersInSet:(current application’s NSCharacterSet’s whitespaceAndNewlineCharacterSet” for strings?
Or more global question: how to find and replace text with attributed strings?
There are no simple methods. You generally need to get the string()
of the attributed string, search that, then use the ranges to adjust a mutable version of the attributed string. So generally you’re probably going to use NSRegularExpression’s matchesInString:::
, followed by deleteCharactersInRange:
or replaceCharactersInRange::
.
… and using a reversed repeat, of course.
A very important point…
Thanks Nigel and Shane,
I feared that it isn’t that simple.
I had done a little code snippet which should split a multi-line styled textblock into styled paragraphs - each one cleaned.
set checkString to (rtfTxBlock's |string|()) as string
set parList to every paragraph of checkString
--
set attrParList to {}
set rangeStart to 0
set i to 0
repeat with thePar in parList
set i to i + 1
set thePar to thePar as text
set theLen to (length of thePar)
--
set cleanPar to my cleanWhiteSpace(thePar)
set cleanLen to (length of cleanPar)
if not cleanLen = theLen then
set startOff to (offset of cleanPar in thePar) - 1
if startOff > 1 then
set rangeStart to rangeStart + startOff
end if
set theLen to cleanLen
end if
--
set searchRange to {location:rangeStart, |length|:theLen}
set attrPar to (rtfTxBlock's RTFFromRange:({location:rangeStart, |length|:theLen}) documentAttributes:(missing value))
set end of attrParList to attrPar
--
if theLen = 0 then
set zStep to 1
else
set zStep to 2
end if
set rangeStart to rangeStart + theLen + zStep
end repeat
This works fine somehow. I get a list of data, each item can be saved to disk and looks like expected. But when I want to address those list items for further handling I got an error message:
"-[NSConcreteMutableData string]: unrecognized selector sent to instance 0x600005b76880"
How to convert the “RTFFromRange” from the code above to an attributed string?
To answer my own question:
set {attrPar, theError} to (current application's NSAttributedString's alloc()'s initWithData:attrPar options:(missing value) documentAttributes:(missing value) |error|:(reference))
will do the job.
Shame on me not finding it earlier.
“cleanWhiteSpace(someString)” uses "whitespaceAndNewlineCharacterSet”
set checkString to (rtfTxBlock's |string|()) as string
--
set cleanString to my cleanWhiteSpace(checkString)
set zStart to (offset of cleanString in checkString) - 1
set theLen to (length of cleanString)
set someRTF to (rtfTxBlock's RTFFromRange:({location:zStart, |length|:theLen}) documentAttributes:(missing value))
set {rtfTxBlock, theError} to (NSAttributedString's alloc()'s initWithData:someRTF options:(missing value) documentAttributes:(missing value) |error|:(reference))
set checkString to rtfTxBlock's |string|() as string
--
set parList to every paragraph of checkString
--
set attrParList to {}
set rangeStart to 0
repeat with thePar in parList
set thePar to thePar as text
set theLen to (length of thePar)
--
set searchRange to {location:rangeStart, |length|:theLen}
try
set attrPar to (rtfTxBlock's RTFFromRange:({location:rangeStart, |length|:theLen}) documentAttributes:(missing value))
set {attrPar, theError} to (NSAttributedString's alloc()'s initWithData:attrPar options:(missing value) documentAttributes:(missing value) |error|:(reference))
-- second check
set cleanPar to my cleanWhiteSpace(thePar)
set cleanLen to (length of cleanPar)
if not cleanLen = theLen then
set finalLen to cleanLen
set tmpOff to (offset of cleanPar in thePar) - 1
if tmpOff > 0 then
set attrPar to (attrPar's RTFFromRange:({location:tmpOff, |length|:cleanLen}) documentAttributes:(missing value))
set {attrPar, theError} to (NSAttributedString's alloc()'s initWithData:attrPar options:(missing value) documentAttributes:(missing value) |error|:(reference))
else
if cleanLen = 0 then
set attrPar to (attrPar's RTFFromRange:({location:0, |length|:0}) documentAttributes:(missing value))
else
set attrPar to (attrPar's RTFFromRange:({location:0, |length|:cleanLen}) documentAttributes:(missing value))
end if
set {attrPar, theError} to (NSAttributedString's alloc()'s initWithData:attrPar options:(missing value) documentAttributes:(missing value) |error|:(reference))
end if
else
set finalLen to theLen
end if
--
if not finalLen = 0 then
set end of attrParList to attrPar
end if
on error errMsg
log errMsg
end try
--
set rangeStart to rangeStart + theLen + 1
end repeat
log "---------------------------"
repeat with attrPar in attrParList
set attrPar to attrPar's |string|() as text
log attrPar
end repeat
Hi Andreas.
I think Shane had something like this in mind:
-- Assuming that rtfTxBlock's your NSAttributedString, get its string().
set checkString to rtfTxBlock's |string|()
-- Use NSRegularExpression to find the ranges of white-space substrings to cut.
set leadingAndTrailingWhiteSpaceRegex to current application's class "NSRegularExpression"'s regularExpressionWithPattern:("(?m)^\\s++|\\h++$") options:(0) |error|:(missing value)
set rangesToCut to (leadingAndTrailingWhiteSpaceRegex's matchesInString:(checkString) options:(0) range:({0, checkString's |length|()}))'s valueForKey:("range")
-- Get a mutable version of the NSAttributedString.
set rtfTxBlock to rtfTxBlock's mutableCopy()
-- Delete the found white-space substrings from it, working from last to first to preserve the ranges of substrings not yet deleted.
repeat with i from (count rangesToCut) to 1 by -1
tell rtfTxBlock to deleteCharactersInRange:(item i of rangesToCut)
end repeat
-- (Check the result.)
-- return rtfTxBlock's |string|() as text
It’s generally best not to equate lengths
of AppleScript texts with |length|()s
of equivalent NSStrings and NSAttributedStrings. They’re often the same, but not always!
Wow! Thanks Nigel!
This works like a charm.
To be honest I’m not very good with RegEx since there was no need before.
The script I posted has a double usage for my app.
I need cleaned text blocks (which don’t contain empty lines and whitespace) and a list of non-empty paragraphs without any whitespace. The final ‘rtfParts’ then will be used to be split into their attributes to be converted to XML flavours or HTML like text.
So I finally can combine both codes into a simple code snippet for my usage.
Again many thanks.
P.S.
Any resource for learning RegEx recommended? How to remove multiple spaces between words for example.
Pretty much, with perhaps a less fancy pattern . The metacharacter \h
was introduced in ICU 55, and I’m not sure when that was introduced in Cocoa, so if the code has to run under older versions of macOS it might need changing.
I learned originally from Regular-Expressions.info. But while it cover’s several different “flavours” of regex, I don’t recall it mentioning ICU regex, which is what Apple’s “Foundation” regular expressions use. (The flavours used by the various shell script commands are something else again!) Still, you can learn the basics at the above site if it suits you and make adjustments as necessary.
Thanks for the warning! The code seems to work on my El Capitan system.
Yes, had a vague memory of 10.11 being the version that ICU 55 was introduced in. Anyway, it introduced \R
, \V
, \v
, \H
, \h
, and named capture back-references.
Here’s a different approach to the same problem. It’s probably a fraction slower, but it may be preferable where a literal search is all that’s required.
property NSRegularExpressionSearch : a reference to 1024
[...]
set rtfTxBlock to rtfTxBlock's mutableCopy()
set checkString to rtfTxBlock's |string|()
repeat
set theRange to checkString's rangeOfString:"(?m)^\\s++|\\h++$" options:NSRegularExpressionSearch
if |length| of theRange = 0 then exit repeat
rtfTxBlock's deleteCharactersInRange:theRange
end repeat
!!!
I was pretty sure it wouldn’t work when I read it! But apparently an NSAttributedString’s string
property is “the current backing store of the attributed string object”, so presumably it’s mutable in an NSMutableAttributedString and is updated each time characters are deleted from the attributed string.
The NSMutableAttributedString class explicitly has a mutableString
property. According to the documentation: “The receiver tracks changes to this string and keeps its attribute mappings up to date.” So you could delete the characters from checkString instead and rtfTxBlock would be kept up-to-date. Looking good so far:
set rtfTxBlock to rtfTxBlock's mutableCopy()
set checkString to rtfTxBlock's mutableString()
tell checkString to replaceOccurrencesOfString:("(?m)^\\s++|\\h++$") withString:("") options:(current application's NSRegularExpressionSearch) range:({0, its |length|()})
-- (Check the result.)
return rtfTxBlock --'s |string|() as text
With this version of the regex, the script will single any multiple spaces (not non-break spaces, tabs, line endings, or ordinary spaces accompanied by any of these) which occur between visible characters — at the same time as tidying up the paragraphs:
"(?m)^\\s++|\\h++$|(?<=\\S\\u0020)\\u0020++(?=\\S)"
If you like, you can replace each instance “\\u0020” with a literal space. I’ve only used the longer form here to make it obvious.
Yes, that’s considerably faster, too.
This is awesome!!!
Should be moved to a prominent place in the forum.
Many thanks Nigel.
To figure out how to write an expression like that it will take me some time to learn RegEx better.
Thanks for the link to the site. Seems like a good resource.
Yes. It’s become rather complex visually. Basically, it matches any text which is either:
- A run of one or more white space characters of any type, starting at the beginning of a line (ie. paragraph). The white space can include line endings, so it’s everything from the beginning of a line up to the next non-space character (or to the end of the text if sooner), including empty lines and lines containing only white space.
OR: - A run of horizontal white space characters at the end of a line. Horizontal white space doesn’t include line endings, so a match only occurs if one or more horizontal spaces are followed by a line ending. The line ending itself isn’t included in the match.
OR: - A run of literal space characters (ie. character id 32) which follow a non-space character and a literal space and which are followed by a non-space character. It would be possible include non-break spaces and/or tabs too if preferred.
In the scripts, searches start at the beginning of the text and, if a match is found, resume from the next character after the match, so the three possibilities above don’t interfere with each other.
The regex “flavour” is ICU.
(?m)
: a flag which makes ^
and $
match the beginnings and ends of lines instead of just the beginning and end of the text.
^\\s++
: one or more white space characters at the beginning of a line.
|
: OR
\\h++$
: one or more horizontal white space characters at the end of a line.
|
: OR
(?<=\\S\\u0020)
: a “look behind” specifying that the matched text must be immediately preceded by a non-space character and a literal space in the source text.
\\u0020++
: one or more literal (character id 32) space characters.
(?=\\S)
: a “look ahead” specifying that the matched text must be immediately followed by a non-space character in the source text.
It’s probably worth pointing out that Nigel has given what might be termed the ideal pattern, but you often don’t require that level of skill — especially as it sounds like you’re dealing with relatively short strings. Things like Nigel’s use of ++
instead of +
increase efficiency, but aren’t actually necessary in this case. And similarly, especially when you’re starting out, it can be easier to perform multiple searches, one for each case you’re dealing with, rather than trying to do it all in one.
Nigel, Shane,
Thanks again and shame on me to react so late on your replies.
This weekend I finally found a bit continuous time to work on my app and this regex things again.
I followed Shane’s advice to keep things simple.
There was one thing I didn’t understand:
I have a pattern: "<font *.* color=.+>+\\b"
This was thought to cover things like "<font name=Helvetica color=#ff00ff>", "<font size=14 color=#ff00ff>", "<font color=#ff00ff>"
, etc.
In an older script I did a ‘pre-flight’ with a shell script and ‘egrep’ which worked fine.
Now I tried to bring that into asobjC and regular expression:
set vPattern to "<font *.* color=.+>+\\b"
set vRegex to current application's class "NSRegularExpression"'s regularExpressionWithPattern:(vPattern) options:(0) |error|:(missing value)
set theRanges to (vRegex's matchesInString:(theString) options:(0) range:({0, theString's |length|()}))'s valueForKey:("range")
This works but only returns the ranges for "<font color=#ff00ff>"
and similar.
What do I miss?
Unless you tell it otherwise, NSRegularExpression matches any character, including line breaks, with .
. Try a pattern like this: "<font *[^>]* color=[^>]+>"
.