Clean up Attributed String

Is there an easy way to remove whitespace and leading/trailing empty lines for an attributed string like the “stringByTrimmingCharactersInSet:(current application’s NSCharacterSet’s whitespaceAndNewlineCharacterSet” for strings?
Or more global question: how to find and replace text with attributed strings?

There are no simple methods. You generally need to get the string() of the attributed string, search that, then use the ranges to adjust a mutable version of the attributed string. So generally you’re probably going to use NSRegularExpression’s matchesInString:::, followed by deleteCharactersInRange: or replaceCharactersInRange::.

… and using a reversed repeat, of course. :slight_smile:

A very important point… :smile:

Thanks Nigel and Shane,

I feared that it isn’t that simple.
I had done a little code snippet which should split a multi-line styled textblock into styled paragraphs - each one cleaned.

set checkString to (rtfTxBlock's |string|()) as string
set parList to every paragraph of checkString
--
set attrParList to {}
set rangeStart to 0
set i to 0
repeat with thePar in parList
	set i to i + 1
	set thePar to thePar as text
	set theLen to (length of thePar)
	--
	set cleanPar to my cleanWhiteSpace(thePar)
	set cleanLen to (length of cleanPar)
	if not cleanLen = theLen then
		set startOff to (offset of cleanPar in thePar) - 1
		if startOff > 1 then
			set rangeStart to rangeStart + startOff
		end if
		set theLen to cleanLen
	end if
	--
	set searchRange to {location:rangeStart, |length|:theLen}
	set attrPar to (rtfTxBlock's RTFFromRange:({location:rangeStart, |length|:theLen}) documentAttributes:(missing value))
	set end of attrParList to attrPar
	--
	if theLen = 0 then
		set zStep to 1
	else
		set zStep to 2
	end if
	set rangeStart to rangeStart + theLen + zStep
end repeat

This works fine somehow. I get a list of data, each item can be saved to disk and looks like expected. But when I want to address those list items for further handling I got an error message:
"-[NSConcreteMutableData string]: unrecognized selector sent to instance 0x600005b76880"

How to convert the “RTFFromRange” from the code above to an attributed string?

To answer my own question:

	set {attrPar, theError} to (current application's NSAttributedString's alloc()'s initWithData:attrPar options:(missing value) documentAttributes:(missing value) |error|:(reference))

will do the job.
Shame on me not finding it earlier.
“cleanWhiteSpace(someString)” uses "whitespaceAndNewlineCharacterSet”

set checkString to (rtfTxBlock's |string|()) as string
--
set cleanString to my cleanWhiteSpace(checkString)
set zStart to (offset of cleanString in checkString) - 1
set theLen to (length of cleanString)

set someRTF to (rtfTxBlock's RTFFromRange:({location:zStart, |length|:theLen}) documentAttributes:(missing value))
set {rtfTxBlock, theError} to (NSAttributedString's alloc()'s initWithData:someRTF options:(missing value) documentAttributes:(missing value) |error|:(reference))
set checkString to rtfTxBlock's |string|() as string
--
set parList to every paragraph of checkString
--
set attrParList to {}
set rangeStart to 0
repeat with thePar in parList
	set thePar to thePar as text
	set theLen to (length of thePar)
	--
	set searchRange to {location:rangeStart, |length|:theLen}
	try
		set attrPar to (rtfTxBlock's RTFFromRange:({location:rangeStart, |length|:theLen}) documentAttributes:(missing value))
		set {attrPar, theError} to (NSAttributedString's alloc()'s initWithData:attrPar options:(missing value) documentAttributes:(missing value) |error|:(reference))
		-- second check
		set cleanPar to my cleanWhiteSpace(thePar)
		set cleanLen to (length of cleanPar)
		if not cleanLen = theLen then
			set finalLen to cleanLen
			set tmpOff to (offset of cleanPar in thePar) - 1
			if tmpOff > 0 then
				set attrPar to (attrPar's RTFFromRange:({location:tmpOff, |length|:cleanLen}) documentAttributes:(missing value))
				set {attrPar, theError} to (NSAttributedString's alloc()'s initWithData:attrPar options:(missing value) documentAttributes:(missing value) |error|:(reference))
			else
				if cleanLen = 0 then
					set attrPar to (attrPar's RTFFromRange:({location:0, |length|:0}) documentAttributes:(missing value))
				else
					set attrPar to (attrPar's RTFFromRange:({location:0, |length|:cleanLen}) documentAttributes:(missing value))
				end if
				set {attrPar, theError} to (NSAttributedString's alloc()'s initWithData:attrPar options:(missing value) documentAttributes:(missing value) |error|:(reference))
			end if
		else
			set finalLen to theLen
		end if
		--
		if not finalLen = 0 then
			set end of attrParList to attrPar
		end if
	on error errMsg
		log errMsg
	end try
	--
	set rangeStart to rangeStart + theLen + 1
end repeat
log "---------------------------"
repeat with attrPar in attrParList
	set attrPar to attrPar's |string|() as text
	log attrPar
end repeat

Hi Andreas.

I think Shane had something like this in mind:

-- Assuming that rtfTxBlock's your NSAttributedString, get its string().
set checkString to rtfTxBlock's |string|()
-- Use NSRegularExpression to find the ranges of white-space substrings to cut.
set leadingAndTrailingWhiteSpaceRegex to current application's class "NSRegularExpression"'s regularExpressionWithPattern:("(?m)^\\s++|\\h++$") options:(0) |error|:(missing value)
set rangesToCut to (leadingAndTrailingWhiteSpaceRegex's matchesInString:(checkString) options:(0) range:({0, checkString's |length|()}))'s valueForKey:("range")
-- Get a mutable version of the NSAttributedString.
set rtfTxBlock to rtfTxBlock's mutableCopy()
-- Delete the found white-space substrings from it, working from last to first to preserve the ranges of substrings not yet deleted.
repeat with i from (count rangesToCut) to 1 by -1
	tell rtfTxBlock to deleteCharactersInRange:(item i of rangesToCut)
end repeat

-- (Check the result.)
-- return rtfTxBlock's |string|() as text

It’s generally best not to equate lengths of AppleScript texts with |length|()s of equivalent NSStrings and NSAttributedStrings. They’re often the same, but not always!

Wow! Thanks Nigel!

This works like a charm.
To be honest I’m not very good with RegEx since there was no need before.
The script I posted has a double usage for my app.
I need cleaned text blocks (which don’t contain empty lines and whitespace) and a list of non-empty paragraphs without any whitespace. The final ‘rtfParts’ then will be used to be split into their attributes to be converted to XML flavours or HTML like text.

So I finally can combine both codes into a simple code snippet for my usage.
Again many thanks.

P.S.
Any resource for learning RegEx recommended? How to remove multiple spaces between words for example.

Pretty much, with perhaps a less fancy pattern :wink:. The metacharacter \h was introduced in ICU 55, and I’m not sure when that was introduced in Cocoa, so if the code has to run under older versions of macOS it might need changing.

I learned originally from Regular-Expressions.info. But while it cover’s several different “flavours” of regex, I don’t recall it mentioning ICU regex, which is what Apple’s “Foundation” regular expressions use. (The flavours used by the various shell script commands are something else again!) Still, you can learn the basics at the above site if it suits you and make adjustments as necessary.

Thanks for the warning! The code seems to work on my El Capitan system.

Yes, had a vague memory of 10.11 being the version that ICU 55 was introduced in. Anyway, it introduced \R, \V, \v, \H, \h, and named capture back-references.

Here’s a different approach to the same problem. It’s probably a fraction slower, but it may be preferable where a literal search is all that’s required.

property NSRegularExpressionSearch : a reference to 1024

[...]

set rtfTxBlock to rtfTxBlock's mutableCopy()
set checkString to rtfTxBlock's |string|()
repeat
	set theRange to checkString's rangeOfString:"(?m)^\\s++|\\h++$" options:NSRegularExpressionSearch
	if |length| of theRange = 0 then exit repeat
	rtfTxBlock's deleteCharactersInRange:theRange
end repeat

!!!

I was pretty sure it wouldn’t work when I read it! But apparently an NSAttributedString’s string property is “the current backing store of the attributed string object”, so presumably it’s mutable in an NSMutableAttributedString and is updated each time characters are deleted from the attributed string.

The NSMutableAttributedString class explicitly has a mutableString property. According to the documentation: “The receiver tracks changes to this string and keeps its attribute mappings up to date.” So you could delete the characters from checkString instead and rtfTxBlock would be kept up-to-date. Looking good so far:

set rtfTxBlock to rtfTxBlock's mutableCopy()
set checkString to rtfTxBlock's mutableString()
tell checkString to replaceOccurrencesOfString:("(?m)^\\s++|\\h++$") withString:("") options:(current application's NSRegularExpressionSearch) range:({0, its |length|()})

-- (Check the result.)
return rtfTxBlock --'s |string|() as text

With this version of the regex, the script will single any multiple spaces (not non-break spaces, tabs, line endings, or ordinary spaces accompanied by any of these) which occur between visible characters — at the same time as tidying up the paragraphs:

"(?m)^\\s++|\\h++$|(?<=\\S\\u0020)\\u0020++(?=\\S)"

If you like, you can replace each instance “\\u0020” with a literal space. I’ve only used the longer form here to make it obvious.

Yes, that’s considerably faster, too.

This is awesome!!!
Should be moved to a prominent place in the forum.
Many thanks Nigel.

To figure out how to write an expression like that it will take me some time to learn RegEx better.
Thanks for the link to the site. Seems like a good resource.

Yes. It’s become rather complex visually. :slight_smile: Basically, it matches any text which is either:

  1. A run of one or more white space characters of any type, starting at the beginning of a line (ie. paragraph). The white space can include line endings, so it’s everything from the beginning of a line up to the next non-space character (or to the end of the text if sooner), including empty lines and lines containing only white space.
    OR:
  2. A run of horizontal white space characters at the end of a line. Horizontal white space doesn’t include line endings, so a match only occurs if one or more horizontal spaces are followed by a line ending. The line ending itself isn’t included in the match.
    OR:
  3. A run of literal space characters (ie. character id 32) which follow a non-space character and a literal space and which are followed by a non-space character. It would be possible include non-break spaces and/or tabs too if preferred.

In the scripts, searches start at the beginning of the text and, if a match is found, resume from the next character after the match, so the three possibilities above don’t interfere with each other.

The regex “flavour” is ICU.

(?m) : a flag which makes ^ and $ match the beginnings and ends of lines instead of just the beginning and end of the text.
^\\s++ : one or more white space characters at the beginning of a line.
| : OR
\\h++$ : one or more horizontal white space characters at the end of a line.
| : OR
(?<=\\S\\u0020) : a “look behind” specifying that the matched text must be immediately preceded by a non-space character and a literal space in the source text.
\\u0020++ : one or more literal (character id 32) space characters.
(?=\\S) : a “look ahead” specifying that the matched text must be immediately followed by a non-space character in the source text.

It’s probably worth pointing out that Nigel has given what might be termed the ideal pattern, but you often don’t require that level of skill — especially as it sounds like you’re dealing with relatively short strings. Things like Nigel’s use of ++ instead of + increase efficiency, but aren’t actually necessary in this case. And similarly, especially when you’re starting out, it can be easier to perform multiple searches, one for each case you’re dealing with, rather than trying to do it all in one.

Nigel, Shane,

Thanks again and shame on me to react so late on your replies.
This weekend I finally found a bit continuous time to work on my app and this regex things again.

I followed Shane’s advice to keep things simple.
There was one thing I didn’t understand:
I have a pattern: "<font *.* color=.+>+\\b"
This was thought to cover things like "<font name=Helvetica color=#ff00ff>", "<font size=14 color=#ff00ff>", "<font color=#ff00ff>", etc.
In an older script I did a ‘pre-flight’ with a shell script and ‘egrep’ which worked fine.

Now I tried to bring that into asobjC and regular expression:

set vPattern to "<font *.* color=.+>+\\b"
set vRegex to current application's class "NSRegularExpression"'s regularExpressionWithPattern:(vPattern) options:(0) |error|:(missing value)
set theRanges to (vRegex's matchesInString:(theString) options:(0) range:({0, theString's |length|()}))'s valueForKey:("range")

This works but only returns the ranges for "<font color=#ff00ff>" and similar.
What do I miss?

Unless you tell it otherwise, NSRegularExpression matches any character, including line breaks, with .. Try a pattern like this: "<font *[^>]* color=[^>]+>".