Work around "fake" umlauts

Hi, first off I’m an AppleScriptObjC newbie and have no idea of text encodings.

A user asked on another forum whether DEVONthink search hits could be accessed via AppleScript, unfortunately they can’t (although that’s planned since at least 4 years).

I found that would be very useful and started UI scripting, but the UI way it’s only possible to get one record’s hits at a time, meh. The idea of accessing more than one record caught me so I started to translate DEVONthink’s search syntax into regex (well aware that it probably causes issues with long PDFs) and that’s working surprisingly well, especially after I changed the part that gets a hit’s context from regex to substringWithRange. So I have a nice working script that gets a DEVONthink query, translates it into regex, searches in all selected records and writes the hits plus some context into a markdown record.

Short before I was ready to let other users test the script I realized how useful it would be to have a link from a record’s hit in the script output to the matched text in the record, which would allow to see the whole context with one click. It’s possible by appending the record’s reference URL with ?search= plus the escaped text.

That’s where the problem starts. There’s something “strange” going on (at least) with german umlauts: after some links didn’t work I discovered by accident that the text I manually copied for debugging actually didn’t include an umlaut ä but something else that looks like an umlaut.

As I can’t describe what’s going on I hope the demo script can

use AppleScript version "2.7"
use framework "Foundation"
use scripting additions

set theString to current application's NSString's stringWithString:("ä") -- correct umlaut "ä" 
set theString_Length to (theString's |length|) as integer --> 1

-- alphanumericCharacterSet
set theCharacterset to (current application's NSCharacterSet's alphanumericCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {true}
set theString_encoded to my encodeString(theString, theCharacterset) --> "%C3%A4"

-- decomposableCharacterSet
set theCharacterset to (current application's NSCharacterSet's decomposableCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {true}
set theString_encoded to my encodeString(theString, theCharacterset) --> "%C3%A4"

-- nonBaseCharacterSet
set theCharacterset to (current application's NSCharacterSet's nonBaseCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {false}
set theString_encoded to my encodeString(theString, theCharacterset) --> "%C3%A4"

---------------------------------------------------------------------------------------------------

set theString to current application's NSString's stringWithString:("ä") -- something that looks like "ä"
set theString_Length to (theString's |length|) as integer --> 2

-- alphanumericCharacterSet
set theCharacterset to (current application's NSCharacterSet's alphanumericCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {true, true}
set theString_encoded to my encodeString(theString, theCharacterset) --> "a%CC%88"

-- decomposableCharacterSet
set theCharacterset to (current application's NSCharacterSet's decomposableCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {false, false}
set theString_encoded to my encodeString(theString, theCharacterset) --> "%61%CC%88"

-- nonBaseCharacterSet
set theCharacterset to (current application's NSCharacterSet's nonBaseCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {false, true}
set theString_encoded to my encodeString(theString, theCharacterset) --> "%61%CC%88"


on encodeString(theString, theCharacterset)
	try
		set theString_encoded to (theString's stringByAddingPercentEncodingWithAllowedCharacters:theCharacterset) as string
	on error error_message number error_number
		activate
		display alert "Error: Handler \"encodeString\"" message error_message as warning
		error number -128
	end try
end encodeString

on testMembership(theString, theCharacterset)
	try
		set theMembership to {}
		set theString_Length to (theString's |length|) as integer
		repeat with i from 0 to (theString_Length - 1)
			set thisCharacter to (theString's characterAtIndex:i)
			set end of theMembership to (theCharacterset's characterIsMember:thisCharacter) as boolean
		end repeat
		return theMembership
	on error error_message number error_number
		activate
		display alert "Error: Handler \"testMembership\"" message error_message as warning
		error number -128
	end try
end testMembership


Already tried everything I could think of to make links work reliably, here’s what I found:

If example text spätestens abc copied from a PDF contains a “fake” umlaut, then the link won’t work:

  • if only spaces are encoded (spätestens%20abc)
  • if everything is encoded (spa%CC%88testens%20abc)

If I replace the “fake” umlaut with an actual umlaut ä, then the link works:

  • if only spaces are encoded (spätestens%20abc)
  • if everything is encoded (sp%C3%A4testens%20abc)

That’s so strange to me, I mean I just copied this text and then it doesn’t match itself?

I’ve no idea what to do, as

  • it doesn’t seem to be possible to replace such “fake” umlauts via regex (at least not for me)
  • it’s not possible to use wildcards in a link (even if I still wouldn’t know how to replace)
  • it seems NSCharacterSets don’t help here
  • it seems it’s not an encoding issue as everything I tried reported UTF-8
  • it wasn’t possible for me to convert the string via iconv CLI (maybe because it’s already UTF-8)

Any pointers what’s going on and how to work around?

Characters like ä can be stored in two ways in Unicode: as a single code point, or as separate code points for the umlaut and the u. The characters in the PDF are probably using the separate, decomposed form.

NSString has methods for converting between the two — you want precomposedStringWithCanonicalMapping or precomposedStringWithCompatibilityMapping.

I’d also point out that the character sets you pass to stringByAddingPercentEncodingWithAllowedCharacters: should be something like URLPathAllowedCharacterSet or URLQueryAllowedCharacterSet, or one of related character sets.

1 Like

Damn, I recently read about that but didn’t recall it when I spend hours trying to find a solution. Using precomposedStringWithCanonicalMapping solved it, thank you very much!

I have another question, however I don’t know whether it’s something I should think about at all.

Currently the script does over 100(!) conversions from either AppleScript string -> NSString or NSString -> AppleScript string to process a simple two word query.

I got this mess because I’ve used ASObjC only where I could’t use vanilla AppleScript. It would take a long time to change this (bloody beginner), however I would do it if it speeds things up or is recommended for other reasons. Is this something I should care about at all?

Yes, repeated conversion back and forth adds overhead, especially for larger values. But whether you bother changing things depends a bit on whether time is an issue with your script.

Nevermind, I just tested conversions with a repeat and it’s blazing fast. Thank you, much appreciated!