Hi, first off I’m an AppleScriptObjC newbie and have no idea of text encodings.
A user asked on another forum whether DEVONthink search hits could be accessed via AppleScript, unfortunately they can’t (although that’s planned since at least 4 years).
I found that would be very useful and started UI scripting, but the UI way it’s only possible to get one record’s hits at a time, meh. The idea of accessing more than one record caught me so I started to translate DEVONthink’s search syntax into regex (well aware that it probably causes issues with long PDFs) and that’s working surprisingly well, especially after I changed the part that gets a hit’s context from regex to substringWithRange. So I have a nice working script that gets a DEVONthink query, translates it into regex, searches in all selected records and writes the hits plus some context into a markdown record.
Short before I was ready to let other users test the script I realized how useful it would be to have a link from a record’s hit in the script output to the matched text in the record, which would allow to see the whole context with one click. It’s possible by appending the record’s reference URL with ?search=
plus the escaped text.
That’s where the problem starts. There’s something “strange” going on (at least) with german umlauts: after some links didn’t work I discovered by accident that the text I manually copied for debugging actually didn’t include an umlaut ä
but something else that looks like an umlaut.
As I can’t describe what’s going on I hope the demo script can
use AppleScript version "2.7"
use framework "Foundation"
use scripting additions
set theString to current application's NSString's stringWithString:("ä") -- correct umlaut "ä"
set theString_Length to (theString's |length|) as integer --> 1
-- alphanumericCharacterSet
set theCharacterset to (current application's NSCharacterSet's alphanumericCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {true}
set theString_encoded to my encodeString(theString, theCharacterset) --> "%C3%A4"
-- decomposableCharacterSet
set theCharacterset to (current application's NSCharacterSet's decomposableCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {true}
set theString_encoded to my encodeString(theString, theCharacterset) --> "%C3%A4"
-- nonBaseCharacterSet
set theCharacterset to (current application's NSCharacterSet's nonBaseCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {false}
set theString_encoded to my encodeString(theString, theCharacterset) --> "%C3%A4"
---------------------------------------------------------------------------------------------------
set theString to current application's NSString's stringWithString:("ä") -- something that looks like "ä"
set theString_Length to (theString's |length|) as integer --> 2
-- alphanumericCharacterSet
set theCharacterset to (current application's NSCharacterSet's alphanumericCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {true, true}
set theString_encoded to my encodeString(theString, theCharacterset) --> "a%CC%88"
-- decomposableCharacterSet
set theCharacterset to (current application's NSCharacterSet's decomposableCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {false, false}
set theString_encoded to my encodeString(theString, theCharacterset) --> "%61%CC%88"
-- nonBaseCharacterSet
set theCharacterset to (current application's NSCharacterSet's nonBaseCharacterSet())
set theString_tested to my testMembership(theString, theCharacterset) --> {false, true}
set theString_encoded to my encodeString(theString, theCharacterset) --> "%61%CC%88"
on encodeString(theString, theCharacterset)
try
set theString_encoded to (theString's stringByAddingPercentEncodingWithAllowedCharacters:theCharacterset) as string
on error error_message number error_number
activate
display alert "Error: Handler \"encodeString\"" message error_message as warning
error number -128
end try
end encodeString
on testMembership(theString, theCharacterset)
try
set theMembership to {}
set theString_Length to (theString's |length|) as integer
repeat with i from 0 to (theString_Length - 1)
set thisCharacter to (theString's characterAtIndex:i)
set end of theMembership to (theCharacterset's characterIsMember:thisCharacter) as boolean
end repeat
return theMembership
on error error_message number error_number
activate
display alert "Error: Handler \"testMembership\"" message error_message as warning
error number -128
end try
end testMembership
Already tried everything I could think of to make links work reliably, here’s what I found:
If example text spätestens abc
copied from a PDF contains a “fake” umlaut, then the link won’t work:
- if only spaces are encoded (
spätestens%20abc
) - if everything is encoded (
spa%CC%88testens%20abc
)
If I replace the “fake” umlaut with an actual umlaut ä
, then the link works:
- if only spaces are encoded (
spätestens%20abc
) - if everything is encoded (
sp%C3%A4testens%20abc
)
That’s so strange to me, I mean I just copied this text and then it doesn’t match itself?
I’ve no idea what to do, as
- it doesn’t seem to be possible to replace such “fake” umlauts via regex (at least not for me)
- it’s not possible to use wildcards in a link (even if I still wouldn’t know how to replace)
- it seems NSCharacterSets don’t help here
- it seems it’s not an encoding issue as everything I tried reported UTF-8
- it wasn’t possible for me to convert the string via
iconv
CLI (maybe because it’s already UTF-8)
Any pointers what’s going on and how to work around?