How to Detect Invisible Unicode Characters in AppleScript?

martin.z · August 26, 2021, 12:55pm

A command (not from reading a file) in my AppleScript gets a string like this:

ERR-The book Gen cannot be found.

There are many signs that are not properly displayed here (in the editor preview):

When I set the clipboard to the string and convert it into unicode character values, it shows as this:

E\x00R\x00R\x00-\x00T\x00h\x00e\x00 \x00b\x00o\x00o\x00k\x00 \x00\x1c G\x00e\x00n\x00\x1d \x00c\x00a\x00n\x00n\x00o\x00t\x00 \x00b\x00e\x00 \x00f\x00o\x00u\x00n\x00d\x00.\x00

In other words, there are many \x00 signs in the string.
When I paste it to wwwDOTregextesterDOTcom I can search for it. Search for `\u0000’ will also find it.

However, when I use

if myStr contains "\\u0000" then 
--Edit: Someone pointed out to me that this is looking for string '\u0000' 
--         but not the unicode character, 
--but I don't know how to look for the unicode character \u0000

or

if myStr contains "\\x00" then

It does not work. It cannot detect either “\u0000” or “\x00”.

The result in Script Debugger:
(On the left, all results show only “E”. The string is cut off at the first “\x00”).

Some one pointed out to me that it is because utf-16 being parsed as utf-8. But I don’t know how to fix it.

Any help appreciated!

martin.z · August 26, 2021, 12:59pm

(Sorry, new users can only include one media in the post.)

Search result in wwwDOTregextesterDOTcom:

martin.z · August 26, 2021, 1:00pm

Search result in wwwDOTregextesterDOTcom with “\u0000”:

martin.z · August 26, 2021, 2:30pm

I have solved it by using a handler:

on decodeCharacterHexString(theCharacters)
    copy theCharacters to {theIdentifyingCharacter, theMultiplierCharacter, theRemainderCharacter}
    set theHexList to "123456789ABCDEF"
    if theMultiplierCharacter is in "ABCDEF" then
        set theMultiplierAmount to offset of theMultiplierCharacter in theHexList
    else
        set theMultiplierAmount to theMultiplierCharacter as integer
    end if
    if theRemainderCharacter is in "ABCDEF" then
        set theRemainderAmount to offset of theRemainderCharacter in theHexList
    else
        set theRemainderAmount to theRemainderCharacter as integer
    end if
    set theASCIINumber to (theMultiplierAmount * 16) + theRemainderAmount
    return (ASCII character theASCIINumber)
end decodeCharacterHexString

I can then call it:

if myStr contains (decodeCharacterHexString("%00")) then

martin.z · August 26, 2021, 2:54pm

An interesting problem:

If I add

use AppleScript version "2.4"

I will encounter this error:

Changing it to a different version won’t help.

use AppleScript version "2.7"

If I remove it, I won’t have the error and the script seems to be working just fine.

ccstone · August 26, 2021, 10:27pm

Hey @martin.z,

When posting problems like this please try to post the actual test string.

Sometimes you can do that using the preformatted-text button in the forum editor, sometimes you need to provide a zipped example file.

Testing beats guessing every time.

-Chris

martin.z · August 26, 2021, 11:09pm

Thank you very much, Chris.

I did post the actual string in the OP. The forum automatically filters those "\x00"s.
From (in the preview):

to

ERR-The book Gen cannot be found.

The only way to preserve them is to first convert them into unicode characters:

E\x00R\x00R\x00-\x00T\x00h\x00e\x00 \x00b\x00o\x00o\x00k\x00 \x00\x1c G\x00e\x00n\x00\x1d \x00c\x00a\x00n\x00n\x00o\x00t\x00 \x00b\x00e\x00 \x00f\x00o\x00u\x00n\x00d\x00.\x00

which I also posted in the OP.

Thank you very much for your tips about the Character ID elsewhere. I was able to replace the deprecated handler with a simple ascii character 0.

This is the right way:

if myStr contains character id 0 then

ccstone · August 27, 2021, 8:24pm

Right.

So a zipped RTF file or AppleScript would probably be the best way to post a testable string.

-Chris

martin.z · August 27, 2021, 9:37pm

That’s a good idea. Never thought of that option before. Will do next time.

Thanks!