Wondering if this really works

Shane or Mark,

I ran into a strange “data ushr” type of data («data ushr3500») when working with NSString’s characterAtIndex. I couldn’t find anything about this type or how to convert. The Xcode docs said characterAtIndex returns an 64-bit unsigned integer since my Mac is 64 bit. I figured the first 9.22337203685478E+18 numbers are positive so I could use AppleScript’s integer value to represent the unicode ID numbers since there are only 1,114,112 possible characters in unicode.

Therefore given a NSString named TheNSString:
The unicode number of the character at position P should be (TheNSString’s characterAtIndex:P) as integer
The unicode character at position P should be character id ((TheNSString’s characterAtIndex:P) as integer)

I couldn’t find anything, anywhere on how to do the conversion and it seems like there is such a simple way I keep thinking I’m missing something. So my question is does this really work in general.

As an example this returns “e”:

use framework "AppKit"
property NSString : class "NSString"

set P to 1 -- returns the second character
set TheNSString to (NSString's stringWithString:"help me")
character id ((TheNSString's characterAtIndex:P) as integer)

while this returns 101 and character id 101 = “e”

use framework "AppKit"
property NSString : class "NSString"

set P to 1 -- returns the unicode number of the second character
set TheNSString to (NSString's stringWithString:"help me")
(TheNSString's characterAtIndex:P) as integer

Bill

The characterAtIndex: method returns a unichar. That’s a 16-bit value, but it only matches what AppleScript calls a character if the character is one that can be encoded in a single 16-bit value. So for example this:

use framework "Foundation"
property NSString : class "NSString"

set P to 1 -- returns the unicode number of the second character
set TheNSString to (NSString's stringWithString:"h😳elp me")
(TheNSString's characterAtIndex:P) as integer

returns something different from getting the AppleScript id of the emoji character.

That’s why most Cocoa string handling is based around strings, rather than characters.

In short, characterAtIndex: works when it works, and doesn’t when it doesn’t…

And ushr presumably comes from the fact that it stores an unsigned sho rt.

I figure it wouldn’t be that easy. It hadn’t sunk into my brain yet that unicode numbers could have that many bits. The unicode code for :flushed: is 11111011000110011 which requires 18 bits for a signed number, or 17 if unsigned. That sounds like a definite problem for a variable with 16 bits. I guess it’s back to the old problem either the method can’t handle a character with that many bits or or the sixteen bit limit on the variable ASObj-C returns the answer in isn’t big enough. I take it they haven’t updated these older Objective-C methods for bigger values or the scripting bridge is not up to date as far as bit capacity goes.

In the end the only way I could use my technique safely is to check all the characters to be worked on before converting the character and return an error is a character is out of range. That doesn’t sound very elegant.

Thanks for pointing out my oversight : )

Bill

It’s not that simple – an AppleScript character may actually be a composed character sequence or grapheme cluster, which means it can consist of more than one value. See:

https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html

I had hear about the surrogate pairs of UTF-16s used for some code points before but it was a while ago. I never really thought much about them. That was the part that hadn’t sunk in yet (2 X 16 > 16 ). The stuff I read from your link with Hangul, jamo and Indic influenced writing systems, … was something I’d never heard of. By the time I finished the grapheme clusters I’d come to realize just how sophisticated this unicode stuff is. One thing I wasn’t sure of after reading the text, is there some code points that use more then 2 UTF-16 units? It seemed like the text I read was suggesting that.

Bill

I believe so, but I’m honestly not sure.