Wondering if this really works

BillKopp · September 30, 2016, 10:41am

Shane or Mark,

I ran into a strange “data ushr” type of data («data ushr3500») when working with NSString’s characterAtIndex. I couldn’t find anything about this type or how to convert. The Xcode docs said characterAtIndex returns an 64-bit unsigned integer since my Mac is 64 bit. I figured the first 9.22337203685478E+18 numbers are positive so I could use AppleScript’s integer value to represent the unicode ID numbers since there are only 1,114,112 possible characters in unicode.

Therefore given a NSString named TheNSString:
The unicode number of the character at position P should be (TheNSString’s characterAtIndex:P) as integer
The unicode character at position P should be character id ((TheNSString’s characterAtIndex:P) as integer)

I couldn’t find anything, anywhere on how to do the conversion and it seems like there is such a simple way I keep thinking I’m missing something. So my question is does this really work in general.

As an example this returns “e”:

use framework "AppKit"
property NSString : class "NSString"

set P to 1 -- returns the second character
set TheNSString to (NSString's stringWithString:"help me")
character id ((TheNSString's characterAtIndex:P) as integer)

while this returns 101 and character id 101 = “e”

use framework "AppKit"
property NSString : class "NSString"

set P to 1 -- returns the unicode number of the second character
set TheNSString to (NSString's stringWithString:"help me")
(TheNSString's characterAtIndex:P) as integer

Bill

ShaneStanley · September 30, 2016, 1:12pm

The characterAtIndex: method returns a unichar. That’s a 16-bit value, but it only matches what AppleScript calls a character if the character is one that can be encoded in a single 16-bit value. So for example this:

use framework "Foundation"
property NSString : class "NSString"

set P to 1 -- returns the unicode number of the second character
set TheNSString to (NSString's stringWithString:"h😳elp me")
(TheNSString's characterAtIndex:P) as integer

returns something different from getting the AppleScript id of the emoji character.

That’s why most Cocoa string handling is based around strings, rather than characters.

In short, characterAtIndex: works when it works, and doesn’t when it doesn’t…

ShaneStanley · September 30, 2016, 1:23pm

And ushr presumably comes from the fact that it stores an unsigned sho rt.

BillKopp · October 1, 2016, 4:50am

I figure it wouldn’t be that easy. It hadn’t sunk into my brain yet that unicode numbers could have that many bits. The unicode code for is 11111011000110011 which requires 18 bits for a signed number, or 17 if unsigned. That sounds like a definite problem for a variable with 16 bits. I guess it’s back to the old problem either the method can’t handle a character with that many bits or or the sixteen bit limit on the variable ASObj-C returns the answer in isn’t big enough. I take it they haven’t updated these older Objective-C methods for bigger values or the scripting bridge is not up to date as far as bit capacity goes.

In the end the only way I could use my technique safely is to check all the characters to be worked on before converting the character and return an error is a character is out of range. That doesn’t sound very elegant.

Thanks for pointing out my oversight : )

Bill

ShaneStanley · October 1, 2016, 5:42am

It’s not that simple – an AppleScript character may actually be a composed character sequence or grapheme cluster, which means it can consist of more than one value. See:

https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html

BillKopp · October 1, 2016, 7:07am

I had hear about the surrogate pairs of UTF-16s used for some code points before but it was a while ago. I never really thought much about them. That was the part that hadn’t sunk in yet (2 X 16 > 16 ). The stuff I read from your link with Hangul, jamo and Indic influenced writing systems, … was something I’d never heard of. By the time I finished the grapheme clusters I’d come to realize just how sophisticated this unicode stuff is. One thing I wasn’t sure of after reading the text, is there some code points that use more then 2 UTF-16 units? It seemed like the text I read was suggesting that.

Bill

ShaneStanley · October 1, 2016, 7:26am

I believe so, but I’m honestly not sure.