Converting characters to unicode

rileypb · July 4, 2025, 9:58pm

I can’t seem to find anything more recent than this 10-year-old non-answer: Getting unicode (or ascii) number for a character

Wwhat I was trying was:

To decide what number is the ordinal of (ch - unicode character): (- ({ch}) -).
To decide what unicode character is the head of (T - a text): (- {T}->0 -).

when play begins:
    let C be the head of "abc"; 
    let N be the ordinal of C; 
    say "Character [C] -> [N]";

However, “head of” returns an empty string. How should this be written?

Zed · July 4, 2025, 10:28pm

Sneak preview of my soon-to-be-classic Character Utilities extension:

lab is room.

Include (-
Constant TEXT_TY_Storage_Flags = BLK_FLAG_MULTIPLE + BLK_FLAG_WORD;
-) replacing "TEXT_TY_Storage_Flags".

To decide what unicode character is utf-32 (n - a number): (- {n} -).

include (-
[ unicodeValue s n x cp1 p1 dsize;
if (n < 0) return 0;
cp1 = s-->0; p1 = TEXT_TY_Temporarily_Transmute(s);
dsize = BlkValueLBCapacity(s);
if (n >= dsize) return 0;
x = BlkValueRead(s,n);
TEXT_TY_Untransmute(s, p1, cp1);
return x;
];
-)

To decide what number is an/-- ord of/-- a/an/-- (T - a text): (- unicodeValue({T},0) -).

[ equivalent to c cast as a number ]
To decide what number is an/-- ord of/-- a/an/-- (C - a unicode character): (- {C} -).

[ named version for application -- can't use inline I6]
To decide what number is wrapped ord (C - a unicode character) (this is char-to-num): decide on ord of C.

To decide what unicode character is a/-- char (N - a number) of/for a/an/-- (T - a text):  (- unicodeValue({T},({N}-1)) -)

To decide what unicode character is a/-- char of/for a/an/-- (T - a text):  (- unicodeValue({T},0) -)

To decide what list of numbers is an/-- ord of/-- a/an/-- (L - a list of unicode characters):
  decide on char-to-num applied to L;

[ equivalent to n cast as a unicode character ]
To decide what unicode character is a/-- char of/for/-- a/an/-- (N - a number): (- {N} -).

[ named version for application -- can't use inline I6]
To decide what unicode character is wrapped char (N - a number) (this is num-to-char):
  decide on char of N.

To decide what list of unicode characters is char/chars of/-- a/an/-- (L - a list of numbers):
  decide on num-to-char applied to L.

To decide what list of unicode characters is chars of/-- a/an/-- (T - a text):
    let result be a list of unicode characters;
    let len be number of characters in T;
    repeat with i running from 1 to len begin;
      add char for character number i in T to result;
    end repeat;
    decide on result;

To say (L - a list of sayable values) join/joined by/with/-- a/an/-- (sep - a sayable value):
  let len be the number of entries in L;
  repeat with i running from 1 to len - 1 begin;
    say entry i in L;
    say sep;
  end repeat;
  say entry len in L;

when play begins:
  let n be 128512;
  let uc be utf-32 n;
  say uc;
  let unicodetext be "[uc]";
  say " text: [unicodetext] ";
  let t be "A";
  now uc is char of t;
  if ord uc is 65, say "ASCII 4EVAH.";

rileypb · July 4, 2025, 10:50pm

I stole what I needed and ran. YOINK!

rileypb · July 4, 2025, 11:05pm

Unfortunately, it doesn’t solve my underlying problem. I was using if-else statements to decode letters into ASCII, and I thought that might have been causing performance issues. Turns out the issue is that Inform is just dreadfully slow iterating over all the unpunctuated words in a text.

Zed · July 4, 2025, 11:51pm

oh, yes, that’s very true, too. Try this:

lab is room.

To repeat with/for (loopvar - nonexisting text variable) running/-- through/in words in/of (T - text) begin -- end loop:
    (- {-my:1} = TEXT_TY_BlobAccess({-by-reference:T}, WORD_BLOB);
for ( {-my:2} = 1 : {-my:2} <= {-my:1} : {-my:2}++ )
if (BlkValueCopy({-by-reference:loopvar}, TEXT_TY_GetBlob({-new:text}, {-by-reference:T}, {-my:2}, WORD_BLOB)))
-)

when play begins:
let t be "Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore—
    While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
'[']Tis some visitor,' I muttered, 'tapping at my chamber door—
            Only this and nothing more.'

    Ah, distinctly I remember it was in the bleak December;
And each separate dying ember wrought its ghost upon the floor.
    Eagerly I wished the morrow;—vainly I had sought to borrow
    From my books surcease of sorrow—sorrow for the lost Lenore—
For the rare and radiant maiden whom the angels name Lenore—
            Nameless here for evermore.

    And the silken, sad, uncertain rustling of each purple curtain
Thrilled me—filled me with fantastic terrors never felt before;
    So that now, to still the beating of my heart, I stood repeating
    '[']Tis some visitor entreating entrance at my chamber door—
Some late visitor entreating entrance at my chamber door;—
            This it is and nothing more.'

    Presently my soul grew stronger; hesitating then no longer,
'Sir,' said I, 'or Madam, truly your forgiveness I implore;
    But the fact is I was napping, and so gently you came rapping,
    And so faintly you came tapping, tapping at my chamber door,
That I scarce was sure I heard you'—here I opened wide the door;—
            Darkness there and nothing more.

    Deep into that darkness peering, long I stood there wondering, fearing,
Doubting, dreaming dreams no mortal ever dared to dream before;
    But the silence was unbroken, and the stillness gave no token,
    And the only word there spoken was the whispered word, 'Lenore?'
This I whispered, and an echo murmured back the word, 'Lenore!'—
            Merely this and nothing more.";
repeat for w in words of t begin;
  say w;
  say " / ";
end repeat;

rileypb · July 5, 2025, 12:17am

Hmm, I don’t see a speedup – I think the problem is repeatedly calling GetBlob, isn’t it? At least that’s my guess looking at:

To decide what text is word number (N - a number) in (T - text)
	(documented at ph_wordnum):
	(- TEXT_TY_GetBlob({-new:text}, {-by-reference:T}, {N}, WORD_BLOB) -).

Seems like that would split the text into words each time you called it, which would give you a running time of O(n^2) rather than O(n) if the words were split once (n = number of words, and assuming equal-length words)

rileypb · July 5, 2025, 12:41am

Well I got a reasonably good work-around. I’m caching the list of words when play begins, since I can tolerate a 2 second lag there.

Zed · July 5, 2025, 8:05am

Sorry. I thought I was remembering I had a fast solution to that, but I must have been thinking of this which does a reasonable job of looping through characters in a text.

So elaborating on that (this seems to work but it’s not really adequately tested, but it’s late…)

Include (-
[ textWipe t i;
for (i = 0 : BlkValueRead(t,i) : i++ ) BlkValueWrite(t,i,0);
];
-)

To repeat with/for (loopvar - nonexisting text variable) running/-- through/in words in/of (t - text) begin -- end loop: (-
  {-my:0} = {t}-->0;
  @push {-my:0};
  {-my:1} = TEXT_TY_Temporarily_Transmute({-by-reference:t});
  @push {-my:1};
  for ( {-my:1} = BlkValueLBCapacity({-by-reference:t}) - 1 : {-my:1} >= 0 : {-my:1}-- ) if (BlkValueRead({-by-reference:t}, {-my:1})) { {-my:1}++; break; }
  for ({-my:0} = 0, {-my:2} = BlkValueRead({-by-reference:t}, 0), {-my:3} = 0 : ({-my:0} <= {-my:1}) : {-my:2} = BlkValueRead({-by-reference:t}, ++{-my:0}))
    if ({-my:2} && ({-my:2} ~= 32)) { BlkValueWrite({-by-reference:loopvar}, {-my:3}++ , {-my:2}); }
    else for ( : {-my:3} : {-my:3} = 0, textWipe({-by-reference:loopvar})) {-block}
   @pull {-my:1};
   @pull {-my:0};
   TEXT_TY_Untransmute({-by-reference:t}, {-my:1}, {-my:0});
-)

n.b.: this won’t work with text literals, e.g., repeat for w in words of "moby dick" [...] [this no work]

but this works:

let mb be "moby dick";
repeat for w in words of mb [...]

zarf · July 5, 2025, 2:18pm

That’s a good argument for doing the split in your text editor and putting it in your source code as a list of texts.

rileypb · July 5, 2025, 2:30pm

Ack. You’re so right.

rileypb · July 5, 2025, 5:52pm

Very fast!

rileypb · July 6, 2025, 2:24am

This solves my problem without having to split the multiple, long, in-flux texts into lists of texts. Thanks! I’ll give you a big shoutout in my IFComp entry.