Unicode File IO extension

You rang? Unicode File IO extension. Ugly as heck, thrown together as a proof of concept, but seems to work. In fact, it worked with the previous version of JSON but for this small change at the end of JSON_Stringify_String.

            default:
#iftrue (WORDSIZE==4);
@streamunichar char;
#ifnot;
        print (char) ch;
#endif;
        }
    }
    print "~";
    TEXT_TY_Untransmute(str, p, cp);
];

And if you’re not a coward, it even worked with unicode values above 65535 with this:

Use UTF-32;

Use UTF-32 translates as
(- Constant UTF_32; -).

Include (-
#ifdef UTF_32;
Constant TEXT_TY_Storage_Flags = BLK_FLAG_MULTIPLE + BLK_FLAG_WORD;
#ifnot;
Constant TEXT_TY_Storage_Flags = BLK_FLAG_MULTIPLE + BLK_FLAG_16_BIT;
#endif;
Constant Large_Unicode_Tables;

{-segment:UnicodeData.i6t}
{-segment:Char.i6t}

-) instead of "Character Set" in "Text.i6t".

for values of “not a coward” that may mean “willing to do something reckless and ill-advised that’s likely to bite you in the butt at some point”.

2 Likes

@Zed Maybe I’m not understanding what your Unicode File IO extension is doing but I’ve unexpected results.
I write unicode chars to a file and the results seems to be incorrect and full of NUL (ASCII 0, shown as ^@ in the output of cat -e ).

"Unicode IO"

Include Unicode File IO by Zed Lopez.

Unicode lab is a room.

File of stdzed is called "stdzed".
File of unized is called "unized".

When play begins:
	write "ascii: hello - unicode: ➥♘√" to file of stdzed;
	write "ascii: hello - unicode: ➥♘√" to unicode file of unized;

Results:

$ cat -e stdzed.glkdata
* //C537B53E-6E34-4680-B565-8C1C2EBC78CE// stdzed$
ascii: hello - unicode: ???

$ cat -e unized.glkdata
^@^@^@*^@^@^@ ^@^@^@/^@^@^@/^@^@^@C^@^@^@5^@^@^@3^@^@^@7^@^@^@B^@^@^@5^@^@^@3^@^@^@E^@^@^@-^@^@^@6^@^@^@E^@^@^@3^@^@^@4^@^@^@-^@^@^@4^@^@^@6^@^@^@8^@^@^@0^@^@^@-^@^@^@B^@^@^@5^@^@^@6^@^@^@5^@^@^@-^@^@^@8^@^@^@C^@^@^@1^@^@^@C^@^@^@2^@^@^@E^@^@^@B^@^@^@C^@^@^@7^@^@^@8^@^@^@C^@^@^@E^@^@^@/^@^@^@/^@^@^@ ^@^@^@u^@^@^@n^@^@^@i^@^@^@z^@^@^@e^@^@^@d^@^@^@$
^@^@^@a^@^@^@s^@^@^@c^@^@^@i^@^@^@i^@^@^@:^@^@^@ ^@^@^@h^@^@^@e^@^@^@l^@^@^@l^@^@^@o^@^@^@ ^@^@^@-^@^@^@ ^@^@^@u^@^@^@n^@^@^@i^@^@^@c^@^@^@o^@^@^@d^@^@^@e^@^@^@:^@^@^@ ^@^@'�^@^@&X^@^@"^Z$

That looks like it’s writing in UTF-32 rather than UTF-8.

Yeah, the Glk spec says about the *_uni functions:

Since these functions deal in arrays of 32-bit words, they can be said to use the UTF-32 character encoding form. (But not the UTF-32 character encoding scheme – that’s a stream of bytes which must be interpreted in big-endian or littleendian mode. Glk Unicode functions operate on long integers, not bytes.)

  • glk_stream_open_file_uni (word stream), text mode: UTF-8
  • glk_stream_open_file_uni (word stream), binary mode: four-byte (big-endian) integers

At a quick glance it does look like the extension is opening in text mode, but I guess it’s actually doing it in binary mode. I couldn’t tell what was wrong though. Either that or the Glk library is non-conformant?

Seems it’s the Glk lib from the IDE. In Quixe, I have a different result. Files are stored as an array of bytes in localStorage and I can see the bytes corresponding to UTF-8. Now I have to decode that on the JS side.

1 Like

GLK 5.6.4 also says

Previous versions of this spec said, of glk_stream_open_file_uni(): “In text mode, the file is written and read in a platform-dependent way, which may or may not handle all Unicode characters.” This left open the possibility of other native text-file formats, as well as richer formats such as RTF or HTML. Richer formats do not seem to have ever been used; and at this point, UTF-8 is widespread enough for us to mandate it.

So I guess not everything has caught up.

(It’s also the case that the top of section 2 on Encoding that @CrocMiam quotes had been sufficiently convincing that I didn’t realize there was a UTF-8 option!)

Why not just replace the whole template within an extension with lots of Include (- ... -) instead of ...'s? Replacing .i6t templates in .materials/I6T definitely works, but for distribution purposes I’ve thought it’s not worth trying to do: people already know how to install extensions.

Oops, looks like my code was at fault, FileIOPrintLineUni was always using @streamunichar (on Glulx), which always writes 4 bytes. I’m working on a not-twenty-minute-proof-of-concept revision…

Edited: no, actually it looks like PrintLine just does output to the display and isn’t relevant. At any rate,
I’ve now pushed a laughably untested version 2 of Unicode File IO that lets you set unicode or not on a per file basis and so uses the normal file phrases.

Oh, it’s now 6M62 only. I’m stuck on 6L38 (French extension). I’m gonna have to keep the “old” version 1 around :slight_smile:

1 Like

New version pushed, mostly different in having some documentation, which I’ll quote:

2 Likes

I don’t think that the spec before 0.7.5 said that text mode unicode files should be UTF-32, I think it was mostly up to the platform. @zarf could confirm though.

You’re right; I’ll correct it. Glk 0.7.4 spec, section 5.6.3 has:

In text mode, the file is written and read in a platform-dependent way, which may or may not handle all Unicode characters. A text-mode file created with glk_stream_open_file_uni() may have the same format as a text-mode file created with glk_stream_open_file(); or it may use a more Unicode-friendly format.)

I hadn’t looked closely enough and had basically just checked for the absence of this, which is in 0.7.5 section 5.6.3

Though 0.7.5 still has

When writing in text mode, character data is written in an encoding appropriate to the platform; this may be Latin-1 or some other format. Newlines may be converted to other line break sequences. Unicode values may be stored exactly, approximated, or abbreviated, depending on what the platform’s text files support.

And both of them have:

The whole exercise left me having a crisis of faith as to whether 0.7.5 really did dictate UTF-8 for unicode text after all. But I think it does.

The “Why not UTF-8” paragraph is old. I should have updated that, sorry.

1 Like

Next version I’ll leave out everything about the Glk version beyond noting that 0.7.5 calls for UTF-8 for unicode text and 0.7.4 left it implementation dependent, and then just discuss the Glk implementations and the applications in terms of whether they use UTF-8 or not. (As well as noting that this is mostly a concern for opening a file in a different application and can be basically ignored if you’re not switching amongst different applications looking at the same file.)

ok, now it says:

1 Like

…and now I think I even have utf-8 text unicode files working in glkterm and glktermw!

Edited: added links to repos, but for entertainment purposes only… I’m not claiming these are close to ready for real use.