Unicode File IO extension

Zed · February 17, 2022, 2:06am

You rang? Unicode File IO extension. Ugly as heck, thrown together as a proof of concept, but seems to work. In fact, it worked with the previous version of JSON but for this small change at the end of JSON_Stringify_String.

            default:
#iftrue (WORDSIZE==4);
@streamunichar char;
#ifnot;
        print (char) ch;
#endif;
        }
    }
    print "~";
    TEXT_TY_Untransmute(str, p, cp);
];

And if you’re not a coward, it even worked with unicode values above 65535 with this:

Use UTF-32;

Use UTF-32 translates as
(- Constant UTF_32; -).

Include (-
#ifdef UTF_32;
Constant TEXT_TY_Storage_Flags = BLK_FLAG_MULTIPLE + BLK_FLAG_WORD;
#ifnot;
Constant TEXT_TY_Storage_Flags = BLK_FLAG_MULTIPLE + BLK_FLAG_16_BIT;
#endif;
Constant Large_Unicode_Tables;

{-segment:UnicodeData.i6t}
{-segment:Char.i6t}

-) instead of "Character Set" in "Text.i6t".

for values of “not a coward” that may mean “willing to do something reckless and ill-advised that’s likely to bite you in the butt at some point”.

CrocMiam · February 17, 2022, 9:11am

@Zed Maybe I’m not understanding what your Unicode File IO extension is doing but I’ve unexpected results.
I write unicode chars to a file and the results seems to be incorrect and full of NUL (ASCII 0, shown as ^@ in the output of cat -e ).

"Unicode IO"

Include Unicode File IO by Zed Lopez.

Unicode lab is a room.

File of stdzed is called "stdzed".
File of unized is called "unized".

When play begins:
	write "ascii: hello - unicode: ➥♘√" to file of stdzed;
	write "ascii: hello - unicode: ➥♘√" to unicode file of unized;

Results:

$ cat -e stdzed.glkdata
* //C537B53E-6E34-4680-B565-8C1C2EBC78CE// stdzed$
ascii: hello - unicode: ???

$ cat -e unized.glkdata
^@^@^@*^@^@^@ ^@^@^@/^@^@^@/^@^@^@C^@^@^@5^@^@^@3^@^@^@7^@^@^@B^@^@^@5^@^@^@3^@^@^@E^@^@^@-^@^@^@6^@^@^@E^@^@^@3^@^@^@4^@^@^@-^@^@^@4^@^@^@6^@^@^@8^@^@^@0^@^@^@-^@^@^@B^@^@^@5^@^@^@6^@^@^@5^@^@^@-^@^@^@8^@^@^@C^@^@^@1^@^@^@C^@^@^@2^@^@^@E^@^@^@B^@^@^@C^@^@^@7^@^@^@8^@^@^@C^@^@^@E^@^@^@/^@^@^@/^@^@^@ ^@^@^@u^@^@^@n^@^@^@i^@^@^@z^@^@^@e^@^@^@d^@^@^@$
^@^@^@a^@^@^@s^@^@^@c^@^@^@i^@^@^@i^@^@^@:^@^@^@ ^@^@^@h^@^@^@e^@^@^@l^@^@^@l^@^@^@o^@^@^@ ^@^@^@-^@^@^@ ^@^@^@u^@^@^@n^@^@^@i^@^@^@c^@^@^@o^@^@^@d^@^@^@e^@^@^@:^@^@^@ ^@^@'�^@^@&X^@^@"^Z$

Dannii · February 17, 2022, 9:57am

That looks like it’s writing in UTF-32 rather than UTF-8.

CrocMiam · February 17, 2022, 11:41am

Yeah, the Glk spec says about the *_uni functions:

Since these functions deal in arrays of 32-bit words, they can be said to use the UTF-32 character encoding form. (But not the UTF-32 character encoding scheme – that’s a stream of bytes which must be interpreted in big-endian or littleendian mode. Glk Unicode functions operate on long integers, not bytes.)

Dannii · February 17, 2022, 11:50am

glk_stream_open_file_uni (word stream), text mode: UTF-8

glk_stream_open_file_uni (word stream), binary mode: four-byte (big-endian) integers

At a quick glance it does look like the extension is opening in text mode, but I guess it’s actually doing it in binary mode. I couldn’t tell what was wrong though. Either that or the Glk library is non-conformant?

CrocMiam · February 17, 2022, 12:12pm

Seems it’s the Glk lib from the IDE. In Quixe, I have a different result. Files are stored as an array of bytes in localStorage and I can see the bytes corresponding to UTF-8. Now I have to decode that on the JS side.

Zed · February 17, 2022, 1:49pm

GLK 5.6.4 also says

Previous versions of this spec said, of glk_stream_open_file_uni(): “In text mode, the file is written and read in a platform-dependent way, which may or may not handle all Unicode characters.” This left open the possibility of other native text-file formats, as well as richer formats such as RTF or HTML. Richer formats do not seem to have ever been used; and at this point, UTF-8 is widespread enough for us to mandate it.

So I guess not everything has caught up.

(It’s also the case that the top of section 2 on Encoding that @CrocMiam quotes had been sufficiently convincing that I didn’t realize there was a UTF-8 option!)

Why not just replace the whole template within an extension with lots of Include (- ... -) instead of ...'s? Replacing .i6t templates in .materials/I6T definitely works, but for distribution purposes I’ve thought it’s not worth trying to do: people already know how to install extensions.

Zed · February 17, 2022, 4:43pm

Oops, looks like my code was at fault, FileIOPrintLineUni was always using @streamunichar (on Glulx), which always writes 4 bytes. I’m working on a not-twenty-minute-proof-of-concept revision…

Edited: no, actually it looks like PrintLine just does output to the display and isn’t relevant. At any rate,
I’ve now pushed a laughably untested version 2 of Unicode File IO that lets you set unicode or not on a per file basis and so uses the normal file phrases.

CrocMiam · February 17, 2022, 6:54pm

Oh, it’s now 6M62 only. I’m stuck on 6L38 (French extension). I’m gonna have to keep the “old” version 1 around

Zed · February 19, 2022, 3:27am

New version pushed, mostly different in having some documentation, which I’ll quote:

With the uni calls, binary mode uses the UTF-32 encoding form: every
character is a 4-byte word. In text mode, UTF-8 is used, but the Glk spec has
only called for that since 0.7.5 from 2017-02-13. Prior to that, it called
for UTF-32 for text mode, too.

As of this writing, many interpreters are still using Glk implementations
prior to 0.7.5; overall, few Glk implementations have been updated to 0.7.5.

Some Glk implementations that are available for 0.7.5:

WindowsGlk

cheapglk

remglk

garglk

Despite Glkote saying it implements 0.7.4, it has supported UTF-8 for unicode
text output since version 2.2.0 from February 2016 (predating the 0.7.5 spec).

The only IDE available that uses 0.7.5 is the beta release of the Windows
IDE.

Some interpreters that support 0.7.5:

Gargoyle 2022.1

Quixe 2.1.3+

Lectrote (since the earliest)

Dannii · February 19, 2022, 9:31am

I don’t think that the spec before 0.7.5 said that text mode unicode files should be UTF-32, I think it was mostly up to the platform. @zarf could confirm though.

Zed · February 19, 2022, 12:04pm

You’re right; I’ll correct it. Glk 0.7.4 spec, section 5.6.3 has:

In text mode, the file is written and read in a platform-dependent way, which may or may not handle all Unicode characters. A text-mode file created with glk_stream_open_file_uni() may have the same format as a text-mode file created with glk_stream_open_file(); or it may use a more Unicode-friendly format.)

I hadn’t looked closely enough and had basically just checked for the absence of this, which is in 0.7.5 section 5.6.3

In text mode, the file is written and read using the UTF-8 Unicode encoding. Files should be written without a byte-ordering mark. This ensures that text-mode files created by glk_stream_open_file() and glk_stream_open_file_uni() will be identical if only ASCII characters (32-127) are written.

[Previous versions of this spec said, of glk_stream_open_file_uni(): “In text mode, the file is written and read in a platform-dependent way, which may or may not handle all Unicode characters.” This left open the possibility of other native text-file formats, as well as richer formats such as RTF or HTML. Richer formats do not seem to have ever been used; and at this point, UTF-8 is widespread enough for us to mandate it.]

To summarize:
glk_stream_open_file (byte stream), text mode: platform native text
glk_stream_open_file (byte stream), binary mode: Latin-1
glk_stream_open_file_uni (word stream), text mode: UTF-8
glk_stream_open_file_uni (word stream), binary mode: four-byte (big-endian) integers

Though 0.7.5 still has

When writing in text mode, character data is written in an encoding appropriate to the platform; this may be Latin-1 or some other format. Newlines may be converted to other line break sequences. Unicode values may be stored exactly, approximated, or abbreviated, depending on what the platform’s text files support.

And both of them have:

The extended, or “Unicode”, Glk functions deal entirely in 32-bit words. They take arrays of words, not bytes, as arguments. They can therefore cope with any Unicode code point. The extended functions have names ending in “_uni”.

[Since these functions deal in arrays of 32-bit words, they can be said to use the UTF-32 character encoding form. (But not the UTF-32 character encoding scheme – that’s a stream of bytes which must be interpreted in big-endian or little-endian mode. Glk Unicode functions operate on long integers, not bytes.) UTF-32 is also known as UCS-4, according to the Unicode spec (appendix C.2), modulo some semantic requirements which we will not deal with here. For practical purposes, we can ignore the whole encoding issue, and assume that we are dealing with sequences of numeric code points.]

[Why not UTF-8? It is a reasonable bare-bones compression algorithm for Unicode character streams; but IF systems typically have their own compression models for text. Compositing the two ideas causes more problems than it solves. The other advantage of UTF-8 is that 7-bit ASCII is automatically valid UTF-8; but this is not compelling for IF systems, in which the compiler can be tasked with generating consistent textual data. And UTF-8 is a variable-width encoding. Nobody ever wept at the prospect of avoiding that kettle of eels.]

The whole exercise left me having a crisis of faith as to whether 0.7.5 really did dictate UTF-8 for unicode text after all. But I think it does.

zarf · February 19, 2022, 3:58pm

The “Why not UTF-8” paragraph is old. I should have updated that, sorry.

Zed · February 19, 2022, 5:25pm

Next version I’ll leave out everything about the Glk version beyond noting that 0.7.5 calls for UTF-8 for unicode text and 0.7.4 left it implementation dependent, and then just discuss the Glk implementations and the applications in terms of whether they use UTF-8 or not. (As well as noting that this is mostly a concern for opening a file in a different application and can be basically ignored if you’re not switching amongst different applications looking at the same file.)

Zed · February 20, 2022, 4:40am

ok, now it says:

With the uni calls, binary mode uses the UTF-32 encoding form: every character
is a 4-byte word. In text mode, version 0.7.5 of the Glk spec calls for UTF-8;
in 0.7.4 and prior versions, the spec defined the behavior as implementation
dependent. (Note that any implementation will be able to read the files it
itself wrote; where there could be an issue is reading a file a different
terp wrote, or wanting some external application to read the file.)

Glk implementations that use UTF-8 for unicode text include:

Glkote 2.20+

WindowsGlk 1.47+

cheapglk 1.05+

remglk 0.2.5+

garglk 2022.1+

Glk implementations that use UTF-32 for unicode text include:

glkterm

glktermw

CocoaGlk

The only IDE available that uses UTF-8 for unicode text is the beta release
of the Windows IDE.

Some interpreters that use UTF-8 for unicode text (which is to say that come
bundled with Glk libraries that do so):

Gargoyle 2022.1

Quixe 2.1.3+

Lectrote (since the earliest)

Zed · February 20, 2022, 6:51am

…and now I think I even have utf-8 text unicode files working in glkterm and glktermw!

Edited: added links to repos, but for entertainment purposes only… I’m not claiming these are close to ready for real use.