Glk external file encoding

zarf · February 8, 2013, 9:06pm

I was reading through the manual, as one does, and I encountered this:

“(22.14. Exchanging files with other programs): An Inform file is currently a Unix text file (with 10 as the line division character), encoded as ASCII Latin-1. (We would like to use Unicode at some point in the future, but the Glk and Glulx layers are still not fully converted to Unicode.)”

The layers are up to date these days, so this requirement could be reconsidered.

The state of things is that when you open an external file stream, you can use a byte or word stream, and a text or binary fileref flag; thus four possibilities. However, none of them is guaranteed to be UTF-8.

byte stream, text file (I7’s current default for external files): Latin-1 character data may be translated to a platform-native character encoding in the file.
byte stream, binary file: Latin-1 character data written directly as bytes.
word stream, text file: Could be UTF-8 or a platform-native encoding.
word stream, binary file: Unicode character values written as big-endian four-byte integers.

I7’s current behavior is conservative, so the right answer may be to leave it as-is. (Although it’s not absolutely safe unless you declare the external files as “binary”, c.f. 22.11.)

It would be safe to switch over to word/binary files, but then exchanging them with other programs would become a nuisance. (It will always be possible to write Python/Perl/whatever to read streams of four-byte integers, but I don’t want to put people in the position of needing to.)

Possibly the right answer is to tighten the Glk spec, and say that word/text files must be UTF-8. I left it flexible because I was writing in the 20th century, when exchanging text files between Mac and Windows (never mind Unix) seemed like a necessarily hazardous operation. (I haven’t reviewed existing Glk libraries to see what they do. I don’t think we’re lucky enough to be in an “interpreters already do that” situation, but it’s worth checking.)

Thoughts?

RealNC · February 9, 2013, 10:27am

The usual way is to use a byte stream with UTF-8 encoding. This provides full support for all languages but still makes it compatible with character string processing routines (strcpy() and such) so there’s no need to modify those in tools that work with those files. Also, it’s easy to convert UTF-8 to another encoding when loading the file for display, as most APIs for that work with NULL-terminated byte sequences for the UTF-8 source.

Btw, what’s a “word stream?”

ChrisC · February 9, 2013, 12:16pm

That would be “word” in the computer science sense.

RealNC · February 9, 2013, 1:43pm

I know that of course, but since Zarf mentioned “word stream, text file”, I assumed he meant something else. A text file is a stream of character bytes, not words.

zarf · February 9, 2013, 7:50pm

You can open a Glk stream with bytes or (Glulx four-byte) words. You can also open a Glk file in text or binary mode. Thus, there are four possibilities, as I enumerated in my comment above.

There is no notion of “encoding” in the Glk API. I guess I could add one, but it’s not what I’m currently considering; I would rather tighten the specification of the current API than extend the API.

bcressey · February 9, 2013, 8:51pm

Speaking for garglk, if (and only if) a Glk file stream is opened in text mode with Glulx four-byte words, the I/O is assumed (forced) to be UTF-8.

DavidK · February 16, 2013, 9:05pm

Windows Glk currently writes Unicode text files as UCS-2, but I think it would be a good idea to change the specification to require UTF-8.

zarf · February 17, 2013, 4:04am

Okay, that’s more or less a concensus. I will put together a spec update post.

zarf · March 7, 2013, 11:37pm

Related question: when writing a byte stream in text mode, the spec currently says: “Unicode values may be stored exactly, approximated, or abbreviated, depending on what the platform’s text files support.”

Is this worth keeping, or should we mandate Latin-1? Do Windows interpreters make an effort to store DOS-style line breaks when writing in text mode?

We could go all the way and say that there is no difference between text and binary mode when writing byte streams – they should behave identically.

(For word streams, i.e. glk_stream_open_file_uni(), the difference will be that text streams are UTF-8; binary streams are sequences of four-byte integers.)

zarf · March 11, 2013, 2:40am

I have posted the following proposed update to the Glk spec:

github.com/erkyrath/glk-dev/wik … ec-changes

strid_t glk_stream_open_file(frefid_t fileref, glui32 fmode, glui32 rock);

…When writing in binary mode, byte values are written directly to the file. (Writing calls such as glk_put_char_stream() are defined in terms of Latin-1 characters, so the binary file can be presumed to use Latin-1. Newlines will remain as 0x0A bytes.) Unicode values (characters greater than 255) cannot be written to the file. If you try, they will be stored as 0x3F (“?”) characters.

When writing in text mode, character data is written in an encoding appropriate to the platform; this may be Latin-1 or some other format. Newlines may be converted to other line break sequences. Unicode values may be stored exactly, approximated, or abbreviated, depending on what the platform’s text files support.

strid_t glk_stream_open_file_uni(frefid_t fileref, glui32 fmode, glui32 rock);
This works just like glk_stream_open_file(), except that in binary mode, characters are written and read as four-byte (big-endian) values. This allows you to write and read any Unicode character.

In text mode, the file is written and read using the UTF-8 Unicode encoding. Files should be written without a byte-ordering mark. This ensures that text-mode files created by glk_stream_open_file() and glk_stream_open_file_uni() will be identical if only ASCII characters (32-127) are written.

Previous versions of this spec said, of glk_stream_open_file_uni(): “In text mode, the file is written and read in a platform-dependent way, which may or may not handle all Unicode characters.” This left open the possibility of other native text-file formats, as well as richer formats such as RTF or HTML. Richer formats do not seem to have ever been used; and at this point, UTF-8 is widespread enough for us to mandate it.

To summarize:

glk_stream_open_file (byte stream), text mode: platform native text
glk_stream_open_file (byte stream), binary mode: Latin-1
glk_stream_open_file_uni (word stream), text mode: UTF-8
glk_stream_open_file_uni (word stream), binary mode: four-byte (big-endian) integers

See also the following notes about filerefs. This section is already in the spec, but I repeat it here as background:

fileusage_BinaryMode: The file contents will be stored exactly as they are written, and read back in the same way. The resulting file may not be viewable on platform-native text file viewers.
fileusage_TextMode: The file contents will be transformed to a platform-native text file as they are written out. Newlines may be converted to linefeeds or linefeed-plus-carriage-return combinations; Latin-1 characters may be converted to native character codes. When reading a file in text mode, native line breaks will be converted back to newline (0x0A) characters, and native character codes may be converted to Latin-1 or UTF-8. Line breaks will always be converted; other conversions are more questionable. If you write out a file in text mode, and then read it back in text mode, high-bit characters (128 to 255) may be transformed or lost.

zarf · March 11, 2013, 2:46am

The unit test for this change is: eblong.com/zarf/glulx/extbinaryfile.ulx

I’ve pushed a cheapglk change which implements this. (To github, not yet released.) My other interpreters have not yet caught up.

djfletch · March 16, 2013, 1:35pm

I’m a bit late to this thread, but I wanted to say that at one point I experimented with writing transcripts as HTML. It worked out OK, although I never got round to releasing anything.

So at some point I might lobby for an exception to the UTF-8 requirement, along the lines of: if the fileref is created by a prompt, where the user can choose a file type, then the library is allowed to write what the user asks for.

UTF-8 could be required to be one of the choices offered. You might also restrict this behaviour to fileusage_Transcript only.

I don’t know if I’ll ever manage to resurrect my implementation, so it may not be worth worrying about for now.

zarf · March 16, 2013, 7:12pm

That is a fair point. I had HTML transcript output in the back of my mind when I originally designed the spec, but nobody ever went that direction.

zarf · March 20, 2013, 7:40pm

I’ve added a couple of lines:

…and notes to this effect on glk_stream_open_file_uni() and glk_fileref_create_by_prompt().