Glk external file encoding

I was reading through the manual, as one does, and I encountered this:

“(22.14. Exchanging files with other programs): An Inform file is currently a Unix text file (with 10 as the line division character), encoded as ASCII Latin-1. (We would like to use Unicode at some point in the future, but the Glk and Glulx layers are still not fully converted to Unicode.)”

The layers are up to date these days, so this requirement could be reconsidered.

The state of things is that when you open an external file stream, you can use a byte or word stream, and a text or binary fileref flag; thus four possibilities. However, none of them is guaranteed to be UTF-8.

  • byte stream, text file (I7’s current default for external files): Latin-1 character data may be translated to a platform-native character encoding in the file.
  • byte stream, binary file: Latin-1 character data written directly as bytes.
  • word stream, text file: Could be UTF-8 or a platform-native encoding.
  • word stream, binary file: Unicode character values written as big-endian four-byte integers.

I7’s current behavior is conservative, so the right answer may be to leave it as-is. (Although it’s not absolutely safe unless you declare the external files as “binary”, c.f. 22.11.)

It would be safe to switch over to word/binary files, but then exchanging them with other programs would become a nuisance. (It will always be possible to write Python/Perl/whatever to read streams of four-byte integers, but I don’t want to put people in the position of needing to.)

Possibly the right answer is to tighten the Glk spec, and say that word/text files must be UTF-8. I left it flexible because I was writing in the 20th century, when exchanging text files between Mac and Windows (never mind Unix) seemed like a necessarily hazardous operation. (I haven’t reviewed existing Glk libraries to see what they do. I don’t think we’re lucky enough to be in an “interpreters already do that” situation, but it’s worth checking.)

Thoughts?

The usual way is to use a byte stream with UTF-8 encoding. This provides full support for all languages but still makes it compatible with character string processing routines (strcpy() and such) so there’s no need to modify those in tools that work with those files. Also, it’s easy to convert UTF-8 to another encoding when loading the file for display, as most APIs for that work with NULL-terminated byte sequences for the UTF-8 source.

Btw, what’s a “word stream?”

That would be “word” in the computer science sense.

I know that of course, but since Zarf mentioned “word stream, text file”, I assumed he meant something else. A text file is a stream of character bytes, not words.

You can open a Glk stream with bytes or (Glulx four-byte) words. You can also open a Glk file in text or binary mode. Thus, there are four possibilities, as I enumerated in my comment above.

There is no notion of “encoding” in the Glk API. I guess I could add one, but it’s not what I’m currently considering; I would rather tighten the specification of the current API than extend the API.

Speaking for garglk, if (and only if) a Glk file stream is opened in text mode with Glulx four-byte words, the I/O is assumed (forced) to be UTF-8.

Windows Glk currently writes Unicode text files as UCS-2, but I think it would be a good idea to change the specification to require UTF-8.

Okay, that’s more or less a concensus. I will put together a spec update post.

Related question: when writing a byte stream in text mode, the spec currently says: “Unicode values may be stored exactly, approximated, or abbreviated, depending on what the platform’s text files support.”

Is this worth keeping, or should we mandate Latin-1? Do Windows interpreters make an effort to store DOS-style line breaks when writing in text mode?

We could go all the way and say that there is no difference between text and binary mode when writing byte streams – they should behave identically.

(For word streams, i.e. glk_stream_open_file_uni(), the difference will be that text streams are UTF-8; binary streams are sequences of four-byte integers.)

I have posted the following proposed update to the Glk spec:

github.com/erkyrath/glk-dev/wik … ec-changes

See also the following notes about filerefs. This section is already in the spec, but I repeat it here as background:

The unit test for this change is: eblong.com/zarf/glulx/extbinaryfile.ulx

I’ve pushed a cheapglk change which implements this. (To github, not yet released.) My other interpreters have not yet caught up.

I’m a bit late to this thread, but I wanted to say that at one point I experimented with writing transcripts as HTML. It worked out OK, although I never got round to releasing anything.

So at some point I might lobby for an exception to the UTF-8 requirement, along the lines of: if the fileref is created by a prompt, where the user can choose a file type, then the library is allowed to write what the user asks for.

UTF-8 could be required to be one of the choices offered. You might also restrict this behaviour to fileusage_Transcript only.

I don’t know if I’ll ever manage to resurrect my implementation, so it may not be worth worrying about for now.

That is a fair point. I had HTML transcript output in the back of my mind when I originally designed the spec, but nobody ever went that direction.

I’ve added a couple of lines:

…and notes to this effect on glk_stream_open_file_uni() and glk_fileref_create_by_prompt().