I was reading through the manual, as one does, and I encountered this:
“(22.14. Exchanging files with other programs): An Inform file is currently a Unix text file (with 10 as the line division character), encoded as ASCII Latin-1. (We would like to use Unicode at some point in the future, but the Glk and Glulx layers are still not fully converted to Unicode.)”
The layers are up to date these days, so this requirement could be reconsidered.
The state of things is that when you open an external file stream, you can use a byte or word stream, and a text or binary fileref flag; thus four possibilities. However, none of them is guaranteed to be UTF-8.
byte stream, text file (I7’s current default for external files): Latin-1 character data may be translated to a platform-native character encoding in the file.
byte stream, binary file: Latin-1 character data written directly as bytes.
word stream, text file: Could be UTF-8 or a platform-native encoding.
word stream, binary file: Unicode character values written as big-endian four-byte integers.
I7’s current behavior is conservative, so the right answer may be to leave it as-is. (Although it’s not absolutely safe unless you declare the external files as “binary”, c.f. 22.11.)
It would be safe to switch over to word/binary files, but then exchanging them with other programs would become a nuisance. (It will always be possible to write Python/Perl/whatever to read streams of four-byte integers, but I don’t want to put people in the position of needing to.)
Possibly the right answer is to tighten the Glk spec, and say that word/text files must be UTF-8. I left it flexible because I was writing in the 20th century, when exchanging text files between Mac and Windows (never mind Unix) seemed like a necessarily hazardous operation. (I haven’t reviewed existing Glk libraries to see what they do. I don’t think we’re lucky enough to be in an “interpreters already do that” situation, but it’s worth checking.)
The usual way is to use a byte stream with UTF-8 encoding. This provides full support for all languages but still makes it compatible with character string processing routines (strcpy() and such) so there’s no need to modify those in tools that work with those files. Also, it’s easy to convert UTF-8 to another encoding when loading the file for display, as most APIs for that work with NULL-terminated byte sequences for the UTF-8 source.
Related question: when writing a byte stream in text mode, the spec currently says: “Unicode values may be stored exactly, approximated, or abbreviated, depending on what the platform’s text files support.”
Is this worth keeping, or should we mandate Latin-1? Do Windows interpreters make an effort to store DOS-style line breaks when writing in text mode?
We could go all the way and say that there is no difference between text and binary mode when writing byte streams – they should behave identically.
(For word streams, i.e. glk_stream_open_file_uni(), the difference will be that text streams are UTF-8; binary streams are sequences of four-byte integers.)
I’m a bit late to this thread, but I wanted to say that at one point I experimented with writing transcripts as HTML. It worked out OK, although I never got round to releasing anything.
So at some point I might lobby for an exception to the UTF-8 requirement, along the lines of: if the fileref is created by a prompt, where the user can choose a file type, then the library is allowed to write what the user asks for.
UTF-8 could be required to be one of the choices offered. You might also restrict this behaviour to fileusage_Transcript only.
I don’t know if I’ll ever manage to resurrect my implementation, so it may not be worth worrying about for now.