Z-machine text encoding - wasteful?

fredrik · June 19, 2019, 1:35pm

It seems to me that there is never any point in using Z-character 0-5 in alphabet A1 and 0-4 in A2, in version 3+ of the Z-machine. This would mean that eleven positions in the alphabet table are wasted. That’s 11.5% of the whole alphabet table.

Seems crazy wasteful. But the people designing the Z-machine were very competent people. Am I missing something here?

Marvin · June 19, 2019, 1:53pm

I believe that in theory, it was possible to permanently shift from alphabet A0 to A1 or A2 by using two shift characters in a row. So from A0, Z-character 4 twice in a row would put you permanently in A1, and Z-character 5 twice in a row would put you in A2. From A1, a Z-character 5 would put you back into A0, from A2, a Z-character 4 would put you back in A0. Abbreviations would still take up 1 to 3 in all alphabets.

I think at least some of Infocom’s interpreters do this, but their compiler never actually compiled text that made use of it.

zarf · June 19, 2019, 2:05pm

As I understand it, the “shift lock” characters were part of the original (v1) Z-machine design. They dropped the idea in v3, but didn’t go back to redesign the rest of the encoding.

Marvin · June 20, 2019, 11:04pm

I’m pretty sure early versions of the 1.1 standard proposal tried to put the V3+ shiftlock feature into the spec, but they got removed for (as I recall) a mixture of not being totally clear that it was an intentional feature, and being unlikely to ever be made use of (since text is such a fundamental feature that any game written using this would fail on terps that weren’t updated).

The feature is also mentioned in the Infocom ZIP docs I posted about, although it’s confusingly enough written that I didn’t know what it was on about until re-reading it quite a bit.

The first 6 values are universal over all character sets. 0 means space. 1, 2, or 3 means to
use one of the special words (see below). 4 and 5 are shift characters. Each permanently or
temporarily changes the character set to one of the other two:
            New Character Set (P=perm, T=temp)
Old C.S.      4     5
0            1T    2T
1            1P    0P
2            0P    2P

Dannii · June 20, 2019, 11:16pm

I wonder why they didn’t use Huffman coding. Was it not well known at that time? (Even though it was already 30 years old?)

Marvin · June 20, 2019, 11:34pm

I don’t know enough about Huffman coding, or old microcomputers, to really be able to answer that one (although there is mention of Huffman coding for image files), but at a guess I’d say either memory or speed concerns. Or both.

Dannii · June 20, 2019, 11:55pm

I asked a question at the Retrocomputing Stack Exchange:

tgl · June 21, 2019, 4:19am

Don’t have a stackexchange account so I can’t comment there – but I will point out that the argument that variable-width codes are too expensive to decode seems mighty suspect. GIF87 required variable-width LZW codes, and that didn’t slow its adoption any, even on very low-ooomph hardware.

The trouble with Huffman coding is that over any large input corpus, it loses efficiency, because it can’t adapt to local variations in the data statistics, and that leaves a lot of money on the table. You could imagine using multiple Huffman tables each optimized for, say, the text associated with some related rooms or objects. But that seems like a lot of complexity, and the Z-machine designers may not have wanted that, and/or might have put significant weight on the time to compile a game not just play it.

Or they might just not have known about Huffman or other then-advanced compression approaches. A lot of people in the industry back then had no formal computer science training, because that academic discipline didn’t even exist before the seventies.

zarf · June 21, 2019, 5:17pm

This was MIT, not industry, though. Or people who had just left MIT.

I suspect the problem with Huffman was that it requires an encoding table in memory at all times. It’s read-only but you can’t swap it. Always-on storage space was a critical resource when porting to the smaller 8-bit platforms.

tgl · June 22, 2019, 3:58am

Hm. I’m pretty skeptical of that argument: the representation of a Huffman code doesn’t need to take much more space than the list of symbols. And the decode loop can be absolutely trivial as long as you don’t need megabytes-per-second throughput.

So I got curious about this and decided to experiment with doing brain-dead Huffman encoding on English text. I lifted some logic from libjpeg and ran it on a text file I had handy (Project Gutenberg’s Lovecraft corpus, actually, which is about 3MB). Using a single Huffman table optimized for the whole input, I ended up with output of just about 4.5 bits per input character.

This is pretty disappointing compared to the typically cited result that the entropy of English text is around 0.6 to 1.3 bits per character. I think what it’s showing is that you have to exploit intracharacter correlation if you want to get near that number with just Huffman coding as the backend. Obviously, you could work on that and other refinements if you were really motivated to take this path. But what I’m suspecting now is that the Z-machine architects may have tried this, gotten results in the same ballpark as I just did, and concluded that it wasn’t worth the trouble compared to getting ~5.3 bits/character with their trivial scheme.

zarf · June 22, 2019, 3:49pm

So what was the table size?

zarf · June 22, 2019, 4:00pm

…Glulx uses Huffman encoding, because I wanted something easy and keeping game data in memory is no longer a problem. Advent.inf winds up using just under 1k of ROM for the table, and about 45k for the compressed string data. That’s from 70k of plaintext.

1k of always-in memory would be a noticeable hit in 1982, I think.

That aside, I guess my compression ratio is not significantly better than Z-machine encoding. On the up side, it doesn’t blow up for non-ASCII text.

zarf · June 22, 2019, 4:01pm

And for a final comment: it may be worth looking at the Infocom ZipTest.z3 that I posted (https://eblong.com/infocom/#ziptest) and seeing if it has shift-lock tests.

tgl · June 22, 2019, 4:36pm

Hm, well, there were 97 distinct characters in the input corpus. So a minimal definition of the encoding (a la JPEG DHT markers) would require 16 bytes (for a list of counts of codes of different lengths, assuming you concur with JPEG’s implementation decision to limit Huffman codes to 16 bits) plus 97 bytes for the code-order list of the input characters, plus maybe a couple bytes for counts and suchlike overhead.

A simple decoder requires additional tables that can be derived from these:

list of the maximum code value for each code length (so 32 bytes);
starting offset in the code-order list of the symbols for each code length (so 16 bytes, or maybe 32 if you want to allow for more than 256 symbols)

(JPEG decoders typically build some additional derived tables for more speed, but you could dispense with that if you only need decoding at human reading speed, even on 70s micros.)

If I were trying to do this in the confines of a Z-machine-like setting, I’d likely insist that the compiler create these additional tables, just because the extra code required to compute them from the minimal representation would take more than 64 bytes. Also, if you did that, I think you could skip storing the list-of-counts array, so that the net hit is that much less. So while I didn’t spend time on specifying an exact table representation, I’d expect it to fall short of 200 bytes for this example.

It might well be that the JPEG spec (written late 80s) reflects knowledge about how to implement Huffman decoding efficiently that didn’t exist ten years before, so maybe if somebody had been trying to do this in the 70s they’d have come to unnecessarily pessimistic conclusions about how big the table needed to be. But it doesn’t look to me like the table really needs to be much bigger than what it’d take to have an explicit representation of a Z-machine alphabet.