A 32-bit Å-machine

I’m sort of surprised it uses the high bits for the tag: I’m used to seeing the low bits used and then you can just arithmetic-shift right to sign-extend. But I guess you could shift up and then arithmetic-shift down, if that feels cleaner to you?

Edit: whoops, I didn’t see the second question. Again, you could maybe shift up and arithmetic-shift down, then see if it’s still the same? If not, it has overflowed? Not sure if there’s another trick for that but it’s been a while.

1 Like

Ooh, interesting thought! I need to figure out what JavaScript actually guarantees about its integer types; I know they’re usually 32-bit signed (once you do something that stops them from being floats, like a bit shift), but I don’t know how much of that is specified and how much is convention. If it’s guaranteed, then shifting left and shifting right sounds perfect.

Oh, wait, I think the trick for overflow is to add a 1 in the 29th bit, converting 0000b and 1111b to 0001b and 0000b, and then do an unsigned compare to make sure it’s < 2. So add 0x10000000 and check if the result is unsigned less than 0x20000000.

And yeah, IIRC it’s specified that JS’s integers are 32-bit signed once you convert, so the default >> is arithmetic shift. Hmm. But then I’m not sure if it has an unsigned compare, so that trick might not work in JS.

Another question to people who have worked on compilers and such. Have you ever seen a line number (for tracing and error messages and such) larger than 216?

The existing spec categorizes its constants as BYTE (8-bit), INDEX (14-bit), WORD (16-bit), or LONG (32-bit). I’m now trying to split WORD into WORD (either 16-bit or 32-bit depending on architecture) and SHORT (always 16-bit). And I suspect that the line number for a tracepoint can be a SHORT.

In C++ sign extension is a two step bit shift trick. If bit 27 is the sign bit then you want to shift left 4 moving it to bit 31, the sign bit, then shift back right 4 moving the number back. The right shift fills in bits 28-31 with the sign bit. If your top 4 bits are 0 then number is not negative, if it is all 1’s then it is negative. Shifting back to the right will fill bits 28-31 with the sign, either 0 or 1.

For overflow detection a 32 bit signed integer can fit into 28 bits only if the top 4 bits are all 0s or all 1s. Any other pattern indicates overflow.

2 Likes

Ah. JS does have an unsigned/zero-fill right-shift though.

const overflow28 = x => (x+0x10000000 >>> 28) > 1

(given the operator precedence, you don’t need the parentheses there but I think it reads better. you could add another set around the addition but I’ll often use spacing to indicate that instead, when the operator precedence works. YMMV)

2 Likes

Yeah, that happens sometimes in the wider software world. Usually only in files that aren’t directly edited by humans, such as:

  • Output of a compiler that turns the entire input program into a single file in another language, such as Inform emitting Inform 6 code.
  • Huge auto-generated tables representing e.g. Unicode data.
  • Many separate source files concatenated into a single file for build/distribution, e.g., the SQLite amalgamation.

But regardless of why the files are so large, non-IF compilers are generally expected to handle them somewhat gracefully.

There are also occasionally actual human-edited source files which approach or cross 216 lines. An open source example I’ve personally had to deal with in the past is this one, currently sitting at 64594 lines (very slightly less than 216), but I think it was even larger in the past.

I don’t know if that’s at all relevant for Dialog, though.

2 Likes

Good to know! Dialog internally stores line numbers in a 24-bit unsigned variable[1], so I’ll bump the Å-machine’s equivalent up to 32 bits. That way the compiler will break down before the interpreter will.


  1. It’s a uint32_t where the remaining 8 bits are the file number. ↩︎

This one-room I7 game I’m looking at is over 81000 lines of I6 code. The I6 compiler very definitely has to deal with that.

2 Likes

In the same file?

I7 always generates a single I6 source file and then compiles that to a game.

1 Like

Okay! Here’s my initial sketch. Please comment if you see any issues, have any questions, or otherwise want to give feedback; I don’t have a lot of experience designing VMs on my own. (I recommend highlighting sections and hitting “quote” as you go, because this is long.)

Literals

These are the values that don’t need heap storage: they fit in a single word. On the 16-bit Å-machine, they used variable-width tags to give each type a different amount of the available space: integers needed a full half of it, while characters only needed one byte.

000vvvvv vvvvvvvv	object (v > 0)
001vvvvv vvvvvvvv	dictionary word (v < $1E00)
00111110 vvvvvvvv	character
00111111 00000000	[] (nil, the empty list)
00111111 00111111	sentinel for unused memory
00111111 vvvvvvvv	(reserved) (all other v)
01vvvvvv vvvvvvvv	unsigned integer

On a 32-bit machine, there’s plenty of space to go around. I propose switching to consistent four-bit tags: every literal type uses four bits for the tag. One of those bits marks it as a literal; the other three determine the type.

0000 object
0001 dictionary word
0010 character
0011 special constant
0100 signed integer
0101 (reserved)
0110 (reserved)
0111 (reserved)
0000vvvv vvvvvvvv vvvvvvvv vvvvvvvv		object (v > 0)
0001vvvv vvvvvvvv vvvvvvvv vvvvvvvv		dictionary word
00100000 000vvvvv vvvvvvvv vvvvvvvv		character
0010vvvv vvvvvvvv vvvvvvvv vvvvvvvv		(reserved) (v > $10FFFF)
00110000 00000000 00000000 00000000		[] (nil, the empty list)
00111111 00111111 00111111 00111111		sentinel for unused memory
0011vvvv vvvvvvvv vvvvvvvv vvvvvvvv		(reserved) (all other v)
0100vvvv vvvvvvvv vvvvvvvv vvvvvvvv		signed integer
0101vvvv vvvvvvvv vvvvvvvv vvvvvvvv		(reserved)
0110vvvv vvvvvvvv vvvvvvvv vvvvvvvv		(reserved)
0111vvvv vvvvvvvv vvvvvvvv vvvvvvvv		(reserved)

Key points:

  • This allows storing a full Unicode codepoint in the “character” type
  • The sentinel value for unused memory ($3F3F) remains, and is still part of the “special constants” section
  • On the 16-bit machine, the integer range (unsigned 14-bit) was chosen so it could represent all possible four-digit numbers. This new range (signed 28-bit) can represent all possible eight-digit numbers, while also handling signed math.
  • There’s a big chunk of the literal space remaining for future expansion, in case there’s some new type of literal we want to implement (maybe for Dialog, maybe for some other language)
  • There’s still no object 0, so “all zeroes” works as a null value that doesn’t represent anything

Live values

These are values living in registers or on the heap. Previously, these used three-bit tags:

0vvvvvvv vvvvvvvv	literal (v > 0)
100vvvvv vvvvvvvv	reference (word index into the heap)
101vvvvv vvvvvvvv	(reserved)
110vvvvv vvvvvvvv	pair (cons cell; word index into the heap)
111vvvvv vvvvvvvv	extended dictionary word (word index into the heap)

Like with literals, I propose switching to four-bit tags. There are plenty of bits available, and using four-bit tags consistently across data types can make them easier to generalize about.

0vvv	literal
1000	reference
1001	(reserved)
1010	pair
1011	extdict
11**	(reserved)
0vvvvvvv vvvvvvvv vvvvvvvv vvvvvvvv		literal (v > 0)
1000vvvv vvvvvvvv vvvvvvvv vvvvvvvv		reference (word index into the heap)
1001vvvv vvvvvvvv vvvvvvvv vvvvvvvv		(reserved)
1010vvvv vvvvvvvv vvvvvvvv vvvvvvvv		pair (cons cell; word index into the heap)
1011vvvv vvvvvvvv vvvvvvvv vvvvvvvv		extended dictionary word (word index into the heap)
11**vvvv vvvvvvvv vvvvvvvv vvvvvvvv		(reserved)

I kept the one (reserved) tag in between “reference” and “pair” for consistency with the 16-bit machine. If some new heap type is implemented on the 16-bit machine, it can take that slot on both versions. If some new heap type is implemented on the 32-bit machine only (like floats), it can use one of the new tags at the end.

(See below in the serialized data section for some other options. We could also give pairs and extdicts the tags 1100 and 1110, so that their prefixes look the same as before in hex.)

Persistent variables

These are values living in global and object variables, which need to remain valid no matter how the heap grows and shrinks. Previously, these used one-bit tags:

00000000 00000000	unset/null
0vvvvvvv vvvvvvvv	literal (v > 0)
1vvvvvvv vvvvvvvv	word index into the long-term heap

And…in this case I don’t really see a need to change that. It’s either a literal or it’s not. We could switch to four-bit tags for consistency—we certainly don’t need 31 bits’ worth of long-term storage space—but I don’t know what would ever get stored in a variable beyond “literal” and “pointer to non-literal”.

00000000 00000000 00000000 00000000		unset/null
0vvvvvvv vvvvvvvv vvvvvvvv vvvvvvvv		literal (v > 0)
1vvvvvvv vvvvvvvv vvvvvvvv vvvvvvvv		word index into the long-term heap

Serialized data

When data is stored on the main heap, it uses the live value encoding above. But sometimes, arbitrary data needs to be serialized into a sequence of words that can be stashed on the aux heap or put into long-term storage. That’s what this is for.

Previously this could either be analyzed as variable-length tags or three-bit tags:

00000000 00000000	end of stream
0vvvvvvv vvvvvvvv	literal (v > 0)
10000000 00000000	unbound variable
10000001 00000000	extdict (two elements) follows
10vvvvvv vvvvvvvv	reserved (all other v)
110vvvvv vvvvvvvv	proper list with v elements follows
111vvvvv vvvvvvvv	improper list with v elements follows

And like with the other data types, I think switching to four-bit tags for consistency will make implementation easier without sacrificing anything important.

0vvv	literal
1000	fixed-length data type
1001	(reserved)
1010	proper list
1011	improper list
11**	(reserved)
00000000 00000000 00000000 00000000		end of stream
0vvvvvvv vvvvvvvv vvvvvvvv vvvvvvvv		literal (v > 0)
10000000 00000000 00000000 00000000		unbound variable
10000001 00000000 00000000 00000000		extdict (two elements) follows
1000vvvv vvvvvvvv vvvvvvvv vvvvvvvv		(reserved) (all other v)
1001vvvv vvvvvvvv vvvvvvvv vvvvvvvv		(reserved)
1010vvvv vvvvvvvv vvvvvvvv vvvvvvvv		proper list with v elements follows
1011vvvv vvvvvvvv vvvvvvvv vvvvvvvv		improper list with v elements follows
11**vvvv vvvvvvvv vvvvvvvv vvvvvvvv		(reserved)

The placement of the (reserved) sections is somewhat arbitrary. I wanted to be similar to the existing encoding, but we could also (e.g.) use 1100 and 1110 for proper and improper lists, which would make their prefixes look the same as before in hexadecimal. I don’t expect to ever run into a list whose length needs all 24 bits, so the details aren’t vitally important here.

Instruction operands

Many of these types don’t really need to change.

BYTE (small numeric constant)

No change.

VBYTE (small object number)

No change. Less important without the 16-bit space constraints, but the first 255 objects still get referenced a lot.

WORD (large numeric constant)

Increase from 16 bits to 32 bits. I considered changing this type to SHORT and making it always 16 bits, but the only place it’s ever used is for line numbers when tracing, and it sounds like those do exceed 16 bits sometimes.

VWORD (literal)

Increase from 16 bits to 32 bits, because the range of literals increased to 32 bits.

RAW (numeric constant, or register holding a numeric constant)

Currently, the Å-machine has $40 (64) global registers and $40 (64) env slots. We could increase this limit, but…is there any reason to? Unlike on Z-machine, the Å-machine’s registers aren’t generally used for global variables; they’re only used for routine arguments, temporaries (local variables), and a couple special purposes. Arguments are limited by the Dialog compiler, so temporaries are the only ones you can run out of—and I’ve never seen that happen without a compiler bug.

So I don’t see a real need to increase the number of registers. I’m happy to be persuaded, though!

Previously:

0vvvvvvv vvvvvvvv	number 0000-7FFF
10vvvvvv			value of register v (00-3F)
11vvvvvv			value of env slot v (00-3F)

Now:

0vvvvvvv vvvvvvvv vvvvvvvv vvvvvvvv		number 0000 0000 - 7FFF FFFF
10vvvvvv								value of register v (00-3F)
11vvvvvv								value of env slot v (00-3F)

VALUE (literal, or register holding a literal or heap reference)

Same as RAW. Same rationale applies. Whatever we decide for one should apply to the other.

Previously:

0vvvvvvv vvvvvvvv	literal (v > 0)
10vvvvvv			value of register v (00-3F)
11vvvvvv			value of env slot v (00-3F)

Now:

0vvvvvvv vvvvvvvv vvvvvvvv vvvvvvvv		literal (v > 0)
10vvvvvv								value of register v (00-3F)
11vvvvvv								value of env slot v (00-3F)

DEST (where to store a result)

As with the last two, this just depends on the number of registers. My current proposal is no change.

Previously and now:

00vvvvvv		store into register v (00-3F)
01vvvvvv		store into env slot v (00-3F)
10vvvvvv		unify with register v (00-3F)
11vvvvvv		unify with env slot v (00-3F)

INDEX (medium-sized numeric constant)

This type currently has a range of 0 to $3FFF, for a total of 16,384 possible values. It’s used for variable and property numbers (so this caps the number of global and per-object variables), style classes (so this caps the number of style classes for divs, spans, and body styles), and wordmaps (an optimization trick for looking up objects from dictionary words).

Are these caps worth increasing? It would cost us basically nothing to increase, because the first 192 values can be stored in one byte, so the vast majority of projects would see no change at all.

Previously:

0vvvvvvv			Index v (00-7F)
10vvvvvv			Index v+80 (80-BF)
11vvvvvv vvvvvvvv	Index v (0000-3FFF)

Now, it could stay the same, or it could be:

0vvvvvvv					Index v (00-7F)
10vvvvvv					Index v+80 (80-BF)
11vvvvvv vvvvvvvv vvvvvvvv	Index v (00 0000 - 3F FFFF)

Or:

0vvvvvvv							Index v (00-7F)
10vvvvvv							Index v+80 (80-BF)
11vvvvvv vvvvvvvv vvvvvvvv vvvvvvvv	Index v (0000 0000 - 3FFF FFFF)

I’m ambivalent on this one. I don’t think the decision we make here will affect much of anything.

CODE (byte address in the CODE chunk to jump to)

This one’s easy. It can encode either absolute or relative jumps; we increase the range of absolute jumps but not relative ones.

Previously:

00000000					fail
00vvvvvv					+v bytes (01 - 3F) from the beginning of the operand
01vvvvvv vvvvvvvv			+v bytes (signed -2000 - 1FFF) from " " " " "
1vvvvvvv vvvvvvvv vvvvvvvv	absolute address v (00 0000 - 7F FFFF)

Now:

00000000								fail
00vvvvvv								+v bytes (01 - 3F) from the beginning of the operand
01vvvvvv vvvvvvvv						+v bytes (signed -2000 - 1FFF) from " " " " "
1vvvvvvv vvvvvvvv vvvvvvvv vvvvvvvv		absolute address v (0000 0000 - 7FFF FFFF)

The fact that we’ve increased the size of various other operands might mean that more jumps need to be encoded in absolute form instead of relative forms. But I don’t expect that to make any significant difference. Jumps within a routine will usually be in relative form, jumps between routines will usually be in absolute form (the Å-machine has no built-in call stack or routine call opcodes), and that won’t change.

STRING (shifted byte pointer in the WRIT chunk)

Like with CODEs, STRINGs can be encoded in three different ways. And like with CODEs, I propose increasing the size of the largest one while leaving the smaller two untouched. “S” is a byte value stored in the header; currently, Dialog always stores 0.

Previously:

0vvvvvvv					address v << 1
10vvvvvv vvvvvvvv			address v << S
11vvvvvv vvvvvvvv vvvvvvvv	address v << S

Now:

0vvvvvvv								address v << 1
10vvvvvv vvvvvvvv						address v << S
11vvvvvv vvvvvvvv vvvvvvvv vvvvvvvv		address v << S

This one I’m torn on, for a few reasons:

  • Is it worth changing at all, when a larger S value would already allow for more string space?
  • Is it worth shifting long pointers by S, when a 30-bit value on its own can already address so much space?
  • Is it worth keeping the weird anomaly where tiny pointers use 1 instead of S as their shift value?

But I think doing it this way, keeping it as consistent as possible with the 16-bit version, will make it easier to implement. It probably won’t be useful to have an S-value higher than 1, but most strings already fit in a short pointer, even with an S-value of zero.

2 Likes

Data structures

There are various other data structures in a story file that aren’t values or operands. These I’m much less confident in than the plain data types.

String-decoding tree

Strings are encoded with a Huffman-esque binary tree.

Previously, with a one-byte character set, that tree was stored as an array of up to $7F (127) two-byte nodes. The first byte indicated what to do when reading a 0; the second byte indicated what to do when reading a 1.

$00-5E: Print character ($20 plus x) and return to the first node.
$5F: Read N bits from the string, MSB first, where N is the number of bits necessary to identify all the “excess” characters (characters $A0 and above) and all the dictionary words. Print that character or dictionary word.
$60-7F: Print character ($20 plus x) and return to the first node.
$80: Stop.
$81-FF: Go to tree node (x minus $80), advance to the next bit of the string, and repeat this process.

(This is why character $7F was reserved, by the way. It can’t be printed because it’s needed as an escape character.)

Now, I propose storing the tree as an array of two-word (eight-byte) nodes (unlimited number). The first word indicates what to do when reading a 0; the second word indicates what to do when reading a 1. As before, I’m using four-bit tags.

00000000 00000000 00000000 00000000		stop
0001vvvv vvvvvvvv vvvvvvvv vvvvvvvv		print dictionary word v
00100000 000vvvvv vvvvvvvv vvvvvvvv		print character v
1000vvvv vvvvvvvv vvvvvvvv vvvvvvvv		go to node at word address v

Concerns about this:

  • Like before, dictionary words can be included in the compression. Unlike before, they need to be stored in the Huffman tree to take advantage of this. Is that a problem? Should we replace the “dictionary word” type with “pop N bits from the compressed string and use that to index into the dictionary”?
  • This could all be done with three bytes for each branch instead of four, but scanning through an array four bytes at a time could be faster than three bytes at a time. Is either of these (space or time) a concern at all?
  • There’s no way the Huffman tree will contain every Unicode character. We could also have a table of relevant characters, and work with indices into that table rather than raw Unicode values.

Word ending decoder

This weird little block of data encodes removable word endings. If a word is not found in the dictionary, it will be passed through this block. The goal is to split the word into a recognized prefix and recognized suffix.

Previously, this was an array of up to 256 instruction bytes. Start with the whole word in the prefix and nothing in the suffix.

$00: Nothing worked. Abort the whole process.
$01: Check if the prefix is a recognized dictionary word. If so, succeed with this prefix and suffix. If not, continue.
$XX $YY: Is the last character of the prefix $XX? If so, move it from the prefix to the suffix, then jump to byte $YY of the decoder. If not, continue.

Now…I’m honestly not sure how to implement this. It’s a fiddly detail that’s very seldom actually used. We could have a table of relevant Unicode characters and work with indices into that table rather than raw codepoints.

Word-to-object maps

For each (determine object) block (a construction that maps objects to dictionary words), the compiler does its best to statically analyze what words can be emitted and turn them into “wordmaps”: mappings from dictionary words to one of three things: “too many objects” (like “the”), one specific object, or a list of possible objects. If a word doesn’t appear in the wordmap at all, it maps to no objects.

Previously, this was an array of maps (number capped by the INDEX type), where each map was an array of entries, and each entry was two words: a dict/char literal, and a result value.

Those result values:

00000000 00000000	wildcard (matches too many objects to list, like "the")
vvvvvvvv vvvvvvvv	pointer to payload (v < $E000)
111vvvvv vvvvvvvv	single object v

Each “payload” was an array of bytes later in the chunk:

00000000			end
vvvvvvvv			object v (short form, v < $E0)
111vvvvv vvvvvvvv	object v (long form)

Now, I’m proposing keeping this basically the same, but switching the results to use four-bit tags:

00000000 00000000 00000000 000000000	wildcard (matches too many objects to list, like "the")
0000vvvv vvvvvvvv vvvvvvvv vvvvvvvv		single object v
1000vvvv vvvvvvvv vvvvvvvv vvvvvvvv		pointer to payload

(Why not use one-bit tags? Because the number of objects is already capped by the literal format. But we could let the payload pointers be up to 31 bits long. It wouldn’t hurt anything.)

The payload format would also be basically the same, but support larger object numbers:

00000000								end
0vvvvvvv								object v (short form, up to $7F)
10vvvvvv vvvvvvvv						object v (long form, up to $3FFF)
1100vvvv vvvvvvvv vvvvvvvv vvvvvvvv		object v (full form)

Miscellaneous

There are some other places where the spec currently uses a BYTE[2] that should be changed to either a WORD or a SHORT. That’s mostly busywork but I’ll get a proposal together once the meat of the matter is figured out.

But there is one other thing I’d like to include in this draft. Opcode $FF is currently undefined (it’s the high-bit counterpart to TRACEPOINT, used for (trace on) purposes).

I’d like to formally reserve it for 32-bit purposes (call it EXT32). Unlike all other opcodes, its number and type of operands can vary; but 16-bit systems will never have to deal with that. If they see a $FF byte in the instruction stream, they should crash immediately instead of trying to skip past it (which requires knowing what operands it wants).

What will this instruction do? No idea! So far, none of this proposal has needed new instructions. But reserving it now as a future-proofing measure means we won’t have to worry about conflicts with the 16-bit spec.

I don’t know if it makes sense to fundamentally diverge from the 16-bit Å-machine here, but for what it’s worth, in a new IF VM without compatibility constraints I’d argue for doing Huffman encoding very differently, so I have to get on the soap box for a minute. Explicitly storing a binary tree is textbook in that Huffman coding is often taught that way, but textbooks have done everyone a disservice here. Actually writing down the tree anywhere other than a blackboard is unnecessarily awkward and inefficient.

A classic reference on this subject is “On the Implementation of Minimum Redundancy Prefix Codes” by Moffat and Turpin. It’s annoying to find a PDF online, here’s one that works right now. In brief:

  • There’s many different trees that give the same compression ratio, so specifying a Huffman tree has unnecessary degrees of freedom (i.e., takes more space to encode than necessary).
  • What actually matters is the alphabet and how many bits you use for the symbols. The codes per symbol fall out for free, you just have to pick one convention for how to assign codes and stick with it. That’s called a canonical Huffman code.
  • The algorithms for encoding and decoding canonical Huffman codes aren’t any more complicated than the tree-based ones (different, but IMO simpler). And the questions about how to store tree nodes somewhat compactly disappear – you never store those nodes at all.
  • On the decoder side, walking the tree one bit at a time is unnecessarily slow. Canonical Huffman codes give you a lot of options to optimize this. Maybe it doesn’t matter, but fact is that Glulxe and Quixe have a lot of complicated code to try and squeeze out four bits at a time of the Glulx version of the explicit Huffman tree.

Another (lesser) issue with classic Huffman coding is that it will sometimes produce ridiculously long codes for rare symbols. If there’s n different symbols being used, then Huffman can give you codes up to n bits long, depending on the frequencies of each symbol.

The standard solution is imposing a fixed upper limit on code length. Either the limit has to be large enough that the full alphabet can be encoded (e.g., 12 bits for 8-bit symbols), or there’s an escape symbol that means “read the next N bits and use them verbatim” (in this case: interpret as dictionary word of unicode character). Computing optimal codes in this setting requires a different algorithm, but the decoder is not affected (except that it may want to exploit the length limit to simplify optimizations).

3 Likes

i understand exactly none of this but it’s been fascinating to follow. i guess my own unenlightened question would be more big-picture:

a 32-bit A-machine doesn’t really help me (negative numbers in dialog, OTOH would be super) because i prefer to make games that can be downloaded as independent stand-alone files. right now, that can only be done using the C64 a-machine interpreter or, i guess, a file that a user can open in their browser.

i guess i would rather see the a-machine expanded to other platforms, i.e. a “bocfel” for the a-machine that could be plugged into other existing interpreters, rather than a new-improved A-machine with features that i won’t need (again, except maybe negative numbers). i don’t actually know if this is feasible or how hard it would be to do.

but i’m probably in the minority. many great dialog games have been released to be played inside the browser and they work great so maybe i’m just a ludite.

It’s definitely doable! It just hasn’t been done yet, because, well…nobody has wanted to. There’s a program that lets you play .aastory files in the terminal, for example, and we use that for running test cases. But its output is so limited, most people prefer to use the web interpreter instead.

(And if you’re using the web interpreter, you might as well bundle it up with your .aastory, because uploading .aastory files to a web site is a hassle. So that’s what most people currently do.)

But the specs are out there, and also reference implementations in both JavaScript and 6502 assembly, so in theory there’s nothing stopping anyone from porting that JavaScript code to C or C++ and plugging it into Gargoyle. The hardest part would be bridging the incompatibilities between the Å-machine’s output model and Glk’s. But apart from that, it’s just a tedious weekend’s work to translate all the code.

1 Like

This is very valuable feedback, thank you! In light of this (which Linus clearly knew, given the escape handling), I imagine the reason the Å-machine currently uses an explicit tree is for Commodore 64 compatibility. That way the C64 doesn’t have to do extra processing at startup (building the canonical tree).

In which case, there’s no reason at all to do that in the 32-bit version. I’m going to read more on these canonical Huffman algorithms and think about better ways to store this. (I also considered just storing strings uncompressed as UTF-8, but given that over half the data output by the compiler tends to be text, compressing it at all seems worthwhile.)

1 Like

In the Huffman encoders and decoders I’ve written, I prefer an adaptive algorithm. In an adaptive algorithm your maximum code length can vary. It largely depends on if the code will be doing a pass on the input to determine the frequency counts and dropping out symbols that will not end up in any output strings. I can share C code of my encoder but it is written mostly in x86 assembler.

Is it though? Wouldn’t gzip, Brotli or whatever is the best/most common today do a better job that we ever could, if really compression is needed?

Probably, but I think the priority here should be compression that does an adequate job and is straightforward to implement.

I recall a bit of annoyance a couple years ago regarding Glulx—when the Glulx spec was written, it was assumed that libc would always be available for every interpreter, so Glulx exposes some obscure libc functions that few IF authors ever need. Except then someone wanted to write an interpreter in Rust, which doesn’t have all of libc easily and safely available, and these obscure functions meant it needed a hefty external library that would almost never be used in order to conform to the spec.

So now I’m hesitant to say “just use gzip, it will always be available” in a VM spec. It’s not like file size is vitally important here; it sounds like canonical Huffman coding is more-or-less standardized, works just fine (albeit not top-of-the-line), and is easy to describe in the spec itself if someone wants to implement it by hand.

It might be that just UTF-8 on its own is enough compression for this purpose, and even canonical Huffman is overkill. I should see how much text a large Dialog game emits in raw UTF-8. But it’s enough text that I definitely don’t want to use uncompressed UTF-32.