Next steps for Inform 6 compiler

LZ77 decompression routine is not CPU intensive. All the heavies are in the compression stage.

HOW MUCH WOOD CAN A WOODCHUCK CHUCK IF A WOODCHUCK CAN CHUCK WOOD

HOW MUCH WOOD CAN A (-11,4)CHUCK (-6,6) IF …

The problem is if you want to build custom made routines in order to avoid STR2SZ conversion. Something like this

[ PZ n;
switch (n) {
0: return "HOW MUCH ";
1: return “WOOD”;
2: return " CAN A ";
3: return PZ(1);
4: return “CHUCK”;
etc

If you want to print WOODCHUCK, you can do print PZ(1),PZ(4); for example. These can be used as entries for built in string substitution, the ones with ampersand. Note, in actual implementation, you’d skip case 3, because you’d just print PZ(1) directly.

The tricky part is where entries are overlapping. Should you do “WOOD”, " WOOD", or “WOODCHUCK” ? This isn’t a problem in pure LZ77 implementation due to buffer window. Doing it with routines will, by necessity, require careful chopping of the text.

Ok, I get the idea. You assign some special escape sequence in strings which is used to reference these string components, and when printing a string you instead print to a buffer and print from the buffer while checking for the escape sequence?

Yes. That’s right.

With internal Abbreviation/ lookup table you do ampersand-number, with numbers 0-63.

With LZ77 you do marker-#charskip-len. You don’t need separate table because the text entry is on the preceeding text stream. This is usually done by “rolling window” buffer, most commonly 64KB.
The key takeaway is that each LZ77 marker is a potential Abbreviation lookup table entry.

But this means you have to uncompress the preceding x KB to have access to the string you will need to insert, or you can only refer to substrings in the current string?

Yes. The former. Remember that LZ77 is designed as stream, so you need to fill in the buffer. But to do that, you need the previous buffer as well, all the way to the beginning.

You can have multiple partitions, each with its own beginning, but you really need to start from beginning in each case.

LZ78 deals with look up tables, and avoids such issues, I believe.

Edit:
For 8 bit computers with limited storage, I wouldn’t implement pure LZ77 decompression. Instead, I would implement LZ77 compressor to output custom Routine as described above. That way, decompression is fast and dynamic, but that’s because it acts as lookup tables.

I make a few assumptions here:

  1. That text compression are most valuable when you’re trying to fit games to 8-bit scene. For other formats you have 512kB (Z8) or 2MB (Glulx) and I don’t know of any game that have used this much text.
  2. The Z-machine standard is “fixed”. Meaning that no decompressing is done by the interpreter. Otherwise the use for compressing with LZ77 would be unusable because there exists no interpreter that could decompress the text.
  3. LZ77 with rolling buffer is not feasable because there is not enough memory available to allocate a buffer of usable size.

Routines as you suggest sounds to me to introduce quite a bit overhead, but I don’t know how much. Remember that ZASCII use 5 bits for each character and bunch them together in 2 bytes that contains 3 characters so each string are optimally of a length dividable by 3. When I run extraction of abbreviations of Z3-files the last abbreviations found (abbrev 80 - 96) are often small saves, usally under 75 bytes each. The escape-char to insert an abbreviation is quite small and if you replace an abbreviation with a CONSTANT or a GLOBAL the savings are often lost because the escape for these costs more and the replacement abbreviations that are found can’t compensate for this.

I’m having a hard time seeing how a compress/decompress without z-machine support and a rolling buffer could save much more space than the current system with abbreviations does. 96 abbreviations are a bit limiting but could be extended by using CONSTANTs and/or GLOBALs for common phrases. See the work that was done with Mini-Zork II over in this thread.

Am I missing something obvious?

2 Likes

The key takeaway is that each LZ77 marker is a potential Abbreviation lookup table entry.

You can use internal Abbreviation, which is quite efficient, and LZ77 compression method is near optimal in finding text snippets to compress.

If more compression is desired than what Abbreviation is available, then use your own look up table system.

Have you tried using GLOBALs for big saves and Abbreviation for small ones?

The manual snippet suggests that Routine saves 2 bytes per entry as compared to static array. I haven’t checked that, however.

These are examples in ZILF, but…

This was done alot in Mini-Zork II. If you look at the bottom of GLOBALS.ZIL you’ll see how it’s done there.

Note that if you can you should use GLOBALs (saves one byte extra) instead of CONSTANTs. The downside is that GLOBALs are a limited resource.

There is also a discussion here and here for doing this automatically for the ZILF compiler.

2 Likes

I think you mean 2 GB for Glulx.

Yes, of course…

HOW MUCH WOOD CAN A WOODCHUCK CHUCK IF A WOODCHUCK CAN CHUCK WOOD

What if I split the stream of LZ77 into table and text?

TABLE=" WOODCHUCK A CAN "
TEXT="HOW MUCH(0,5)(12,5)A(0,11)(5,6)IF(10,3)(0,11)(13,4)(5,6)(1,4)"

TABLE=" A WOODCHUCK CAN "
TEXT="HOW MUCH(2,5)(12,5)(1,12)(7,6)IF(0,17)(7,6)(3,4)"

The difference between this and traditional LZ77 is that it’s no longer streaming with rolling buffer, thus must load all table entries into one long string all at once. Like LZ77, there’s no waste of storage, and you still have the flexibility of keeping short clips in actual text.

Notice also the possibility of optimization by rearranging the text of the table, although process will surely take longer than LZ77 compression algorithm.

The question then is what is the maximum length of Inform strings( I think 256, as opposed to unlimited), and can I access it per individual characters, instead of only as a whole.

Those of you who are knowledgeable about algorithms, has this been done before? If so, by who? Any reference?

Pretty much any simple scheme for text compression has been tried before. I don’t know if this one has a name. It might turn out to be impractical as the offset/length pairs could just be too large.

Making any other compression scheme work with the Z-machine is a big problem, though (even ignoring speed and code space). The useful space is “high strings”, which are not directly addressable. You can print them to a buffer and then decompress the buffer, but you end up losing quite a few bits doing so. Other considerations are that you need a scheme which works with short strings that have many repeats across them; this rules out many LZ77 and LZ78 based algorithms. The so-called semi-adaptive dictionary coders (of which yours is one, as is the Z-machine’s abbreviation algorithm) are actually quite good for this.

2 Likes

The built-in abbreviation algorithm in the z-machine uses 2 zchars for indicate a place in the text that should be replaced with an abbreviation. 1st zchar is the token (1, 2 or 3) and the 2nd points to the abbreviation - 3x32 = 96 possible abbreviations.

Your scheme have the advantage that the abbreviations can use a variable starting point and a variable length, but at the cost that you’ll need more zchars for addressing the abbreviation.

If one assumes that one uses 3 zchars for each abbreviation. 1st is the token, 2nd & 3rd are 10 bits for address and length. If you split the 10 bits, for example, 7+3, the TABLE could be 2^7=128 zchars long and the length 0-7 zchars. This is a bit restrictive so I guess you would want at least one more zchar for address+length.

If I use your example and calculate a bit:

“HOW MUCH WOOD CAN A WOODCHUCK CHUCK IF A WOODCHUCK CAN CHUCK WOOD”
is 65 zchars long.

If you use your scheme the TABLE " A WOODCHUCK CAN " is 17 zchars and the TEXT “HOW MUCH(2,5)(12,5)(1,12)(7,6)IF(0,17)(7,6)(3,4)” would occupy 31 zchars with in the 3 zchars/abbrev and 38 zchars in the 4 zchars/abbrev. 48 or 55 zchars total.

I don’t know what a good abbreviation finder would find in this string. But lets assume it identifies " WOOD" (1) and "CHUCK " (2).
The string would be: “HOW MUCH(1) CAN A(1)(2)(2)IF A(1)(2)CAN (2)(1)” = 38 zchars + 11 zchars for the abbreviations = 49 zchars total.

Even if you could implement the scheme I can’t see that you would gain that much space instead of using the buildin one. For games where this is of most interest, to keep Z3 under 128kB you rarely exhaust the current 96 abbreviations anyhow.

2 Likes

That’s great calculation. I’ve come to the same conclusion as you. This algorithm is for the addition of the internal one. Its utility is best applied to works with high text to code ratio, for example, War and Peace – The CYOA. That’s certainly niche.

Didn’t someone did Heart of Ice? Something like that would be perfect.

It was used as a benchmark-file when we worked out improvements for calculating abbreviations.

1 Like

I’ve been struggling a little to normalise my new-lines across the logic; whilst a string literal is the equivalent of print_ret which appends a new-line, other methods do not do this, and I realised that this was was making the process so difficult, particularly where you want to print multiple paragraphs.

If you have some logic that might bail out early with a print_ret, you’ll get an automatic new-line; but then your truthy logic may go on to do multiple print calls and bits of logic and then you need to remember to print a new-line to normalise the output for all logic paths. There are instances where you don’t want to include the new-line in the string, because the same string might be used on different logic paths / routines where a normal print or a print_ret might be used… It gets confusing and difficult to track down errant / duplicate new-lines.

Please can we have a function that prints, and then includes a new-line but does not return. Yes, I could write such a thing, but it being in the compiler is going to be much tighter for code-gen and it’ll help a few less people go mad swatting new-lines.

The usual way to do this is

print (string) stuff, "^";

I’m not sure what an alternative would look like. We could follow the pattern of print_ret and add a new statement:

print_nl (string) stuff;

But is this clearer? It doesn’t save any typing. The “_nl” is at the beginning, which means it’s harder to see and is maybe inconsistent with the logic of what you actually want.

(Okay, it saves two characters of typing…)

2 Likes

Does the underscore need to be there? Other languages just use “println” for print followed by newline command.

I had assumed print_ret as “print then return”, i.e. 2 commands collapsed into one, hence the underscore.

I’m not thinking about how it’s spelled at this point, but whether it improves the language.

I think it will help those with prior programming experience. The thinking process mimics other computer programming languages such as Pascal and Java. Also some BASIC dialects.

Those without programming experience will migrate toward Inform7 anyway. But I think that those with experience will gravitate toward Inform6. So, yes, IMHO, this will help.