Next steps for Inform 6 compiler

I make a few assumptions here:

  1. That text compression are most valuable when you’re trying to fit games to 8-bit scene. For other formats you have 512kB (Z8) or 2MB (Glulx) and I don’t know of any game that have used this much text.
  2. The Z-machine standard is “fixed”. Meaning that no decompressing is done by the interpreter. Otherwise the use for compressing with LZ77 would be unusable because there exists no interpreter that could decompress the text.
  3. LZ77 with rolling buffer is not feasable because there is not enough memory available to allocate a buffer of usable size.

Routines as you suggest sounds to me to introduce quite a bit overhead, but I don’t know how much. Remember that ZASCII use 5 bits for each character and bunch them together in 2 bytes that contains 3 characters so each string are optimally of a length dividable by 3. When I run extraction of abbreviations of Z3-files the last abbreviations found (abbrev 80 - 96) are often small saves, usally under 75 bytes each. The escape-char to insert an abbreviation is quite small and if you replace an abbreviation with a CONSTANT or a GLOBAL the savings are often lost because the escape for these costs more and the replacement abbreviations that are found can’t compensate for this.

I’m having a hard time seeing how a compress/decompress without z-machine support and a rolling buffer could save much more space than the current system with abbreviations does. 96 abbreviations are a bit limiting but could be extended by using CONSTANTs and/or GLOBALs for common phrases. See the work that was done with Mini-Zork II over in this thread.

Am I missing something obvious?

2 Likes

The key takeaway is that each LZ77 marker is a potential Abbreviation lookup table entry.

You can use internal Abbreviation, which is quite efficient, and LZ77 compression method is near optimal in finding text snippets to compress.

If more compression is desired than what Abbreviation is available, then use your own look up table system.

Have you tried using GLOBALs for big saves and Abbreviation for small ones?

The manual snippet suggests that Routine saves 2 bytes per entry as compared to static array. I haven’t checked that, however.

These are examples in ZILF, but…

This was done alot in Mini-Zork II. If you look at the bottom of GLOBALS.ZIL you’ll see how it’s done there.

Note that if you can you should use GLOBALs (saves one byte extra) instead of CONSTANTs. The downside is that GLOBALs are a limited resource.

There is also a discussion here and here for doing this automatically for the ZILF compiler.

2 Likes

I think you mean 2 GB for Glulx.

Yes, of course…

HOW MUCH WOOD CAN A WOODCHUCK CHUCK IF A WOODCHUCK CAN CHUCK WOOD

What if I split the stream of LZ77 into table and text?

TABLE=" WOODCHUCK A CAN "
TEXT="HOW MUCH(0,5)(12,5)A(0,11)(5,6)IF(10,3)(0,11)(13,4)(5,6)(1,4)"

TABLE=" A WOODCHUCK CAN "
TEXT="HOW MUCH(2,5)(12,5)(1,12)(7,6)IF(0,17)(7,6)(3,4)"

The difference between this and traditional LZ77 is that it’s no longer streaming with rolling buffer, thus must load all table entries into one long string all at once. Like LZ77, there’s no waste of storage, and you still have the flexibility of keeping short clips in actual text.

Notice also the possibility of optimization by rearranging the text of the table, although process will surely take longer than LZ77 compression algorithm.

The question then is what is the maximum length of Inform strings( I think 256, as opposed to unlimited), and can I access it per individual characters, instead of only as a whole.

Those of you who are knowledgeable about algorithms, has this been done before? If so, by who? Any reference?

Pretty much any simple scheme for text compression has been tried before. I don’t know if this one has a name. It might turn out to be impractical as the offset/length pairs could just be too large.

Making any other compression scheme work with the Z-machine is a big problem, though (even ignoring speed and code space). The useful space is “high strings”, which are not directly addressable. You can print them to a buffer and then decompress the buffer, but you end up losing quite a few bits doing so. Other considerations are that you need a scheme which works with short strings that have many repeats across them; this rules out many LZ77 and LZ78 based algorithms. The so-called semi-adaptive dictionary coders (of which yours is one, as is the Z-machine’s abbreviation algorithm) are actually quite good for this.

2 Likes

The built-in abbreviation algorithm in the z-machine uses 2 zchars for indicate a place in the text that should be replaced with an abbreviation. 1st zchar is the token (1, 2 or 3) and the 2nd points to the abbreviation - 3x32 = 96 possible abbreviations.

Your scheme have the advantage that the abbreviations can use a variable starting point and a variable length, but at the cost that you’ll need more zchars for addressing the abbreviation.

If one assumes that one uses 3 zchars for each abbreviation. 1st is the token, 2nd & 3rd are 10 bits for address and length. If you split the 10 bits, for example, 7+3, the TABLE could be 2^7=128 zchars long and the length 0-7 zchars. This is a bit restrictive so I guess you would want at least one more zchar for address+length.

If I use your example and calculate a bit:

“HOW MUCH WOOD CAN A WOODCHUCK CHUCK IF A WOODCHUCK CAN CHUCK WOOD”
is 65 zchars long.

If you use your scheme the TABLE " A WOODCHUCK CAN " is 17 zchars and the TEXT “HOW MUCH(2,5)(12,5)(1,12)(7,6)IF(0,17)(7,6)(3,4)” would occupy 31 zchars with in the 3 zchars/abbrev and 38 zchars in the 4 zchars/abbrev. 48 or 55 zchars total.

I don’t know what a good abbreviation finder would find in this string. But lets assume it identifies " WOOD" (1) and "CHUCK " (2).
The string would be: “HOW MUCH(1) CAN A(1)(2)(2)IF A(1)(2)CAN (2)(1)” = 38 zchars + 11 zchars for the abbreviations = 49 zchars total.

Even if you could implement the scheme I can’t see that you would gain that much space instead of using the buildin one. For games where this is of most interest, to keep Z3 under 128kB you rarely exhaust the current 96 abbreviations anyhow.

2 Likes

That’s great calculation. I’ve come to the same conclusion as you. This algorithm is for the addition of the internal one. Its utility is best applied to works with high text to code ratio, for example, War and Peace – The CYOA. That’s certainly niche.

Didn’t someone did Heart of Ice? Something like that would be perfect.

It was used as a benchmark-file when we worked out improvements for calculating abbreviations.

1 Like

I’ve been struggling a little to normalise my new-lines across the logic; whilst a string literal is the equivalent of print_ret which appends a new-line, other methods do not do this, and I realised that this was was making the process so difficult, particularly where you want to print multiple paragraphs.

If you have some logic that might bail out early with a print_ret, you’ll get an automatic new-line; but then your truthy logic may go on to do multiple print calls and bits of logic and then you need to remember to print a new-line to normalise the output for all logic paths. There are instances where you don’t want to include the new-line in the string, because the same string might be used on different logic paths / routines where a normal print or a print_ret might be used… It gets confusing and difficult to track down errant / duplicate new-lines.

Please can we have a function that prints, and then includes a new-line but does not return. Yes, I could write such a thing, but it being in the compiler is going to be much tighter for code-gen and it’ll help a few less people go mad swatting new-lines.

The usual way to do this is

print (string) stuff, "^";

I’m not sure what an alternative would look like. We could follow the pattern of print_ret and add a new statement:

print_nl (string) stuff;

But is this clearer? It doesn’t save any typing. The “_nl” is at the beginning, which means it’s harder to see and is maybe inconsistent with the logic of what you actually want.

(Okay, it saves two characters of typing…)

2 Likes

Does the underscore need to be there? Other languages just use “println” for print followed by newline command.

I had assumed print_ret as “print then return”, i.e. 2 commands collapsed into one, hence the underscore.

I’m not thinking about how it’s spelled at this point, but whether it improves the language.

I think it will help those with prior programming experience. The thinking process mimics other computer programming languages such as Pascal and Java. Also some BASIC dialects.

Those without programming experience will migrate toward Inform7 anyway. But I think that those with experience will gravitate toward Inform6. So, yes, IMHO, this will help.

In this instance, I don’t see any need for a change. It works and the syntax is clear.

So, you’re saying the solution would be

[ println x
  print (string) x, "^";
]

I suppose that works. It’d be nice if the inform manual has a section on “migrating from other languages” and includes such routines for convenience. At least, I know C language has extensions written for it.

1 Like

What Andrew said.

Another option to do this with the existing compiler is:

print (string) stuff;
new_line;

This also saves a few bytes in Z-code, not sure about Glulx.

That also works. But it’s not a function. Also, if the request is born from the desire to make multiple paragraphs of text, then maybe the correct answer lies on p.28 on DM4, which deals with block of text across multiple lines.

if a line finishes
with a ^ (new-line) character, then no space is added.

If you want to print a new line without returning, you just do what Zarf said. You can have any number of "^" within one print statement and print "^^" to print a blank line between paragraphs.

I know that the range of print commands is a bit bewildering for a newcomer, but adding a new one is not going to help. On the contrary, it will make it worse.

1 Like