Highly optimized abbreviations computed efficiently

heasm66 · February 22, 2021, 1:35pm

I can’t see any obvious. The PERIOD-CR is a border case. As I count it, there are 65 ".|" in the files (4 in DESCs). Maybe it would be be picked up as a abbreviation and save 65*(4-2)-4=126 z-chars. This would replace an abbreviation that saves around 50 for a net gain of about 75 z-chars. From this we would need to subtract what is currently saved by the GLOBAL. It would only be a couple of bytes either way. Not worth the trouble, I say…

fredrik · February 22, 2021, 8:35pm

Do note the cost of the alternatives here. We’re talking about a string consisting of a dot and a newline character.

If we don’t have an abbreviation for this specific string, it takes up four five-bit codes, using up 4 bytes.

If we have an abbreviation, it takes up two five-bit codes, using up 2 bytes. This also means it uses one slot in the abbreviation table, so some other string we would benefit from creating an abbreviation for won’t get an abbreviation.

If we use a global, each reference to that global uses up 1 byte, due to the way Z-code is encoded.

If we use this string in say 50 places, the savings are big enough to matter.

heasm66 · February 22, 2021, 9:17pm

I think globals will cost a little more than 1 byte?

If you have a string like:
“You can’t do that.|Try something different instead.”

It will code in zap as:

PRINTR "You can't do that.|Try something different instead."

opcode+ address + length of string, I think.

If we use a global for “.|” it will compile to:

PRINTR "You can't do that"
PRINT PERIOD-CR
PRINTR "Try something different instead."

3 opcodes + 3 addresses + length of strings - 4 z-chars.

The calculation becomes different if the PERIOD-CR is in the end of the string.

Then it’s the issue if +/- of the 4 z-chars makes the string fit neatly in the 2-byte words before or after.

I think it’s hard to beforehand now how much you will save or lose especially in a small game like mini-zork II where the abbreviations get a bit exhausted at the end and only saves betwen 40-50 z-chars so there is not great abbreviation waiting in the wing to replace the global.

I’m a bit tired so the math can be very off. If so I’m apologize beforehand…

fredrik · February 22, 2021, 9:24pm

Oh, I was only talking about the case where a certain string is printed in many places, not where it’s part of longer strings. When replying to player actions, it’s pretty common to end a message with printing an object name or a pronoun and then a dot and a newline, and that’s exactly the reason this global was created (see Torbjörn’s post).

heasm66 · February 22, 2021, 9:38pm

I’m not sure what I’m arguing here.

I’m certainly think the globals are helpful and removing the PERIOD-CR probably will add a few bytes but there isn’t that many good globals/abbreviations left. Any new global will cannibalize on the abbreviations and the net gain won’t be so much.

russotto · February 28, 2021, 5:13pm

By improving the parse (how the text is broken into abbreviations and literals), I have gotten my previous best of 222,984 for heart-of-ice down to 222,672 (same abbreviation file; I never matched Henrik’s 222,588, but it is likely to improve that set as well). It turns out an optimal parse given a set of abbreviations is tractable and an algorithm was known in 1973*. With an optimal parse the order of abbreviations does not matter.

Attached is a ZILF patch; be aware that this has not been well-tested at all, even by my own cowboy-coding standards.

optimalparse.txt (4.1 KB)

R.A. Wagner , “Common phrases and minimum-space text storage”, Commun. ACM, 16 (3) (1973), pp. 148-152

heasm66 · February 28, 2021, 9:21pm

This is good!

If I apply your patch and, as I understand it, Zilf use the best combinations of abbreviations (from the full set) for each individual string. It improves the result and I updated my list with a couple of examples below. Heart of Ice is now down to 222,248 bytes.

Updated list (third column is with patch applied)

'   Zork II:
'      No Abbrev.   103,462
'      Zilf 0.9      90.368
'      Zilf beta     89,454
'      Sort once     88,384  88,336
'      Keep sorted   88,392  88,348
'
'   Mini-Zork II, Release 14:
'      No Abbrev.    60,316
'      Zilf 0.9      54,234
'      Zilf beta     53,906
'      Sort once     53,522  53,498
'      Keep sorted   53,560  53,488
'
'   Heart of Ice:
'      No Abbrev.   283,184
'      Zilf 0.9     232,408
'      Zilf beta
'      Sort once    222,992
'      Keep sorted  222,588 222,248
'
'   Trinity:
'      No Abbrev.   281,580
'      Zilf 0.9     257,408
'      Zilf beta    256,908
'      Sort once    255,232
'      Keep sorted  255,308

russotto · March 1, 2021, 5:05am

This one is 222,208 with old-style parse, 221,920 with optimal parse. Algorithm is rather slow and brute-forcish, 40s on my machine. Savings values in the file are the “naive” savings values rather than the true ones.

heartice_freq.zap.opt.221920.txt (6.8 KB)

This one was even more brute force, trying every possible abbreviation rather than just branching nodes. It took 65 minutes, so obviously not very practical, but the differences might provide some ideas. Size is 222,152 with standard parse, 221,868 with optimal parse.

heartice_freq.zap.opt.221868.txt (6.8 KB)

heasm66 · March 1, 2021, 2:31pm

The differences are minimal. The only ones i can see are the " towards" instead of " towards " and " against" instead of “ain”. The order is also a bit different, but that shouldn’t matter with “OptimalParse”, or?

I’m gonna run my algorithm tonight to compare, but they are only experimental. They don’t use Suffix Tree/Array and are very slow for files this big (65-75 min). I have a limit of 60 characters for the abbreviation that makes it not finding the “Crashing through a thicket…” that I’m gonna remove.

I had a look at the Zapf-code but it have a problem with repeating phrases, “aaaaaaaaaaaaaaaaaaaa” is identifing as 10x"aaaaaaaaaa", that I havn’t found a way to patch for yet.

(There is also a scanned PDF of the Wagner paper here, if anyone are interested…)

russotto · March 1, 2021, 3:32pm

There’s some claims in the literature that for English text, disallowing abbreviations with leading spaces or trailing spaces (but not both) improves compression (presumably by reducing overlap). I had tried this with earlier algorithms with terrible results, but possibly it will work given my current algorithm:

Calculate abbreviations, score by naive savings
Put them in a max-heap.
Remove top of heap abbreviation, add to the set of best abbreviations and parse entire corpus.
Compute change in savings (using optimal parse) and declare that to be the new score for the current abbreviation.
If the new score is less than the current score for the top-of-heap, remove the current abbreviation from the best_abbreviations set and throw it back on the heap.
Repeat from step 3 until enough abbreviations are found or the heap is empty.

heasm66 · March 4, 2021, 7:32pm

I finally came around to implement your new algorithm. I’ve added a little twist on step 5.

5.1 Remove all abbreviations that have a score lower than the latest delta-savings and throw them back on the heap.

This doesn’t add much time to the algorithm (<5 s on Heart of Ice) and the result matches your brute-force method. My timings below is without using the the Boyer–Moore–Horspool algorithm and generally not so much streamlined in the indexing part. I’m also using the “MoreComplexDataStructures 1.9.1” NuGet-package for maxheap and I don’t know the performance on that. Anyway, what I mean to say is that the times between throwning back lowscorers (step 5.1) and not doing it only differs on the marginal.

Results

'   Zork II:        Zilf 0.9* Optimal Parse**
'      No Abbrev.   103,462
'      Zilf 0.9      90.368
'      Zilf beta     89,454
'      Sort once     88,384   88,336
'      Keep sorted   88,392   88,348
'      Max heap      88,278   88,252 58 s
'
'   Mini-Zork II, Release 14:
'      No Abbrev.    60,316
'      Zilf 0.9      54,234
'      Zilf beta     53,906
'      Sort once     53,522   53,498
'      Keep sorted   53,560   53,488
'      Max heap      53,506   53,428 27 s
'
'   Heart of Ice:
'      No Abbrev.   283,184
'      Zilf 0.9     232,408
'      Zilf beta
'      Sort once    222,992
'      Keep sorted  222,588  222,248
'      Max heap     222,148  221,880 183 s
'
'   Trinity:
'      No Abbrev.   281,580
'      Zilf 0.9     257,408
'      Zilf beta    256,908
'      Sort once    255,232
'      Keep sorted  255,308 
'      Max heap     255,024  254,988 103 s

*  Abbreviations are sorted from longest to shortest.
** Zilf 0.9 patched with "Optimal Parse".

Angstsmurf · March 4, 2021, 8:27pm

Just curious: Could any of this be useful outside of the Z-machine, say, for reducing the size of Glulx games?

zarf · March 4, 2021, 10:12pm

It’s certainly applicable to Glulx – the Glulx compiler accepts an abbreviation list in the same way as the Z-code compiler. But the efficiency math is different, because Glulx doesn’t use Z-machine text compression. It uses a Huffman system, which does some global optimization already.

Also: nobody is trying to cram Glulx games down to the point where they fit on a C64. (I haven’t measured, but I expect it’s hopeless even for the most minimal I6 game.) Therefore, there’s no use case for that level of extreme byte-shaving. Knocking a few kilobytes off the game file doesn’t get you any new platforms.

russotto · March 4, 2021, 11:20pm

I think I tried your step 5.1 and found it didn’t help for me. I might try it again having done some further refinements. The reason the syntax tree method didn’t match the brute force method turned out to be that " towards" isn’t a branching node in the syntax tree – that is, every time " towards" occurs, the next character is a space. So I added a special case for abbreviation candidates which end in a space and contain alphabetic characters – if the same candidate without a space is not in the initial set of candidates, it is added to that set.

I’m currently at 221,664 for heart-of-ice, 88,236 for zork2 (not quite the same algorithm), and 53,232 for minizork2 (27c8a163; I will try the release version later). The algorithm difference probably has to do with padding on z5 vs z3. I will post the new version of my code to the Gitlab repository soonish. I’m out of interesting ideas so I’ll probably leave the “abbreviation golf” for now.

Some papers suggest a ~53% compression ratio for English text with this kind of compression, but most don’t give a dictionary size and I think one has a size of 256. Compression ratios here are about 73%, which is not bad seeing as the Z-machine Standards document suggests a 10% total reduction with text making 75% of the file, which is about a 87% text compression ratio.

mulehollandaise · March 5, 2021, 12:04am

Thank you for the discussion and the advances

If I was to try to catch up for my own script, the optimal parse algorithm is the main improvement, right? And what does it do, give a better evaluation of the space saved and make you able to choose better abbreviations especially when a few good candidates overlap each other?

Your algorithms seem structurally pretty close to mine so hopefully it wont be too hard to implement this in mine.
Is Henrik’s idea to generalize the tiebreaker to all strings within 5% of each other still in your algorithms?

Thank you for your work, I’m excited to maybe have more space available for Release 3 of “Tristam Island”

russotto · March 5, 2021, 4:26am

If I was to try to catch up for my own script, the optimal parse algorithm is the main improvement, right?

Yes. But it changes a lot; all the stuff about fully-overlapping abbreviations and sorting is no longer relevant.

Is Henrik’s idea to generalize the tiebreaker to all strings within 5% of each other still in your algorithms?

I never implemented that; might be worth a try.

heasm66 · March 5, 2021, 8:33am

I had a hard time remembering what this was, but I once used a percentage threshold to determine which overlappers to use. This is no longer relevant.

As I see it there have been two “breakthroughs” from your script.

Being gready isn’t always the best strategy. This led to looking for overlappers and sorting the result from longest to shortest.
Optimal Parse.

My layman’s understanding of the difference between “traditional parse” and “optimal parse” is that in traditional you apply the abbreviations on all the text in order. In optimal you instead look at each individual text and identifies all possible abbreviations, from the 96 set, and then identifies which combination that gives the best savings for that single text before continuing to the next text.

This leads, of couse, to a lot of searching in strings! I don’t know Python very well, how is it on performance with strings? Running my program on Heart of Ice went from 70 minutes to 3 minutes by changing one line of code for string comparison, changing NET-framework standard for .IndexOf to StringComparison.Ordinal.

Best result is achived if the compiler also uses “optimal parse” when parsing in the abbreviations, hence the patch to Zapf. I don’t know how Inform does it but I doubt that it use “optimal parse”. If the compiler don’t have “optimal parse” you get a slightly better result if you sort the abbreviations from longest to shortest, otherwise the order of the abbreviations doesn’t matter.

If you have a extract of all text in “Tristam Island” I can run it to see how much there’s to save. I’m happy to show you my code (VB.NET, pretty similar in syntax to Python) if your interested but it is still a experimental work in progress and not yet fit for public release.

zarf · March 5, 2021, 4:26pm

Not great. Although if you find the right native-code library, it might handle most of the hard parts.

heasm66 · March 8, 2021, 7:08pm

The strings from the ZAP-files comes from four different sources.

From the game.zap there are the PRINTI and PRINTR opcodes followed by a text string.
In the game_str.zap there are all the .GSTR.
In the game_data.zap there are the .STRL

My question is if there is a way to know which of these text-strings that end up in dynamic memory and which end up in static high memory for z4+ games?

russotto · March 8, 2021, 11:45pm

The .STRL are in dynamic memory. The rest is in high memory. The .GSTRs are aligned to word boundaries, the .STRLs are not aligned, and the opcodes are not aligned but the routines they are contained in are aligned to word boundaries.

Note the abbreviation strings are conventionally in dynamic memory, although they are word addresses so they could even move to high memory (though this would be a bad idea if your interpreter does swapping and isn’t smart about it). If you have very large repeated text (like Heart of Ice) and you’re trying to compress things, you’re likely best off putting those long repeated texts into strings (GLOBAL if you have the global space, otherwise CONSTANT) before trying abbreviations.