Highly optimized abbreviations computed efficiently

You shaved 600 bytes off with that? Wow good job! I’ll definitely look into it.
I guess the greedy approach isnt the best one, even with the tiebreaker rule. The search continues!! :smiley:

Very good. I spent way too much time playing with this abbreviations stuff. I got it down to one build of the suffix tree (I’m now using the SA-IS suffix array algorithm with some custom routines on top to do the traversal) and some extra scoring logic, then I tried improving the abbreviations. What mine does is

  1. Identify twice the number of winners as there are abbreviations, using the old logic. Any abbreviation eliminated due to completely containing an earlier winner is saved.

  2. Go through each of the abbreviations that have been saved, in order of their original score, and re-score the entire set based on the long abbreviation being above the first abbreviation it fully contains. If the score for the top 96 in the re-scored set is higher, use the abbreviation; otherwise, discard it.

  3. If there are any abbreviations in the second half of the set of 192 that have a higher score than those in the set of 96, move them up.

This is all very ad-hoc, alas. I’ve looked to see if there are any good algorithms for this, but I’ve only found some which assume abbreviations can be abbreviated, and a handwavy reference to the problem being NP-complete (whether in the number of abbrs or the text size, I don’t know)

My best for heart-of-ice is 223332 in about 7.6s, and 88566 for Zork2 in 1.8s (on my iMac) Almost all the time is in the re-scoring.

I’ve put my code up on Gitlab: Matthew T. Russotto / ZilAbbrs · GitLab

4 Likes

I’ve made a small modification of the algorithm. In the one below it’s only point 6 that is new.

1.   Collect all text strings from the zap-files in a vector(of strings). In C++ this
     becomes string1 string2 ... seperated by 0 in an array of characters.

2.   Split all strings into unquie keys of length 2 or longer.

3.   Calculate cost i z-chars for each key.

4.1. Count occurrences of each key in text.

4.2. Keep track of key with highest savings.

4.3. Save best candidate.

4.4. Remove best candidate from text

4.5. Remove all keys with savings < LOW_SCORE_CUTOFF.

4.6. Repeat from 4.1 until we have up to 192 candidates in the best_candidates list.
     There could be fewer, if we run out of valid keys with savings > LOW_SCORE_CUTOFF.

5.   Find all candidates that overlaps keys in best_candidates.

6.   Sort best_candidates and order top 96 candidates from longest to shortest.
        1-96 : Longest to shortest
        97-  : Descending savings value

7.   Calculate combined savings for 96 top keys in best_candidates.

8.1. Pick overlapper and place it before longest key it overlapps in new_best_candidates.

8.2. Calculate combined savings for 96 top keys in new_best_candidates. 

8.3. If new_score <= old_score, discard the overlapper and repeat from 7.1.

8.4. Use new_best_candidates as best_candidates.

8.5. Repeat from 8.1 until all there are no more overlappers.

9.   Print 96 best_candidates.

This is to stop, for example, don’t pick “receptacl” instead of “receptacle.” beacause “e.” higher up have lowered the latters savingscore.

Zork 2 is now down to 88,440 bytes and Heart of Ice is down to 223,004 bytes with this modification.

I made a small improvement to the above algorithm. In stage 8.1 I plave the overlapper before the overlapping key direcly after the key that is one z-char longer. Therby maintaining the descending order of length for the top 96. This made a small gain for Zork II and saved 10 bytes more (down to 88,434 bytes). But when I tried it with Mini-Zork II the reverse happened, see table below.

Zork II
                 Saved   Calculated  Actual              
Method Org. size z-chars saved bytes saved bytes Diff Game size
1        103,462  22,426      14,951      15,018   67    88,444
2        103,462  22,491      14,994      15,028   34    88,434

Mini-Zork II, Release 13
Method Org. size z-chars saved bytes saved bytes Diff Game size
1         60,258  10,269       6,846       6,842   -4    53,416   
2         60,258  10,305       6,870       6,794  -76    53,464

This is probably because how many “wasted” z-chars the abbreviations regains/introduces.

As I understand it each string in the game always occupy an even number of bytes (2, 4, 6, 8, …). Every two-bytes, word, contains 3 z-chars. If you have a string of 13 z-chars there will be two “wasted” slots in the last word. If you then abbreviate something in this string that brings it down to some length dividable by 3 you will reclaim or gain these two slots. The other way around you will “waste” slots if you have a string that is dividable by 3 that the abbreviation changes to some length that is not dividable by 3.

Normaly this would even out, I think, but you could maybe introduce bonus/punishment for good/bad behaving abbreviation. This would have to be done on every string the abbreviation is applied to and summed up for each key. The cost is increased/decreased by a value between -2 to +2 for each time this abbreviation is used.

This, just to squeeze out the last bytes…

Am I thinking this right, or?

It’s definitely possible to determine the exact savings of a set of abbreviations, taking into account the wasted Z-characters, but you can’t really do it for a single abbreviation because other abbreviations in the same strings affect it.

I made some small changes to the algorithm:

  1. Overlappers are sorted from shortest to longest before they are tested against top 96 to see if they qualify.
  2. Resort the bottom half (candidates 97-) of the BestCandidatesList.
  3. Overlappers that not qualify to a top spot don’t get discarded right away. Instead they are played last in the list and only discarded if they during recalculation get a savings score below 0.
The result:
'   Zork II:
'      No Abbrev.   103,462
'      Zilf 0.9      90.368
'      Zilf beta     89,454
'      Sort once     88,384
'      Keep sorted   88,392
'
'   Mini-Zork II, Release 13:
'      No Abbrev.    60,258
'      Zilf 0.9      54,170
'      Zilf beta     53,846
'      Sort once     53,404
'      Keep sorted   53,468
'
'   Heart of Ice:
'      No Abbrev.   283,184
'      Zilf 0.9     232,408
'      Zilf beta
'      Sort once    222,992
'      Keep sorted  222,588
'
'   Trinity:
'      No Abbrev.   281,580
'      Zilf 0.9     257,408
'      Zilf beta    256,908
'      Sort once    255,232
'      Keep sorted  255,308 

It’s irritating that there’s a difference between if the top 96 of the BestCandidatesList is only sorted once, before overlappers are tested, or resorted between each overlapper is tested.

I don’t think this difference is only due to “wasted” z-chars, these should even out. I’m leaning more towards that it’s interference from partial overlappers that changes the result depending on wich order the list is sorted.

Mini-Zork II has a fair amount of globals for common texts. At what point - if any - does that start to work against the abbreviation finder? E.g. I could add a global for "You can’t ", but I think that’s a text that the better abbreviation finders detect on their own, even if ZILF 0.9 doesn’t.

The shortest one is <GLOBAL PERIOD-CR ".|"> which it uses for things like <TELL "You can't move the " D ,PRSO ,PERIOD-CR>, i.e. where it really can’t be made part of a longer string.

I think you’re about there. The GLOBALs aren’t as much savings as an abbreviation. I don’t have the math here right now, but the biggest one is that the global only replaces strings in (.PRINTI, .PRINTR in ZAP).

I identified a couple of strings that i thought would save about 76 bytes, but the net gain was only 6 bytes because the abbreviations were less effective.

EDIT: I don’t think this would be a problem in bigger games. There the limit of 96 abbreviations are the hindrance and there’s plenty of room for GLOBALs.

Can you see if any of the current ones does more harm than good?

I can’t see any obvious. The PERIOD-CR is a border case. As I count it, there are 65 ".|" in the files (4 in DESCs). Maybe it would be be picked up as a abbreviation and save 65*(4-2)-4=126 z-chars. This would replace an abbreviation that saves around 50 for a net gain of about 75 z-chars. From this we would need to subtract what is currently saved by the GLOBAL. It would only be a couple of bytes either way. Not worth the trouble, I say…

Do note the cost of the alternatives here. We’re talking about a string consisting of a dot and a newline character.

If we don’t have an abbreviation for this specific string, it takes up four five-bit codes, using up 4 bytes.

If we have an abbreviation, it takes up two five-bit codes, using up 2 bytes. This also means it uses one slot in the abbreviation table, so some other string we would benefit from creating an abbreviation for won’t get an abbreviation.

If we use a global, each reference to that global uses up 1 byte, due to the way Z-code is encoded.

If we use this string in say 50 places, the savings are big enough to matter.

I think globals will cost a little more than 1 byte?

If you have a string like:
“You can’t do that.|Try something different instead.”

It will code in zap as:

PRINTR "You can't do that.|Try something different instead."

opcode+ address + length of string, I think.

If we use a global for “.|” it will compile to:

PRINTR "You can't do that"
PRINT PERIOD-CR
PRINTR "Try something different instead."

3 opcodes + 3 addresses + length of strings - 4 z-chars.

The calculation becomes different if the PERIOD-CR is in the end of the string.

Then it’s the issue if +/- of the 4 z-chars makes the string fit neatly in the 2-byte words before or after.

I think it’s hard to beforehand now how much you will save or lose especially in a small game like mini-zork II where the abbreviations get a bit exhausted at the end and only saves betwen 40-50 z-chars so there is not great abbreviation waiting in the wing to replace the global.

I’m a bit tired so the math can be very off. If so I’m apologize beforehand…

Oh, I was only talking about the case where a certain string is printed in many places, not where it’s part of longer strings. When replying to player actions, it’s pretty common to end a message with printing an object name or a pronoun and then a dot and a newline, and that’s exactly the reason this global was created (see Torbjörn’s post).

I’m not sure what I’m arguing here. :grinning:

I’m certainly think the globals are helpful and removing the PERIOD-CR probably will add a few bytes but there isn’t that many good globals/abbreviations left. Any new global will cannibalize on the abbreviations and the net gain won’t be so much.

1 Like

By improving the parse (how the text is broken into abbreviations and literals), I have gotten my previous best of 222,984 for heart-of-ice down to 222,672 (same abbreviation file; I never matched Henrik’s 222,588, but it is likely to improve that set as well). It turns out an optimal parse given a set of abbreviations is tractable and an algorithm was known in 1973*. With an optimal parse the order of abbreviations does not matter.

Attached is a ZILF patch; be aware that this has not been well-tested at all, even by my own cowboy-coding standards.

optimalparse.txt (4.1 KB)

  • R.A. Wagner , “Common phrases and minimum-space text storage”, Commun. ACM, 16 (3) (1973), pp. 148-152
5 Likes

This is good!

If I apply your patch and, as I understand it, Zilf use the best combinations of abbreviations (from the full set) for each individual string. It improves the result and I updated my list with a couple of examples below. Heart of Ice is now down to 222,248 bytes.

Updated list (third column is with patch applied)
'   Zork II:
'      No Abbrev.   103,462
'      Zilf 0.9      90.368
'      Zilf beta     89,454
'      Sort once     88,384  88,336
'      Keep sorted   88,392  88,348
'
'   Mini-Zork II, Release 14:
'      No Abbrev.    60,316
'      Zilf 0.9      54,234
'      Zilf beta     53,906
'      Sort once     53,522  53,498
'      Keep sorted   53,560  53,488
'
'   Heart of Ice:
'      No Abbrev.   283,184
'      Zilf 0.9     232,408
'      Zilf beta
'      Sort once    222,992
'      Keep sorted  222,588 222,248
'
'   Trinity:
'      No Abbrev.   281,580
'      Zilf 0.9     257,408
'      Zilf beta    256,908
'      Sort once    255,232
'      Keep sorted  255,308 

This one is 222,208 with old-style parse, 221,920 with optimal parse. Algorithm is rather slow and brute-forcish, 40s on my machine. Savings values in the file are the “naive” savings values rather than the true ones.

heartice_freq.zap.opt.221920.txt (6.8 KB)

This one was even more brute force, trying every possible abbreviation rather than just branching nodes. It took 65 minutes, so obviously not very practical, but the differences might provide some ideas. Size is 222,152 with standard parse, 221,868 with optimal parse.

heartice_freq.zap.opt.221868.txt (6.8 KB)

The differences are minimal. The only ones i can see are the " towards" instead of " towards " and " against" instead of “ain”. The order is also a bit different, but that shouldn’t matter with “OptimalParse”, or?

I’m gonna run my algorithm tonight to compare, but they are only experimental. They don’t use Suffix Tree/Array and are very slow for files this big (65-75 min). I have a limit of 60 characters for the abbreviation that makes it not finding the “Crashing through a thicket…” that I’m gonna remove.

I had a look at the Zapf-code but it have a problem with repeating phrases, “aaaaaaaaaaaaaaaaaaaa” is identifing as 10x"aaaaaaaaaa", that I havn’t found a way to patch for yet.

(There is also a scanned PDF of the Wagner paper here, if anyone are interested…)

There’s some claims in the literature that for English text, disallowing abbreviations with leading spaces or trailing spaces (but not both) improves compression (presumably by reducing overlap). I had tried this with earlier algorithms with terrible results, but possibly it will work given my current algorithm:

  1. Calculate abbreviations, score by naive savings

  2. Put them in a max-heap.

  3. Remove top of heap abbreviation, add to the set of best abbreviations and parse entire corpus.

  4. Compute change in savings (using optimal parse) and declare that to be the new score for the current abbreviation.

  5. If the new score is less than the current score for the top-of-heap, remove the current abbreviation from the best_abbreviations set and throw it back on the heap.

  6. Repeat from step 3 until enough abbreviations are found or the heap is empty.

1 Like

I finally came around to implement your new algorithm. I’ve added a little twist on step 5.

5.1 Remove all abbreviations that have a score lower than the latest delta-savings and throw them back on the heap.

This doesn’t add much time to the algorithm (<5 s on Heart of Ice) and the result matches your brute-force method. My timings below is without using the the Boyer–Moore–Horspool algorithm and generally not so much streamlined in the indexing part. I’m also using the “MoreComplexDataStructures 1.9.1” NuGet-package for maxheap and I don’t know the performance on that. Anyway, what I mean to say is that the times between throwning back lowscorers (step 5.1) and not doing it only differs on the marginal.

Results
'   Zork II:        Zilf 0.9* Optimal Parse**
'      No Abbrev.   103,462
'      Zilf 0.9      90.368
'      Zilf beta     89,454
'      Sort once     88,384   88,336
'      Keep sorted   88,392   88,348
'      Max heap      88,278   88,252 58 s
'
'   Mini-Zork II, Release 14:
'      No Abbrev.    60,316
'      Zilf 0.9      54,234
'      Zilf beta     53,906
'      Sort once     53,522   53,498
'      Keep sorted   53,560   53,488
'      Max heap      53,506   53,428 27 s
'
'   Heart of Ice:
'      No Abbrev.   283,184
'      Zilf 0.9     232,408
'      Zilf beta
'      Sort once    222,992
'      Keep sorted  222,588  222,248
'      Max heap     222,148  221,880 183 s
'
'   Trinity:
'      No Abbrev.   281,580
'      Zilf 0.9     257,408
'      Zilf beta    256,908
'      Sort once    255,232
'      Keep sorted  255,308 
'      Max heap     255,024  254,988 103 s

*  Abbreviations are sorted from longest to shortest.
** Zilf 0.9 patched with "Optimal Parse".