Highly optimized abbreviations computed efficiently

Finding that earns you 54 bytes.

zork2_freq.zap.89368.txt (6.0 KB)

4 Likes

So, up to a >1% improvement in binary size over the ZILF default.

Are the gains noticeably different on wordier games? https://github.com/daelsepara/heart-of-ice , for example?

heart-of-ice no abbreviations is 283184 (too big)

Zapf abbreviations 232408 in 195s

My abbreviations 228472 in 9.4s

Similar small but noticeable gain, much faster.

heartice_freq.zap.228472.txt (6.1 KB)

4 Likes

I’ve implemented this approach in C# for ZAPF 0.10: https://foss.heptapod.net/zilf/zilf/-/merge_requests/15

On Zork II, it gets within a couple bytes of Matthew’s results in 4.3 seconds (on a single thread). On Heart of Ice, it gets identical results, but takes about 30 seconds.

There’s probably more optimization to be done on this code – parallelizing the DFS, for example, which should be straightforward – but this is a huge step forward already.

7 Likes

Just wanted to mention that I updated my script yesterday, implementing @russotto’s tiebreaking scheme and fixing the counting bug. I can reach 89400 bytes for Zork 2, and I don’t know if it’s because of my algorithm or because of my crude scheme to get all the gametext (which misses multi-line prints), but at this point I don’t think I’ll attempt to figure it out.

Anyway, if anyone wants a slower and not-quite-optimal Python script to do this on Inform 6 gametext, you know where to find it :slight_smile: And thanks again @russotto @heasm66 @vaporware for the discussion!

3 Likes

Over in the thread about Mini-Zork II I noticed that the abbreviation files that @russotto saved ~300 bytes more than the other. Why?

One difference was that the beginning of his files looked like this:

        .FSTR FSTR?1," in the "		; 35x, saved 202
        .FSTR FSTR?2,"n the "		; 38x, saved 146
        .FSTR FSTR?3,"to the "		; 47x, saved 228
        .FSTR FSTR?4,"o the "		; 13x, saved 46
        .FSTR FSTR?5," of the "		; 38x, saved 220
        .FSTR FSTR?6,"s the "		; 51x, saved 198
        .FSTR FSTR?7," the "		; 154x, saved 457

and the other like this:

        .FSTR FSTR?1," the "		; 376x, saved 1123
        .FSTR FSTR?2,"The "		; 243x, saved 724
        .FSTR FSTR?3," is "		; 198x, saved 392
        .FSTR FSTR?4,"You "		; 117x, saved 346
        .FSTR FSTR?5," you"		; 174x, saved 344

The glaring difference is that the first file waits a while to define the clear winner " the " and list a couple of phrases that have the winner as a part. This is a good strategy because if " the " is the first abbreviation it ‘ruins’ tho other ones and they will never get used!

I wrote a little test (in VB.NET, hope that doesn’t bug too many!) where I try to implement this logic. The basic structure of the code is largely based on the Python-script by @mulehollandaise and doesn’t use suffix-tree and is therefore a bit slower (30 s on my computer for Mini-Zork II).

  1. Identify the first winner.
  2. Pick a threshold based on this winner. I have experimented with different values, but for Mini-Zork II 8% gave the best result.
  3. Instead of using the winner pick the longest abbreviation that clears the threshold and have the winner inside it.
  4. Recalculate index.
  5. Pick a new winner
  6. repeat from 3 until 96 abbreviations are found.

Below is the code if anyone wants to use it or improve it. For Visual Studio you need a Form with a Button and a multiline Textbox and then paste just paste the code in the Form’s class.

Beware! The code isn’t very polished yet.

Form1.vb.txt (11.8 KB)

2 Likes

With my new abbreviations, 34 seconds to generate, I get Zork II down to 89,028 bytes.

zork2_freq.xzap.txt (7.2 KB)

EDIT: Noticed that I used Mini-Zork II. Gonna regenerate one for Zork II instead.

2 Likes

A little more benchmarks, now with the correct file for Zork II.

Zork II         88,738 bytes   50 s to generate
Heart of Ice   223,984 bytes   6 minutes to generate

The time to generate is highly dependent on what cut-off is used for “worthless” abbreviations. For these times a cut-off of about 2% of highest score is used.

heartice_freq.xzap.txt (7.5 KB) zork2_freq.xzap.txt (7.2 KB)

1 Like

You shaved 600 bytes off with that? Wow good job! I’ll definitely look into it.
I guess the greedy approach isnt the best one, even with the tiebreaker rule. The search continues!! :smiley:

Very good. I spent way too much time playing with this abbreviations stuff. I got it down to one build of the suffix tree (I’m now using the SA-IS suffix array algorithm with some custom routines on top to do the traversal) and some extra scoring logic, then I tried improving the abbreviations. What mine does is

  1. Identify twice the number of winners as there are abbreviations, using the old logic. Any abbreviation eliminated due to completely containing an earlier winner is saved.

  2. Go through each of the abbreviations that have been saved, in order of their original score, and re-score the entire set based on the long abbreviation being above the first abbreviation it fully contains. If the score for the top 96 in the re-scored set is higher, use the abbreviation; otherwise, discard it.

  3. If there are any abbreviations in the second half of the set of 192 that have a higher score than those in the set of 96, move them up.

This is all very ad-hoc, alas. I’ve looked to see if there are any good algorithms for this, but I’ve only found some which assume abbreviations can be abbreviated, and a handwavy reference to the problem being NP-complete (whether in the number of abbrs or the text size, I don’t know)

My best for heart-of-ice is 223332 in about 7.6s, and 88566 for Zork2 in 1.8s (on my iMac) Almost all the time is in the re-scoring.

I’ve put my code up on Gitlab: Matthew T. Russotto / ZilAbbrs · GitLab

4 Likes

I’ve made a small modification of the algorithm. In the one below it’s only point 6 that is new.

1.   Collect all text strings from the zap-files in a vector(of strings). In C++ this
     becomes string1 string2 ... seperated by 0 in an array of characters.

2.   Split all strings into unquie keys of length 2 or longer.

3.   Calculate cost i z-chars for each key.

4.1. Count occurrences of each key in text.

4.2. Keep track of key with highest savings.

4.3. Save best candidate.

4.4. Remove best candidate from text

4.5. Remove all keys with savings < LOW_SCORE_CUTOFF.

4.6. Repeat from 4.1 until we have up to 192 candidates in the best_candidates list.
     There could be fewer, if we run out of valid keys with savings > LOW_SCORE_CUTOFF.

5.   Find all candidates that overlaps keys in best_candidates.

6.   Sort best_candidates and order top 96 candidates from longest to shortest.
        1-96 : Longest to shortest
        97-  : Descending savings value

7.   Calculate combined savings for 96 top keys in best_candidates.

8.1. Pick overlapper and place it before longest key it overlapps in new_best_candidates.

8.2. Calculate combined savings for 96 top keys in new_best_candidates. 

8.3. If new_score <= old_score, discard the overlapper and repeat from 7.1.

8.4. Use new_best_candidates as best_candidates.

8.5. Repeat from 8.1 until all there are no more overlappers.

9.   Print 96 best_candidates.

This is to stop, for example, don’t pick “receptacl” instead of “receptacle.” beacause “e.” higher up have lowered the latters savingscore.

Zork 2 is now down to 88,440 bytes and Heart of Ice is down to 223,004 bytes with this modification.

I made a small improvement to the above algorithm. In stage 8.1 I plave the overlapper before the overlapping key direcly after the key that is one z-char longer. Therby maintaining the descending order of length for the top 96. This made a small gain for Zork II and saved 10 bytes more (down to 88,434 bytes). But when I tried it with Mini-Zork II the reverse happened, see table below.

Zork II
                 Saved   Calculated  Actual              
Method Org. size z-chars saved bytes saved bytes Diff Game size
1        103,462  22,426      14,951      15,018   67    88,444
2        103,462  22,491      14,994      15,028   34    88,434

Mini-Zork II, Release 13
Method Org. size z-chars saved bytes saved bytes Diff Game size
1         60,258  10,269       6,846       6,842   -4    53,416   
2         60,258  10,305       6,870       6,794  -76    53,464

This is probably because how many “wasted” z-chars the abbreviations regains/introduces.

As I understand it each string in the game always occupy an even number of bytes (2, 4, 6, 8, …). Every two-bytes, word, contains 3 z-chars. If you have a string of 13 z-chars there will be two “wasted” slots in the last word. If you then abbreviate something in this string that brings it down to some length dividable by 3 you will reclaim or gain these two slots. The other way around you will “waste” slots if you have a string that is dividable by 3 that the abbreviation changes to some length that is not dividable by 3.

Normaly this would even out, I think, but you could maybe introduce bonus/punishment for good/bad behaving abbreviation. This would have to be done on every string the abbreviation is applied to and summed up for each key. The cost is increased/decreased by a value between -2 to +2 for each time this abbreviation is used.

This, just to squeeze out the last bytes…

Am I thinking this right, or?

It’s definitely possible to determine the exact savings of a set of abbreviations, taking into account the wasted Z-characters, but you can’t really do it for a single abbreviation because other abbreviations in the same strings affect it.

I made some small changes to the algorithm:

  1. Overlappers are sorted from shortest to longest before they are tested against top 96 to see if they qualify.
  2. Resort the bottom half (candidates 97-) of the BestCandidatesList.
  3. Overlappers that not qualify to a top spot don’t get discarded right away. Instead they are played last in the list and only discarded if they during recalculation get a savings score below 0.
The result:
'   Zork II:
'      No Abbrev.   103,462
'      Zilf 0.9      90.368
'      Zilf beta     89,454
'      Sort once     88,384
'      Keep sorted   88,392
'
'   Mini-Zork II, Release 13:
'      No Abbrev.    60,258
'      Zilf 0.9      54,170
'      Zilf beta     53,846
'      Sort once     53,404
'      Keep sorted   53,468
'
'   Heart of Ice:
'      No Abbrev.   283,184
'      Zilf 0.9     232,408
'      Zilf beta
'      Sort once    222,992
'      Keep sorted  222,588
'
'   Trinity:
'      No Abbrev.   281,580
'      Zilf 0.9     257,408
'      Zilf beta    256,908
'      Sort once    255,232
'      Keep sorted  255,308 

It’s irritating that there’s a difference between if the top 96 of the BestCandidatesList is only sorted once, before overlappers are tested, or resorted between each overlapper is tested.

I don’t think this difference is only due to “wasted” z-chars, these should even out. I’m leaning more towards that it’s interference from partial overlappers that changes the result depending on wich order the list is sorted.

Mini-Zork II has a fair amount of globals for common texts. At what point - if any - does that start to work against the abbreviation finder? E.g. I could add a global for "You can’t ", but I think that’s a text that the better abbreviation finders detect on their own, even if ZILF 0.9 doesn’t.

The shortest one is <GLOBAL PERIOD-CR ".|"> which it uses for things like <TELL "You can't move the " D ,PRSO ,PERIOD-CR>, i.e. where it really can’t be made part of a longer string.

I think you’re about there. The GLOBALs aren’t as much savings as an abbreviation. I don’t have the math here right now, but the biggest one is that the global only replaces strings in (.PRINTI, .PRINTR in ZAP).

I identified a couple of strings that i thought would save about 76 bytes, but the net gain was only 6 bytes because the abbreviations were less effective.

EDIT: I don’t think this would be a problem in bigger games. There the limit of 96 abbreviations are the hindrance and there’s plenty of room for GLOBALs.

Can you see if any of the current ones does more harm than good?

I can’t see any obvious. The PERIOD-CR is a border case. As I count it, there are 65 ".|" in the files (4 in DESCs). Maybe it would be be picked up as a abbreviation and save 65*(4-2)-4=126 z-chars. This would replace an abbreviation that saves around 50 for a net gain of about 75 z-chars. From this we would need to subtract what is currently saved by the GLOBAL. It would only be a couple of bytes either way. Not worth the trouble, I say…

Do note the cost of the alternatives here. We’re talking about a string consisting of a dot and a newline character.

If we don’t have an abbreviation for this specific string, it takes up four five-bit codes, using up 4 bytes.

If we have an abbreviation, it takes up two five-bit codes, using up 2 bytes. This also means it uses one slot in the abbreviation table, so some other string we would benefit from creating an abbreviation for won’t get an abbreviation.

If we use a global, each reference to that global uses up 1 byte, due to the way Z-code is encoded.

If we use this string in say 50 places, the savings are big enough to matter.

I think globals will cost a little more than 1 byte?

If you have a string like:
“You can’t do that.|Try something different instead.”

It will code in zap as:

PRINTR "You can't do that.|Try something different instead."

opcode+ address + length of string, I think.

If we use a global for “.|” it will compile to:

PRINTR "You can't do that"
PRINT PERIOD-CR
PRINTR "Try something different instead."

3 opcodes + 3 addresses + length of strings - 4 z-chars.

The calculation becomes different if the PERIOD-CR is in the end of the string.

Then it’s the issue if +/- of the 4 z-chars makes the string fit neatly in the 2-byte words before or after.

I think it’s hard to beforehand now how much you will save or lose especially in a small game like mini-zork II where the abbreviations get a bit exhausted at the end and only saves betwen 40-50 z-chars so there is not great abbreviation waiting in the wing to replace the global.

I’m a bit tired so the math can be very off. If so I’m apologize beforehand…