Highly optimized abbreviations computed efficiently

Very useful. There still is a small issue that I had already mentioned in this forum about the characters encoding of the gametext.txt file which should be the same as the one chosen for the game. If you have accented characters in your dictionary, they are displayed wrong. A detail.

This is great!

I intend to modify my program so it accepts gametext input and generates inform-style abbreviation output. Now I can use your code as template, very useful.

1 Like

I have not tried to do the “optimal parse” for abbreviation-finding. Inform’s abbreviation finding routine, try_abbreviations_from(), is very simple. If someone feels like improving that, go for it!

1 Like

I just submitted a pull request to the Inform 6 compiler that applies Wagner’s algorithm given a set of abbreviations. It seems to work on “Tristam Island” (the gamefile is fine, the output of the compiler with -f is good), and it saved me 1200 bytes. I nearly fell off my chair :smiley:

I hope I didn’t mess anything up and I hope everyone gets to enjoy it in their I6 compiler very soon! :slight_smile:

8 Likes

For anyone interested, the code has just been merged into the I6 compiler :slight_smile:

10 Likes

Thanks again!

Here’s what I see when compiling Advent.inf to .z5 with a list of abbreviations generated by Inform -u. (This is not the perfect abbreviation list, but the point is to show the effect of this compiler update.)

No abbreviations:

 69279 characters used in text       51470 bytes compressed (rate 0.742)
135252 bytes used in Z-machine

Abbreviations on, using Inform 6.35 (before this change):

 69561 characters used in text       43980 bytes compressed (rate 0.632)
127804 bytes used in Z-machine

Abbreviations on, with this change:

 69561 characters used in text       43876 bytes compressed (rate 0.630)
127680 bytes used in Z-machine

It’s a small improvement, but still welcome.

(The “characters used” line includes the Abbreviation directives, is why that changes. “Bytes used” is the important stat.)

5 Likes

On constraint systems, every single byte counts! Thank you so much Hugo!

1 Like

I just saved 1.1k only with recompiling the latest Inform 6 compiler trunk! Thanks @mulehollandaise and @zarf!

5 Likes

I finally added support for Inform6 to ZAbbrevMaker. It requires a 'gametext.txt"-file generated with Inform 6.35 or later.

Source code, instructions and precompiled executables for the most common platforms are here.

3 Likes

Version 0.4 of ZAbbrevMaker. Bugfixes in this version:

  • Handle path-seperator machine independent (\ on windows, / on other).
  • Autodetect text encoding and if that fails switch to force encoding.
  • Added lines tagged with H and W in the “gametext.txt” to the calculation of abbreviations.

Here are some benchmark tests that I did for this version, PunyInform:

Hibernated 1 DC, release 8
109.204 without abbreviations
100.660 64 abbreviations pre-defined in PunyInform
 99.812 64 abbreviations generated by Inform6 (-u)
 94.364	96 abbreviations generated by ZAbbrevMaker 0.3
 94.116	96 abbreviations generated by ZAbbrevMaker 0.4

The Job, release 5
 46.996 without abbreviations
 44.272 64 abbreviations pre-defined in PunyInform
 43.936 64 abbreviations generated by Inform6 (-u)
 43.032 96 abbreviations generated by ZAbbrevMaker 0.3
 42.904 96 abbreviations generated by ZAbbrevMaker 0.4

Craverly Heights (PunyInform)
 58.436 without abbreviations
 54.528 64 abbreviations pre-defined in PunyInform
 52.456 64 abbreviations generated by Inform6 (-u)
 50.644 96 abbreviations generated by ZAbbrevMaker 0.3
 50.616 96 abbreviations generated by ZAbbrevMaker 0.4
5 Likes

I have updated ZAbbrevMaker with a new ,experimental, function that lets you generate tailor-made alphabets for the game and calculate the abbreviations based on this alphabet.
This only works for z5+ games where custom alphabets are allowed and it is obviously more useful if the game is in another language than english, especially a language with lots of accented characters.

Benchmark testfiles in Inform6, PunyInform and ZIL
Hibernated 1 DC, release 8 (PunyInform, Z5)
109.204 without abbreviations
100.660 64 abbreviations pre-defined in PunyInform
 99.812 64 abbreviations generated by Inform6 (-u)
 94.364	96 abbreviations generated by ZAbbrevMaker 0.3
 94.116	96 abbreviations generated by ZAbbrevMaker 0.4
 93.296 ZAbbrevMaker 0.5 with custom alphabet

The Job, release 5 (PunyInform, Z5)
 46.996 without abbreviations
 44.272 64 abbreviations pre-defined in PunyInform
 43.936 64 abbreviations generated by Inform6 (-u)
 43.032 96 abbreviations generated by ZAbbrevMaker 0.3
 42.904 96 abbreviations generated by ZAbbrevMaker 0.4
 42.736 ZAbbrevMaker 0.5 with custom alphabet

Craverly Heights (PunyInform, Z5)
 58.436 without abbreviations
 54.528 64 abbreviations pre-defined in PunyInform
 52.456 64 abbreviations generated by Inform6 (-u)
 50.644 96 abbreviations generated by ZAbbrevMaker 0.3
 50.616 96 abbreviations generated by ZAbbrevMaker 0.4
 50.216 ZAbbrevMaker 0.5 with custom alphabet
 
Craverly Heights (Inform 6.12.5, Z5)
 94.612 without abbreviations
 87.854 64 abbreviations generated by Inform6 (-u)
 85.708 96 abbreviations generated by ZAbbrevMaker 0.5
 85.144 ZAbbrevMaker 0.5 with custom alphabet
 
Beyond Zork (ZIL, Z5)
278.808 without abbreviations
262.190 ZILCH abbreviations (Infocom original)
258.440 Zilf 0.9 generated abbreviations
257.584 Zilf in development generated abbreviations
256.120 ZAbbrevMaker 0.5 applied with Zilf in development
254.956 ZAbbrevMaker 0.5 with custom alphabet

Craverly Heights (ZIL, Z5)
 55.832 without abbreviations
 48.760 Zilf 0.9 generated abbreviations
 48.092 Zilf in development generated abbreviations
 47.512 ZAbbrevMaker 0.5 applied with Zilf in development
 47.164 ZAbbrevMaker 0.5 with custom alphabet

Zork 285 (ZIL, Z5)
 45.080 without abbreviations
 39.628 Zilf 0.9 generated abbreviations
 39.572 Zilf in development generated abbreviations
 38.524 ZAbbrevMaker 0.5 applied with Zilf in development
 38.292 ZAbbrevMaker 0.5 with custom alphabet

Heart of Ice (ZIL, Z5)
283.184 without abbreviations
232.408 Zilf 0.9 generated abbreviations
228.144 Zilf in development generated abbreviations
221.700 ZAbbrevMaker 0.5 applied with Zilf in development
221.184 ZAbbrevMaker 0.5 with custom alphabet

Zork II (ZIL, Z3)
103.462 without abbreviations
 90.368 Zilf 0.9 generated abbreviations
 89.454 Zilf in development generated abbreviations
 88.198 ZAbbrevMaker 0.5 applied with Zilf in development

Mini-Zork II release 14 (ZIL, Z3)
 60.316 without abbreviations
 54.234 Zilf 0.9 generated abbreviations
 53.906 Zilf in development generated abbreviations
 53.420 ZAbbrevMaker 0.5 applied with Zilf in development

Trinity (ZIL, Z4)
281.580 without abbreviations
262.065 ZILCH abbreviations (Infocom original)
257.408 Zilf 0.9 generated abbreviations
256.908 Zilf in development generated abbreviations
254.812 ZAbbrevMaker 0.5 applied with Zilf in development

Note that if you use custom alphabets in Inform you must specify the new alphabet as early as possible because the compiler only starts coding text with the new alphabet at the point it is included but the interpreter will apply the the specified alphabet on the whole text in the game.

3 Likes

If I export the gametext.txt-file with the new options, I get these types of textlines:

I:info
G:game text
V:veneer text
L:lowmem string
A:abbreviation
D:dict word
O:object name
S:symbol
X:infix
H:game text inline in opcode
W:veneer text inline in opcode

I’m not entirely sure what each of these is but currently when I calculate the abbreviations I use G, V, L, O, H & W. By chance I discovered today that I also should include S.

Are there anymore that should be included (or some that should be excluded)?

That looks right.

Dict words can’t use abbreviations. Abbreviations can’t recursively use abbreviations (according to the Z-spec). Infix is debugging code which will be omitted from a released game, so you wouldn’t want to include it in abbreviation calculations. Info text doesn’t appear in the game file at all.

1 Like

I have updated ZAbbrevMaker to version 0.6. Fixes are:

  • Include symbols (S:) from gametext.txt when caclulation abbreviations.
  • Calculate abbreviations on all candidates. Earlier versions used a cut-off that excluded candidates that saved less than 20 bytes (this for speed). This affects the result, especially for very small files.
  • Inform does not allow abreviations longer than 63 characters. Added a warning and limits the abbreviations to max 63 characters when using the -i switch.
Benchmark testfiles in Inform6, PunyInform and ZIL
Hibernated 1 DC, release 8 (PunyInform v2.6, Z5)
109.204 without abbreviations
100.660 64 abbreviations pre-defined in PunyInform
 99.812 64 abbreviations generated by Inform6 (-u)
 94.364	96 abbreviations generated by ZAbbrevMaker 0.3
 94.116	96 abbreviations generated by ZAbbrevMaker 0.4
 93.296 ZAbbrevMaker 0.5 with custom alphabet
 93.372 96 abbreviations with ZAbbrevMaker 0.6 & Inform 6.36b

The Job, release 5 (PunyInform v3.0, Z5)
 46.996 without abbreviations
 44.272 64 abbreviations pre-defined in PunyInform
 43.936 64 abbreviations generated by Inform6 (-u)
 43.032 96 abbreviations generated by ZAbbrevMaker 0.3
 42.904 96 abbreviations generated by ZAbbrevMaker 0.4
 42.736 ZAbbrevMaker 0.5 with custom alphabet
 42.660 96 abbreviations with ZAbbrevMaker 0.6 & Inform 6.36b

Craverly Heights (PunyInform v2.6, Z5)
 58.436 without abbreviations
 54.528 64 abbreviations pre-defined in PunyInform
 52.456 64 abbreviations generated by Inform6 (-u)
 50.644 96 abbreviations generated by ZAbbrevMaker 0.3
 50.616 96 abbreviations generated by ZAbbrevMaker 0.4
 50.216 ZAbbrevMaker 0.5 with custom alphabet
 50.124 96 abbreviations with ZAbbrevMaker 0.6 & Inform 6.36b
 50.076 96 abbreviations with ZAbbrevMaker 0.6 (-l 63) & Inform 6.36b

Craverly Heights (Inform 6.12.5, Z5)
 94.612 without abbreviations
 87.854 64 abbreviations generated by Inform6 (-u)
 85.708 96 abbreviations generated by ZAbbrevMaker 0.5
 85.144 ZAbbrevMaker 0.5 with custom alphabet
 85.372 96 abbreviations with ZAbbrevMaker 0.6 & Inform 6.36b
 85.344 96 abbreviations with ZAbbrevMaker 0.6 (-l 63) & Inform 6.36b

Nada (PunyInform v3.2, Z3)
 25.230 without abbreviations (-v3s)
 24.098 64 pre-defined abbreviations. Inform 6.35 (-v3es)
 24.094 64 pre-defined abbreviations. Inform 6.36b (-v3es)
 24.032 96 abbrevs with ZAbbrevMaker 0.6. Inform 6.35 (-v3es)
 24.022 96 abbrevs with ZAbbrevMaker 0.6. Inform 6.36b (-v3es)
 24.026 96 abbrevs with ZAbbrevMaker 0.6 (-l 63). Inform 6.35 (-v3es)
 24.016 96 abbrevs with ZAbbrevMaker 0.6 (-l 63). Inform 6.36b (-v3es)

Beyond Zork (ZIL, Z5)
278.808 without abbreviations
262.190 ZILCH abbreviations (Infocom original)
258.440 Zilf 0.9 generated abbreviations
257.584 Zilf in development generated abbreviations
256.120 ZAbbrevMaker 0.5 applied with Zilf in development
254.956 ZAbbrevMaker 0.5 with custom alphabet
256.120 ZAbbrevMaker 0.6 applied with Zilf in development

Craverly Heights (ZIL, Z5)
 55.832 without abbreviations
 48.760 Zilf 0.9 generated abbreviations
 48.092 Zilf in development generated abbreviations
 47.512 ZAbbrevMaker 0.5 applied with Zilf in development
 47.164 ZAbbrevMaker 0.5 with custom alphabet
 47.508 ZAbbrevMaker 0.6 applied with Zilf in development
 47.436 ZAbbrevMaker 0.6 with -l 100 and optimal parse in Zilf

Zork 285 (ZIL, Z5)
 45.080 without abbreviations
 39.628 Zilf 0.9 generated abbreviations
 39.572 Zilf in development generated abbreviations
 38.524 ZAbbrevMaker 0.5 applied with Zilf in development
 38.292 ZAbbrevMaker 0.5 with custom alphabet
 38.540 ZAbbrevMaker 0.6 applied with Zilf in development
 38.452 ZAbbrevMaker 0.6 with -l 100 and optimal parse in Zilf
 
Heart of Ice (ZIL, Z5)
283.184 without abbreviations
232.408 Zilf 0.9 generated abbreviations
228.144 Zilf in development generated abbreviations
221.700 ZAbbrevMaker 0.5 applied with Zilf in development
221.184 ZAbbrevMaker 0.5 with custom alphabet
221.780 ZAbbrevMaker 0.6 applied with Zilf in development
221.700 ZAbbrevMaker 0.6 with -l 40 and optimal parse in Zilf

Zork II (ZIL, Z3)
103.462 without abbreviations
 90.368 Zilf 0.9 generated abbreviations
 89.454 Zilf in development generated abbreviations
 88.198 ZAbbrevMaker 0.5 applied with Zilf in development
 88.188 ZAbbrevMaker 0.6 applied with Zilf in development

Mini-Zork II release 15 (ZIL, Z3)
 60.316 without abbreviations
 54.234 Zilf 0.9 generated abbreviations
 53.906 Zilf in development generated abbreviations
 53.420 ZAbbrevMaker 0.5 applied with Zilf in development
 53.414 ZAbbrevMaker 0.6 applied with Zilf in development
 53.414 ZAbbrevMaker 0.6 with -l 100 and optimal parse in Zilf

Trinity (ZIL, Z4)
281.580 without abbreviations
262.065 ZILCH abbreviations (Infocom original)
257.408 Zilf 0.9 generated abbreviations
256.908 Zilf in development generated abbreviations
254.812 ZAbbrevMaker 0.5 applied with Zilf in development
254.812 ZAbbrevMaker 0.6 applied with Zilf in development
5 Likes

There is a new updated version of ZAbbrevMaker (version 0.9). It’s mainly done because the earlier precompiled binaries needs deprecated libraries (at least the Linux version, running on Ubuntu 23.04). The earlier versions where compiled with Visual Studio 2019 and NET5.0.

The new versions are compiled with Visual Studio 2022 and NET7.0. I also did some code cleanup and tweaks with the compile parameters, there is no change in the logic and it produces the same output. The result are files that is half the size but funnily enough runs in more than twice the speed (NET7.0 better than NET5.0?).

The files (and the source) are on GitHub, if you are only interested in a precompiled file, they can also be found on my Google Drive.

EDIT: A bit OT, but I found this short (not!) article that explains the improvements I get with NET7.0. It contains some information about compile property settings I’m eager to try out.

6 Likes

I’ve read this thread over a few times now.

I thought the hard part was finding the suitable abbreviations in the first place, and applying them in the compiler is pretty simple. I also thought Wagner’s optimal parse was necessary for the construction of the abbreviations list, not also for its application?

If you already store your “best” (ie longest) abbreviations first, I’d think it was enough to, for each string being encoded, iterate over each character in the string and if any abbreviations start with that letter, search the potential abbreviations (in descending length order) for the first match.

(I was thinking I’d avoided a lot of hard work by just making Zabbrev figure out the tough bits for me)

But seeing as it looks like Wagner’s algorithm was added to Inform several years ago… I’m missing something.

Thanks,

-Dave

Not quite. Imagine that we have abbreviations “special” and “es”. The word “especially” should be encoded as “especially”, not “especially”.

3 Likes

Ah, that’s a much clearer example than what I was fumbling to synthesize.

This link from above also describes the issue (the original pull request) Optimal parsing of abbreviations by hlabrand · Pull Request #108 · DavidKinder/Inform6 · GitHub

It’s funny how similar abbreviation finding and general sliding window compression algorithms are. Some LZW code I write nearly 30 years ago had a similar issue where an eager match sometimes resulted in poorer compression.

I’m glad you saw this thread Daniel, I was worried the suggestions I’d made yesterday were steering you in the wrong direction. My approach does work, but it’s probably a few percent off optimal in larger datasets.

-Dave

Oh no, it was very helpful! I’m deliberately trying to implement the simplest possible version of the system now, because once something is in place, it’s easier for other people to improve on. Now we have data structures for abbreviation data, and an algorithm to apply them. Neither one is ideal, but someone can now write new code to fill those data structures based on a file of optimization data (for example), or a new non-greedy algorithm for better application.

Turns out I’d done this implicitly on accident. I’ll study the papers and do this properly.

Although I’ve studied a lot of the papers and watched part of that MIT lecture and am still having a hard time finding a concise description.

One of the papers mentioned is Wagner 1973, “Common Phrase and Minimum-Space Text Storage”. That seems most relevant for “what do you put in the compiler once something has already figured out the best abbreviations”. But there’s also a lot of discussion about suffix trees, with the canonical BANANA example, which seem more geared for finding the abbreviations in the first place? To make things even more confusing, that effort stems back to another paper from 1973 by Weiner, not Wagner.

The algorithm was improved in 1976, and then somebody else improved it even further in 1995.

Can anybody point me to some reasonable pseudocode demonstrating at least the parts from Wagner 1973? That’s the part I need to apply the abbreviations. My current solution is greedy and looks for the longest match from the current character position, which as pointed out above is often suboptimal.

Thanks,

-Dave