Very useful. There still is a small issue that I had already mentioned in this forum about the characters encoding of the gametext.txt file which should be the same as the one chosen for the game. If you have accented characters in your dictionary, they are displayed wrong. A detail.
This is great!
I intend to modify my program so it accepts gametext input and generates inform-style abbreviation output. Now I can use your code as template, very useful.
I have not tried to do the “optimal parse” for abbreviation-finding. Inform’s abbreviation finding routine, try_abbreviations_from(), is very simple. If someone feels like improving that, go for it!
I just submitted a pull request to the Inform 6 compiler that applies Wagner’s algorithm given a set of abbreviations. It seems to work on “Tristam Island” (the gamefile is fine, the output of the compiler with -f is good), and it saved me 1200 bytes. I nearly fell off my chair 
I hope I didn’t mess anything up and I hope everyone gets to enjoy it in their I6 compiler very soon! 
For anyone interested, the code has just been merged into the I6 compiler 
Thanks again!
Here’s what I see when compiling Advent.inf to .z5 with a list of abbreviations generated by Inform -u. (This is not the perfect abbreviation list, but the point is to show the effect of this compiler update.)
No abbreviations:
69279 characters used in text 51470 bytes compressed (rate 0.742)
135252 bytes used in Z-machine
Abbreviations on, using Inform 6.35 (before this change):
69561 characters used in text 43980 bytes compressed (rate 0.632)
127804 bytes used in Z-machine
Abbreviations on, with this change:
69561 characters used in text 43876 bytes compressed (rate 0.630)
127680 bytes used in Z-machine
It’s a small improvement, but still welcome.
(The “characters used” line includes the Abbreviation directives, is why that changes. “Bytes used” is the important stat.)
On constraint systems, every single byte counts! Thank you so much Hugo!
I just saved 1.1k only with recompiling the latest Inform 6 compiler trunk! Thanks @mulehollandaise and @zarf!
I finally added support for Inform6 to ZAbbrevMaker. It requires a 'gametext.txt"-file generated with Inform 6.35 or later.
Source code, instructions and precompiled executables for the most common platforms are here.
Version 0.4 of ZAbbrevMaker. Bugfixes in this version:
- Handle path-seperator machine independent (\ on windows, / on other).
- Autodetect text encoding and if that fails switch to force encoding.
- Added lines tagged with H and W in the “gametext.txt” to the calculation of abbreviations.
Here are some benchmark tests that I did for this version, PunyInform:
Hibernated 1 DC, release 8
109.204 without abbreviations
100.660 64 abbreviations pre-defined in PunyInform
99.812 64 abbreviations generated by Inform6 (-u)
94.364 96 abbreviations generated by ZAbbrevMaker 0.3
94.116 96 abbreviations generated by ZAbbrevMaker 0.4
The Job, release 5
46.996 without abbreviations
44.272 64 abbreviations pre-defined in PunyInform
43.936 64 abbreviations generated by Inform6 (-u)
43.032 96 abbreviations generated by ZAbbrevMaker 0.3
42.904 96 abbreviations generated by ZAbbrevMaker 0.4
Craverly Heights (PunyInform)
58.436 without abbreviations
54.528 64 abbreviations pre-defined in PunyInform
52.456 64 abbreviations generated by Inform6 (-u)
50.644 96 abbreviations generated by ZAbbrevMaker 0.3
50.616 96 abbreviations generated by ZAbbrevMaker 0.4
I have updated ZAbbrevMaker with a new ,experimental, function that lets you generate tailor-made alphabets for the game and calculate the abbreviations based on this alphabet.
This only works for z5+ games where custom alphabets are allowed and it is obviously more useful if the game is in another language than english, especially a language with lots of accented characters.
Benchmark testfiles in Inform6, PunyInform and ZIL
Hibernated 1 DC, release 8 (PunyInform, Z5)
109.204 without abbreviations
100.660 64 abbreviations pre-defined in PunyInform
99.812 64 abbreviations generated by Inform6 (-u)
94.364 96 abbreviations generated by ZAbbrevMaker 0.3
94.116 96 abbreviations generated by ZAbbrevMaker 0.4
93.296 ZAbbrevMaker 0.5 with custom alphabet
The Job, release 5 (PunyInform, Z5)
46.996 without abbreviations
44.272 64 abbreviations pre-defined in PunyInform
43.936 64 abbreviations generated by Inform6 (-u)
43.032 96 abbreviations generated by ZAbbrevMaker 0.3
42.904 96 abbreviations generated by ZAbbrevMaker 0.4
42.736 ZAbbrevMaker 0.5 with custom alphabet
Craverly Heights (PunyInform, Z5)
58.436 without abbreviations
54.528 64 abbreviations pre-defined in PunyInform
52.456 64 abbreviations generated by Inform6 (-u)
50.644 96 abbreviations generated by ZAbbrevMaker 0.3
50.616 96 abbreviations generated by ZAbbrevMaker 0.4
50.216 ZAbbrevMaker 0.5 with custom alphabet
Craverly Heights (Inform 6.12.5, Z5)
94.612 without abbreviations
87.854 64 abbreviations generated by Inform6 (-u)
85.708 96 abbreviations generated by ZAbbrevMaker 0.5
85.144 ZAbbrevMaker 0.5 with custom alphabet
Beyond Zork (ZIL, Z5)
278.808 without abbreviations
262.190 ZILCH abbreviations (Infocom original)
258.440 Zilf 0.9 generated abbreviations
257.584 Zilf in development generated abbreviations
256.120 ZAbbrevMaker 0.5 applied with Zilf in development
254.956 ZAbbrevMaker 0.5 with custom alphabet
Craverly Heights (ZIL, Z5)
55.832 without abbreviations
48.760 Zilf 0.9 generated abbreviations
48.092 Zilf in development generated abbreviations
47.512 ZAbbrevMaker 0.5 applied with Zilf in development
47.164 ZAbbrevMaker 0.5 with custom alphabet
Zork 285 (ZIL, Z5)
45.080 without abbreviations
39.628 Zilf 0.9 generated abbreviations
39.572 Zilf in development generated abbreviations
38.524 ZAbbrevMaker 0.5 applied with Zilf in development
38.292 ZAbbrevMaker 0.5 with custom alphabet
Heart of Ice (ZIL, Z5)
283.184 without abbreviations
232.408 Zilf 0.9 generated abbreviations
228.144 Zilf in development generated abbreviations
221.700 ZAbbrevMaker 0.5 applied with Zilf in development
221.184 ZAbbrevMaker 0.5 with custom alphabet
Zork II (ZIL, Z3)
103.462 without abbreviations
90.368 Zilf 0.9 generated abbreviations
89.454 Zilf in development generated abbreviations
88.198 ZAbbrevMaker 0.5 applied with Zilf in development
Mini-Zork II release 14 (ZIL, Z3)
60.316 without abbreviations
54.234 Zilf 0.9 generated abbreviations
53.906 Zilf in development generated abbreviations
53.420 ZAbbrevMaker 0.5 applied with Zilf in development
Trinity (ZIL, Z4)
281.580 without abbreviations
262.065 ZILCH abbreviations (Infocom original)
257.408 Zilf 0.9 generated abbreviations
256.908 Zilf in development generated abbreviations
254.812 ZAbbrevMaker 0.5 applied with Zilf in development
Note that if you use custom alphabets in Inform you must specify the new alphabet as early as possible because the compiler only starts coding text with the new alphabet at the point it is included but the interpreter will apply the the specified alphabet on the whole text in the game.
If I export the gametext.txt-file with the new options, I get these types of textlines:
I:info
G:game text
V:veneer text
L:lowmem string
A:abbreviation
D:dict word
O:object name
S:symbol
X:infix
H:game text inline in opcode
W:veneer text inline in opcode
I’m not entirely sure what each of these is but currently when I calculate the abbreviations I use G, V, L, O, H & W. By chance I discovered today that I also should include S.
Are there anymore that should be included (or some that should be excluded)?
That looks right.
Dict words can’t use abbreviations. Abbreviations can’t recursively use abbreviations (according to the Z-spec). Infix is debugging code which will be omitted from a released game, so you wouldn’t want to include it in abbreviation calculations. Info text doesn’t appear in the game file at all.
I have updated ZAbbrevMaker to version 0.6. Fixes are:
- Include symbols (S:) from gametext.txt when caclulation abbreviations.
- Calculate abbreviations on all candidates. Earlier versions used a cut-off that excluded candidates that saved less than 20 bytes (this for speed). This affects the result, especially for very small files.
- Inform does not allow abreviations longer than 63 characters. Added a warning and limits the abbreviations to max 63 characters when using the -i switch.
Benchmark testfiles in Inform6, PunyInform and ZIL
Hibernated 1 DC, release 8 (PunyInform v2.6, Z5)
109.204 without abbreviations
100.660 64 abbreviations pre-defined in PunyInform
99.812 64 abbreviations generated by Inform6 (-u)
94.364 96 abbreviations generated by ZAbbrevMaker 0.3
94.116 96 abbreviations generated by ZAbbrevMaker 0.4
93.296 ZAbbrevMaker 0.5 with custom alphabet
93.372 96 abbreviations with ZAbbrevMaker 0.6 & Inform 6.36b
The Job, release 5 (PunyInform v3.0, Z5)
46.996 without abbreviations
44.272 64 abbreviations pre-defined in PunyInform
43.936 64 abbreviations generated by Inform6 (-u)
43.032 96 abbreviations generated by ZAbbrevMaker 0.3
42.904 96 abbreviations generated by ZAbbrevMaker 0.4
42.736 ZAbbrevMaker 0.5 with custom alphabet
42.660 96 abbreviations with ZAbbrevMaker 0.6 & Inform 6.36b
Craverly Heights (PunyInform v2.6, Z5)
58.436 without abbreviations
54.528 64 abbreviations pre-defined in PunyInform
52.456 64 abbreviations generated by Inform6 (-u)
50.644 96 abbreviations generated by ZAbbrevMaker 0.3
50.616 96 abbreviations generated by ZAbbrevMaker 0.4
50.216 ZAbbrevMaker 0.5 with custom alphabet
50.124 96 abbreviations with ZAbbrevMaker 0.6 & Inform 6.36b
50.076 96 abbreviations with ZAbbrevMaker 0.6 (-l 63) & Inform 6.36b
Craverly Heights (Inform 6.12.5, Z5)
94.612 without abbreviations
87.854 64 abbreviations generated by Inform6 (-u)
85.708 96 abbreviations generated by ZAbbrevMaker 0.5
85.144 ZAbbrevMaker 0.5 with custom alphabet
85.372 96 abbreviations with ZAbbrevMaker 0.6 & Inform 6.36b
85.344 96 abbreviations with ZAbbrevMaker 0.6 (-l 63) & Inform 6.36b
Nada (PunyInform v3.2, Z3)
25.230 without abbreviations (-v3s)
24.098 64 pre-defined abbreviations. Inform 6.35 (-v3es)
24.094 64 pre-defined abbreviations. Inform 6.36b (-v3es)
24.032 96 abbrevs with ZAbbrevMaker 0.6. Inform 6.35 (-v3es)
24.022 96 abbrevs with ZAbbrevMaker 0.6. Inform 6.36b (-v3es)
24.026 96 abbrevs with ZAbbrevMaker 0.6 (-l 63). Inform 6.35 (-v3es)
24.016 96 abbrevs with ZAbbrevMaker 0.6 (-l 63). Inform 6.36b (-v3es)
Beyond Zork (ZIL, Z5)
278.808 without abbreviations
262.190 ZILCH abbreviations (Infocom original)
258.440 Zilf 0.9 generated abbreviations
257.584 Zilf in development generated abbreviations
256.120 ZAbbrevMaker 0.5 applied with Zilf in development
254.956 ZAbbrevMaker 0.5 with custom alphabet
256.120 ZAbbrevMaker 0.6 applied with Zilf in development
Craverly Heights (ZIL, Z5)
55.832 without abbreviations
48.760 Zilf 0.9 generated abbreviations
48.092 Zilf in development generated abbreviations
47.512 ZAbbrevMaker 0.5 applied with Zilf in development
47.164 ZAbbrevMaker 0.5 with custom alphabet
47.508 ZAbbrevMaker 0.6 applied with Zilf in development
47.436 ZAbbrevMaker 0.6 with -l 100 and optimal parse in Zilf
Zork 285 (ZIL, Z5)
45.080 without abbreviations
39.628 Zilf 0.9 generated abbreviations
39.572 Zilf in development generated abbreviations
38.524 ZAbbrevMaker 0.5 applied with Zilf in development
38.292 ZAbbrevMaker 0.5 with custom alphabet
38.540 ZAbbrevMaker 0.6 applied with Zilf in development
38.452 ZAbbrevMaker 0.6 with -l 100 and optimal parse in Zilf
Heart of Ice (ZIL, Z5)
283.184 without abbreviations
232.408 Zilf 0.9 generated abbreviations
228.144 Zilf in development generated abbreviations
221.700 ZAbbrevMaker 0.5 applied with Zilf in development
221.184 ZAbbrevMaker 0.5 with custom alphabet
221.780 ZAbbrevMaker 0.6 applied with Zilf in development
221.700 ZAbbrevMaker 0.6 with -l 40 and optimal parse in Zilf
Zork II (ZIL, Z3)
103.462 without abbreviations
90.368 Zilf 0.9 generated abbreviations
89.454 Zilf in development generated abbreviations
88.198 ZAbbrevMaker 0.5 applied with Zilf in development
88.188 ZAbbrevMaker 0.6 applied with Zilf in development
Mini-Zork II release 15 (ZIL, Z3)
60.316 without abbreviations
54.234 Zilf 0.9 generated abbreviations
53.906 Zilf in development generated abbreviations
53.420 ZAbbrevMaker 0.5 applied with Zilf in development
53.414 ZAbbrevMaker 0.6 applied with Zilf in development
53.414 ZAbbrevMaker 0.6 with -l 100 and optimal parse in Zilf
Trinity (ZIL, Z4)
281.580 without abbreviations
262.065 ZILCH abbreviations (Infocom original)
257.408 Zilf 0.9 generated abbreviations
256.908 Zilf in development generated abbreviations
254.812 ZAbbrevMaker 0.5 applied with Zilf in development
254.812 ZAbbrevMaker 0.6 applied with Zilf in development
There is a new updated version of ZAbbrevMaker (version 0.9). It’s mainly done because the earlier precompiled binaries needs deprecated libraries (at least the Linux version, running on Ubuntu 23.04). The earlier versions where compiled with Visual Studio 2019 and NET5.0.
The new versions are compiled with Visual Studio 2022 and NET7.0. I also did some code cleanup and tweaks with the compile parameters, there is no change in the logic and it produces the same output. The result are files that is half the size but funnily enough runs in more than twice the speed (NET7.0 better than NET5.0?).
The files (and the source) are on GitHub, if you are only interested in a precompiled file, they can also be found on my Google Drive.
EDIT: A bit OT, but I found this short (not!) article that explains the improvements I get with NET7.0. It contains some information about compile property settings I’m eager to try out.
I’ve read this thread over a few times now.
I thought the hard part was finding the suitable abbreviations in the first place, and applying them in the compiler is pretty simple. I also thought Wagner’s optimal parse was necessary for the construction of the abbreviations list, not also for its application?
If you already store your “best” (ie longest) abbreviations first, I’d think it was enough to, for each string being encoded, iterate over each character in the string and if any abbreviations start with that letter, search the potential abbreviations (in descending length order) for the first match.
(I was thinking I’d avoided a lot of hard work by just making Zabbrev figure out the tough bits for me)
But seeing as it looks like Wagner’s algorithm was added to Inform several years ago… I’m missing something.
Thanks,
-Dave
Not quite. Imagine that we have abbreviations “special” and “es”. The word “especially” should be encoded as “especially”, not “especially”.
Ah, that’s a much clearer example than what I was fumbling to synthesize.
This link from above also describes the issue (the original pull request) Optimal parsing of abbreviations by hlabrand · Pull Request #108 · DavidKinder/Inform6 · GitHub
It’s funny how similar abbreviation finding and general sliding window compression algorithms are. Some LZW code I write nearly 30 years ago had a similar issue where an eager match sometimes resulted in poorer compression.
I’m glad you saw this thread Daniel, I was worried the suggestions I’d made yesterday were steering you in the wrong direction. My approach does work, but it’s probably a few percent off optimal in larger datasets.
-Dave
Oh no, it was very helpful! I’m deliberately trying to implement the simplest possible version of the system now, because once something is in place, it’s easier for other people to improve on. Now we have data structures for abbreviation data, and an algorithm to apply them. Neither one is ideal, but someone can now write new code to fill those data structures based on a file of optimization data (for example), or a new non-greedy algorithm for better application.
Turns out I’d done this implicitly on accident. I’ll study the papers and do this properly.
Although I’ve studied a lot of the papers and watched part of that MIT lecture and am still having a hard time finding a concise description.
One of the papers mentioned is Wagner 1973, “Common Phrase and Minimum-Space Text Storage”. That seems most relevant for “what do you put in the compiler once something has already figured out the best abbreviations”. But there’s also a lot of discussion about suffix trees, with the canonical BANANA example, which seem more geared for finding the abbreviations in the first place? To make things even more confusing, that effort stems back to another paper from 1973 by Weiner, not Wagner.
The algorithm was improved in 1976, and then somebody else improved it even further in 1995.
Can anybody point me to some reasonable pseudocode demonstrating at least the parts from Wagner 1973? That’s the part I need to apply the abbreviations. My current solution is greedy and looks for the longest match from the current character position, which as pointed out above is often suboptimal.
Thanks,
-Dave