Over in the thread about Mini-Zork II I noticed that the abbreviation files that @russotto saved ~300 bytes more than the other. Why?
One difference was that the beginning of his files looked like this:
.FSTR FSTR?1," in the " ; 35x, saved 202
.FSTR FSTR?2,"n the " ; 38x, saved 146
.FSTR FSTR?3,"to the " ; 47x, saved 228
.FSTR FSTR?4,"o the " ; 13x, saved 46
.FSTR FSTR?5," of the " ; 38x, saved 220
.FSTR FSTR?6,"s the " ; 51x, saved 198
.FSTR FSTR?7," the " ; 154x, saved 457
and the other like this:
.FSTR FSTR?1," the " ; 376x, saved 1123
.FSTR FSTR?2,"The " ; 243x, saved 724
.FSTR FSTR?3," is " ; 198x, saved 392
.FSTR FSTR?4,"You " ; 117x, saved 346
.FSTR FSTR?5," you" ; 174x, saved 344
The glaring difference is that the first file waits a while to define the clear winner " the "
and list a couple of phrases that have the winner as a part. This is a good strategy because if " the "
is the first abbreviation it ‘ruins’ tho other ones and they will never get used!
I wrote a little test (in VB.NET, hope that doesn’t bug too many!) where I try to implement this logic. The basic structure of the code is largely based on the Python-script by @mulehollandaise and doesn’t use suffix-tree and is therefore a bit slower (30 s on my computer for Mini-Zork II).
- Identify the first winner.
- Pick a threshold based on this winner. I have experimented with different values, but for Mini-Zork II 8% gave the best result.
- Instead of using the winner pick the longest abbreviation that clears the threshold and have the winner inside it.
- Recalculate index.
- Pick a new winner
- repeat from 3 until 96 abbreviations are found.
Below is the code if anyone wants to use it or improve it. For Visual Studio you need a Form with a Button and a multiline Textbox and then paste just paste the code in the Form’s class.
Beware! The code isn’t very polished yet.
Form1.vb.txt (11.8 KB)