Over on another thread I wrote a fairly cryptic remark about suffix trees and abbreviations. I’ll try to expand on that a little less cryptically here.
First, the results. I got Zork2 down to 89450 bytes with an abbreviation-finding algorithm that takes under 3.1 seconds to run (including I/O, but not including ZILF) on one core of a 3.5Ghz Core i5. This is native C++, not managed C#, but that’s not most of the difference.
zork2_freq.zap.txt (6.0 KB)
I started by hacking ZAPF to spit out the strings it uses to generate abbreviations. Then I found a library which had a useful generalized suffix tree; this was Seqan (not Seqan3, which seemed less useful for my purposes). Suffix trees are ridiculously efficient data structures for string search and related problems; they probably don’t get used nearly enough. For instance, it’s possible to index a set of strings in time linear to the total length of the strings, and then find all occurrences of any given substring in that set in time linear in the length of the substring plus the number of occurrences. They’re not all that space-efficient but there are variants which are more space-efficient or trade off time for space.
(nb: a “generalized” suffix tree indexes a set of strings; a regular suffix tree indexes only a single string; I used only the generalized variant)
So, once I had a suffix tree library, I implemented the algorithm I described before.
Step 1: Load the strings into an array (a Seqan “StringSet” to be exact)
Step 2: Index the array into a suffix tree.
Step 3: Traverse the suffix tree in depth-first preorder (that is, visit parents before children).
Step 3.1: Calculate the cost of the string leading up to the current node. This is just the cost of the parent node plus the cost of the edge, which is the number of Z-characters in the encoding of the edge. We will count each Z-character exactly once.
Step 3.2: The suffix tree library gives me the number of occurrences of the string leading up to the current node.
Step 3.3: Calculate savings as count * (cost -2) - cost (this is equivalent to ZAPFs calculation, though the “new” algorithms in the other thread seem to get ‘count’ too high)
Step 3.4: Keep track of the node with the most savings
Step 4: The node with the most savings is our new abbreviation. Output it.
Step 5: Get the list of occurrences of the abbreviation in the StringSet.
Step 6: Sort that list
Step 7: Calculate a new list of strings by going through the old list and splitting the old string around any abbreviated text
Step 8: Repeat from step 2 96 times.
Step 9: Profit.
As I mentioned, this wasn’t parallelized. You could parallelize the DFS (though I don’t think SeqAn supports that) and the building of the new string list, but it’s not necessary.