Highly optimized abbreviations computed efficiently

Over on another thread I wrote a fairly cryptic remark about suffix trees and abbreviations. I’ll try to expand on that a little less cryptically here.

First, the results. I got Zork2 down to 89450 bytes with an abbreviation-finding algorithm that takes under 3.1 seconds to run (including I/O, but not including ZILF) on one core of a 3.5Ghz Core i5. This is native C++, not managed C#, but that’s not most of the difference.

zork2_freq.zap.txt (6.0 KB)

I started by hacking ZAPF to spit out the strings it uses to generate abbreviations. Then I found a library which had a useful generalized suffix tree; this was Seqan (not Seqan3, which seemed less useful for my purposes). Suffix trees are ridiculously efficient data structures for string search and related problems; they probably don’t get used nearly enough. For instance, it’s possible to index a set of strings in time linear to the total length of the strings, and then find all occurrences of any given substring in that set in time linear in the length of the substring plus the number of occurrences. They’re not all that space-efficient but there are variants which are more space-efficient or trade off time for space.

(nb: a “generalized” suffix tree indexes a set of strings; a regular suffix tree indexes only a single string; I used only the generalized variant)

So, once I had a suffix tree library, I implemented the algorithm I described before.

Step 1: Load the strings into an array (a Seqan “StringSet” to be exact)

Step 2: Index the array into a suffix tree.

Step 3: Traverse the suffix tree in depth-first preorder (that is, visit parents before children).

Step 3.1: Calculate the cost of the string leading up to the current node. This is just the cost of the parent node plus the cost of the edge, which is the number of Z-characters in the encoding of the edge. We will count each Z-character exactly once.

Step 3.2: The suffix tree library gives me the number of occurrences of the string leading up to the current node.

Step 3.3: Calculate savings as count * (cost -2) - cost (this is equivalent to ZAPFs calculation, though the “new” algorithms in the other thread seem to get ‘count’ too high)

Step 3.4: Keep track of the node with the most savings

Step 4: The node with the most savings is our new abbreviation. Output it.

Step 5: Get the list of occurrences of the abbreviation in the StringSet.

Step 6: Sort that list

Step 7: Calculate a new list of strings by going through the old list and splitting the old string around any abbreviated text

Step 8: Repeat from step 2 96 times.

Step 9: Profit.

As I mentioned, this wasn’t parallelized. You could parallelize the DFS (though I don’t think SeqAn supports that) and the building of the new string list, but it’s not necessary.

12 Likes

Nice! I’ll see if I can get this working in C# using https://github.com/gmamaladze/trienet.

3 Likes

Awesome work!!

2 Likes

Honestly, it’s hard for me to gauge the savings. What was the original file size?

There are a lot of numbers over in this thread but a better algorithm for abbreviations could save about 1k extra. The OP:s algorithm seems to primarely be much faster.

(I don’t know how abbreviations work in Inform6 but shouldn’t this be something interesting to include into the I6 compiler also?)

There’s a feature in the I6 compiler to compute abbreviations. It’s quite primitive; it hasn’t been optimized to the level that people are talking about now.

However, I’m not sure it’s worth upgrading. Because this task is so slow, you always run it as a separate step, not as part of the regular compile. So I think it would make more sense to have a general abbreviator tool – read a list of strings, output an abbreviation list in ZIL/I6/whatever format.

4 Likes

I thought this result was better than all previous results (by a whopping two bytes) and also has the advantage of being fast. Am I missing something?

You are right, Sir. I stand corrected. :grin:

So, @Eleas, although a lot of numbers have been thrown around these might be the most relevant for comparison:

Zork 2 without abbreviations                103.462 bytes
Zork 2 with the ZILCH abbrev-file            92.192 bytes
Zork 2 with the ZILF abbrev-file             90.368 bytes
Zork 2 with Russotto suffix tree magic       89.450 bytes

Thank you. That’s pretty cool! Am I correct in recalling something along the lines of the original source not actually being compacted in this way?

No, the “ZILCH abbrev-file” line in the table above represents Infocom’s method.

And the imps also used tricks outside of this system to reuse chunks of text outside of these 96 official abbreviation slots.

1 Like

I tried a Python implementation of this. (Not because Python is fast, but because it’s easy to whack code out. I wanted to have an algorithm before I reimplemented it in something quick.)

I’m using this library: https://pypi.org/project/suffix-tree/

I’m not clear about what’s going on, though. The library assigns each node a “path”, but this doesn’t seem to be the concatenation of edges from the root down to the node.

If I compute the cost/count/savings based on the “path”, I think I get the right results. E.g, for the very short test file (five lines):

Your sword has begun to glow very brightly.
Your sword is glowing with a faint blue glow.
Your sword is no longer glowing.
You are in a large circular room whose high ceiling is lost in gloom. Eight identical passages leave the room.
A loud whirring sound comes from all around, and you feel sort of disoriented in here.

I get:

'Your sword ' (count 3, savings 16)
' glow' (count 4, savings 7)
' in ' (count 3, savings 2)
'ing ' (count 3, savings 2)
' room' (count 2, savings 1)

But I’m not actually doing the business of adding up cost or strings from the root down. I’m just using a list that I find in the node. So I’m not sure this is correct.

If anyone want to dig deeper, or like me, understand suffix trees. I watched a M.I.T. lecture about strings (suffix trees start at about 41 minutes in).

1 Like

It appears the ‘path’ of a node in the python suffix tree library contains the string leading to that node. The problem with using this rather than adding up costs from the root down is it increases algorithmic complexity; your cost calculation now, instead of running over each character in the set of strings, now runs over each character in the set of possible abbreviations, making the traversal go from O(N) to something larger (probably not O(N^2), but something like O(Nn) where “small n” is the average string size in the set, but I was never great at the theory stuff).

If you approximate cost by string length, you don’t need path at all; I believe you can use ‘node.string_depth()’ or just ‘len(node)’. But since string characters can have a cost of 1,2, or 3 Z-characters you do need the actual characters for a complete implementation. The path for just the edge leading to the current node seems to be node.edge().

You also don’t need len(tree.find_all(abbrev)); that number is node.C if I’m reading the library code right, though you may need to call self.root.compute_C before the traverse. This also is an algorithmic complexity issue.

With more review, it appears the python suffix-tree library may be not very useful. It appears to not be linear in build time (it’s easy to make subtle errors which cause that), and the first 1000 strings in Zork 2 take over 10 seconds for tree-building on my machine. All 1782 take over 70 seconds.

My sense was that 100% of the Python run time was spent on the initial tree construction. The find_all() calls were trivial. But, as I said, I wasn’t doing this for speed – I was trying to get a set of answers to compare future implementations to.

I tried an implementation using the root-down path, but the results were clearly non-optimal. I may have messed up the logic, however.

Here is my code. It requires SeqAn2 for the suffix tree. It expects as input a text file which has all the strings separated by nulls, between the startMark and endMark.

abbr.cpp.txt (9.3 KB)

This version implements a trick to avoid computing costs for all the strings. Instead it uses the fact that the cost must be between 1 and 3 times the length of the string to generate a set of candidates, then computes the true costs for those candidates. It’s faster, probably because of better cache locality. This is serious overengineering for this problem.

abbrapprox.cpp.txt (10.6 KB)

Thanks, that’s helpful.

I understand my confusion earlier. I didn’t understand the way the Python data structure was put together. It makes sense now, and my original code was correct (but not efficient). (My version also treats all characters as equal, ignoring Z-character encoding. But that aside, I’m getting the same results as you.)

89422

z2str.txt (6.0 KB)

This is by preferring the abbreviation with the fewest repeats when the savings is the same.

3 Likes

Hint for people trying to follow along at home… to build Matthew’s code on a Mac, you need to install the seqan library:

brew install brewsci/bio/seqan@2

Then this will work:

c++ -std=c++14 -o abbrev abbrev.cpp