Highly optimized abbreviations computed efficiently

russotto · December 20, 2020, 10:29pm

Over on another thread I wrote a fairly cryptic remark about suffix trees and abbreviations. I’ll try to expand on that a little less cryptically here.

First, the results. I got Zork2 down to 89450 bytes with an abbreviation-finding algorithm that takes under 3.1 seconds to run (including I/O, but not including ZILF) on one core of a 3.5Ghz Core i5. This is native C++, not managed C#, but that’s not most of the difference.

zork2_freq.zap.txt (6.0 KB)

I started by hacking ZAPF to spit out the strings it uses to generate abbreviations. Then I found a library which had a useful generalized suffix tree; this was Seqan (not Seqan3, which seemed less useful for my purposes). Suffix trees are ridiculously efficient data structures for string search and related problems; they probably don’t get used nearly enough. For instance, it’s possible to index a set of strings in time linear to the total length of the strings, and then find all occurrences of any given substring in that set in time linear in the length of the substring plus the number of occurrences. They’re not all that space-efficient but there are variants which are more space-efficient or trade off time for space.

(nb: a “generalized” suffix tree indexes a set of strings; a regular suffix tree indexes only a single string; I used only the generalized variant)

So, once I had a suffix tree library, I implemented the algorithm I described before.

Step 1: Load the strings into an array (a Seqan “StringSet” to be exact)

Step 2: Index the array into a suffix tree.

Step 3: Traverse the suffix tree in depth-first preorder (that is, visit parents before children).

Step 3.1: Calculate the cost of the string leading up to the current node. This is just the cost of the parent node plus the cost of the edge, which is the number of Z-characters in the encoding of the edge. We will count each Z-character exactly once.

Step 3.2: The suffix tree library gives me the number of occurrences of the string leading up to the current node.

Step 3.3: Calculate savings as count * (cost -2) - cost (this is equivalent to ZAPFs calculation, though the “new” algorithms in the other thread seem to get ‘count’ too high)

Step 3.4: Keep track of the node with the most savings

Step 4: The node with the most savings is our new abbreviation. Output it.

Step 5: Get the list of occurrences of the abbreviation in the StringSet.

Step 6: Sort that list

Step 7: Calculate a new list of strings by going through the old list and splitting the old string around any abbreviated text

Step 8: Repeat from step 2 96 times.

Step 9: Profit.

As I mentioned, this wasn’t parallelized. You could parallelize the DFS (though I don’t think SeqAn supports that) and the building of the new string list, but it’s not necessary.

vaporware · December 21, 2020, 5:46am

Nice! I’ll see if I can get this working in C# using https://github.com/gmamaladze/trienet.

mulehollandaise · December 21, 2020, 3:31pm

Awesome work!!

Eleas · December 21, 2020, 4:05pm

Honestly, it’s hard for me to gauge the savings. What was the original file size?

heasm66 · December 21, 2020, 4:20pm

There are a lot of numbers over in this thread but a better algorithm for abbreviations could save about 1k extra. The OP:s algorithm seems to primarely be much faster.

(I don’t know how abbreviations work in Inform6 but shouldn’t this be something interesting to include into the I6 compiler also?)

zarf · December 21, 2020, 4:25pm

There’s a feature in the I6 compiler to compute abbreviations. It’s quite primitive; it hasn’t been optimized to the level that people are talking about now.

However, I’m not sure it’s worth upgrading. Because this task is so slow, you always run it as a separate step, not as part of the regular compile. So I think it would make more sense to have a general abbreviator tool – read a list of strings, output an abbreviation list in ZIL/I6/whatever format.

jcompton · December 21, 2020, 5:02pm

I thought this result was better than all previous results (by a whopping two bytes) and also has the advantage of being fast. Am I missing something?

heasm66 · December 21, 2020, 5:05pm

You are right, Sir. I stand corrected.

jcompton · December 21, 2020, 5:17pm

So, @Eleas, although a lot of numbers have been thrown around these might be the most relevant for comparison:

Zork 2 without abbreviations                103.462 bytes
Zork 2 with the ZILCH abbrev-file            92.192 bytes
Zork 2 with the ZILF abbrev-file             90.368 bytes
Zork 2 with Russotto suffix tree magic       89.450 bytes

Eleas · December 21, 2020, 8:20pm

Thank you. That’s pretty cool! Am I correct in recalling something along the lines of the original source not actually being compacted in this way?

jcompton · December 21, 2020, 8:34pm

No, the “ZILCH abbrev-file” line in the table above represents Infocom’s method.

And the imps also used tricks outside of this system to reuse chunks of text outside of these 96 official abbreviation slots.

zarf · December 21, 2020, 8:38pm

I tried a Python implementation of this. (Not because Python is fast, but because it’s easy to whack code out. I wanted to have an algorithm before I reimplemented it in something quick.)

I’m using this library: https://pypi.org/project/suffix-tree/

I’m not clear about what’s going on, though. The library assigns each node a “path”, but this doesn’t seem to be the concatenation of edges from the root down to the node.

If I compute the cost/count/savings based on the “path”, I think I get the right results. E.g, for the very short test file (five lines):

Your sword has begun to glow very brightly.
Your sword is glowing with a faint blue glow.
Your sword is no longer glowing.
You are in a large circular room whose high ceiling is lost in gloom. Eight identical passages leave the room.
A loud whirring sound comes from all around, and you feel sort of disoriented in here.

I get:

'Your sword ' (count 3, savings 16)
' glow' (count 4, savings 7)
' in ' (count 3, savings 2)
'ing ' (count 3, savings 2)
' room' (count 2, savings 1)

But I’m not actually doing the business of adding up cost or strings from the root down. I’m just using a list that I find in the node. So I’m not sure this is correct.

zarf · December 21, 2020, 8:41pm

gist.github.com

https://gist.github.com/erkyrath/0e09cc8a2bf16c1238fd87d85158ef7f

abbrev.py

# This uses the suffix-tree library (https://pypi.org/project/suffix-tree/).
#   pip3 install suffix-tree

import sys
from suffix_tree import Tree
from suffix_tree.node import Internal, Leaf
from suffix_tree.util import UniqueEndChar

if len(sys.argv) <= 1:
    print('usage: work.py game.txt')

This file has been truncated. show original

heasm66 · December 21, 2020, 10:28pm

If anyone want to dig deeper, or like me, understand suffix trees. I watched a M.I.T. lecture about strings (suffix trees start at about 41 minutes in).

russotto · December 22, 2020, 12:49am

It appears the ‘path’ of a node in the python suffix tree library contains the string leading to that node. The problem with using this rather than adding up costs from the root down is it increases algorithmic complexity; your cost calculation now, instead of running over each character in the set of strings, now runs over each character in the set of possible abbreviations, making the traversal go from O(N) to something larger (probably not O(N^2), but something like O(Nn) where “small n” is the average string size in the set, but I was never great at the theory stuff).

If you approximate cost by string length, you don’t need path at all; I believe you can use ‘node.string_depth()’ or just ‘len(node)’. But since string characters can have a cost of 1,2, or 3 Z-characters you do need the actual characters for a complete implementation. The path for just the edge leading to the current node seems to be node.edge().

You also don’t need len(tree.find_all(abbrev)); that number is node.C if I’m reading the library code right, though you may need to call self.root.compute_C before the traverse. This also is an algorithmic complexity issue.

With more review, it appears the python suffix-tree library may be not very useful. It appears to not be linear in build time (it’s easy to make subtle errors which cause that), and the first 1000 strings in Zork 2 take over 10 seconds for tree-building on my machine. All 1782 take over 70 seconds.

zarf · December 22, 2020, 2:17am

My sense was that 100% of the Python run time was spent on the initial tree construction. The find_all() calls were trivial. But, as I said, I wasn’t doing this for speed – I was trying to get a set of answers to compare future implementations to.

I tried an implementation using the root-down path, but the results were clearly non-optimal. I may have messed up the logic, however.

russotto · December 22, 2020, 3:51am

Here is my code. It requires SeqAn2 for the suffix tree. It expects as input a text file which has all the strings separated by nulls, between the startMark and endMark.

abbr.cpp.txt (9.3 KB)

This version implements a trick to avoid computing costs for all the strings. Instead it uses the fact that the cost must be between 1 and 3 times the length of the string to generate a set of candidates, then computes the true costs for those candidates. It’s faster, probably because of better cache locality. This is serious overengineering for this problem.

abbrapprox.cpp.txt (10.6 KB)

zarf · December 23, 2020, 1:12am

Thanks, that’s helpful.

I understand my confusion earlier. I didn’t understand the way the Python data structure was put together. It makes sense now, and my original code was correct (but not efficient). (My version also treats all characters as equal, ignoring Z-character encoding. But that aside, I’m getting the same results as you.)

russotto · December 23, 2020, 1:35am

89422

z2str.txt (6.0 KB)

This is by preferring the abbreviation with the fewest repeats when the savings is the same.

zarf · December 23, 2020, 1:51am

Hint for people trying to follow along at home… to build Matthew’s code on a Mac, you need to install the seqan library:

brew install brewsci/bio/seqan@2

Then this will work:

c++ -std=c++14 -o abbrev abbrev.cpp