ZAbbrev has been ported from C# to C

I tried to get zabbrev working locally - for some reason, autoreconf doesn’t work correctly on my Mac, so I punted and created a dummy config.h and compiled it by hand.

It seems to work, but the input format of the text isn’t obvious to me. The Inform version seems to have letter prefixes on every line. My source language isn’t Inform or Zap.

Is there a version of input that is basically just raw text?

Thanks,

-Dave

EDIT: Just noticed the documentation at the top of gametext.txt, derp:

I: [I:info, G:game text, V:veneer text, L:lowmem string, A:abbreviation, D:dict word, O:object name, S:symbol, X:infix]
I: [H:game text inline in opcode, W:veneer text inline in opcode]

Well, it ran, but I can’t figure out where it put the output? It seems like it modifies the input gametext file?

Okay, it looks like you’re supposed to redirect the output. I tried ZAP (format 2) and the example earlier in this thread seemed a bit garbled near the end.

Is this the C version or the C#? Because I don’t think the C version is fully functional (It’s not the one I wrote so I won’t try to offer any support).

It’s the C version, which is why I posted in this thread, hoping the author would see this.

The output is usable for my purposes. My compiler produces something close enough to Inform-style game text that can serve as a usable input, and I modified my input routines to deal with the

Abbreviate “Your example” ! stats

output format it generates.

Thanks,

-Dave

Sorry; I don’t follow the forums but I wanted to share the the C port has moved to a new home at Public Git Repositories - zabbrev.git/summary . At the same time, there’s an update to improve abbreviation generation but it’s only half the matter; the other half depends on updates to the C# version. A patch is attached for the C# version.

patch.txt (6.0 KB)

Both programs need updating to address this because they each relied on an implementation-specific detail of their respective heap (priority queue) data structures to break ties, and those implementation details differ.

ZAbbrev’s algorithm uses a max-heap (priority queue) to pick abbreviations. Patterns are inserted with a priority equal to their savings score. The algorithm repeatedly pops the highest-priority item off the heap. The question is: when two patterns have the exact same savings score, which one comes out first?

The original C# version would return 0 in PatternComparer. That return 0 means: “These two items are equivalent; I have no preference for which comes first.” When the comparator returns 0, the heap is free to return either one. The order you actually get depends on the internal memory layout of the heap - which in turn depends on the exact sequence of insertions, removals, and the heap’s internal array rebalancing algorithm.

.NET’s PriorityQueue uses a specific internal binary heap implementation (a quaternary heap, actually). Given the same sequence of enqueue/dequeue operations, it will always produce the same internal array layout, so the same tie-breaking “winner” comes out each time. But this order is an artifact of .NET’s specific heap implementation - it’s not specified behavior. Microsoft’s documentation makes no guarantee about which equal-priority element comes out first. Even if I were to march .NET’s quaternary heap perfectly in C, a .NET update to the C# version could make them diverge again. I would rather not chase copying undefined behavior.

The C port uses a standard binary max-heap (array-based, with sift-up/sift-down). This is a perfectly correct max-heap but its internal layout differs from .NET’s quaternary heap. When two items have savings, the C heap might put “stone” on top while the .NET heap puts “appear” on top. Both are valid - neither heap is “wrong.”

The problem is that the algorithm itself was underspecified - it didn’t define what to do when savings are equal. Both implementations filled in that gap with “whatever my heap happens to do,” and their heaps happened to do different things.

But that’s not enough to explain why the lists were so different. The reason for that is that once even a single abbreviation differs, it changes the scores of all subsequent abbreviations (because the optimal parse depends on which abbreviations have already been selected), so the divergence cascades through the remaining picks and you end up with a very different list.

The fix is easy: change the priority key from int to (int savings, string key) and add string.CompareOrdinal as the tie-breaker. This makes the output provably deterministic - defined entirely by the data, not by the runtime. It also makes the output reproducible across languages, which is important for the C port.

This change doesn’t affect correctness or quality of the abbreviations. When two patterns have equal savings, it doesn’t matter which one you pick - they’re equally good. The tie-breaker just ensures you always pick the same one, which then avoids the cascading series of differences from even one single abbreviation being different.

The C version already implements this change. According to my testing if the C# version applies this patch then the two programs generate identical abbreviations.

4 Likes

Seems reasonable to make it deterministic at this pretty low level. I’ll put it in the todo for next version.

2 Likes

Just for anyone curious about the performance difference between C# and C version:

For my game (Inform 6, source 427k, z5 220k), on a MacBook Pro M2 Max using -x3 optimizing:

  • C# original: 10 min 45sec
  • C: 8min 31sec

It’s always impressive how fast modern VMs have become!

Thanks to both of the authors!

3 Likes

As a side note I think it’s pretty amazing that often the AOT (Ahead-of-Time, in practice compiled to target machine) version is slower than the version running IL. As I understand it, it is because the JIT-compiler identifies bottlenecks during runtime and applies optimization to them.

3 Likes

This is complete speculation, but would it be possible for the two choices to have different overall results because choosing ‘stone’ allows subsequent abbreviations {set-N}, while choosing ‘appear’ allows subsequent abbreviations {set-P}, and one of them is more efficient than the other? Or is this formally impossible?

If it is possible, the optimal choice for tied abbreviations would be to try both, and see which is better in the long run.

1 Like

Short answer, yes.

The problem does not stop there though. Unfortunately there is nothing that says that always beeing greedy produces the absolut best result in the end either, but greedy is on average good.

I have not found any better without testing all available combinations, an impossibility.

After the first result is found, zabbrev tries a few other combinations (these are the most time consuming phases) and often finds a few extra bytes. But I have not found the “optimal” way to do this. This and to properly minimize the wasted space at the end of each strings because word alignment are open for optimization for the cunning.

All that said, I can’t see any harm in breaking ties in a deterministic way.

2 Likes

I believe picking the optimal abbreviations is NP-hard (a variation of the knapsack problem), so heuristics like this are the best that can be done in practice.

4 Likes