Sorry; I don’t follow the forums but I wanted to share the the C port has moved to a new home at Public Git Repositories - zabbrev.git/summary . At the same time, there’s an update to improve abbreviation generation but it’s only half the matter; the other half depends on updates to the C# version. A patch is attached for the C# version.
patch.txt (6.0 KB)
Both programs need updating to address this because they each relied on an implementation-specific detail of their respective heap (priority queue) data structures to break ties, and those implementation details differ.
ZAbbrev’s algorithm uses a max-heap (priority queue) to pick abbreviations. Patterns are inserted with a priority equal to their savings score. The algorithm repeatedly pops the highest-priority item off the heap. The question is: when two patterns have the exact same savings score, which one comes out first?
The original C# version would return 0 in PatternComparer. That return 0 means: “These two items are equivalent; I have no preference for which comes first.” When the comparator returns 0, the heap is free to return either one. The order you actually get depends on the internal memory layout of the heap - which in turn depends on the exact sequence of insertions, removals, and the heap’s internal array rebalancing algorithm.
.NET’s PriorityQueue uses a specific internal binary heap implementation (a quaternary heap, actually). Given the same sequence of enqueue/dequeue operations, it will always produce the same internal array layout, so the same tie-breaking “winner” comes out each time. But this order is an artifact of .NET’s specific heap implementation - it’s not specified behavior. Microsoft’s documentation makes no guarantee about which equal-priority element comes out first. Even if I were to march .NET’s quaternary heap perfectly in C, a .NET update to the C# version could make them diverge again. I would rather not chase copying undefined behavior.
The C port uses a standard binary max-heap (array-based, with sift-up/sift-down). This is a perfectly correct max-heap but its internal layout differs from .NET’s quaternary heap. When two items have savings, the C heap might put “stone” on top while the .NET heap puts “appear” on top. Both are valid - neither heap is “wrong.”
The problem is that the algorithm itself was underspecified - it didn’t define what to do when savings are equal. Both implementations filled in that gap with “whatever my heap happens to do,” and their heaps happened to do different things.
But that’s not enough to explain why the lists were so different. The reason for that is that once even a single abbreviation differs, it changes the scores of all subsequent abbreviations (because the optimal parse depends on which abbreviations have already been selected), so the divergence cascades through the remaining picks and you end up with a very different list.
The fix is easy: change the priority key from int to (int savings, string key) and add string.CompareOrdinal as the tie-breaker. This makes the output provably deterministic - defined entirely by the data, not by the runtime. It also makes the output reproducible across languages, which is important for the C port.
This change doesn’t affect correctness or quality of the abbreviations. When two patterns have equal savings, it doesn’t matter which one you pick - they’re equally good. The tie-breaker just ensures you always pick the same one, which then avoids the cascading series of differences from even one single abbreviation being different.
The C version already implements this change. According to my testing if the C# version applies this patch then the two programs generate identical abbreviations.