Highly optimized abbreviations computed efficiently

Conveniently, I bookmarked that bit of the Inform 6 source for Dialog purposes. Pretty standard dynamic programming: make an array of “minimum pentets needed to store the text from this point onward” (abbreviations_optimal_parse_scores), make a separate array of “which abbreviation (or none) you should apply at this point to get that minimum value” (abbreviations_optimal_parse_schedule), then start at the end of the text and fill in the arrays character by character.

1 Like

Here’s the relevant part of the ZILF string encoder:

https://foss.heptapod.net/zilf/zilf/-/blob/branch/default/src/Zilf.Common/StringEncoding/StringEncoder.cs#L103-167

3 Likes

This is the relevant part from zabbrev:

                // Generate an array, length of this textline, with lists that of each possible
                // abbreviation from current collection of abbreviations.
                // List is null = no possible abbreviations at this position.
                int abbreviationsCount = abbreviations.Count;                   // Loop hoisting: moving condition outside
                List<int>[] possibleAbbrevsArray = new List<int>[textLength];
                for (int abbrevNo = 0; abbrevNo < abbreviationsCount; abbrevNo++)
                {
                    List<int> abbrevOccurrences = abbreviations[abbrevNo].Occurrences[line];
                    if (abbrevOccurrences is not null)
                    {
                        int abbrevOccurrencesCount = abbrevOccurrences.Count;   // Loop hoisting: moving condition outside
                        for (int abbrevPosition = 0; abbrevPosition < abbrevOccurrencesCount; abbrevPosition++)
                        {
                            possibleAbbrevsArray[abbrevOccurrences[abbrevPosition]] ??= new List<int>(1); // initiate list with 1 slot (most common). Curses max out at 4.
                            possibleAbbrevsArray[abbrevOccurrences[abbrevPosition]].Add(abbrevNo);
                        }
                    }
                }
                // Iterate reverse over this textline and apply the abbreviation for each position that is
                // most effective (optimal parse).
                var minimalCostFromHere = gameTextLine.minimalCostFromHere;   // Use local copy to hopefully avoid boundary check
                var pickedAbbreviations = gameTextLine.pickedAbbreviations;   // Use local copy to hopefully avoid boundary check
                Array.Fill(pickedAbbreviations, -1);                          // -1 for "no abbreviation"
                for (int index = textLength - 1; index >= 0; index--)
                {
                    minimalCostFromHere[index] = minimalCostFromHere[index + 1] + ZcharCost(gameTextLine.textAsCharArray[index]);
                    if (possibleAbbrevsArray[index] is not null)
                    {
                        foreach (int abbrevNo in possibleAbbrevsArray[index])
                        {
                            int costWithPattern = ABBREVIATION_REF_COST + minimalCostFromHere[index + abbreviations[abbrevNo].Length];

                            // If using the abbreviation saves z-chars, then use it
                            if (costWithPattern < minimalCostFromHere[index])
                            {
                                pickedAbbreviations[index] = abbrevNo;
                                minimalCostFromHere[index] = costWithPattern;
                            }

                            // 1. If cost for using the z-char is equal to the use of the abbreviation, then use the z-char.
                            // 2. If two abbreviations equals the same minimal cost, use the one that represents most
                            //    z-chars (usually the longest).
                            // 3. If the two abbreviations also have the same number of z-chars, use the one that's last defined.
                            if (!(pickedAbbreviations[index] == -1) && costWithPattern == minimalCostFromHere[index] && abbreviations[abbrevNo].Cost >= abbreviations[gameTextLine.pickedAbbreviations[index]].Cost)
                            {
                                pickedAbbreviations[index] = abbrevNo;
                                minimalCostFromHere[index] = costWithPattern;
                            }

                        }
                    }
                }
  1. Loop through the string and find every possible abbreviation for each position in the string.
  2. Loop backwards through the string and calculate the cost from the current position to the end of the string for each possible abbreviation at that position. Choose the abbreviation the yields the lowest cost.

All the talk about suffix tree/array is about finding the best abbreviations in a fast way and is not relevant when applying an already choosen set of abbreviations.

4 Likes

Henrik, thank you for that two line summary at the end, that really helps.

My naive approach was just, any time I found one or more abbreviations, pick the longest one, which is just a greedy algorithm. It’s not terrible, and many times will get lucky and find the best answer anyway.

I was thinking I needed to build a suffix tree for every candidate string, and then use that tree to find the optimal parse of abbreviations.

Am I correct that the only time this really matters is when there are overlaps, meaning, more than one abbreviation could be applied to a particular substring?

-Dave

yes

1 Like

And it’s not worth optimizing for the case where there are no overlaps, because code such as yours already avoids extra work?

In other words, if there is at most one abbreviation match per location in the string, and the next abbreviation match is no closer than the length of the current abbreviation, then the walking backwards step won’t have to do anything special because “each possible abbreviation at that position” is at most one.

-Dave

I also realized that the 2D table (really it’s just a hash table with chaining) I came up with to identify which possible abbreviations could start here was already a good chunk of the work.

But instead of immediately applying the longest abbreviation and moving onward, I need to just remember that work for later.

Since it’s just an array access though, step 1 is already done before I’ve started, so I can just skip to step 2.

(Well I still need to make sure the rest of the abbreviation matches to thin the list of potential candidates)

Henrik, a few questions.

What’s the purpose of the last tests (the parts numbered 1/2/3)? It seems like it wants to replace an abbreviation with one that has an equivalent cost? I ended up leaving that logic out of my version. I pick whatever abbreviation has the lowest cost (and is better than the current minimum cost).

About the Wagner 1973 paper – your two line description was more useful to me than the entire five page paper. I’ve been through it several times and still can’t crack it. Is there a particular part of it that describes why you want to scan the string in reverse? Intuitively I see how that can help but I didn’t glean any of that from the paper itself.

(And surprising nobody, switching from a greedy parse to an optimal parse did save a few dozen bytes out of Cloak of Darkness!)

Thanks,

-Dave

That’s just how dynamic programming tends to work: you have the fewest options to consider at the end of the string, so it’s easiest to do that first.

At the last character, for example, there’s only one option: print that character. The second-to-last character has only two options: use a two-character abbreviation, or print that character and proceed to the last one. The third-to-last has three options: three-char abbreviation, two-char abbreviation and proceed to the final char, or print that character and proceed to the second-last. And so on.

And since we started from the end, we can now calculate the proper cost for “at character X, print a Y-character abbreviation and proceed from there”: it’s two pentets plus the smallest cost we found at position X+Y.

2 Likes

Well, there’s no fundamental reason you have to go right-to-left, but if you go left-to-right, your code now has to be able to answer the question “which abbreviations could end at this position?” instead. So rather than indexing your abbreviations in traditional lexicographic order, you’d have to index them by their last character and then loop over them backwards to check whether they fit the string. So you can go either way, but the algorithm is easier to think about (and probably implement) if the operations inside the outer loop go left-to-right, which means the outer loop itself goes right-to-left.

3 Likes

These are only relevant for zabbrev when it is picking candidates for abbreviations and not when applying an already choosen set. For example, it isn’t correct to upscore a candidate when ordinary zchars give the same saving (rule 1).

Zabbrev spends around 98% of execution time in these loops so there are a few optimization efforts here that may make the code harder to read…

1 Like

FWIW if you’re worried about performance of heap allocations inside leaf functions, in C# stackalloc is similar to C alloca and can help in those situations.

For example, in my string encoding code I do two allocas based on the length of the string. One of them is uint32_t’s, and I use that to store up to 4 potential matches (7 bits each) and a three bit “stack count”), and the other is a uint16_t cost array (with one extra element at the end initialized to zero).

Summary
	// If we have abbreviations, do an optimal parse (Wagner, 1973) for them.
	// The paper is cryptic but Henrik Åsman explained it pretty simply:
	// 1. Loop through the string and find every possible abbreviation for each position in the string.
	// 2. Loop backwards through the string and calculate the cost from the current position to the end of the string 
	//    for each possible abbreviation at that position. Choose the abbreviation the yields the lowest cost.
	uint32_t *abbrevs = nullptr;
	if (!forDict && abbreviation_count) {
		int totalCount = 0;
		abbrevs = (uint32_t*) alloca(srcSize * 4);
		for (size_t i=0; i<srcSize-1; i++) {
			abbrevs[i] = 0;
			uint8_t j = abbreviations_lut[src[i]];
			while (j != 255) {
				if (!strncmp(abbreviations[j],src+i,abbreviation_lengths[j]) && (abbrevs[i] & 7) != 4)
					++totalCount, abbrevs[i] = ((abbrevs[i] & ~0x7U) << 7) | (j << 3) | ((abbrevs[i]&7)+1);
				j = abbreviations_next[j];
			}
		}
		abbrevs[srcSize-1] = 0;
		if (totalCount) {
			// Cost array has one extra slot, always zero, to simplify logic.
			uint16_t *costsFromHere = (uint16_t*)alloca((srcSize+1)*2);
			costsFromHere[srcSize] = 0;
			for (size_t i=srcSize; i--;) {
				costsFromHere[i] = s_ZCost[src[i]] + costsFromHere[i+1];
				if (abbrevs[i]) {
					int thisCount = abbrevs[i] & 7, best = -1;
					uint32_t a = abbrevs[i] >> 3;
					while (thisCount--) {
						uint16_t costWithPattern = 2 + costsFromHere[i + abbreviation_lengths[a & 127]];
						if (costWithPattern < costsFromHere[i]) {
							best = a & 127;
							costsFromHere[i] = costWithPattern;
						}
						a>>=7;
					}
					if (best == -1)
						abbrevs[i] = 0;
					else // store length to save extra lookup and ensure value isn't zero
						abbrevs[i] = best | (abbreviation_lengths[best] << 10);
				}
			}
		}
		else
			abbrevs = nullptr;
	}
	for (size_t i=0; i<srcSize && (!destSize || offset < destSize);) {
		if (abbrevs && abbrevs[i]) {
			storeCode(((abbrevs[i] >> 5)+1) & 31);
			storeCode(abbrevs[i] & 31);
			i += abbrevs[i]>>10;
		}
		else {
			uint8_t code = s_EncodedCharacters[src[i]];
			if (code == 255) {
				storeCode(5);
				storeCode(6);
				storeCode(src[i]>>5);
				storeCode(src[i]&31);
			}
			else {
				if (code > 31)
					storeCode(code >> 5);
				storeCode(code & 31);
			}
			++i;
		}
	}
1 Like