[I6] Question about string literals and their storage

I thought I remembered reading that if there are two different string literals in an I6 program, they would both reference the same value, which is an address where the Z-encoded string is held in ROM. Is that the case? I’m asking because in this code it looks like maybe that’s not the case:

Global abc_str = "abc";

[ Main x ;
    print "abc_str = ", abc_str, " / (string) abc_str = ", (string) abc_str, "^";
    x = abc_str;
    print "x = ", x, " / (string) x = ", (string) x, "^";
    print "abc literal = ", (string) "abc", "^";
    x = "abc";
    print "new x = ", x, " / (string) new x = ", (string) x, "^";

This produces (with comments added after the fact):

abc_str = 769 / (string) abc_str = abc ! 769 is memory address for Z-encoded "abc"
x = 769 / (string) x = abc ! x holds same address
abc literal = abc
new x = 771 / (string) new x = abc ! 771 is memory address for second and separate Z-encoded "abc"?
[Hit any key to exit.]

Am I misinterpreting what the output means? If so, why is new x different than x in the output?

I’m pretty sure packed strings are not deduplicated in Inform 6. If you want both uses of the string to point to the same memory, you need to define a constant for it.

Jesse is right.

Additionally, default string values which aren’t needed are still stored in memory.

Well, that solves that mystery. (And thank you both!) Is it just not worth it to do this automatically because it doesn’t save enough memory?

I did a quick experiment, grabbing a couple Z-machine games compiled by Inform 6, disassembling them with TXD, and checking for duplicate strings:

  • 129 KB Z5:
    • Out of 1549 strings, 203 (13.1%) are extra copies.
    • Removing them would save 1759 characters or ~1160 bytes (0.88% of file size).
  • 279 KB Z8:
    • Out of 2795 strings, 215 (7.6%) are extra copies.
    • Removing them would save 7844 characters or ~5177 bytes (1.82% of file size).

I get the impression that ZILCH and ZILF do some elimination of duplicate strings, e.g. this:

      (NORTH SORRY "There's no exit on that side of the park.")
      (EAST SORRY "There's no exit on that side of the park.")
      (SOUTH SORRY "There's no exit on that side of the park.")
      (WEST SORRY "There's no exit on that side of the park.")
      (NW SORRY "There's no exit on that side of the park.")

seems to create only one copy of the “There’s no exit …” string. But when the string later appears in a TELL, it seems to have created its own copy of it.

The Witness seems to have been written with the assumption that duplicate strings would be eliminated, because there is code like:

"Out of the blue, " "two" " of the coroner's men run"
" in to the office with a stretcher and carry Linder's body out. "
"One of them shouts to you, \"">)
		     (T <TELL
"Out of the blue, " "one" " of the coroner's men run"
"s up to you and says, \"We just removed the body. ">)>

But I don’t think it did…?

@vaporware, that’s actually higher memory savings than I would have guessed. Out of curiosity, on which games did you do the tests?

Some parts of the Inform library rely on the compiler not deduplicating strings. In particular, the list_together property groups items based on the identity of its string value. You can define two classes which group separately with the same list_together string. If the compiler merged that string, it would change the meaning of the source code.

So, when this has come up in the past, we’ve said that string deduplication was an unacceptable change to the language semantics. Even though it’s just a corner of the I6 library.

Now, it would be possible to do string merging in a more limited way: only apply it to strings in print statements (whose addresses are not exposed). I don’t think this idea has come up before. Moderate headache to implement, mind you.

@zarf, thank you for pointing to the wider context here – it’s illuminating, as always. I wouldn’t have guessed that there would be a dependency of that type (though, in review, I see that the discussion on DM4 p. 203 [https://inform-fiction.org/manual/html/s27.html] explains that to overcome this aspect of string definition in the context of list_together it’s necessary to use a numeric or string constant).

Aside from backwards compatibility (which is important), is there any reason why it wouldn’t be desirable to list together objects of different classes that make use of the same list_together string (meaning string literal contents, not string memory address)? Interestingly, the discussion of the provided example solution for DM4 Ex 69 [https://inform-fiction.org/manual/html/sa6.html#ans69] notes that it takes special pains to prevent creating an identical value for list_together by defining in line routines for the GoldCoin and SilverCoin classes.

Perhaps string deduplication (if pursued) could be a compiler option, with this impact on list_together pointed out in the usage notes? (And perhaps also the impact on the evaluation of equality tests between strings?) Would doing it that way simplify a potential implementation?

Dr Ego and the egg of Man-Toomba (3 / 400410) and Lost Pig (2 / 080406).

That would be a breaking change, but it’d be a change for the better, IMO.

The DM4 cautions that a list_together string “must either be given in a class definition or else as a constant … the actual text should only be written out in one place in the source code”, because if you write it out more than once, you’ll get inventory listings with identically-named groups:

You are carrying:
    a porcelain bird (bronzed)
    three foodstuffs:
        a scarlet fish
        some lembas water
        an onion
    three foodstuffs:
        a handful of wedding cake
        a shrimp cocktail
        a bottle of shrimp mix (mostly full)

In the rare case where an author wants that to happen, they can get the same effect by setting list_together to a (duplicated) routine that prints the string.

1 Like

That’s because the Z-machine has two different types of strings: “referenced strings” are stored in their own section of ROM and referenced by address, and “immediate strings” are stored within the code of the routine itself. In other words, <TELL "a string literal"> assembles to a @print instruction, followed by several bytes of literal text before the next instruction.

So as I understand it, ZIL deduplicates referenced strings (any strings which are treated as values, assigned to variables, etc) but doesn’t deduplicate immediate strings (literal text embedded in a <TELL>).

(Now, I don’t think there’s actually any theoretical reason why referenced strings need to be in their own section of ROM; that’s just the easiest way for the compiler to work. But if everything is aligned in the right way, I believe it’s possible to take a pointer into the middle of a routine, pointing at an immediate string, and pass it around as a reference. Why anyone would ever actually want to do this is left as an exercise for the reader.)

Oh, heh. When I said Inform could deduplicate “only strings in print statements”, I forgot that the Z-machine compiled them inline in the routine! That makes it harder.

(Glulx has no immediate strings.)

Inform compiles immediate strings in print statements using @print (PRINTI), but only below a certain size (32 characters). Beyond that, it uses @print_paddr (PRINT), which indeed creates a string that theoretically could be deduplicated and whose address is never exposed.

Alignment would be a problem – only a random subset of inline strings could be addressed that way – but there’s also the offset to think about. In V6 (and V7), strings and routines are in a different address space, so routines would have to be stored after strings in memory, and strings inside routines toward the end of the routine space might be unaddressable.

Yeah, this would require a lot of finagling. For example, inserting @nops as needed so that @print instructions come right before appropriate alignment boundaries.

Compared to that, keeping the address spaces lined up seems fairly trivial. If I remember right, that’s what Inform does by default when compiling to Z6: it sets the string and routine address offsets to the same value, as long as there’s sufficient space, to make pointer manipulations simpler.

Supposing that one did want to de-duplicate strings but avoid the thorns of packed address alignment – would it be possible to make use of @print_addr to target them? As far as I can tell, that opcode will happily print other packed text besides dictionary word entries, and it can specify individual byte addresses as the starting point.

The implicit suggestion would be to create a zone in static memory to hold short strings, and for the compiler to use the @print_addr opcode instead of the @print opcode (by default or by switch). If I’m not mistaken, such an area would be similar in nature to the static arrays area that was added for 6.34.

If you’re going to move the strings out of the routine, you may as well just make them packed strings and use @print_paddr. I think the idea was to reuse the string constants that had already been compiled into routines with @print.