Infocom game word counts

Somebody asked this on Twitter. I did a quick script to count words.

This counts output words only – printable strings found in the string segment of memory and embedded in the code segment. It does not count game vocabulary (words typed by the player). I used txd and wc -w.

amfv: 38148
ballyhoo: 18720
beyond-zork: 30071
border-zone: 33749
bureaucracy: 34473
cutthroats: 16540
deadline: 19382
enchanter: 20742
hhgg: 18965
hollywood-hijinx: 16947
infidel: 16620
lgop: 25078
lurking-horror: 23165
moonmist: 15544
nord-and-bert: 25708
planetfall: 23198
plundered-hearts: 20748
seastalker: 16558
sherlock: 31294
sorcerer: 17193
spellbreaker: 21762
starcross: 14637
stationfall: 20803
suspect: 20431
suspended: 16811
trinity: 31781
wishbringer: 18529
witness: 17293
zork-1: 14214
zork-2: 15760
zork-3: 14360

How do those numbers compare to some of the top free games made since the 90’s?

Sorted by word count:

zork-1 14214
zork-3 14360
starcross 14637
moonmist 15544
zork-2 15760
cutthroats 16540
seastalker 16558
infidel 16620
suspended 16811
hollywood-hijinx 16947
sorcerer 17193
witness 17293
wishbringer 18529
ballyhoo 18720
hhgg 18965
deadline 19382
suspect 20431
enchanter 20742
plundered-hearts 20748
stationfall 20803
spellbreaker 21762
lurking-horror 23165
planetfall 23198
lgop 25078
nord-and-bert 25708
beyond-zork 30071
sherlock 31294
trinity 31781
border-zone 33749
bureaucracy 34473
amfv 38148

Blue Lacuna has 365,000 words, including Inform 7 source code. So maybe around 250,000 words of prose?

Heliopause is about 7000 words; Hadean Lands about 75000. That’s printable text from my code, excluding vocabulary and the Standard Rules. (So not directly comparable to the above.)

(Note that a word-count of the HL source code is 240000. So that’s only 30% printable text.)

That adds up to only around 700K words in total, which seems pretty small to me, considering the average novel is a little over 100K words. (I think our published CoG-label games add up to a little over 5M words of source.)

This measures quantity of text in the source code. That’s only vaguely correlated with “length of a typical playthrough as a player would see it.”

The scripts for Blood and Laurels run to >150K words; these contain dialogue and scene transitions plus some markup, but they’re not mostly code, and do not include library text. A single playthrough might be more like 8000-10000.

I’d be interested to check this on my own works, but I didn’t quite follow the explanation of how this works. Is the script available anywhere?

It was a cheap hack. Command-line looked like:

txd -w 0 zork-1.dat | perl -ne ‘if (/"(.*)"/) { print $1, “\n”; }’ | wc -w

txd is an old Z-machine decompiler tool – compile the ztools package from ifarchive.org/indexes/if-archive … tools.html .

This will not work for Glulx, obviously. I don’t remember which Glulx decompiler is currently good. For counting HL and Heliopause, I used a different script which is too embarrassing to describe.

Now that I think about this, what I did with the two Textfyre games was to take full transcripts and do a lexical analysis. Seems like that’s a better gauge of word count, no?

Seeing the depth of implementation is also interesting, though. You don’t see many words on any individual playthrough of Aisle, but there’s a lot of writing in the game. Maybe categorize them as the “run-through word count” and “full word count”?

The minimal run-through, the run-through that the author expects (with smart exploration), or the run-through that the player really does (with stupid mistakes and repetitive wandering around)? It’s extremely subjective.

I’m interested in how much work the author did, anyhow.

Our games do a random run and measure median/average, which usually tells us what we want to know. (This works because we don’t usually include puzzles.)

The average word count of the source of a CoG-label game is 146K and the median is 122K. The average run-through is 29K and the median run-through is 24K.

Players thus usually see about 20% of the game on a given run through.

Oh, okay. Not so helpful for me then - essentially everything I do is in Glulx. Thanks anyway!

Scroll Thief is currently at 27666 words, by a rough count. (This doesn’t include text from any extensions, including the Standard Rules, and treats text substitutions as blanks. So it’s the number of words I’ve actually written directly, in quoted text, in my source.)

[rant=How I measured this]I used a pair of Python scripts I had lying around. It’s kind of messy but gets the job done.

The first one strips everything within brackets, including comments and text substitutions, and can deal with Inform’s nested comments properly (which my regex approach could not).

import sys

ifname = sys.argv[1]
ofname = sys.argv[2]

comment = 0

with open(ifname,'r',encoding='utf-8') as input:
	with open(ofname,'w',encoding='utf-8') as output:
		for inline in input:
			outline = ""
			for c in inline:
				if c == '[':
					comment += 1
				elif c == ']':
					comment -= 1
				elif comment == 0: # Only add the character if we're not within a comment.
					outline += c
			if outline != '':
				output.write(outline+'\n')

The second one removes anything that’s not literal quoted text.

import sys

ifname = sys.argv[1]
ofname = sys.argv[2]

quote = False

with open(ifname,'r',encoding='utf-8') as input:
	with open(ofname,'w',encoding='utf-8') as output:
		for inline in input:
			outline = ""
			for c in inline:
				if c == '"':
					quote = not quote
					if quote:
						outline += ' '
				elif quote:
					outline += c
			if outline != '':
				output.write(outline+'\n')

I took the source.txt from my Release folder and ran it through both of these, then ran the output through a word counter.[/rant]

There’s a Glulx decompiler online at http://toastball.net/glulx-strings/. It works for Z-machine and TADS too. You can copy-paste the output to a word processor to count the words or save it as a text file and run zarf’s Perl snippet on it.

Ah-ha! Thank you both.