Somebody asked this on Twitter. I did a quick script to count words.
This counts output words only – printable strings found in the string segment of memory and embedded in the code segment. It does not count game vocabulary (words typed by the player). I used txd and wc -w.
Heliopause is about 7000 words; Hadean Lands about 75000. That’s printable text from my code, excluding vocabulary and the Standard Rules. (So not directly comparable to the above.)
(Note that a word-count of the HL source code is 240000. So that’s only 30% printable text.)
That adds up to only around 700K words in total, which seems pretty small to me, considering the average novel is a little over 100K words. (I think our published CoG-label games add up to a little over 5M words of source.)
The scripts for Blood and Laurels run to >150K words; these contain dialogue and scene transitions plus some markup, but they’re not mostly code, and do not include library text. A single playthrough might be more like 8000-10000.
This will not work for Glulx, obviously. I don’t remember which Glulx decompiler is currently good. For counting HL and Heliopause, I used a different script which is too embarrassing to describe.
Now that I think about this, what I did with the two Textfyre games was to take full transcripts and do a lexical analysis. Seems like that’s a better gauge of word count, no?
Seeing the depth of implementation is also interesting, though. You don’t see many words on any individual playthrough of Aisle, but there’s a lot of writing in the game. Maybe categorize them as the “run-through word count” and “full word count”?
The minimal run-through, the run-through that the author expects (with smart exploration), or the run-through that the player really does (with stupid mistakes and repetitive wandering around)? It’s extremely subjective.
I’m interested in how much work the author did, anyhow.
Our games do a random run and measure median/average, which usually tells us what we want to know. (This works because we don’t usually include puzzles.)
The average word count of the source of a CoG-label game is 146K and the median is 122K. The average run-through is 29K and the median run-through is 24K.
Players thus usually see about 20% of the game on a given run through.
Scroll Thief is currently at 27666 words, by a rough count. (This doesn’t include text from any extensions, including the Standard Rules, and treats text substitutions as blanks. So it’s the number of words I’ve actually written directly, in quoted text, in my source.)
[rant=How I measured this]I used a pair of Python scripts I had lying around. It’s kind of messy but gets the job done.
The first one strips everything within brackets, including comments and text substitutions, and can deal with Inform’s nested comments properly (which my regex approach could not).
import sys
ifname = sys.argv[1]
ofname = sys.argv[2]
comment = 0
with open(ifname,'r',encoding='utf-8') as input:
with open(ofname,'w',encoding='utf-8') as output:
for inline in input:
outline = ""
for c in inline:
if c == '[':
comment += 1
elif c == ']':
comment -= 1
elif comment == 0: # Only add the character if we're not within a comment.
outline += c
if outline != '':
output.write(outline+'\n')
The second one removes anything that’s not literal quoted text.
import sys
ifname = sys.argv[1]
ofname = sys.argv[2]
quote = False
with open(ifname,'r',encoding='utf-8') as input:
with open(ofname,'w',encoding='utf-8') as output:
for inline in input:
outline = ""
for c in inline:
if c == '"':
quote = not quote
if quote:
outline += ' '
elif quote:
outline += c
if outline != '':
output.write(outline+'\n')
I took the source.txt from my Release folder and ran it through both of these, then ran the output through a word counter.[/rant]
There’s a Glulx decompiler online at http://toastball.net/glulx-strings/. It works for Z-machine and TADS too. You can copy-paste the output to a word processor to count the words or save it as a text file and run zarf’s Perl snippet on it.
I thought about redoing the word counts based on the ZIL source code, now that we have it. But it turns out ZIL uses double-quoted strings for a bunch of reasons and it’s not trivial to sort out which ones are printable text.