On Quantifying Puzzle Difficulty

As I read over feedback from my entry in this year’s IF Comp, I was struck by the disparity of “difficulty” reported by players. Of course, challenge is subjective and every player will rank puzzles according to their own experience and tastes, but it feels like there should be a some mechanism to more appropriately quantify difficulty. I’d like to say, with confidence: “This puzzle ranks 5/10 on the difficulty scale” or something to that effect.

Note that I’m not talking about puzzle “fairness”. Although it makes sense that puzzles leaning toward the “Cruel” side of Zarf’s spectrum would be inherently more difficult, I don’t think the reverse is true: A challenging puzzle may also be fair. I see these as different, related things.

So, I have a general concept around estimating difficulty which I’d like to make real: quantifying the difficulty of each granule step required to complete a puzzle, according to a set of rules, summing these values up into a “puzzle difficulty” aggregate, and perhaps even a “game difficultly” measurement. It builds on the foundation of the Puzzle Dependency Map (or “graph”, or “chart”, depending on your preference).

A little bit about Puzzle Dependency Maps (click here)

There’s already some online discussion about mapping puzzle dependencies. This practice offers a lot of benefit to the game design discipline (even these forums have this relatively recent thread on the topic) so I won’t deep dive on the topic here. For discussion purposes, the following is one I produced for MaCK:

Some quick notes about my personal approach to these:

  • My tastes lean left to right, rather than top down, because that’s how I read the page and how I transcribe by ideas. It doesn’t align with the pattern of scrolling a window, but… that’s my approach.

  • I also like to include the outcomes for each action. It helps remind me what the action actually does to advance the game.

  • My dependency maps are not walkthroughs; they don’t include every command the player must type to complete a goal. For example, picking up something which was just revealed is assumed.

The things I’m considering adding to my map nodes are the difficulty boxes at the bottom left (the upper left circle, “9” in this case, is the step number and can be safely ignored for this discussion):

These are, from left to right:

  • The “difficulty” of the given, discrete action.
  • The aggregate difficulty of all prior steps which are considered part of the given puzzle.
  • The aggregate difficulty of all prior steps since the start of the game.

The end goal would be to apply some interpretation against the aggregated points at any given node in the game chain to determine difficulty.

Which brings me to the actual point of this post, where I’d really like to poll for ideas: How do we objectively measure difficulty of a given step for an average human?

Here are some initial thoughts on quantifying the difficulty of a specific type of action, “Discovery by examination”:

To reveal the object, something else must be examined which is…

  • Clearly Called Out (CCO): Requires examination of something clearly highlighted in the text, usually as a separate paragraph. (+1 difficulty pts)
    note: I did consider quantifying CCO at 0pts, but feel like a non-zero value makes more sense when quantifying chains of actions.

  • Hidden In Plain Site (HIPS): Requires examination of something mentioned inline, with the description, or minimally obscured. (+2)

  • Obscured In Plain Site (OIPS): Is described in a way which is significantly misleading or obscured. Typically reflecting that the PC didn’t recognize what it was at first glance and the player must see deeper than the character. (+3)

  • In cases where nested examination is required, each scenario would be summed together.

Here’s a set of scenarios to test this initial chart against. The player’s goal in the following is to “discover the red gem”.

Examples of the above rules applied... (click to expand)
Transcript Commentary
1. “The sun shines through a little window, illuminating a clean breakfast nook. Four chairs are clustered around the small table. A clock ticks on the wall.

On the table is a red gem.”
The gem is clearly called out separately from the room description (CCO, 1pt). It’s obviously a thing to focus on. Not much of a puzzle at all.
Total difficulty: 1 pt
2. “The sun shines through a little window, illuminating a clean breakfast nook. Four chairs are clustered around the small table supporting a red gem. A clock ticks on the wall.” The gem’s presence is embedded in the room description, so the player will need to pick it out from the rest of the room description (HIPS, 2pts).
Total difficulty: 2 pts
3. “The sun shines through a little window, illuminating a clean breakfast nook. Four chairs are clustered around the small table. A clock ticks on the wall.

On the table is a cube.

> x cube
The cube is adorned with a round, red gem.”
The cube is clearly called out separately from the room description (CCO, 1pt), examining it reveals the gem (CCO, 1pt).
Total difficulty: 2 pts
4. “The sun shines through a little window, illuminating a clean breakfast nook. Four chairs are clustered around the small table with a cube on it. A clock ticks on the wall.

> x cube
The cube is adorned with a round, red gem.”
The cube’s presence is embedded in the room description, so the player will need to pick it out from the rest of the room description before examining it (HIPS, 2pts) to reveal the gem (CCO, 1pt).
Total difficulty: 3 pts
5. “The sun shines through a little window, illuminating a clean breakfast nook. Four chairs are clustered around the small table. A clock ticks on the wall.

> x table
There’s a cube sitting on it.

>x cube
The cube is adorned with a round, red gem.”
Here, the cube is not initially described at all. The table is embedded in the room description (HIPS, 2pts); but after examining that, the cube is clearly called out (CCO, 1pt), and after examining that, so is the gem (CCO, 1pt).
Total difficulty: 4 pts
6. “The sun shines through a little window, illuminating a clean breakfast nook. Four chairs are clustered around the small table. A clock ticks on the wall.

> x table
There’s a cube sitting on it.

>x cube
The cube is ornately painted with shades of earth, and decorated with precious stones.”
Similar to the previous: table described inline (HIPS, 2pts); the cube directly described on examination (CCO, 1pt); but, the gem is described within the cube’s description, in a way that is nominally obscured as decorative stones (HIPS, 2pts).
Total difficulty: 5 pts
7. "The sun shines through a little window, illuminating a clean breakfast nook. Four chairs are clustered around the small table. A clock ticks on the wall.

> x table
There’s a cube sitting on it.

>x cube
It’s dirty, formed of earthy clay; misshapen and covered in lumps.

>x lumps
As you examine them, you find one of the lumps isn’t part of the cube at all. It’s covered in dirt, but you brush that away revealing a round, red gem. "
Again: the table is described inline (HIPS, 2pts) and the cube is directly described (CCO, 1pt). Now though, the cube’s description is misleading. It’s not “gem adjacent” or at all obvious that examining lumps would reveal a gem. (OIPS: 3dif).
Total difficulty: 6 pts

So I am curious:

  • What are your thoughts on this approach in general, the above attempt to quantify difficulty for this type of action, and other types of actions.

  • Has anyone applied similar thinking to assessing difficulty, and is there a methodology published some place which I’ve missed?

  • Finally, there may be some thought to quantifying a “negative difficulty” for certain types of mitigation strategies, such as hinting.

I know that all of the above is a can of worms, but I am interested in hearing from the group.

Thanks!

Jim

8 Likes

I have played a lot of puzzle games, and my answer is: this question is useless.

No puzzle is written for an average human. Puzzles are written for an audience that is familiar with some kinds of puzzles. (Maybe not the one you’re writing.) And there are a lot of kinds of puzzles, so you can’t even narrow it down to “puzzle fiends” vs “typical folks”.

Moreover: puzzles are always trying to skate the boundary of “using ideas you’re familiar with” vs “…but there’s a twist”. This is even more intensely subjective. It entirely depends on what twists you’ve seen before.

10 Likes

Even if there was room for a legal-fiction like a “reasonable player”, it would be a long way away from the notion of an “objectively measured difficulty” rating for puzzles.

2 Likes

Off the top of my head, I’d say there are two types of puzzles.

There are the ones where the solution is difficult to find: you don’t know what to do to solve them. These tend to require lateral leaps; I think it was Dan Fabulich who said the best puzzles (of this sort), like the best jokes, are obvious in hindsight. Taking “the gold nugget can’t be carried through the stairs” and “this magic word teleports you between two rooms” and combining them into “I can use the magic word to get the gold nugget past the stairs” is this kind of puzzle.

And then there are the ones where the solution is difficult to execute: you know what you have to do, and the hard part is making that happen. The door is locked with a key, the key is in one of 50 rooms, you have to search through all the rooms to find it. These puzzles are easier to design (since you can quantify exactly how difficult they should be!), but less satisfying to play.

This isn’t a hard, objective binary; to some people, Sokoban is the first sort, to others, it’s the second. Maybe the hard part of a maze is figuring out that you can drop items in the rooms to map them (first type), maybe the hard part is doing that mapping (second type). But I think only the second can really be measured in any sort of objective way.

10 Likes

You don’t. The same puzzle can have wildly different difficulties for different people. Those different people may even be me on different days, even given the same level of sleep/hunger/stress/etc.

And even if you could accurately measure an average, almost nobody is actually average in every way: consider the 1950s US fighter-cockpit design story. Curse of dimensionality.

8 Likes

So this does sound like you’re saying “subjectivity” can’t be objectively measured. And while true, you can come close.

My thoughts here are that an objective measurement, based on objective criteria, provides a general baseline of difficulty. Of course it makes sense that it would mean different things to differ players. But it should provide a clear warning signal that a given puzzle truly too difficult, or too easy, for a given audience.

Moreover: puzzles are always trying to skate the boundary of “using ideas you’re familiar with” vs “…but there’s a twist”.

This is an interesting observation, as it implies the thing which makes the “twist” can’t be measured. I’d be interested in examine some “twist” examples to see if we could find some common attributes to measure.

1 Like

Even your example of “Clearly Called Out” is subjective and contextual: look up “locus of attention” and some of the early computer UI research. Even if, for instance, an interface changes the mouse cursor from a pointer to an hourglass or an X or something to indicate a status change, people can often fail to see it even if it’s in their central focal zone, even if they’re looking straight at it, if their locus of attention is the thing that’s behind the mouse cursor, and not the cursor itself.

3 Likes

Whether or not such objective criteria or objective measurement thereof exist is what’s being argued about, though.

Is your argument that a 6 might mean “moderately difficult” for one player, “very hard” for another, and “very easy” for a third, but still be an objective measurement because it reflects supposed underlying objective features of the puzzle?

But this (at least to me) seems to assume that all 6s, say, are uniform for a particular player or group of players—yet there seems to be no good reason to assume that all possible combinations of puzzle features that sum to six points are interchangeable in this manner; indeed, one instance of your HIPS could be found to be much more obscure than another by the same player, but they would both be rated at +2 points.

The various means devised to create things which look like conventional mazes but aren’t? (i.e. the puzzle in one of the Enchanter trilogy games where you alter room connections, the wizard’s maze in Acheton)

I don’t know this story; what is it?

4 Likes

(Wanted to edit this into my post above, but kept getting an error about the “requested URL or resource”)

Even granting the above assumptions, how do you determine where a ‘6’ sits for a given audience?

Intuition? That might be wrong.

Aiming for an audience of which you are a member? Limiting, and it might bias you to assuming that the audience will find a particular puzzle easy or hard because you do.

Playtesting? You could also have testers cardinally rate how difficult they found each puzzle, making this a priori puzzle rating look a bit redundant—either you have data directly indicating how difficult each tester found each puzzle, and thus don’t need this rating mechanism, or you are basing this off testing comments which might not correlate as neatly with the granular per puzzle approach outlined here.

EDIT: Split into paragraphs for clarity.

EDIT 2: Slight wording change.

Even worse, I can easily imagine two puzzles and two players such that one puzzles is easy for Bob but hard for Alice, while another puzzle is easy for Alice but hard for Bob. So not only can you not assign objective numerical scores, you can’t even assign objective rankings in a way that consistently tells you which puzzles will be easier or harder than which other puzzles.

2 Likes

It is! In the same way that bench pressing 100 lbs is “moderately difficult” for one person, “very hard” for another, and “very easy” for a third. Or an 80 degree temperature would be too hot for one, too cold for another, and just right for a third. (This reminds me of “Goldilocks and the Three Bears”, by the way.) They’re not the same, sure. Weight and temperature are purely objective measurements, easily quantified, and not nearly as messy as puzzle difficulty. But there are whole disciplines around measuring subjectivity at scale in other domains and quantifying it. I suspect there’s some unexplored thinking from which we could derived basic, less flexible principles. In the above example, I distinguish “object mentions” embedded in the room description, from standalone paragraphs. I’ve seen this reflected in recurring feedback, but also it just feels true to me. It requires a bit of discipline, of basic parser-game experience, to take notice of such things as important.

We shouldn’t assume any of this valid; nor should we simply assume it isn’t. We can theorize though, and potentially produce an experimental model. That’s the point of this post, to explore the possibilities.

Erm… I realize “a given audience” was the phrase I used, but that wasn’t really my point. Although we could say “an audience of newbies” or "an audience of “puzzle fiends”, what we would really be doing here is establishing a measurement independent of the audience. I suspect that such an established measurement, perhaps satisfying no one, would end up generating insights, similar to:

“Games with a measurement below X tend to rank less in Parser Comps than those in such-and-such range.”

But that’s a suspicion. Certainly you can choose audiences for which such measurments means nothing. The audience of blue-eyed members would probably be evenly distributed across the player’s experience of such a ranked game.

2 Likes

I think the main thing people here are getting at is that there are so many different types of puzzles, outside of “find the object and use it”. For instance:

  • wordplay
  • cryptography
  • tower of Hanoi
  • maze
  • relationships
  • lights out
  • NPC movement
  • object movement
  • state-tracking

All of these appeal to different players, and one cannot definitively be called harder than another, nor, I think, can they be quantifiably compared.

5 Likes

Wouldn’t you find it odd, though, if one person found one set of 100 lbs weights easy to bench press and a different set hard to bench press, if you were considering all sets of weights of 100 lbs as interchangeable and lacking separate identity?

Will debating the marginal utility of completing the Zork II baseball puzzle help puzzle design, though?

More seriously, I don’t particularly see what this approach—feel free to clarify!—has to do with measuring subjectivity.

You even disavowed calibrating the scale for a given audience, which would seem to me to cross out the slightest possibility of subjectivity in your system.

I was assuming you meant the audience for the specific game or genre—the audience for Infocom games when they came out is different from the audience for Quilled games when they were current is different from the audience for, say, Suveh Nux.

I don’t really think the idea of generic audiences of puzzle fiends and newbies really make sense at all because of that difference in audiences.

Imagine a whole genre growing up around the kind of nested examination puzzle you see in Lime Ergot or Toby’s Nose. Imagine that the audience of puzzle fiends relative to this genre wouldn’t get out of bed for a puzzle that wasn’t anything less than a 7 in your scale.

Then put them in front of a game of a comparable difficulty structure according to your scale but constructed along the lines of being an EXAMINE-less puzzle game that gets its difficulty in other ways.

My intuition suggests that their hypothetical performance would be different between game styles, but your “measurement independent of audience” would presumably suggest the opposite—that all puzzles of X rating are the same and that if you find any such puzzle easy, you find them all easy.

1 Like

One thing I think you can do (with a bunch of caveats) is have a difficulty between different puzzle instances. Maybe not an objective, numerical rating, but at least “this wordplay puzzle is harder than this one” because it requires two leaps of logic, or you need to strain against pronunciation a bit more. In technical terms you might get a partial ordering.

Even then it’s a bit moth-eaten. It doesn’t give you a baseline absent actual playtesting. It doesn’t necessarily give you a slope of difficulty. It doesn’t necessarily let you compare different styles of puzzles, even if they are closely related. And it can be subject to you making an erroneous assumption on how to solve a puzzle.

But, you know, better than nothing.

6 Likes

I was going to bring this up, too—I did a puzzle hunt once where the team running the hunt tried to quantify each puzzle in terms of “lightbulbs” (how much of a leap of logic does it require?) vs. “hammers” (how much work is it?). My puzzle hunt team actually uses this concept a lot now when we design puzzles, but just the “is this puzzle predominantly lightbulb or predominantly hammer?” part. The actual “quantifying lateral thinking and raw effort on a scale of 1–5 each” part we don’t use, and I think that’s at least in part due to how tough it is to quantify this kind of thing in the first place.

6 Likes

Aha! So I did tap into something useful there!

An aside on this: in Miss Gosling’s Last Case, I tried to make one of the more difficult puzzles have both a lightbulb solution and a hammer solution: the tea garden. You need to figure out the colors of different plants, but you’re colorblind, so they all look the same to you. The trick (and this one is lightbulb-only) is to look at them through different colors of glass. You can either drag the colored glass around the garden to examine each plant individually (hammer), or go up to the sitting room where a window overlooks the garden and see them all at once (lightbulb).

In the end, basically everyone did the hammer solution. Which surprised me—I thought the leap of logic to look through colored glass would be much harder than the leap of logic to go to the window overlooking the garden!

3 Likes

In the 1950s, the US air force tried to design an ideal cockpit to fit the “average person,” using extensive measurements of average leg length, average arm length, average torso height, and so on, based on the assumption that most people would be reasonably close to fitting these dimensions. But when someone decided to actually interrogate this assumption, they found that of the > 4000 people measured for the project, not a single one was within the “average range” for all 10 of the measurements being considered.

8 Likes

Even this objective information is fairly useless for a game designer though. If I’m designing a physical puzzle, I can’t make an object that’s of “average difficulty” to move by picking an average weight, or by picking a weight that would be moderately difficult to move for an “average person”. People come in too many sizes, and the same weight will be too heavy for many of them while being too light for many others. I’d have to tailor the weight to the individual player if I wanted it to “work” reliably.

If I make a level 6 puzzle I’ll have the same problem: a significant number of players will find it impossible, and a significant number of players will find it trivial. Puzzles are worse than weights though, because solving or not solving a puzzle depends on quirks of the player’s conceptualization versus the author’s, which the author probably isn’t in a position to assess. It’s like trying to measure which size of bowling ball is the hardest to use: it depends on your size and the shape of your hand.

I also question whether situating the solution at the end of a long chain of fairly obvious actions actually produces a harder puzzle. That seems like saying Harry Potter is a harder book than something else just because it’s longer.

2 Likes

I really appreciate what you’re trying to do here. It would be fantastic if we had a useful system for calculating puzzle difficulty, and a way to combine puzzle difficulties into game difficulty.

To anyone who claims the question is meaningless, I want to suggest thinking of the easiest puzzle they remember, e.g. the shower puzzle in 9:05 and comparing to something difficult, like one of the alternate endings in Dreamhold. Clearly, one is easier than the other in some sense, otherwise you wouldn’t understand the question. Nobody is saying that everyone solves the shower puzzle faster than a difficult alternate ending in the Dreamhold, but generally many people do.

Yeah, difficulty is subjective. We’d have to aim for some sort of average, just like e.g. Elo ranking in chess hides a bunch of nuance (opening theory, reading, strategic control, piece value estimation, endgame, etc.) in one average number. You can find cases where one specific lower-Elo player consistently beats one specific higher-Elo player because of their differing strength profiles. But Elo is still highly useful!

I’m a big fan of quantifying thngs in ways that are predictive, because that means (a) the estimation is objective, and (b) at scale, it can be verified against reality. Something like “percentage of players that solve the puzzle within 1 hour of finding it, without hints” or “average time spent with the puzzle before solving it without hints” would be difficult to know with certainty, but highly verifiable if someone took the time to study it.

In lieu of study, we’d like automated ways of measuring it based on play (this is what happens with online Elo ranking in chess) but text adventures don’t have the infrastructure set up for that.

It could of course be estimated mentally. I think your suggestion is practically a formalised technique to do that, but I’m not sure it captures the necessary nuance. I have some ideas of how to apply systems safety techniques to this as well – I’ll think more about it. Great thread!

4 Likes