On Quantifying Puzzle Difficulty

I’ve heard about the no average pilot bit… and the math kind of predicts that outcome… For a normal distribution on one variable, 68% are within 1 standard deviation of the mean, Multiplying a probability of 0.68 times itself across 10 independant variables comes out to about 2%… Granted, in real life, not all variables are normally distributed and not all variables are independant, but the point is it doesn’t take very many variables to make the intersection of the majority become very small or even empty.

As for puzzle difficulty… it’s hard to quantify difficulty even within a very specific class of puzzles… I’ve been a member of the Twisty Puzzles Forum for over 15 years, and in that time, I don’t think any concensus has ever been made on how to quantify the difficulty of a twisty puzzle, much less any of the other classes of puzzles that frequently come up in the non-twisty puzzles subforum there… there is a naive notion that a twisty puzzle, or the more general class of combinatoric puzzles increase in difficulty with number of permutations or the puzzle’s God’s Number(the maximum number of moves needed to take any permutation to the solved state in the fewest moves), but there are counter examples to that(e.g. the nnn class of cubic puzzles increase in permutations and God’s number as n increases, but its generally accepted that difficulty stops increasing after the 777 and that higher order cubes only become more tedious to solve, not actually harder, to the point that record breaking cubes being built are more interesting as feats of engineering than as puzzles to solve… On a similar note, what makes the 64 tall towers of Hanoi impossible while the height 5 version is practically trivial is that the height 64 version would take too long for any human to execute the solution while the height 5 version could be speed solved in seconds as their solutions are essentially identical, differeing only in how many moves the general solution has to run for.

And of course, there is a puzzle’s computational complexity, but that’s less a measure of how hard a puzzle is for a human to solve and more about how much time or space a computer needs to find a solution given a known algorithm.

3 Likes

An a posteriori measure of puzzle difficulty, whether one thinks that is valuable or not, is different in kind to the sort of a priori measure which is the object of this thread, though.

Elo ranking, for example, is relative to the body of players and the organisation maintaining the rankings—USCF Elo != FIDE Elo != Lichess Elo—and ranking changes over time.

An equivalent system of ranking for puzzles would likely lack the “measurement independent of audience” quality sought by the OP, given that it would not be invariant over time and across changes in audience composition.

EDIT: corrected typo of by to be

2 Likes

I agree on this. A measurement of how difficult players perceive puzzles to be has to depend on two things: the puzzle and the player. It cannot be measured “independent of audience”.

But to steelman Jim’s argument, I think they meant “relative to a large audience of different experience levels and strength profiles” in contrast to “individual difficulty estimations for each specific player.”

2 Likes

As far as I can tell, though, Jim’s arguing not for a measure of how difficult players perceive puzzles to be but rather a measure of how difficult puzzles are, in some sort of puzzle Platonism.

But the analogy to 100 lbs weights pushes against this—a weight being 100 lbs one day and 90 lbs the next would conventionally be seen as highly peculiar.

I read the analogy as arguing that puzzle difficulty ratings must not depend on individual difficulty estimations because (according to the analogy) individual difficulty estimates only tell us how hard individuals find puzzles rated X in the same way such estimates for weights tell us how hard individuals find lifting X weights instead of measuring weights in terms of the difficulty people have lifting them.

The notion of measuring difficulty qua an idealised average human and the way in which the measurement scheme proposed in the opening post seem to view all puzzles of the same measured difficulty as interchangeable point towards my interpretation.

I may very well be wrong, but if you’re correct as to what Jim meant, then I have to say I find it clear as mud.

EDIT: I suppose you could read him writing

as saying he doesn’t assume puzzle interchangeability as I think he does, but I struggle to put that interpretation on it when it’s in a post alongside an analogy that (to my mind) either depends on such an assumption being true or amounts to saying “puzzle difficulty is much like the difficulty of lifting weights, in that it’s not like that at all”.

EDIT 2: In the OP, he sets out that his system by

which, to me, suggests that he’s quantifying difficulty as an objective measurable property and not something subjective whose variance between players can be measured and averaged to create an estimate of puzzle difficulty.

He desires

which I read as applying once again to puzzles themselves, not player performance.

I read

as referring to basic principles in terms of evaluating puzzles, rather than player performance in puzzles.

Indeed, the title of the thread points towards an objective quantification of puzzle difficulty in itself, rather than as an estimate derived from player performance or even player evaluation.

1 Like

I think one of the tricky things in IF puzzle design is that if there’s a “hammer” option and it’s near-immediately obvious that it will work, a lot of people will just get tunnel vision and not think to look for anything else.

Not quite the same thing, since there’s not really a faster alternative, just one that lets you have more individual “aha” moments along the way, but I thought the phone hacking in Winter-Over was so hammer-y that people would do it in bits and pieces spread out amid other investigative activities or maybe not in their first playthrough at all, and I was very wrong about that.

But of course this is kind of a tangent.

8 Likes

That’s not true, I think. A puzzle’s solution might be as simple as PUT DIAMOND ON NECKLACE, but if the player missed one clue and screws it up while experimenting and tries PUT EMERALD ON NECKLACE, she might have irreversibly locked herself out of victory without knowing.

6 Likes

This is a fair observation. An aggregation of points may not be a valid approach. Perhaps an average, or some other algorithm. But I do think there is some approach which produces a meaningful metric…

Edit: Updated a typo.

1 Like

Now that sounds interesting! I’d be curious to see your thoughts in that.

Of course, reality and perception of reality are always different. Measuring the former means nothing if it isn’t interpreted through the latter. But as subjective as “difficulty” is, it isn’t random. It if were, a theoretical charting of all individual experiences in a group would fill the graph as white noise. Instead, opinions clump together, indicating something which could be, I propose, estimated.

My premise, my suspicion, is this: when you break puzzles down to their individual, constituent pieces, you find characteristics which can be immutably quantified, independent of an individual’s perception. I envision this as a sort of objective “friction” inherent in every observation+action pairing the player makes. Some observation+action pairs have more friction than others, and players experience this grouping of turn-friction differently, but it can serve as an objective measurement to benchmark against.

1 Like

No. I’m saying this is a theoretical discussion, a starting point to establishing a model for measurement. Postulating something isn’t the same as assuming its true or false.

2 Likes

I think Jim accidentally started this thread off on the wrong foot by proposing concrete ways to measure difficulty, which conflated the general idea of modeling difficulty with a specific proposal for a model.

It sounds to me like the criticisms in this thread make the error of forgetting what a model is. All models are wrong, but some can still be useful. Jim has proposed one model for puzzle difficulty. Like any other models, it is wrong. I’m confident Jim knows this too. The way to refute this model is not to think of a case in which the model makes the wrong prediction, but to present an alternative model that is more useful.

How do we measure the usefulness of a puzzle difficulty model? That would require the connection back to reality I suggested in an earlier reply: the model eventually has to be predictive in some sense, i.e. estimate the time it takes a player to solve the puzzle, or the fraction of players that solve the puzzle without hints, or some other concrete, measurable outcome.

Then we can compare models in terms of how large the error is in their predictions. I’m convinced the model suggested by Jim would have huge errors. But I also think those errors would be smaller than completely random guessing, which is what everyone else has proposed so far by rejecting Jim’s model without proposing an alternative. The challenge then is coming up with an alternative model that has even smaller errors.

See also Paul Meehl’s disturbing little book: Arithmetic Models: Better Than You Think

4 Likes

I have enjoyed reading the comments in this thread, and have been swayed by the argument that “puzzle difficulty” is not a thing that can be measured or quantified in a meaningful way. The trick for the author is to create differentiated puzzles that can be enjoyed for the widest variety of players. Good puzzle design means providing entertaining and helpful responses to incorrect solutions as well as correct solutions, and a progressive system of in-game gentle nudges toward a solution. In other words: a successful puzzle generates the impression that all players are solving a challenging puzzle, even if they may not be solving the same puzzle.

but kqr makes a valid point also, that if we limit the discussion to classic forms of parser based puzzlery, there are some common elements to puzzles that were considered “difficult but still satisfying” to a variety of people.

  • Situations requiring careful planning because of time or resource constraints.
  • Situations which require the acquired knowledge from multiple failed attempts.
  • Situations which require players to do things which the game has already trained them not to do (walking into a dark space without a light source, sacrifice a one-use valuable item)
  • Complex “meta” puzzles which require synthesis of information from the entire game.

This list above omits classical puzzles which are regarded as “difficult” but “not fun” for a variety of players, which include

  • puzzles which are obvious but tedious
  • guess the verb
  • puzzles which seem illogical to most players even after the solution is revealed.
5 Likes

Yes. I concur. I may have put the cart before the horse by describing the example, proposed model. I did not consider that our ability to model difficulty would be a point of contention; only the approach.

I’m not sure I have time to develop the idea further at the moment, but to keep things simple let’s use the darkness puzzle in Zork. The controls we have to avoid being eaten by a grue are things like (a) not moving from darkness into darkness, and (b) making light.

In order to apply (a), players would need to be informed of where there is darkness before going there. They can get this feedback by dying and undoing, but adding further feedback paths would make the difficulty lower. We can also think of reasons a player might go from darkness to darkness despite being given the appropriate feedback:

  • Darkness status might change throughout the game. This would naturally increase darkness puzzle difficulty.
  • There might be someone chasing them forcing them to move quickly, This is an unrelated element that would increase the difficulty of the darkness puzzle!
  • The player might not realise they can undo out of death, and think they have to restore an old save. Thus hinting at the undo command would decrease puzzle difficulty.
  • The player could have collected feedback the expensive way (dying and restarting) but recorded it incorrectly on a paper map. Providing an accurate digital map would make the puzzle easier.
  • Having fewer areas that are dark would obviously make it easier to avoid going from dark to dark, thus decreasing puzzle difficulty.

This was the uninteresting case of avoiding darkness. Turning to (b), making light. Why would a player be unable to do this?

  • They may not have found anything that provides light.
    • Adding more things that provide light would decrease puzzle difficulty.
    • Adding more hints about where things that provide light are would decrease puzzle difficulty.
    • Moving things that provide light into the path the player goes on their way to darkness would decrease puzzle difficulty.
    • Making automatic the action to pick up light-providing items would decrease puzzle difficulty.
  • They may not know that what they have found is a thing that provides light.
    • Adding hints to the description decreases difficulty.
    • Forcing the player to traverse through darkness after finding the thing would show them, decreasing difficulty.
  • They may have found and subsequently dropped the thing that provides light.
    • Reducing inventory limits makes the darkness puzzle easier.
    • Giving the player information that there is more darkness in the future makes the puzzle easier.
  • They may not understand how to operate things that provide light.
    • Having things automatically light would decrease puzzle difficulty.
    • Adding instructions on the light-thing on how to operate it would decrease puzzle difficulty.
  • They may not have sufficient fuel for the thing that provides light.
    • Providing more fuel decreases puzzle difficulty. (Or equivalently making fuel flow lower.)
    • Allowing the player to use other things they find in the game as fuel would decrease puzzle difficulty.

We could also think of other ways of making the darkness puzzle easier, e.g. by giving the player a probability of escaping from the grue, having items that makes it possible to kill the grue, or allowing more movements in the dark before the grue tracks them down, etc. (Of course, this might make other puzzles harder, as it allows the player to spend longer than they should in darkness!)

STPA is a formal method by which safety engineers answer questions like “What are reasons the pilot might not disengage MCAS even though it’s causing their 737-800 MAX to nosedive into the ocean?” and we could hypothetically use STPA to come up with answers to “What are reasons the player might not perform the puzzle solution even though it’s preventing them from continuing to play the game?” Counting the answers might be one measure of puzzle difficulty. If one puzzle makes us think of 13 reasons the player might fail, and another only 3 reasons, then maybe (as a model, to remind everyone) we can think of the former puzzle as harder than the latter.

The ideas is to ask questions like,

  • What are reasons the player might do the wrong thing, even though they have the correct information? (E.g. errors in the player’s mental model of the game, inability to find the right verb, complex logic puzzles.)
  • What are reasons the player might not even get the correct information? (E.g. insufficient hinting, not having explored sufficiently to discover the answer, the player previously having done something that locks them out of the necessary information.)
  • What are reasons the player might do the right thing, but the game still behaves as if they did not? (Bugs, unfulfilled prerequisites, lack of in-game resources.)

but the full STPA method is more structured, of course. There is a free handbook which is quite good, but doesn’t reflect the latest developments in the method.

One could probably come up with a set of semi-standardised questions tailored for parser games and then judge puzzle difficulty by how many of those questions indicate valid reasons to not complete a puzzle.

For example

  • “Can the player do something that locks them out of getting the necessary items?“ is about the Zarfian cruelty of a specific puzzle, which is clearly something that would increase puzzle difficulty, so that would count as +1 in a difficulty model.
  • “Does the player need a non-standard verb to complete the puzzle?“ would also be a +1 in a difficulty model.
  • “Are the required items found on the main path from starting location to the puzzle?“ would be a -1 in a difficulty model.
  • “Are the required items mentioned in the descriptions of the rooms they are in?” would similarly be a -1 in a difficulty model. (Some puzzles require items that are discovered only by searching other things, for example.)
  • “Does the puzzle involve an untypical use of an item?“ would be a +1.

You can argue that a cruel puzzle should not count as the same +1 difficulty as using a non-standard verb, and I don’t really care about the specific weights assigned to these factors. I think it’s more important to find a set of broadly applicable factors and then we can debate specific weights later.

7 Likes

I feel like you’re saying two things at once here, but I want to be clear they are two separate questions:

  1. Is it ever sensible to say one puzzle is harder than another?
  2. Is puzzle difficulty (however measured) a useful guide to game design?

The answer can be “Yes” to the first question and “No” to the second question without any contradiction occurring, but it sounds like you think they are linked.

I’m still curious if any of the puzzle difficulty deniers would claim that e.g. the shower puzzle in 9:05 is just as difficult as e.g. one of the alternative ending puzzles in Dreamhold. If not, you are admitting that puzzles can have varying difficulty. That’s a rough mental measurement you just performed in your head. Why not talk about how we can convert that rough mental measurement into something that can be performed on paper?

3 Likes

Yes, and I’m convinced that there does NOT exist ANY model in which which putting a single number on the difficulty of a puzzle (or a single step of a puzzle) provides enough information or precision to be useful.

It’s too much of a multifaceted and contextual problem: the time it takes to solve a puzzle can vary by well over an order of magnitude for the same person based on seemingly trivial things about what they happen to be thinking about or focusing on at the moment.

There’s plenty of useful stuff to say about puzzle difficulty and design, but it’s a wide-ranging craft discussion, not a quantitative model.

For example, I have watched two different people streaming The Roottrees Are Dead who have read the following passage out loud, gone back over it several times, and finally moved on, puzzled, without ever having suspected that there was any connection between the words “herring” and “red.”

You find information on a website about strange historical facts.

Aloysius Herring Roottree, aka “Red” was an appointed judge and politician who earned his nickname ironically because he was the ambassador to Ireland for only about six days.

He served what some people contend is the shortest term in history. A few days after he was appointed, he was heard disparaging his position and the Irish people.

He then tried to run for state senate. He claimed to be related to the Roottree Candy Company in a newspaper ad but it came out that the last name was merely a coincidence and he was defeated in a landslide.

Both of them immediately saw far more subtle things in other passages. Both of them came back to it later, saw it within seconds, and laughed at themselves for missing it earlier.

So while you probably could put broad relative difficulties on “how subtle is this information?” that we could all mostly agree on, it’s useless as a measure of how difficult any particular person is going to find it in the moment, or how long it takes them to see it at any particular time.

2 Likes

Here’s an example of how it might be useful: it would allow beginners to list games by puzzle difficulty, and if they are choosing between e.g. Glowgrass and So Far, they would see that Glowgrass has easier puzzles.

This is a real question beginners ask when selecting games. It is not helpful to tell them “Man, it’s all subjective. Difficulty varies so much you might have an easier time with So Far than with Glowgrass!“ because it’s not true. For most players (I almost want to say “all players” but just to be careful I’m not going to) Glowgrass would be the easier game in terms of puzzles. It would be super useful if we had a model that captured this.[1]

Will we come up with a model that allows us to accurately predict the time a specific player will spend with a specific puzzle in Glowgrass? No. But that’s not necessary for the model to be useful.

[1]: We do sort of have models that capture this, but they are locked up in each and every one of our heads, so what I’m really saying is that it would be useful to get that model down on paper, so we have a shared basis from which to discuss further.

5 Likes

I agree with @kqr and @Doug_Egan that it’s better to have questions to ask about the puzzles rather than a quantitative measure. It’s sort of like software best practices. Creating a complex system is hard and people have come up with a lot of heuristics to guide them, but it’s not an exact science. (Saying this question is useless is probably going a bit far though.)

Some more questions that might be useful… How many steps in the puzzle? Is the goal of the puzzle communicated? Are intermediary sub goals communicated? How well is it hinted? Is there feedback if you do the wrong thing, and if you finish the puzzle? Does it require external knowledge not provided in the game? Does it require knowing about conventions for the game genre that a new player wouldn’t know? Etc etc etc

I think some people will be rude and dismiss a game if they find it too hard (even or maybe especially people who claim to like puzzles). I wouldn’t read too much into the reviews.

6 Likes

Whether or not puzzle difficulty is measurable or modelable, I don’t think this is a good premise because—as far as I can see—it treats different sources of puzzle difficulty as interchangeable.

Surely postulates carry implicit assumptions?

No? I wouldn’t claim the shower puzzle is easier or harder than the Dreamhold puzzle either—puzzle difficulty is not a property of the puzzle.

Firstly, this thread is about the usage of measuring puzzle difficulty for the purposes of game design—a measure might be adequate for the purpose you have outlined but inadequate for the purposes of this thread.

Secondly, is “you could do a detailed and painstaking analysis of each puzzle in a game to get a result which could probably be crudely approximated by simply asking people to rate how difficult they found the game as a whole” a very convincing argument for a method?

3 Likes

Why not both? I agree it’s much easier to have people rate the difficulty of the game as a whole – once a sufficient number of people have played the game. (And to be clear, this rating is not the “crude approximation” – the rating is the ground truth. The detailed and painstaking analysis is what would end up in a crude estimation of the rating!)

But if we are also able to break this rating down into contributing factors, then people who design games can estimate what the rating would be for their game even before it is played by many players, and that could give them an idea of whether they would like to make it easier or more difficult.

Sure, they could wing that judgment, but the author is not always a great judge of puzzle difficulty in their own game. They could rely on the opinions of a few beta testers, but at least I would like to have an idea of what it is I’m throwing at the beta testers before doing so.

Aside from the practical questions, I admit I also have an academic curiosity in this matter. I would like to be able to decompose and understand puzzles better in their constituent parts, for the pleasure of doing so on its own. (Even if the puzzle experience cannot be totally understood unless it is fully assembled in a specific context.) Not everyone shares this pleasure, and that’s fine.

4 Likes