Growing number of comp entries and reviewer selection bias

(I was inspired to post this by a Twitter thread – unsure what the etiquette is around whether I should link it.)

Lately there’s been some discussion about whether IFComp is too big, or whether it’s even possible for IFComp to be too big. The IFComp 2020 announcement blog post mentioned that 41% of survey respondents thought “the more games the merrier”, and 72% of respondents thought there weren’t too many games. Personally, though, I am a little concerned about it.

One worry I have is that, the more entries there are, the harder it’ll be to fairly compare their scores.

Let’s imagine two hypothetical IFComps: one where there’s ten entries, and one where there’s a thousand entries. In IFComp Ten, it’s safe to assume that most of the games are being played by most of the reviewers, so if you average everything together, you’ll get a roughly consistent rating scale. In IFComp Thousand, though, no one will be able to play all the games! So what people will do instead is scroll through the list of games until they find some whose titles/covers/blurbs look interesting, and play and rate those. (A few people might use the random shuffle and rate whatever comes up, but why would you bother playing a game that you don’t think you’ll enjoy?)

This creates a problem because different blurbs appeal to different people. Like, if one game’s blurb is “Join this teenager’s journey of self-discovery in a fifteen-minute work of interactive poetry”, and another game’s blurb is “Can you solve all two hours of parser-based math puzzles inspired by the Intel 80286 Programmer’s Reference Manual?”, these games will get rated by two almost entirely separate groups of people with very different ideas of how to score games. So the average score any given game gets will dramatically differ depending on which subgroup of judges its blurb/cover/etc manages to attract. You could imagine platform-based effects too, like maybe Windows users give higher numeric scores on average than Mac users.

It would be unfortunate, I think, for the comp to depend so heavily on “metagaming” like this. Does anyone know if this is currently the case, or if it’s likely to become the case if the comp gets even bigger? Heck, should I package up my next hypertext game as a Windows-only executable to increase its average review score?

5 Likes

i feel like this is perhaps resting on a few preconceptions about the competition that aren’t necessarily true. from my observations, and somewhat counter to the concerns:

  • there’s very little correlation between genre and score, and it’s therefore difficult to effectively pander through game theory or betting odds to what might be the highest-value focus.
  • there’s virtually no commercial or even career benefit to winning or even placing well in the comp. neither sales, nor visibility, nor XYZZY awards, not even long-term critical reputation, bear any correlation to top decile finishes.
  • different delivery systems than web don’t produce any significant change in scores, but it does reduce viewership, so it’s unlikely people would want to metagame in a way that guarantees few eyeballs even if they win.
  • people don’t ignore games they dislike at all. if they did, we’d see a whole lot less 1 votes. every voter can still deliver votes on every single game, and there’s absolutely no reason that publishing an 80286 manual or a 15-minute poetry game would result in anything better than a middle-of-the-pack finish for that reason. people aren’t voting in a vacuum, they’re not assigning scores based on absolute merit, they want their favorites to win.

if one wanted to exploit anything about the competition, it’s very unlikely to be the single most difficult part, which is spending a year plus designing, producing, and testing a game which almost by definition can’t even pay for sweat equity if it becomes orders of magnitude more successful than any other game in the comp.

8 Likes

While I get that there’s different demographics here as to who is more likely to play a game based on it’s genre, blurb or platform accessibility, why would you assume they’re going to mark wildly differently based on that demographic. I’d assume there’d be harsher and more lenient markers in both groups?

4 Likes

Those are fair objections! I should clarify that I don’t think anyone is currently trying to game the comp by microtargeting blurbs, and that IFComp Thousand is an exaggerated scenario to illustrate the effect more clearly. (In IFComp Thousand, people can’t negatively rate every game but their favorites because pressing the 1 key that many times would induce a repetitive stress injury!)

3 Likes

Re: the effect of different people scoring different games, with little or less overlap, I have thought about this myself, but I assume people who’ve studied statistics at a tertiary level can probably clear this up (and I think there are a lot of them around here!). The upshot of my thinking on it is that, for instance, the meaning of me giving a 7 is always specific to me. But the meaning of person B giving a 7 is always specific to them, too. So as long as there are enough voters spread around the games (i.e. a large enough sample size) even if they’re playing fewer games in common, it doesn’t really effect things. Chance has a large part in whether one person plays all (or any… or many) games they’re likely to have given any particular score to, because even if I seek out and play things I think I’ll like, there’s no guarantee I will turn out to like them having played them, only a possible initial bias. I expect factors like these apply to every single person judging, and therefore they work out overall if there are enough people. That’s my lay thoughts on it. Though I do think the overall experience of the comp must be changing as, incrementally, judges have fewer games played in common by sheer weight of entries. The chance that you will not even have played the game that ends up winning is increasing, which may make you feel weird. I mean it felt a little weird to me when it started happening.

-Wade

5 Likes

right, a fairer and more concise way to phrase it is that i think in most foreseeable futures, one still stands enormously more to gain from good-faith “metagaming,” or market research in other terms, than they do from manipulation in service of efficiency.

2 Likes

Even if there aren’t differences in overall grading strictness between groups, I think it’d still be possible for blurb targeting to have significant effects on how well a game places.

For example, if a blurb is a little vague about the contents of a game or who its target audience is, that could cause it to place lower just because more people outside the target audience are playing it. (If a game is only played by 10 people who really like it, it’ll place higher than a game that’s played by 10 people who really like it and 50 people who are kinda lukewarm.) In IFComp especially, where there are lots of competing genres and philosophies – parsers! choices! games should always be lighthearted! games should always be deep and literary! games should have hard puzzles! the player should never get stuck! – that effect could be pretty large.

Again, I have no evidence this is actually happening and I don’t think anyone’s deliberately doing this. But it might cause results to be weird someday, or create mildly unfortunate long-term trends like authors writing blurbs that are less literary/evocative and more boring/utilitarian.

4 Likes

Voting doesn’t happen in a vacuum. You’re hearing about games from other people, and especially as the competition goes on, people start playing games they’re hearing a positive buzz about. Given that I don’t think anyone cares too much about whether a game places 40th or 50th, and much more about whether a game places 1st or 11th, I think this should go a long way to alleviating your concerns. It’s probably the highest places games that have been played by the greatest variety of people.

10 Likes

I think this second example is where IFComp 1000 goes wrong, if it does.

I agree with Victor that if a math puzzle gets high scores from math lovers and interactive poetry gets high scores from the poetry lovers, it’s likely that “enough” people will post their reviews online, so people will notice that both games are getting buzz, and “enough” people will cross over and try high-buzz games, such that at least the Top 10 results will be basically cromulent.

But if a game is “misblurbed” such that it seems to appeal to poetry lovers but it’s “really” a math puzzle, the math lovers won’t even try it, and the poetry lovers will review it poorly. As a result, the game won’t get buzz in the first place.

I think this happened to Six Silver Bullets in 2018, which included a “content warning” that said:

Violence, foul language, extensive ruminations on death and free will. Numerous controversial mechanics: randomized combat, arbitrary death, players are encouraged not to savescum or undo actions

In fact, while the game included randomness, it was designed to be a fair puzzle, where you try to figure out what isn’t random. It ranked 31st in IFComp 2018, but it was subsequently nominated for two XYZZY Awards (Setting, Individual PC).

The blurb turned off the people who would like it, and appealed to people who didn’t, which prevented the game from developing buzz in time for the competition.

7 Likes

I often think the voting system or IMDb is flawed in that it invites users to give a score for a movie, without guidance as to what it means. Some movies I’d like to rate 9 out of 10 or 10 out of 10 for cinematography or for artistry or for acting, but I still rate the movies low I found the movie to pretentious and not half as clever or profound as it was claiming to be (the lovely bones) .

Other films don’t try at all to be profound and yet I’d give them a 10 our of 10 simply because I was entertained (the Goonies, Die Hard 1, Robocop for example).

Other films have artistry and entertainment scores that are largely equivalent ,such as Vertigo, Rear Window, Shawshank, etc…

The point is here that some people who are judging think that this is an entertainment competition. Others may think it’s an artistry competition.

Games probably have to appeal to both biases in order to finish top. I’m not sure that serves the exceptionally artistic but perhaps a bit sad or the exceptionally entertaining but perhaps a bit badly written games well.

I’m still confused if this is primarily a creative writing competition or a puzzle box creation competition?

Judges need to know what they should be judging on. Art or entertainment?

I’d rate suicide stories or tales of depression 1 out of 10 all day if rating on entertainment. But if rating on art, maybe could be an 8 or a 9. So if 80% of active judges are rating on entertainment you have a balance issue, and vice versa.

2 Likes

There are all kinds of possible problems, where for one reason or another IF Comp results seriously distort some underlying measure of quality. But are these actual problems? If we had strong evidence that a particular type of game was always winning, and another always losing, even though an equally good case for the merits of both could be made… then, yeah, maybe there would be a problem. But I’m not seeing this at all. What I’m seeing is that top placing IF Comp entries tend to incredibly diverse. So, take the 2019 top 10: there are choice and parser games, comedy and tragedy, puzzle and non-puzzle, almost-no-story and story-heavy, and so on.

5 Likes

(Not sure if this is on edge of this topic, and if I should create a new topic, since this post is about how to bring down the growing number of comp entries voluntarily)

If IFComp becomes sufficiently large, I guess the least played games will end up getting less attention than Spring Thing games. But at the moment IFComp games get much more attention. E.g. on IFDB the 20 Spring Thing 2019 games has a total of 104 ratings, whereas the 20 most rated IFComp games of 2019 has 316 ratings. Add to this that all the IFComp 2019 games have so far generated 750 ratings, it shows, at least on IFDB, that IFComp games get a lot more attention.

I like Spring Thing a lot but if only a fraction(?) of the IF community is into it(?), perhaps it would be better to have an “IF Spring Comp” and an “IF Fall Comp”. They could have different rules regarding the 2 hour rule, so that the spring comp still encourage long games.

As an author, I prefer IFComp, even if I wrote a long game, because of all the feedback, not least the statistics, which show how many gave this and that score. Then you can see how many people actually played your game. If Spring Thing became more like IFComp, I think it would become more popular and thus many “long game”-authors would probably move their games to to the spring comp, which would be good for both competitions.

6 Likes

I think supporting Spring Thing with more attention is one of the most important things the community could do. I’ve been very bad at this myself too, alas.

13 Likes

Yes. :stuck_out_tongue:

The thing with IFComp is that it is fundamentally a popularity contest. The game that wins is the one that is the most liked by the most people. It doesn’t matter why they like it.

6 Likes

I think it is good that people can use their own system for voting. I don’t think it makes sense to tell people to rate something like art or to rate something like entertainment. If I don’t like a game, I don’t find it entertaining and I don’t find it to be great art either. If either I regard it as good art or as good entertainment (or both), I would rate it higher.

6 Likes

Rating systems are always unfair. I just hope the IFComp would have enough volunteers to manage a thousand entries.

3 Likes

These discussions always look weird to me because the arguments assume that the judges are frail creatures who cannot apply the kind of advanced reasoning that the participants bring to the issue.

Judges know that both types of games are here, and I’m willing to believe that they take that into account when assigning ratings.

And the overall results will reflect the preferences of the larger group of people. That sounds like direct democracy.

The following concern seems weird to me:

I’m reading that as “a game with strong niche appeal might place poorly if it’s considered by a broader audience,” which does not sound like a bad thing.

I am okay with this. It means that developers are encouraged to create entries that appeal to the largest number of people, which seems more likely to grow the audience of people who notice Interactive Fiction.

Why are people uncomfortable with the idea that the IFcomp results may not reflect their personal preferences? Its okay to like games that don’t win, and it’s okay to disagree with the final results.

8 Likes

Spring Thing seems to be getting slightly more attention each year. For instance, The Missing Ring ended up getting a lot of XYZZY nominations last year.

I’ve thought that if I ever win IFComp I’d probably put all of my future ‘big’ games in Spring Thing instead.(but given the ever-increasing number and quality of IFComp games that’s fairly unlikely!)

3 Likes

yes, i think if there’s a Big Problem here that’s probably it – at present, IFComp is so important, it more or less blocks off about a quarter of the entire year as largely pointless for any non-IFComp release. to me it seems like the clear answer there is diversification of both voters and entrants into smaller and more experimental competitions, but that’s not something one can force either.

2 Likes

I think the fact that, with 102 games in this competition “anyone off the street can play 5% of the games and that counts as having discharged their duty” is a lot more of an issue. There’s a lot more talk about what the writers are doing wrong or right and whether or not blurbs are important and what are the judges playing and what is everyone’s preference and that would really be addressed a lot better if we required people to actually play and judge a significant number of games. When I first entered the competition in 2013 there were 35 games—5 games out of that is a lot more significant than the pathetic 5 games out of 102 that’s the bare minimum that we are requiring judges to do.

1 Like