Outlier votes

I don’t recall whether I’ve brought up this topic often or only occasionally, but it keeps bugging me how the IF Comp rankings are skewed by people who give out what are obviously bad faith votes. The Absence of Miriam Lane, average score 8.17, received no ratings below a 4… except for one person who gave it a 1. A Long Way to the Nearest Star received no ratings below a 6… except for one person who gave it a 1.

Does it matter for the final rankings? It does. Without that single voter giving the game a 1, A Long Way to the Nearest Star would certainly have won the IF Comp, and The Absence of Miriam Lane might have won it. Now I don’t claim that these games would have been more worthy winners than The Grown-Up Detective Agency – the two games I played were both very good, and I haven’t played the other one! But is just bugs me that one bad-faith judge can have so much influence over the final rankings, trumping the carefully calibrated votes of a dozen other judges.

And of course these ratings are bad-faith. It is impossible to believe in good faith that these pieces, which are at the very least highly competent, deserve the worst possible rating.

(I’m also kind of incensed about the four people giving 1’s, 2’s and 3’s to According to Cain, which you may or may not enjoy, but which is obviously an extremely well-crafted parser game that cannot conceivably merit such low grades. But here there are apparently four people who disagree with me, so perhaps I’m the crazy one?)

Isn’t there some kind of formula for determining the final scores that just leaves out these outlier votes? It seems a lot fairer and more fun for the authors.

16 Likes

Maybe some people “use all the numbers”, as recommended on the IFComp website, but just don’t use them in the recommended way :slight_smile:

3 Likes

So the IFComp team has in the past screened for bad faith ballots. I’m not sure if that filter was applied this year, but I’m sure it’s a complicated decision to make. If someone has a more complex ballot, with a lot of normal ratings interspersed with a few inexplicable ones? I mean sometimes people are inexplicable. I know I had a negative reaction to elements of an entry this year that didn’t have anything to do with any objective criteria of craftedness.

Ultimately, IFComp score is what a random coterie of internet people felt about an entry, for all the petty and ridiculous reasons that make us who we are. I don’t know how you fix that at an institutional level. That’s why, even though the pageantry of competition is fun, I’m not sure it’s healthy to take the competition aspect too literally. There were a lot of great entries that didn’t make the top ten, c’est la vie, they’re still there to be recognized and valued.

12 Likes

My question is, how do you define bad faith? If it’s impossible to give a particular game a 1 in good faith, how about a 2? Or a 3? Would removing the 1 from the scale and rating from 2-10 fix the problem?

You could just knock off the single highest and single lowest rating for each game, which might help with this situation, but then you have the same problem if two people gave 1s (or 10s).

This is mostly a rhetorical question, but I don’t see any way to really measure a judge’s intentions like this. The best solution to outliers, imo, is to have more data. A larger number of reviewers means a single bad vote has less of an impact.

14 Likes

And there were about 26% more votes this year? Almost 900 more? So we’re moving in the right direction there, at least…

7 Likes

I get what you mean, and I also need to preface by saying that I usually rate things between 3-5, so please don’t take this as me accepting culpability; I was not the outlier there.

However, I think there’s also an argument to be presented about how wildly-different perspectives would produce results that you would never believe, and still not be “bad-faith”, necessarily.

I don’t know anything about the games you listed, but there could be a hypothetical game where tons of effort obviously went into it, but its prose is so flowery that one player might consider it “obfuscating”. They would not have been able to enjoy or progress in the game enough to see all the effort that was put in. The wider community, meanwhile, might have eaten it up and loved every moment.

Another case could be a character with some kind of condition, disability, neurodivergence, etc. A larger audience might completely miss the character being misrepresented or grievously-stereotyped, but that one player who overlaps with the character in some way might feel rather angry that someone like them would be presented is such an incorrect way, and that the game’s success might even be a kind of misinformation.

(I feel like this forum isn’t likely to let such grievances go unnoticed, but I have heard stuff like this a lot in other places.)

A game might be completely inaccessible for technical reasons, like a screen reader, making gameplay very difficult, or even impossible to continue after already getting invested.

Someone might have missed a major solution to a puzzle, and now everything else built around that problem seems subjectively self-important in hindsight, through a frustrated lens.

A game might have been extremely character-driven, but the person playing the game has extreme difficulty with understanding people, so the entire game seemed meandering and impossible to follow.

A game might be constructed almost entirely on puns and niche references, and while most people might have just shrugged it off, by golly, user195052 spent every last ounce of patience trying to understand what was going on.

These are just some example off the top of my head that I have heard from other people, or have experienced for myself before.

I, on principle, don’t rate things lower than 3, unless the game itself seems malicious in its design, which almost never happens. However, anyone with a different rating bias might easily be dishing out 1s and 2s without intending to be spiteful or have bad faith about it.

Again, I don’t know anything about the games you specifically listed, so even if none of the previous examples apply, there’s still the problem of unknown perspectives giving unexpected outcomes.

Here’s the really crucial bit:
I’m not saying they’re not bad-faith. They could be. However, assuming that every low score on an otherwise-spotless game is bad-faith also erases anyone who genuinely had a fundamentally-different experience, which would be impossible for you to have accounted for, and these experiences are also valuable.

EDIT: I originally wrote this assuming a scale of 1 to 5, where 5 is the best.

12 Likes

It is just the nature of our voting system that the outcomes are rather unstable with regard to individual voters. We show scores to two decimal places (“8.25”) but in some sense that’s fake precision. One person voting with a headache will shift that last digit a lot.

Years ago, we discussed the idea of moving to ranked-choice voting. That reduces the impact of outliers. Only the order of your ballot matters, so “1 5 5 5 10” has the same impact as “4 5 5 5 6”.

However, ranked-choice algorithms are way more confusing to explain than the simple averaging we do today. There’s also a bunch of minor variations of how to run them. (Which algorithm you pick can affect the outcome of the contest, which from the outside doesn’t seem much better than having single individual voters affect the outcome of the contest.) There hasn’t been much enthusiasm for going down this route.

11 Likes

This also bugged me, to the point where I wondered if maybe a few people thought a 1 was the highest rating. Could that be?

6 Likes

I remember when I ran a “review the reviewers” competition during IFComp one year I was really shocked at the votes coming in, because people were giving really low scores to what I considered great reviewers and high scores to others, but not in any pattern that looked malicious. And a lot of people with the weird votes were people I had known in the community for a while, and weren’t just random trolls. So I think sometimes people do just feel like giving inappropriately low scores in a way that’s hard to refute.

As for purposely cruel voting, in 2020 several people had ballots disqualified for malicious voting:

4 Likes

There was a controversy Back In The Day when a group of people suggested giving 1s to every Twine game, which is part of why we have the new Code of Conduct. But those sorts of ballots seem like they should be easy enough to filter out, and I imagine probably have been, if they actually happened.

6 Likes

What? Why?

4 Likes

There was a contingent that didn’t think choice-based games should be considered interactive fiction, and objected to having them in the competition at all. They wanted to send a message that those games weren’t welcome.

Which is why the Code of Conduct now has a specific note about this. (“Intfiction.org promotes a broad definition of Interactive Fiction. Don’t claim a type or style of game already accepted by the community doesn’t belong.”) Nobody is obligated to like a particular type of IF, but that doesn’t mean it doesn’t belong in the comp.

8 Likes

Oof. I guess change is hard-- it was hard for me to come around to choice-based games. Still, yikes. Thinking about the recent accusations of gatekeeping, I suppose now I understand a little better what may have been meant by that.

7 Likes

Yeah occasionally I get a search result that leads to a thread that touches on this stuff from the before-times, and it is wild to see the kind of things people felt comfortable saying. Very happy to have missed all that, and that the community as a whole and this board in particular did the work to become a more inclusive place (on its own terms and also because it’s hard to see IF being as vital as it currently is if choice and parser were hermetically sealed off from each other).

EDIT: on topic I’ll say that these kinds of low ratings do suck and are hard to take seriously, but at the same time the Comp does have safeguards to protect against obvious bad faith voting, which these didn’t trip, and I’m not sure there’s a process for discarding outliers that will feel transparent and fair. Better for us all to cultivate a Zen-like indifference to the vagaries of rankings, I suppose, though that’s easier said than done.

11 Likes

Did attitudes change, or did certain people just leave? Thinking of Planck’s Principle here.

3 Likes

It happened during GamerGate, which actually started (or had a defining event) because of a Twine game (Depression Quest). A lot of people across the internet that were involved in gaming of every kind were trying to out games that ‘weren’t real’, in addition to attacking women and LGBTQ+ authors. The most prominent Twine author at the time was Porpentine, and so she became a kind of lightning rod. A lot of people left around then, and for some reason IFComp started thriving after that in terms of number and quality of games (maybe not directly related, since that’s when jmac improved a lot of things, but I think there was a connection).

Edit: Several people did leave to start a different IF forum which had less moderation, but it didn’t take off.

14 Likes

Hard agree on the 1s being almost impossible to deliver in good faith for …Star and …Lane. It also bugged me that they were counted.

I’d advocate for this - certainly doing outlier removal is not novel or controversial as a concept! Of course one must document that it’s being done but I would be happy if the organizers would prune these outliers. If the IFComp guidelines provide a rubric stating that a 1 should be:

then we can safely disregard the 1s on either of the games listed; anybody who did so either did not read the rules, did not play the game, or read the rules and consciously decided that they were not going to follow them.

e: funnily enough, going through the games in reverse order, I can’t really see any obvious 10 outliers that couldn’t be interpreted as somebody just really liking a specific game (like, Glimmer has a 10 despite placing 53rd, but it also has a 9; the distribution isn’t as obviously off).

4 Likes

My recollection is that the Twine Wars (if you will) predated GamerGate by a couple years – howling dogs came out in 2012, Depression Quest in 2013 (and while the success of howling dogs kicked the Twine Wars up a notch, they had already started by then, I think). But I agree that some of it came from a similar place of people being aggro about an influx of women and LGBTQ+ people into a space dominated by straight cis men, and Twine was kind of a proxy for that in the way that outrage over “walking simulators” sometimes was in mainstream gaming spaces.

“What is a Real Game?” discourse also generally involves people whose egos depend heavily on Being Good at Games, and they feel oddly threatened by the concept of games that aren’t meant to be challenging and the idea that someone might play those games and consider themselves “a gamer” even though they can’t beat a Soulsborne game, or an IF player even though they’ve never solved a 100-room maze with a light source puzzle and a hunger mechanic and instadeaths. (I kid, I kid, but there was definitely a strain of “IF should be challenging and puzzleless IF isn’t real IF” that predated Twine but ended up getting rolled into Twine discourse.) But this tends to be tied up with the whole “gatekeeping of women and minorities” thing in ways that are difficult to disentangle.

There was also one part of it that I think was specific to IF, and that was the anxiety of “I don’t like this type of game, and what if it drowns out the type of game I like and no one ever makes anything that caters to my tastes again???” – which was mostly a choice-based vs. parser-based issue, but I think it was also bolstered by the fact that early Twine games had a sort of “house style” that not everyone enjoyed, and that style got conflated with the medium even though there was nothing about it that was inherent to Twine.

6 Likes

In the private channel for IF Comp 2022 authors, I voiced my, ah, consternation about the person who gave Cain a score of 1.

The 2’s and 3’s I can rationalize as people not caring for the subject matter (the source material is, after all, sacred to some) or finding the length off-putting (although not as daunting as Jim Aiken’s game, amirite???) or what-not. There’s just some people who don’t like what you’re offering. I’ve dealt with this before, in different mediums.

But to me a score of 1 means “this game shouldn’t have been entered in the comp.” Yes, everyone’s allowed their own rubric, but I don’t see how a 1 could mean much else. For various reasons I won’t discuss in public, I have wondered about it being a bad-faith vote, although I’m aware that the comp has measures to prevent that.

I honestly do not want to make a Federal case about my situation. The Miss Congeniality award tells me my peers appreciate my work. And, unless I’m missing something, Cain was the second-highest rated parser game this year (right behind Sector 471).

That’s all gratifying and I’m happy where I landed. It’s disconcerting to hear outliers played such a big role in the ordering of the top three, though.

18 Likes

Honestly, Jim, this is The Main Thing. It sucks that someone was giving out these low scores, but I wouldn’t let it bug you, because at the end of the day, IFComp is judged mostly by random internet people, and some of them have axes to grind.

But your peers deciding you win? That’s a major thing to be proud of. Not to discount the fabulous games that placed so well and so deservedly in IFComp, but I’d rather win Miss Congeniality than win the comp.

9 Likes