Outlier votes

… Anyway, I went off on a tangent up there, but one thing I think might help without opening the can of worms of “is this vote in good faith or not” is some kind of normalization for the number of ratings a game gets (though I don’t know how feasible this would be to implement). There are some games in this year’s comp that had 20 ratings, and others that had over 100, and the one random judge giving the game a 1 has a lot more impact on the former than the latter. A single person rating The Alchemist a 1, for example, would have dropped it 7 places in the rankings!

5 Likes

Outlier removal is used as a guard against honest wild opinions (or, in science, simple measurement error). It’s not great as a guard against malicious voting. If you document that the highest and lowest N scores will be dropped, you’re just telling bad actors how many friends to invite.

The comp admins consider apparently-malicious ballots on a case-by-case basis.

8 Likes

Let me preface this with saying that I don’t use statistics in my day-to-day like, at all, and I’m not an expert. I studied it a little in college but that was a while ago. So, like…I dunno, I don’t really have any authority backing up what follows, either professionally or mathematically - so if I’m wrong about anything that follows, I mean, just tell me that I haven’t remembered enough statistics and I’ll nod and go “okay sure.”

Okay, that said, I think there’s probably a way to sand the most egregious stuff out that isn’t as easy to game as “dump the top/lowest X items”. I don’t have any specific suggestions off the top of my head but doing something like asking “given the mean and standard deviation, how likely is this result?” and flagging everything that seems sufficiently improbable seems doable.

I’m gonna use The Absence of Miriam Lane here as an example. It has 78 votes, a mean of 8.17, and a standard deviation of 1.59, as listed on the website. A rule of thumb I remember from my classes is that in a normal distribution, 99.7% should be within 3 standard deviations - so a vote of 1 is, like…uh, hold on. It’s more than 4 standard deviations off. Which is actually reasonable to expect if you’ve got a huge, huge, huge number of votes but we’ve got less than a huge number.

Obviously the scores are not a standard deviation, I don’t have good knowledge of whether they should be (given that video game scores might have weird patterns), and this is all just off the top of my head; it’s just, I think that type of analysis could be reasonably done and it would be reasonably defensive to say that anything that’s so obviously statistically improbable is suspect.

e: You know what let’s see how much I remember. Okay.

So, assuming a normal distribution (lol) we have a mean of 8.17 and a standard deviation of 1.59.

Therefore the rating of 1 is has a z-score of (8.17 - 1) / 1.59 = 4.51

According to this Z-score chart I found on the internet, that’s…actually off the chart. This chart only goes up to 3.9 standard deviations, listing the probability at 3.9 as .00003. Fine let’s just go with that?

Therefore, assuming the above, the probability of getting a 1 is less than .003%.

Okay, now…how can we ask “given that we have 78 votes, what’s the probability of at least one rating being a 1”? Argh. Ugh. I can’t remember this part. You invert it, right, and ask “What’s the probability of having zero 1s”?

Which is…(1 - .00003) ^ 78 = 0.99766270064?

Which means we have a quarter of a percent chance (assuming normal distribution & independent events) that given 78 votes there was at least one? Except that real value would be a lot lower since I used the chart for 3.9 standard deviations and not 4.5.

Sorry, people, I don’t really know why I’m doing this here, I think I just nerd sniped myself. Apologies.

e2: Why the heck did they make me literally look up z-score charts in college, we had, like, graphing calculators, computers, like, this was well beyond the era of slide rules. What the heck.

3 Likes

Yeah I hate to be a devil’s advocate but I don’t see a way to separate good faith votes from bad ones. While I of course personally agree that your examples couldn’t reasonably deserve anything below a 5, it’s possible these games just really pissed someone off. (Some people just loathe well-made things)

As Zarf points out, dropping ouliers isn’t actually a very good guide against malice…it’s more like a guard against really wild opinions, but I think the comp has wisely decided to credit all opinions that aren’t obviously part of a hate campaign or something

8 Likes

I think the magnitude of the adjustment that would be produced by some fix to the problem of unreasonable votes would be dwarfed by the inherent noisiness of the voting process itself. What if one of Brendan’s fans got sick and never gave it that 10-point rating? Big difference in the rankings. The composition of the judging pool is a much bigger factor than bad votes. There is no “real” ranking we can uncover by filtering out bad votes. We can just introduce another random element.

7 Likes

I dunno, I get the basic idea, but this definitely feels like a can of worms to me. Putting aside the fact that even if everyone’s casting their votes in good faith there’s no reason to expect they’d follow a normal distribution or any other well-behaved function – I think the idea of sharing the standard deviation is to provide an apples to apples measure of the variance in scores between the games, not to imply anything about what the ideal set of votes would look like – this would systematically wind up benefiting some games (less divisive ones) at the expense of others (more divisive ones).

(Like, this year’s Comp look relatively well-behaved in the higher rankings, but in last year’s Comp, this kind of method would probably lead to low scores for Fine Felines and Funicular Simulator being thrown out, while Paradox Between Worlds would see low scores sticking around. Heck, I Contain Multitudes had a bunch of 10 votes almost 2 standard deviations away from the mean!)

If there was a reason to think that weirdo bad-faith votes were more likely to hit middle-of-the-road games, and less likely to hit more divisive ones, I guess this might be less of a problem, but if anything I’d think the reverse would be true – it’s just that this method is more likely to be able to detect shenanigans in the MOR case, and less able to in the golden-banana case.

Anyway I think this does suck for these authors, and if authors or others have reason to think there are bad-faith votes out there definitely that should be mentioned to the organizers. But I worry that any cure would be worse than the disease in terms of the overall trust in and transparency of the results.

8 Likes

If one of the 10s dropped out it’d still be first place; it’d take 4 of the 10s getting sick before it dropped to second. That said, yes, there’s a lot of noisiness even amongst clearly good faith voters.

As far as actually using a statistical system, yeah it’s definitely a can of worms and I don’t know enough to actually come up with a good argument or proposal about implementation; it’s just, I think you can definitely flag suspicious votes using that method.

I think part of the reason this is bothering me, though, is that if I look at The Absence of Miriam Lane and A Long Way to the Nearest Star and see the 1s listed and put up into the calculation, I think “They let you get away with that? You mean, I could do that?”

If I, personally, had had it out from Brendan Patrick Hennessy, I could have put a 1 rating and that would have pulled it out of the top spot? (If it didn’t get filtered out it would’ve, adding a 1 would have bumped it to 2nd). I mean, sure, I guess, everything’s pretty random, don’t take it so seriously, but - you know, if I’m looking at the scores and thinking “who would stop me from manipulating the votes?” then that itself damages my trust in the votes.

I don’t really agree? If you invert the question - can you make the votes more “fake” by adding bad votes - the answer seems likely to be yes. It wouldn’t need to be blatant, like adding 1s, but it doesn’t seem philosophically impossible or something.

If you’re saying that filtering is impractical and one shouldn’t try, sure, I can buy that as a practicality argument.

1 Like

As the number of games submitted to IFCOMP grows, this will further dilute the number of votes per game, increasing the chances of individual bad-faith votes affecting the rankings.

So… random thought… why don’t we do run-offs?

These outliers become less important the more votes per game, so we could use run-offs to increase the number of votes used to determine the final rankings.

Battle it out as we already do in a general vote, but, for the top 20 slots, then do a second round run-off.

This would concentrate your existing pool of judges on fewer games, generating a larger sample of votes for each title, decreasing (but not outright discounting) the weight of apparent outliers.

This also has the benefit of being far more simple than a ranked choice vote, and more immediately and intuitively understood by more people.

This would also give those who hadn’t previously played or voted on one or more of the titles in the top 20 a chance to play and vote on them in the second round.

If you wanted, you could even do a third round run-off for, let’s say, the top 5 of those 20 games, concentrating the voting power (and interest/tension/anticipation) of the entire internet community on determining which of the five deserves to walk away da big winner.

Seems like you’d get a more reliable result without the ickiness of discounting votes or the complications (perceived and/or real) of ranked choice voting.

3 Likes

It’s at moments like these that I like to mention Arrow’s Impossibility Theorem:

There is no fair way for more than one person to vote for more than two options.

Every possible voting method will have one of the following flaws:
-The relative rank between two options X and Y can be affected by votes on a third option Z, even if the scores of X and Y remain unchanged (like Independent party votes in the US)
-An option X might not be ranked higher than Y even if everyone prefers X over Y
-A single person makes all the decisions

Every method you can think of, including runoffs, fancy ranked methods, etc. will always have one of these flaws.

If you do allow only a single judge, then all the other problems disappear, which is why the Ryan Veeder Quadrennial Exposition for Good Interactive Fiction has the best judging design. Also, if there are only two options, majority rules is ‘fair’.

9 Likes

Agreed, but the system is already flawed, as you just demonstrated. I’d like to think fairness is a spectrum and not a boolean.

4 Likes

But doesn’t the current voting system sidestep Arrow’s theorem, because it’s a cardinal system (you give a rating out of 10 to each entry) rather than a ranked one (you put all the games in an order from best to worst)? My layperson’s knowledge of Arrow’s Theorem is that it specifically applies only when you vote by ranking, which is where the first flaw raises its ugly head.

1 Like

I really like this idea personally. I think this might require @AmandaB to pimp out the comps a bit more, at least at first. You’d have a lot of people who might not realize there’s a runoff, and might cast their vote in the first stage and assume it’s done.

But as long as you can handle the inertia problem, I kinda like this idea. Also, if any trolls are casting bad-faith votes, then they might not bother to drop one in the second stage. If anyone had a genuine problem with a game, then either they cool off by time the second stage starts, or it was actually a bad enough problem that the low vote occurs again and is actually deserved, regardless of what the majority might think. It still values all voices, but isn’t as “unstable”.

3 Likes

One way to handle the problem statistically is to use medians, rather than means. I think that might end up in some spectacular ties, though. (I’m saying this with irony: really, I’m just pointing out how hard this is.)

I do think this is hard in practice. If you screen low scores for games that generally got higher scores, it’ll have the effect of boosting the scores for high-scoring games even more (because the 1s will get screened for games that score an average of 7-8, say, but not for games that score an average of 5-6). So it could potentially have harmful effects.

That said, 1s do seem anomalous in the voting system. It’s hard to credit that someone engaged fully with a game that they scored so low.

4 Likes

There is no “real” ranking we can uncover by filtering out bad votes. We can just introduce another random element.

I agree. I don’t see any cure here that isn’t worse than the disease.

I can get behind filtering out votes when there is evidence of outright fraud (i.e. an author using fake accounts to vote up their own submission and vote down others). Beyond that I don’t think it’s productive to try to police or second-guess voter intent. Note in particular that the competition rules explicitly reject a concrete rubric and do not require judges to score a game by how well-crafted it is; a judge could rank games according to emotional response to its themes while still exhibiting good faith.

I think it’s important not to lose the perspective that even in the best of circumstances, the ifcomp results are biased towards crowd-pleasers rather than games that take risks (in either form or content), and the final rankings correlate only very loosely with artistic merit or quality of craft. The list of extremely influential games that placed outside of the top 10 when entered into the comp is long indeed.

What about ranked pairs, instead of ranked choice? The judge marks that they’ve finished a game on the list, and then the website asks the judge if they liked it better or worse than each of (a random subset of) the games previously finished. (The ranked pairs could then be aggregated to compute a Tideman order).

11 Likes

I woke up to over thirty emails in my mailbox… turns out that thread blew up and I’m automatically notified. :smiley: Clearly, everyone who is pointing out the problems you introduce with different scoring systems are correct. I myself am tempted to think that doing something like

  1. calculate the mean using all votes
  2. remove all votes that are more than five points away from the mean
  3. recalculate the mean

would not be terrible. But I can see the arguments against introducing anything like this too. Possibly the best way to regain my Zen is by meditating on this piece of deep wisdom:

7 Likes

It would be interesting to see the results of IFComp 2022 based on these various ranking/scoring methods… but how would we all choose which method is best?

2 Likes

You can still do it automatically as an addition to current voting system. No need to explain.

I think it’s hard to out-math these situations. Giving a game a “one” rating is not exactly evil genius stuff but implementing a filter for those cases will just encourage more elegant approaches, which will demand subtler detection methods, and so on.

The criteria for a “one” rating is surprisingly objective—it doesn’t seem to matter if a judge even likes the game or not—so the ratings in question can’t really be about the games themselves. They might be about something as simple as “I hope my buddy wins”. They might be votes against rhetorical strategies, subjects, architectures.

I’m always curious about the motivations behind these actions. They may prove to be unknowable (this time, at least), but ultimately such problems are as much behavioral as they are mathematical (as in the case of the cabal that planned to give all Twine games bad scores).

7 Likes

Speaking as a first-time judge who wandered in cold, I took the advice to “use all the numbers” and “develop a personal rating criteria” seriously. If there is a cultural norm here that disproportionately biases scores of 1,2,3 I for one would want to align my criteria with that. I was not shy about giving out scores of 2 and 3, unfortunately. No 1’s at least! (Maybe stronger wording or cluing in the Judging Guidelines?) Certainly I hope I threw enough words out to at least justify my “good faith” bona fides.

Like many here, I’m not the person to find (if even findable) an algorithm to detect ‘bad faith’. Decades ago, I ran an amateur fiction award where we used “Track and Field” scoring (rank top 3, for 5,3,1 point respectively), but here that would almost certainly punish longer games in an unacceptable way. It certainly had the (desired at the time, also probably not appropriate here) effect to mix ‘popular’ in with ‘high quality’ as a criteria.

8 Likes

Is the (anonymized) raw voting data publicly available?

6 Likes