Outlier votes

HanonO · November 21, 2022, 9:39pm

Historically I believe this has happened, though not to the extent of 60 accounts. Accounts can be traced by IP address, and it’s likely easy to review voters who only cast five votes to qualify and consistently boost one specific entry.

mathbrush · November 21, 2022, 9:40pm

I gave a score of 1 to around three games: Traveller’s Log, 4 edith + 2 niki, and Tower of Plargh (I think; maybe one was a 2). To me they were much less enjoyable than the average comp game, and I think it was important to communicate that to the authors as they entered a competition centered on feedback. I think it’s important for them to know why, so I also gave them 1 star on IFDB along with reviews.

I don’t think people are arguing about these kinds of ratings (especially since 25 other people gave one of those games a 1).

I do think the Miriam Lane vote was probably in bad taste. But I can think of reasons for giving a very low rating to a game for ideological reasons. I unsubscribed to a podcast recently because the author was adapting a classic story which turned out to have the n-word in it, and they just casually recorded it and didn’t put a warning on the episode. I like to listen at home, and I don’t want my son wondering why he’s hearing the n-word. I even wrote the podcaster, and he said many people had complained and he was going to change his podcast to stop doing it.

Nothing in these game is as bad as that, but According to Cain, for instance (which I would have given a 10 if I hadn’t beta tested) has very uncomfortable incest scenes. I think someone would be completely justified for 1-voting the game for the incest alone, since that would send a message about what kind of stories you want to see. But that kind of voting would be best accompanied by an actual review to explain the problem you have.

pinkunz · November 21, 2022, 9:48pm

That’s really a good point. If you’re making a protest vote (against n-words, incest, or Texans) you should probably make sure the individual knows why you’re doing it. Otherwise, it’s just an unexplained ~~shit~~ SHOT out of the dark and does little to air your grievances, even during Festivus.

Edit: For obvious reasons.

HanonO · November 21, 2022, 9:48pm

And if you vote for enough games, like you and @DeusIrae do powering through like you are frikking machines, you have a larger sample that warrants using the actual entire spread of scores. Someone who votes on all 70 games and gives five of them 1s and 65 others everything else in the range actually read the assignment.

Yeah, it’s weird for a highly-rated game to get a lone 1 when there are no other scores between 2-6, but an anomaly is not a pattern and can likely be accounted for as reviewer preference.

The0didactus · November 21, 2022, 9:58pm

It’s kind of fun to imagine a reviewer who only gives out ones and tens

mathbrush · November 21, 2022, 10:01pm

When I was a grad student, we had TA training, and one grad student said that he would only hand out 0’s and 10’s on assignment, because in math you’re either right or wrong and there’s no reason to give a score in the middle.

VictorGijsbers · November 21, 2022, 10:03pm

“The continuum hypothesis is true.”

rileypb · November 21, 2022, 10:05pm

So is the Axiom of Choice.

HanonO · November 21, 2022, 10:06pm

I mean, that’d be like “thumbs up/thumbs down” but IMHO doing that would be a bit disingenuous when the comp clearly asks for a rating from 1 to 10.

Jacqueline · November 22, 2022, 12:04am

All - I appreciate the lively discussion, and for a few of you drawing my attention to it.

I will keep my remarks brief and do not intend to otherwise stay in the conversation, but I’d like to highlight a key point from the ifcomp.org website:

The website states that the competition organizers reserve the right to disqualify any ratings that appear to have been submitted under any other circumstances.

I personally spend many hours reviewing the votes. Going into detail as to how I go about this would just provide tips on how to game the system for the few people whose votes are not cast in good faith, but rest assured we do review.

Competition organizers have been discussing the approach to tabulating votes since the 1990s. No approach to voting will please everyone.

Prior to being the competition organizer, back when I did vote in the competition and reviewed games, I personally had been the outlier (or one of the outliers) at least a couple of times. I know that I played the games in good faith, and I know that I had a reason for my scores. The fact that very few people (or no one else) agreed with me is just how things go sometimes.

I’m not saying what we do is perfect, and we are willing to thoughtfully weigh suggestions. Please discuss, hypothesize, refine your ideas, and submit them through our post-competition survey.

Thank you for caring so much about the competition. That does mean a great deal to us.

— Jacqueline

MalignPheasant · November 22, 2022, 12:14am

I have been uncomfortable with my ability to make quantitative judgements about art for some time.

The rubrik I’ve recently adopted is to consider “How likely am I to recommend this game?” This is still my subjective opinion but it takes into account my enthusiasm for the game, how close the game is to my personal taste, and a general assessment of how well-made the game is.

So my rankings are something like:

5 - Wouldn’t recommend to anyone, not confident enough it would be worth their time
6 - Would give a qualified recommendation to, say, an IF fan who asked about the game
7 - Would recommend the game to IF fans
8 - Would recommend the game to non-IF-playing friends I thought would like the game’s genre/story
9 - Would recommend to anyone who reads for pleasure
10 - Would recommend even to someone who doesn’t read for pleasure: “You still might love this”

(Ratings less than 5 are “how likely would I be to warn people against playing this game”)

Doug_Egan · November 22, 2022, 1:47am

I didn’t play nearest star until after the competition, but would have given it a “10” if I had (and I only gave one other “10”, to the game that won, as well as a 9 to the second place). Had I cast that vote, “nearest star” would have been a dead tie for second. Every vote counts.

It is unfortunate that somebody was deliberately giving “1” to awesome games, although the other explanation is that someone doesn’t understand the scoring system and thinks “1” is the best possible score, which is not out of the realm of possibility.

blinovvi · November 22, 2022, 11:45am

This is an inevitable effect of the sorting by the average score. In our Russian IF competition (called KONTIGR) we used Schulze system (kind of) instead, and that had a stronger tolerance to the outliers though required some knowledge of the method to understand the final scores (higher entry threshold for those who are interested in). If we had a list of votes it would be quite easy to check what happens when that system is used instead. (Sorry, some links are in Russian only)

But in the end it is very important to respect the system that is used and the results that we have when the rules are followed by the organizers. My comment is not about questioning the current approach.

DamonWakes · November 22, 2022, 2:18pm

I’ve only had a quick-ish skim through this massive thread so apologies if someone else has already made this point, but I didn’t see it in so many words.

Though it’s impossible to say what prompted any given person to rate a game low, there’s a real possibility that a 10 for one person could be a 1 for another due to something as simple as having played on different machines. I personally have put together two things that turned out to fail spectacularly on certain computers but not on others. One was The Ten Million Invocations of Esnesnon 22/02/2022, which I’ve since fixed but originally rendered that initial big “RECITE” button unclickable for some: they simply couldn’t use it. The other was Spewnicorn 3310, which isn’t IF but does still illustrate the problem: on some devices/browsers (for reasons that escape me completely) vomiting downwards produces no upward thrust, which makes a fairly early level impossible to beat. I’ve even seen one person on YouTube do it in one level and then run into the problem in a later one in the same game.

I think that kind of problem is less likely in a text-based game, but Esnesnon shows it’s possible and it’s easy to imagine other situations that might cause one person difficulties that others simply don’t encounter. Maybe they’re using a screen reader and a neat text effect makes the words impossible to understand. Maybe it’s red text on green and that reviewer is colourblind. Maybe something just doesn’t work well on mobile. I’d hope that most low ratings of this sort would be accompanied by some text feedback and suspect plenty of others are simply given in bad faith, but I’d rather see the honest outliers reflected in the scores.

nilsf · November 22, 2022, 4:30pm

I’m mostly of the mind that things work well enough and don’t need to be fixed.

That said one small and easy improvement would be to change the labels on the ballot page so that instead of 1, 2, 3, ... 9, 10 in the dropdown selector it would read 1 (worst), 2, 3, ... 9, 10 (best), eliminating the possibility of confused votes.

DeathByTroggles · November 23, 2022, 4:47pm

Two thoughts, mostly echoing other people

–I don’t like striving for objectivity, because I have biases that impact my enjoyment, and I don’t want to discredit them. I do strive for consistency, which is why I try to review every game I score.

–I don’t consider the game that wins first place the “best” game. I consider it the game most people liked, if that makes sense. Shawshank Redemption has been #1 at IMDb for years, but I don’t think I know anyone who says it’s their favorite movie ever. But most everyone thinks it’s great, and that has value to the consumer.

wmodes · November 23, 2022, 6:05pm

Yes, there are formulas and strategies. I’m no statistician, but I enjoy a good dataset. In statistics, this is a common problem: How to deal with the outlier.

This is why in public policy they talk about the median income or median housing price, rather than the mean/average. There is always some outlier income or house in the quadrillions that would make an average meaningless. Thus median takes the middle value.

An outlier is well-defined in stats:

Extreme outliers tend to lie more than three times the interquartile range (below the first quartile or above the third quartile), and mild outliers lie between 1.5 and three times the interquartile range (below the first quartile or above the third quartile).

This cones from a great explainer article “Outliers in Statistics: How to Find and Deal with Them in Your Data.”

And there are programmatic/functional ways to deal with outliers with tools as mundane as Excel or Google Sheets. One can use the QUARTILE() function to identify datasets with outliers and the TRIMMMEAN() function to get a mean that excludes them. Here’s a useful article about it.

A uniform application of a statistically accepted practice eliminates any credible accusations of bias in calculating the winners.

AvB · November 24, 2022, 11:49am

Allowing for a moment I had 20 or so different accounts for the comp and played every game. By experience I could then tell which are the top ten contenders to the title. Each of my accounts would then rate around 15 games. Five to eight which aren’t going to figure in the top spots anyway and which i can therefore rate fairly, six to nine which are a threat to the one I want to win and which get scores in the 3-5 range and the one I’m pushing which gets an 8-10. I would defeat all proposed measures for weeding out my votes and still determine the winner of the comp with a good probability.

PM me for Paypal details.

pinkunz · November 24, 2022, 1:56pm

You’d still have to defeat @Jacqueline’s top secret fraud detection. There are plenty of others I’d cross before her. ¯\ (ツ) /¯

huftis · November 27, 2022, 12:59am

And that’s an important point. Both the mean scores and the ranking of the games are highly uncertain.

Here’s an illustration why. In statistics, we typically use intervals to quantify the accuracy of estimates. Basically, instead of saying ‘the real score is X’, we say ‘our best guess of the real score is X, but it could easily be somewhere in the interval [Y, Z]’.

(The interval [Y, Z] typically used is called a 95% confidence interval, and has an exact definition, but that’s not really important here. The main thing is that it quantifies the uncertainty in the ‘estimate’ of the mean score. Short intervals are good, since that means the estimate is quite certain.)

Here is some math. A good approximation to the confidence interval is: (average score) ± (2 times the standard deviation) / √(number of votes). So for The Grown-Up Detective Agency, with 85 votes, an average score of 8.25 and a standard deviation of 1.35, the confidence interval 8.25 ± 2 × 1.35 / √85 = [7.96, 8.54]. In other words, the uncertainty (also called the ‘margin of error’) in the 8.25 score is about ±0.29.

So the two-digit (i.e., 0.01) precision is entirely ‘fake’! And this was for a game with very many votes.

The calculation demonstrates the inherent uncertainty in the scores (and, consequently, the rankings). Also note that these calculations are based on an assumption that the scores for a game is a simple random sample of the possible scores, i.e., that all judges used the randomiser to select which game to play and judge. Something which obviously isn’t true! So in the real world, the uncertainty in what the ‘real score’ (e.g., the score a game would have gotten if all the judges – or all ‘typical IF players’ – played the game) is, is even greater.

In the end, I don’t think the odd outlier really matters. Just averaging the ratings gives quite reasonable estimates of the quality of the game. (For example, the top 10 games are probably pretty good games.) We should just be aware of the inherent uncertainty in the scores. (Even for a game with a hundred votes and a very low standard deviation of 1, the margin of error would be ±0.20.)

And even if the two decimals are ‘fake’, reporting the average to two decimals is reasonable, as it avoids having too many ties.