Outlier votes

kamineko · November 21, 2022, 5:22pm

I can’t speak to your (or anyone else’s) decision to assign ratings of 2 or 3, since those presumably involve aesthetic evaluations of some sort. I wouldn’t presume to evaluate someone else’s aesthetic judgements. If you had given something a 1, I would be curious to hear about your reasoning. The reason would interest me more than the rating. That was what I hoped to express in my post.

Speaking only for myself, I doubt I will ever give another game a numerical rating unless it is a closed loop (i.e., what is my favorite Infocom game). I think I will continue to nominate ribbons for Spring Thing, since those nominations—like my own tastes—are transparently subjective. I have been uncomfortable with my ability to make quantitative judgements about art for some time.

Please note that I am only talking about my own capacities. People can and should do whatever they find enjoyable, competitions or otherwise.

zarf · November 21, 2022, 5:32pm

No.

aschultz · November 21, 2022, 5:49pm

Certainly over the years I’ve felt less comfortable giving public scores to games. I’d rather say, hey, this worked and this didn’t, to give information for future players (should I play this or not? Should I be prepared to zone this out or not?) or to authors who want to try the next thing (is a sequel worth it? What sort of ideas were overlooked that could be made into something new?)

The more I gave scores (e.g. reviewing old NES games way back when) the more I realized it wasn’t just hot air when teachers talked about grades being imperfect, and there was stuff beyond the grades. It felt like they were just trying to mollify the “losers” at some point. But I realized a lot of games in IFComp I gave 5 or less to, I didn’t want to stamp them with BELOW AVERAGE or whatever. Or it was hard to objectively compare short and long games. That said, we need some sort of metric, but all the same I’d rather keep my actual scores quiet.

On the flipside to teachers giving grades, I remember people being rightly upset the spent weeks on a term paper and got “A, nice job” with no comments, not even what they could do or should try next.

As to the original topic: people who act in bad faith, will. It stinks that what happened, happened. On the balance of things I’d rather let people get away with semi bad faith actions and it seeems like most of us would too.

I’m also glad raw voting data is not available as that could be reverse engineered rather easily in many cases. For instance, for me, I could not vote on 2 entries I tested, and along with abstaining from my own entries, I’d be de-anonymized quickly.

Draconis · November 21, 2022, 5:51pm

For a pathological case, imagine a game that everyone either loved or hated (no in between), which loses all its votes when you remove the 1s and the 10s.

For a less pathological case, a game that some people loved but had an average of 5.0 would lose all its 10s, while a game that some people loved but had an average of 5.5 would keep them. Which doesn’t seem fair.

AmandaB · November 21, 2022, 5:54pm

I could have written this statement. I judge comps because that’s what comps are for, but I feel uncomfortable doing it and I change my ratings a lot and then feel uneasy about it for a long time. I never rate games on IFDB (which is really hypocritical because I like getting ratings), but I just don’t feel like a number out of 5 stars is how I want to permanently judge people’s hard work. IFComp is better because 10 stars allows for more nuance, but I fiddled with my ratings there a lot, too, and was unhappy with most of them when the comp ended.

spellmotif · November 21, 2022, 6:01pm

Would it be possible/advisable to implement a (small) warning for when someone ranks a game a 1, along the lines of “this is the worst rating you can give a game, and implies you don’t think it should have been entered into the comp altogether. Was this intentional?” with like, a little checkbox to disable future warnings or something?

I also see a lot of math-related solutions in here, but I think judging is very messy and human. I can only speak for myself, but there are games that I ranked a 3 for example that have placed very well in the comp. Things that turned me off a game altogether or made it a slog were apparently genuinely very fun for people.

However, I do remember constantly tabbing back and forth between my ballot and both the example rubric and other people’s rubrics to make sure I was rating games fairly and consistently. Maybe including that on the ballot page somewhere with the option to access it easily on the same page might reduce confusion around the judging criteria. Or creating a (private) space for users to fill in what their personal rubric is on the page itself, for easy reference?

kamineko · November 21, 2022, 6:08pm

I had already sworn off rating games, but I felt committed to supporting ParserComp for a number of reasons, and felt strongly that I should rate the games. My experience was much like yours. I told myself afterward that I wouldn’t do it again. I nearly got swept up in the excitement of IF Comp, but fortunately between my AMFV thing (which has turned out to be pretty big!) and the state of my own game I’ve been very busy. In fact, I should be working on my game right now

In my brief stint teaching writing, I told my students: “Show up and do what I say, and you’ll get an A.” I don’t think it was as easy as it sounds! Rather than doing a lot of markups, I conferenced multiple times a semester. Evaluations were always face-to-face and qualitative. I tried to get things right, but judging the work of another is fraught with peril. You can never really know where someone is coming from, or what their situation is like.

zarf · November 21, 2022, 6:18pm

This is listed in a suggested voting plan, but it’s not part of the competition rules that that’s what it means. Not everybody means a 1 that way. For some people, 1 just means “I liked this game the least of any entry this year.”

Anyhow, what if everybody decided to eschew 1s? Then 2 would be the lowest rating and we’d be having the same discussion next year about 2s.

JoshGrams · November 21, 2022, 6:30pm

Yeah, the comp rules seem pretty clear to me that any kind of rubric is fine. And in general I don’t see how you can force a single rubric on any widely varied set of people like this. I do think a bunch of people here tend to see 1 as “shouldn’t have been entered” and 10 as “perfect; couldn’t have been improved” and thus don’t really use them. But I also know people who plot their overall feelings about all the games along a line and then vote at the end by making their favorite game the 10 mark and their least favorite the 1 mark and eyeballing all the scores based on that. And that also seems like a perfectly reasonable thing to do.

I really think that statistically and also “this is how arbitrarily-self-selected groups of human beings work” there’s just no answer other than that it takes a lot of people voting to make it more even and less arbitrary, and that you just end up pushing the uneven arbitrariness around, and that it’s a mistake to get too upset or too swollen-headed based on how you do, because you know it could have gone very differently if the breeze had been in a different direction or the moon hadn’t been waxing gibbous that day…

evouga · November 21, 2022, 6:49pm

As an aside: there have been some proposals to try to compute reliability ratings for each judge when aggregating crowdsourced votes, to downweight users who disagree with consensus of the other judges. (For an example of such a scheme, see for instance O’Donovan et al., Exploratory Font Selection Using Crowdsourced Attributes).

One of my students has been doing research on the properties of these schemes. Unfortunately, for many judges casting a sparse set of votes, they are catastrophically unstable: they identify a small set of “oligarchs” that are assigned infinite reliability weight, with the rest of the judges marked as “trolls” with zero weight.

The moral is that trying to compute reliability weights for voters jointly with a linear preference scale is a very hard problem and seemingly reasonable approaches can have surprising unintended consequence.

rovarsson · November 21, 2022, 6:57pm

This is what I admired very much in @jjmcc’s review thread: A transparent rating rubric that was very personal and at the same time applied thoroughly and consistently. It both conceded the point that ratings are subjective, and found some way close-to-objective to make the ratings make sense across the board.

(I, on the other hand, mostly went with my gut. Which went up and down and back and forth. Which is why I’m not happy about some of the ratings I’ve given, mostly in the 4/5/6 ranges. I want to think up a personalized-but-consistent rubric of my own next time.)

pinkunz · November 21, 2022, 7:14pm

Ever since I made my run-off suggestion earlier, I’ve had this idea I can’t stop turning over in my head.

I’m imagining a few hour long Speed-IF Jam that starts with the announcement of the theme, sort of like Ludum Dare. Then, voting would determine the top so many submissions, maybe the top 10? It would depend on the number of submissions.

Those top games would then go into a second round where the authors would be given time (A couple weeks? A month?) to flesh out the games and spend more time developing the idea.

After this development time, another round of voting decides who did the most with fleshing out their Speed-IF game.

I like this, because the first round isn’t only a vote for who you liked best, but also a direct vote for the game you would like to see more of.

It also flexes two very different skill sets for the authors, raising an interesting challenge. Someone adept at quick coding and rapid idea generation may excel in the first round, but then struggle with creating a more long-form game from this start. Conversely, there are probably some very accomplished authors that would struggle to qualify for the second round in the first place.

Was chewing on what to call it, and so far all I got is SeedComp (IntroComp is rightfully taken, lol).

Edit-to-add: I imagine each author’s final second round submissions being listed directly next to the Speed-IF they submitted in round one, enabling easy comparison between the two.

Edit-to-add×2: Erm, anyway, yeah, sorry for the distraction. Please carry on with the uncomfortably familiar discussion of bad-faith actors impacting the vote and how little we can objectively do about it.

The0didactus · November 21, 2022, 7:35pm

Yeah I think by the proffered metric my game would have done substantially worse.

Pretty much all my games have had high standard devs. I fear plans like this

lpsmith · November 21, 2022, 7:56pm

I feel like part of the issue here is that a lot of us are programmers or mathematicians or otherwise skew to the ‘objective’ end of things, and have a very strong and very understandable desire for the results of the competition to be an objective rating of the relative quality of all the games. The best game should come in first, the next best game should come next, etc.

But any competition is not really a way to determine this. It’s a game where the quality of the entry plays a large part, but is not the only factor involved. Sports games are not always won by the better team. Political races are not always won by the better candidate.

If you have a favorite team, and think that they sports better than the other sportsers, you want your team to win the Big Sports, but cannot be entirely surprised if they do not, because Life Happens. And that doesn’t mean Your Team is bad, or anything, it just means they happened to lose.

We have had this discussion since, and I kid you not, the 90’s. We never end up changing anything because in the end, the rules are clear and understandable to everyone, and because while the system can be gamed, any system can be gamed, and the problem is the person gaming the system, not the system itself. You put in checks to guard against ‘obvious’ cheating, and most importantly, try to foster a community that Doesn’t Do That Sort Of Thing.

Personally, back when I could play all the game in the comp (~30 games or so), I would sort them by my favorite to least favorite, and give about three 10’s, three 9’s, etc., meaning I’d always give out a handful of 1’s. This spread my ‘voting influence’ over the entire field somewhat evenly. I know of other people who wanted their vote to mostly count towards determining the winner, so would hand out a single 10, maybe a couple 8’s and 9’s, and give everyone else 5’s and under.

When the comp got too large for me to play every game any more, I felt less comfortable doing that, and started judging vaguely based on what I would have given a game in previous years. At this point, the number of judges has grown enough and my sense of how much it matters has decreased enough that I felt OK this year ‘using all the numbers’, even though I did not feel like there were any games that didn’t belong in the comp. On the contrary, I could feel the love and care put into every single game I played, even the ones I didn’t care for. It was a good comp!

So while there is a definite gestalt opinion that 1’s are ‘objectively terrible games’, and therefore would not give a 1 to any ‘honest entry’, there are a lot of other ways to vote out there, and in the end, the comp is a game. We pour gatorade on the winner, and move on with our lives. And I’ll echo what many many people have said in this conversation that’s been going on for over 20 years: in the end, the comp is more ‘about’ the attention and the reviews and the support than it is about the ranking. The ranking is the excuse we use to come together and celebrate the collective creation of art; the ranking isn’t ultimately what we’re here to celebrate.

vivdunstan · November 21, 2022, 8:02pm

I’m someone else who generally uses the full range of votes. Though this year my lowest score for IFComp was 2. I’m extremely uncomfortable with people questioning others generally who give 1 scores. I would hope that if there is genuine bad faith voting this might show up for the competition organisers, looking at all the scores a particular voter has voted in the competition. Whereas someone who uses the full range should have a better spread. But yup, I am rather uncomfortable with how some of this discussion has gone.

A_Wthrmnstr · November 21, 2022, 8:06pm

I don’t get it. People are voting differently than you would’ve voted so therefor we need to recalibrate the whole voting system?
You say some games are so competent, there’s no way they deserve a 1, 2, or 3. What if there was something in that game that was particularly offensive to that judge? His vote gets trashed because you disagree with it?

rileypb · November 21, 2022, 8:12pm

No one wants to trash any votes. People are merely throwing around ideas.

evouga · November 21, 2022, 8:15pm

Rather I’d characterize the concern as the following:

There is concern that some ‘1’ votes were cast in bad faith, and;
That because many voters to only assign scores in the range [4,10], those bad faith votes have outsized influence.

pinkunz · November 21, 2022, 8:25pm

That’s why I suggested run-offs. Not too many folks are parsing the difference between 53rd and 54th place, but it increases the sample size for the top entries, decreasing the individual impact of every individual vote, while also counting every vote.

MiloM · November 21, 2022, 8:29pm

I know that this isn’t the point of the thread but I just wanted to say while I like this and am already excited to enter the purely hypothetical SeedComp (it’s called that now, it’s in my headcanon), I do worry that it misses that what makes a good short story (or game) good is different from what makes a good long story (or game) good. I agree that it’s cool that it requires the authors to be good at both forms, but I don’t think that it would, artistically, be a good idea to expect the games to be good at both forms.

Also because I think I should acknowledge it: I don’t like the quantitative solutions that have been suggested. I think it’s a bit silly to try to numerically appraise art anyway, but the idea of cutting votes outside of a certain range seems to remove the whole point of public voting. I think the problem is that the sort of ‘illusion’ of the competition is that we’re looking for an objective ‘Best Game’ ™, but that is inherently at odds with the very idea of public voting, which is always basically just a popularity contest. IFComp voting is effectively just an internet poll and internet polls have never, ever lead to the best ‘objective’ choice. What they lead to (in all its horribly messiness) is democratic choices. I feel that the effect of any procedural method for dealing with votes would just limit this democratic stance, raising the question: why do we have public voting at all?

As for run offs: I think it would be too much work for the organisers and too much commitment from the judges. I think instead of getting a more concentrated judgement, you’d just get voter apathy and less interest in the comp as a whole. At the moment, win or lose, we’re all in it together. Segregating the comp into multiple tiers of success doesn’t seem to vibe with the…um…vibe.

Okay I hadn’t planned to say that much. Apologies for inevitably having said something someone else has already put more eloquently above. (and sorry for playing devil’s advocate a bit, it’s my whole personality)