A little rubric chat

Well, it was my first year being an IFComp judge, which feels weird, for having been on the forum almost every day since mid-'19. First judging and reviewing of any comp, as well, but it’s the IFComp scoring that’s pertinent at the moment.
Rather than frame all of this coherently first and write it afterwards, I’m just going to babble.
I think the rubric I used was… sort of close to the standard proposed on the IFComp site? With maybe the caveat that I was treating 10 (and even 9) as a bit too much of a Holy Grail, like rating comp games against games of all time.
I feel like in the end, it’s not as helpful for comparing games to each other within the same comp. As in, if a game was respectable at all, it should get at least a 6 so the author doesn’t get their feelings hurt by seeing their game rated as a 4. But there are a dozen other games rated as a 6 that are distinctly better than this one, which don’t quite feel like a 7.
What I find myself wishing to do, is to set the [perfectly respectable work; no egregious bugs; no egregious typography; author put in the work necessary to make this a functioning playable game] bar at something like a 3, so that the gradations of extra merit are more evident.
One rationale for this is, that how important is it really, to finely grade the exact level of sub-par work? If we presume that every passable effort should start at a 5 or a 6, how important is it really to determine whether an obviously inferior entry gets a 3 or a 2?
On the other hand, if I award a 3 for a game that I consider acceptably made even if not meriting any special attention, I feel like that can have no other effect than to make the author wonder why some judge unjustly trashed their game with an “abysmal” score.
So I find myself wishing one of two things: either that authors could know that “decent work” starts from a 3 and goes up from there, or else I could have those decimals that I talked about, so that if numbers 1-10 are supposed to rate quality as compared to games of all time, I can distinguish the comp games between 5-8 with tenths of a point to indicate which are better than others.

I’ve heard others mention “use all the numbers.” For those that do, where does “decent work” start on that scale for you? Again, my main takeaway is that it feels much less important to me to nitpick about how poorer entries rank against each other, than to have a larger range of comparison for the games with merits. (At first blush, it feels sufficient in my mind to say “1: Game for one reason or another was clearly not comp-ready. Bugs or typography or content was either woefully unrefined/undertested or untasteful. 2: Game had some promise but had clear shortfallings that disqualified it from even an “acceptable/mediocre” ranking. 3: From here on we talk about how far a game rises past basic muster and mediocrity…”)

7 Likes

I think extending your upper range could help more. The official rubric suggests the 9 is just a slight gradation above 7 and 8.

A lot of people will go back after and take their favorite game and make it a 10, even if it doesn’t necessarily reach a 10 in their personal scale of all games of all time.

One issue is that the average IFComp game has gone up in quality. If you look at IFComp 2009, their bottom half of games often struggle quite a bit or are very buggy. But our bottom half has some great games.

For me personally, for scores of 1-5, I think if you’re giving more than 2*(value) number of scores of that value, it’s probably too harsh. So more than 2 1’s, more than 6 3’s, etc. seems like a lot to me when you have 70-100 games.

But… if you’re consistent, it takes the sting off. Getting a three sucks but if a lot of other people got a three you can say “okay, it wasn’t just me”.

9 Likes

Yeah, I guess the point was to envision a situation where a 3 didn’t “suck” but was instead like the baseline “You did pretty good,” but I’m aware that a new outlook on the rubric is not likely to take over the scene, so authors would continue to just feel like they were getting blasted at the sight of any 3.
Which leads me back to wishing for my decimal points, because the justice side of me wants to rate games against each other, not just “a bunch of 6s, a bunch of 7s, and some eights, with maybe a nine.”

4 Likes

I didn’t write reviews this year but I did last year, and I found it immensely helpful to break things down into 4-5 categories and award points based on how well a particular game did in that category. For example, something with good “Gameplay” will get two points in that category, other games may get a 1, and something with significant gameplay issues would get a 0. With one category usually being “because I liked it” this means most games will score a minimum of 5 but I also have a lot more confidence in the final score. It’s so hard for me to look at a game in its entirety and say “Ah yes, this is a 7” that I don’t think I could do it another way.

10 Likes

I’m generally very hesitant to give anything a score of 1 or 2, and I don’t automatically give a 9 or 10 unless something blows me away.

Perhaps this is a relic of judging IFComp when there were more broken games, and I remember them, and I don’t want to equate anything with that without evidence.

But you’re right–providing a relatively hard floor of 3 means giving a weak game a 3 and a very weak game a 3. And that’s a problem.

But 1) you don’t have to answer directly to the authors for that and 2) I’ve found this problem presents itself for all ranges. Looking at the entries I gave, say, a 6 to, I still feel the ones at the high end are superior to the ones at the low end. The cutoff has to be somewhere!

And yes, that would be harsh if I were the only person judging. Thankfully I am not! And it can be tough to remember, even if other people thought a lot like me, they’d have a coin flip for if game X deserved 6 and game Y deserved 5, or vice versa, and that’d balance out.

That said, shortening the rating range does potentially make high-6 more different from low-6. I just don’t feel right, though, giving something that does lots of basic things right a 2. They may only be basic things.

I’m also not big on sticking too much to a mean score of 5.5 as a perfect average. 6 is more like it from me–it’s a bit arbitrary, but my scores seem to cluster around there anyway.

I like going with scores from 1-100, not because I think IFComp itself should have the decimal points (this would probably drive a lot of judges batty,) but it gives me some fudge factor to rate entries without rating them against each other, at first. Then I can round off later, without saying “Really? Game X gets the same rating as Y, by how I voted? Crazy.”

I’ve come to accept I’m part of the random deviation, and like a lot of other people I allow an extra point or two for really subjective stuff, not for a whim, but for an admission I don’t understand everything, including why I feel the way I do. (Sometimes I deduct, too. Two entries that placed very well indeed, I thought were pandering, and I marked it down. Obviously I want to keep them anonymous, but it does happen. And it didn’t happen too often. I felt guilty about that at first, but I realize this probably isn’t a vindictiveness thing now.)

I’ve found a lot of worries I had about my own (very hazy) rubric were washed away with time–as a competitor I find I don’t really even remember my average scores! Also looking back on some scores I gave, I think ouch, I was too harsh there–but I’m confident other judges were lenient.

And I think in particular just giving myself time to revisit scores later, with some basic notes, helps a lot. 80-90% of the time I am perfectly okay with the original score I gave, but there will be occasional bumps up and down. And just seeing that as a pattern helps me to think, hey, I have thought about things enough, and maybe I could think of it more, but it has serious diminishing returns to scale.

These are all points worth thinking of but I found when I went in too deep that ruined my enjoyment of other games or made it hard for me to make any concrete judgements.

5 Likes

I used 5 as my score for “average” games—ones that were decent on the whole, but still had some issues. So then 1-4 were gradations for “worse than average” games, and 6-10 were gradations for better than average.

I did this as well, but with three categories, and point values of 0-3 for those categories (with possibility of giving a bonus point to get a game to a total score of 10). I felt like it worked well, and I felt confident about all the scores I gave, but it was interesting to see how much others’ scores varied from mine!

7 Likes

I tend to just go by feel on a 5 “star” scale and then doubling it and often bumping it up or down one.

But it’s always cool seeing people score based on several criteria and then adding it up.

Something I keep thinking about is going fully subjective: taking a big wide sheet of paper, plotting “how I feel about this game” on an unmarked “number line” across it (shades of mapping parser games and where do you start so you don’t run off one edge of the paper or the other?), and then drawing a 10-point scale when I’m done judging, with my least favorite game at a 1 and favorite one at a 10 and rounding scores to the nearest point in between. I’ve never gotten around to actually trying it and I’m not entirely sure if I’d be comfortable actually judging like that. But I think it’d be an interesting experiment even if I didn’t use those as my final numbers. Maybe one of these years…

8 Likes

Go players are traditionally ranked on a scale that runs from 25 kyu on the low end up to 1 kyu, followed by increasing ranks of 1-9 dan. There’s no strict lower limit to how bad you can be, which maybe avoids the problem of having unused low ranks on the 0-10 scale, and because even a low dan rank is an accomplishment, there’s a clear place home for respectable but not masterful players.

2 Likes

Mmmh. Reading these posts I now understand how the comp was judged this year and why some games that I think of like “good games” had been graded so low.

I’m used to this (and don’t think I will ever change at it is hardwired in me due to all the grades of my life given or taken):

1-5 insufficient.
1/2/3 are all very bad votes and probably no one would get them anymore due to the quality of games in IFComp
4 bad game. Missed the point, is not funny, it’s handled bad in general.
5 oh well, it’s not THAT bad, but still not enough to be worthwhile. Maybe this game can get better post-comp. Hardly the same can be said with lower numbers.

6 sufficient
Not the work of a lifetime, but “ok”. “It works” we say as teachers when there is nothing plainly good in one’s work but neither anything inherently bad.

7-8 good
Sarisfying. Somehow memorable. Probably not eternal but very good.

9 exceptional
It is getting close to a masterpiece. Or it is but something rubbed me the wrong way. Or, simply, not a masterpiece.

10
Anchorhead. Counterfeit Monkey. Leather Goddesses of Phobos. Add yours.

6 Likes

To be clear, I did not use the “3 as a baseline for an acceptable rating” mentality this year. Almost all of my ratings were 5-8. But my rubric left me feeling like I wanted more gradation among games mediocre and above.
I feel like assuming a 5 (or worse, a 6) as an “average game “ just ends up reducing me to giving things a 1-4 rating, which is far less satisfying as a judge, because most games are going to meet the qualifications to be considered mediocre with some promise. And I still feel that it’s not that interesting or important to categorize the exact badness of games that are generally recognized as inferior. So the whole 1-4/5 region of the scale feels kind of wasted to me.
Just sayin’… I wish I could express my degrees of appreciation for a work more minutely without making the less-amazing efforts feel like they’re getting attacked with unfairly poor ratings.
And others may be right in saying, that if I use the scale I’m talking about, no one’s going to know, and the stats will balance themselves out anyway, but as an author I know how it feels to see those subpar numbers and just wonder why your game soured somebody so badly. So I feel hesitant about throwing anonymous 3s at somebody’s respectable work, cause they won’t know that it means a 5.5/6 according to the old rubric.

3 Likes

The Go ratings were interesting… I’m almost tempted to imagine a system where every numerical rating indicates the game was worthy of entering, or else it can simply be marked as “falling short” or “grossly falling short”. Then you get 10 whole numbers to talk about how good the game was :slightly_smiling_face:

3 Likes

I would surely appreciate it (although we may bump into a troll entry anyway, sooner or later).
My “worries” are all around the fact that if no one knows, a 4 will never be taken as a fairly good grade but very low. It’s all about perception.

As an example: as Thesis grades at the university, we have to score a thesis 0 to 5. Zero means “ok” and the rest goes from “exceptional” to “superhuman”. Fact is, judges from outside don’t know/ignore it and a very “meh” works gets a lot of 2/3s that in our schemes means “oh, you are Margherita Hack” and in theirs is “you barely scratched the surface”.

2 Likes

I am curious, if you wanted more gradations, why you didn’t use 9 and 10?

4 Likes

I’m a fan of the Anthony Fantano school of review - if you have an entire number scale available, you should use it all where needed. It feels nonsensical to give average works 6’s or 7’s, that should be reserved for good works. You should be proud of a 7, not left thinking “damn, it’s only a 7”.

That means needing to give bad works 2’s or 3’s, sure. That might make some authors upset. But at the end of the day, if you’re submitting works to a competition you should expect to be judged.

4 Likes

Looking at my ifdb rankings (which are just half my ifcomp rankings, rounded up), I gave a 1 star (i.e. a 1 or a 2) to 4 ifcomp games last year, but not any this year. I did give out at least one 3 that I can remember this year, to a game that felt unfinished and buggy and short (by an author who I don’t think is on the forum).

5 Likes

I’ve always treated five as “fine.” Average. Edit: and by average, I guess I mean competent in terms of both writing and tech. Sevens and eights are good games. Nines and tens are exceptional. There might be just a handful per year.

Because of my affection for punk rock and low-fi music, a well-written but hopelessly buggy game might not get a one or a two. RTE is about an unplayable IF Comp game from the 1990s that took second-to-last place, after all.

There are very few 1, 2, 9, or 10 games for me. A game I have a great time with is an 8. That’s what I consider a great game: 8. A 9 or 10 has to set itself apart somehow as special. It has to be more than a great game. Its themes or writing may really stick with me, for instance, or it may be innovative in some way.

I feel the same way about 1 and 2 scores. They’re more than bad. There has to be something truly obnoxious or unpleasant about them.

In practice, though, I only rate games that I consider sevens or better. I don’t finish anything else. There are just too many incredible games out there for me to settle for “fine.” We have decades of IF already made, plus everything coming out now. I’ll never experience every great IF game at my current rate, even if that’s all I play. I haven’t played every IFDB top 100 game, for instance, or every “best of all time” game, either. I have a lot of IF I want to get to.

I don’t like the feeling of handing out low ratings, anyway.

5 Likes

Well, one of the main reasons I made this post is because I came away dissatisfied with my rubric! I went into the comp with too much of a fixation on considering 9 and 10 as standards to be matched up against games of all time. Anyway, I feel pretty sure that in the future I may be more lax about what a 9 or 10 means within a given comp.

I wasn’t giving 6s or 7s as an “average” game. I’m just saying that if most comp games are going to at least check in as average, and if you reserve 10 for masterpieces that may only come every few comps (as I did), the range of scores you end up giving only varies 3 or four places. But in my mind I could rank that whole gamut of games in the 5-8 range much more specifically if more gradations were available.

3 Likes

For me, everything pretty damn good or clever is a 9 or 10. 8 is pretty good.

1 Like

As an author, these sorts of threads are useful to dial in on the audience. Not so much in the numbers but the requirements.

Maybe better – although risky and demanding of people who are already donating a huge amount of their time to the community – would be for judges drop authors a note on why they rated your game. Of course authors would have to, in turn, not argue with it, but a little bit of info could go a long way.

In any case, sharing the rubrics can also help the community come to understand what might be “fair play”. One person may give 1s for games that do not start. Others for games that crash at all. Or have content they disagree with. Or if it happened to be the lowest of the five games they rated. (And similarly for 10s!)

I have mathematical reservations about “use all the numbers” but something like IF Comp is in a weird spot in terms of statistics. Outlying ratings can still have a significant impact on averages.

In the end, I assume these are community events and assume best intent. Ratings are a necessary evil to make it a competition, but the real value are the reviews, comments, discussions and follow-on works.

7 Likes

I’ve read the whole thread and have no new perspectives to add. I can just add the data points of myself.

I score mostly the same way I also score movies, where I use 6 the most. 6 being the basic positive but not becoming too remarkable movie experience. However 6 also ends up being my ‘Bizarre film with flashes of brilliance but also too many flaws to go higher…’ and variants. No wonder I overuse 6.

Here’s my ratings distribution on IMDB, where I’ve rated almost all (as far as I can tell, now) of the 5000 feature films I’ve seen:

six

You can see my most used ratings in order are 6,7,5,8,4.

This is a ratings approach leaning on worth, not trying to use all the numbers more often. But after I’ve used it, in IFComp, I will push scores up and down a little bit to try to express my own preferences more accurately, only because I’m conscious everything in this batch will be ranked versus everything else in this batch. So I’m doing that thing where if I gave two things a six, and I liked one much better in retrospect, it may go to 7, etc. I’m separating things from each other for my own satisfaction.

-Wade

12 Likes