Does Games Criticism Need Quality Standards?

What? No.

The question being asked in the title of this post is ridiculous, to be clear, but it’s also a thing being pushed by some folks who paid for Twitter right now, and a lot of credulous folks are parroting it going “oh THIS is what games need!” I’m not going to link to the original post – I’m sure you can follow the trail of people complaining about this or dunking on the OP to find it yourself – but I will describe what’s going on for anybody not interested in logging into Elon’s Website this morning.

Basically, the host of a very small, very new (as in, joined the Internet in February) podcast/YouTube channel about video games posted an image of a “metric scoresheet” they use to quantify their game reviews. The scoresheet has 42 different metric attributes across five separate categories (mechanics, story/character development, sound and music, graphics and art style, and “innovation”), and rates games on a fucking 500-point scale, with “485-500” being rated as a “masterpiece,” “460-484” as outstanding, and so on. The point spread is fascinating and absolutely skewed—games that score anywhere from 0 to 224 points (0% to 44.8%) are rated as “disasters”—but we can’t get lost in the weeds here. It’s not that certain parts of this metric scoresheet are ridiculous, it’s that the whole concept is goofy.

Now look, here’s my shit, right: if someone wants to come up with a way to sort through their own thoughts on games, a reliable guide to structure their own reviews, a method to remind themselves of the things they find important to talk about, that’s great. Go with god, everyone has their own way of keeping things locked down and I think that’s beautiful. But of course, that’s not exactly what’s going on. Here’s the podcaster, in their own words, about what they want for this scoresheet: “We believe other reviewers should adopt this method to enhance the clarity and reliability of their evaluations, fostering greater trust and satisfaction among their audiences.”

And because they have Twitter Blue, this post is getting shared and boosted by other Twitter Blue users, and the bolder ones trying to grift off recent anti-games media bullshit try to shove it in folks’ faces, and then I have to see this fuckin COPC quarterly audit-ass document being talked about everywhere, even on goddamn Bluesky. So I want to talk about why this is ridiculous, and to do that, I’m going to talk about the typical structure of quality standards evaluation forms. (This is being done in a generic way based on the decade of experience I have in dealing with said forms almost every day at my day job. You’re welcome/I’m sorry.)

Let’s say you’re a business (ugh, fuck, I know), and you need to figure out a reliable method of evaluating your employees’ on-the-job performance. Already we’re stepping well outside the bounds of anything we might call “media criticism,” but keep up with me. Anyway, if you’re looking for a way to ensure that all employees find your evaluation standards fair, you’re going to need to run a fucking checklist on your checklist.

First: are your evaluation categories clearly defined? Is there any overlap between them? If so, can those categories be consolidated? With the scoresheet in question, we have five: game mechanics, story/character development, sound and music, graphics and art style, and “innovation and creativity.” Each category needs to be able to be summed up in a couple sentences or less.

Second: are each category’s attributes clearly defined? Is it clear what the evaluator is looking for in each category, and are those expectations being communicated? In this scoresheet’s structure, there are 42 different attributes: 10 in game mechanics, eight in story/character development, four in sound, and 10 each in graphics/art style and innovation/creativity. Are these attributes understandable? And again, is there significant overlap between them? If so, those attributes should be consolidated for clarity. Additional thought needs to be given to attributes in different categories that may contradict each other, as we want to eliminate any possible unfairness in the evaluation process.

Third: is the scoring rubric clear? If the scorer marks something positively, is it clearly understood what is meant by that? If the scorer marks something negatively, is it similarly clearly understood? Would two scorers be able to come to similar or the same conclusions based on their own observations and understanding of the rubric? In this scoresheet, there are six different scores: one score out of 100 points for each of the five categories, and one overall score collating the results into a single number grade out of 500. These category scores and the overall score share a rubric, which is a seven-point tally marking “masterpiece” at 97-100 (485-500 overall), “outstanding” at 92-96 (460-484), “excellent” at 86-91 (430-459) “good” at 75-85 (375-429), “fair” at 65-74 (325-374), “poor” at 55-64 (225-324), and “disaster” from 0-54 (0-224).

By itself, this rubric resembles other game outlets’ score spreads, at least on the surface (for me it’s reminiscent of letter grade spreads in high school). But we have a problem: each of the 42 attributes has a score as well. And because the podcaster wanted to keep each category sitting at a 100-point scale, the attributes in both Story/Character Development and Sound have different points values from the rest of the attribute pool. What separates “narrative structure” (15 points) from “thematic depth” (10 points)? And how should we value deductions from those nonstandard attributes vs. deductions from standard ones? Each Sound attribute has a 25-point scale; how fine-grained can we really grade “voice acting quality” or “soundtrack composition” without having a degree in audio engineering or a background in music theory? Again, we want to ensure maximum fairness.

This is I think a fine place to stop; we’ve demonstrated a sufficient enough breakdown in process that would cause any quality standards regulatory team to send the sheet back for major revisions. But just to make sure we’ve covered everything, there’s one more section of our checklist we need to confirm.

Fourth: do the individual attributes contain drivers explaining or demonstrating phenomena that is both a) specific, b) definable, c) observable and/or d) replicable? On a scoresheet with 42 attributes, this potentially opens us up to dozens of positive and negative drivers for a myriad array of topics. We can’t standardize these drivers, either: you simply cannot use the same drivers for “puzzle design” as you do for “special effects.” And you can’t be generic about these drivers either; either the driver was observed or it wasn’t, and it is the justification for the maintenance or deduction of points.

When you have such a massive scoresheet with potentially hundreds of data points to consider, it is no longer possible to ensure fairness or even allow for proper calibration of secondary and tertiary scorers. The onboarding process for subsequent generations of scorers would be prohibitively long and expensive, and they would have to spend so much time making sure everyone was on the same page about what the categories, attributes and drivers meant that there’d be no time left for the actual process and practice of evaluation. The QS forms I’ve been subjected to (as well as those I’ve used on others) operate at about a quarter of the complexity of this scoresheet and we still have to sit through monthly meetings and self-guided training sessions making sure everybody’s operating with the same understanding.


But just to be clear, just to make sure we’re all clear: there’s no way the rigor I just described is being performed by this podcaster as they use their scorecard. There’s a certain point at which even this reviewer is just kind of saying “ah fuck it” and assigning mostly arbitrary numeric values to each of the 42 attributes based on the general vibe of their experience with whichever game they’re talking about. And the filling-out of the form, and the posting of the form, and the promotion of the form: it’s all performative. At this point the question about whether it’s possible to write an “objective” review has been so thoroughly beaten into the dust that folks just start laughing whenever its bones start creaking again, so shitheads who think games critics are The Literal Devil(???? This is its own fuckin wild story from last week but we can’t get into it rn) have to move the goalposts. They have to say “well, we know you can’t write an OBJECTIVE review, but how about promising to just fill out a couple standards, just a few quality metrics, so that you can” – how was it put? “foster greater trust and satisfaction among your audiences?”

Bullshit.

Let’s say everyone adopted this standard, this scoresheet. Everyone in the industry is now grading games on a seven-tier, 500-point scale with 100-point sub-grades in the five major categories. You need to know or be a PhD in music theory with an extensive 25-year career as an audio engineer to grade the Sound category, so that severely limits who gets to talk about games, but beyond that:

what the fuck is stopping the chuds who complain about “woke” and “shilling” IGN and GameSpot scores today from complaining about this new, more granular score system tomorrow? Like, and I’m speaking to The Black Viking and the LongHouse Podcast now directly: do you think anyone who is currently mad about reviews and how they’re scored today is going to be satisfied with your system? What’s gonna happen when you give a game a bad score because you think the game mechanics, sound, narrative, visual design or “innovation” are lacking (or perhaps a combination of all five) and some unhinged fucks decide you need to get death threats and have your name and address posted on a chan board as a result?

Can you take the heat if you ever get absorbed into Metacritic and now your “transparent,” “comprehensive” and “quantitative” scores are blamed for devs losing out on a bonus or getting laid off?

The thing is, it’s never actually been about the veracity or legitimacy of scores, or the rigor with which individual reviewers go about their work. It’s always been a fucking cultural divide, a struggle between those who see games as art and thus engage in criticism of them as art (with all that entails, as messy as it is) and those who reject the idea that we should ever think deeper about games than the surface-level dopamine hits they give us. You can’t satisfy folks who think the act of reviewing is the problem. I hope y’all realize that sooner than later.

Responses

  1. Good breakdown of all this. Crazy to think that this is a conversation still being had by some people in 2024, but some people’s refusal to engage with the things they like deeply continues to amaze (and sadden) me

  2. […] Does Games Criticism Need Quality Standards? | No Escape Kaile Hultner critiques the latest attempt to granulate human authorship out of media criticism through the sober lens of actual quality standards evaluation (curator’s note: Kaile works for CD). […]

From the blog

Archives