How I'm Watching the MCU, With Stats

Reconciling 11 online guides into a single score

Last updated on Mar 8, 2021 13 min read

Executive Summary
Problem
Data and methods
- Guides
- Item-response theory
  - Concept
  - Details
Procedure
Ratings
- Movies
- Sources
Discussion

Executive Summary

I create must-watch ratings for MCU movies via item-response theory and 11 online guides.

If the goal is to get the payoff of Infinity War and Endgame…
Watch Iron Man, Captain America, Avengers, Winter Soldier, Guardians 1, and Civil War.
Optional, in order, are Ragnarok, Ultron, Captain Marvel, Wasp, and Black Panther.
The remaining movies are apparently safe to skip.

In a future post, I’ll write up the novel watching order I devise, relying on these ratings.

Problem

So you’ve decided to finally watch the MCU.

PNG of Captain America sitting on a backwards chair goes here.

Well, you have if you’re me. I’ve decided this. But there’s a hangup:

I cannot possibly commit to binging 22 Marvel movies.

I’ll be honest, I’m mostly here for the memes. I want to be culturally conversant.
But I largely don’t watch movies. I saw ten total movies in the 2010s.
I don’t expect ex ante that I’ll care a lot about these stories or characters, either.
So I don’t know at what point I’ll get too bored with more MCU to continue.

I turned to the Internet for assistance, but found more info than I could use.

I set out hoping I’d find one, clearly good guide to which movies I could skip.
What I came back with was eleven guides, with different opinions, all of unknown quality.

So is there a good way to unify all these into one authoritative list of what (not) to watch?

Data and methods

Guides

A quick Startpage search turned up eleven guides to the MCU up through Endgame.

Sources ranged from geek blogs to media sites to general news outlets to some Redditor:

13th Dimension
A Place To Hang Your Cape
BGR
Collider
Insider
Polygon
Screenrant
Slashfilm
Vox
Vulture (contains link to a later revision; I averaged the two versions)
zomghax92 on Reddit

Each guide recommends some movies and not others, and they don’t agree on which.

That is, each one’s picks partly capture some true signal of how essential each movie is.
But they also contain some random noise due to its own author’s tendencies.

So how do we pull out that common signal to rate each movie statistically?

A first idea would be to average the guides, scoring each movie by how many lists picked it.

This would treat each guide as containing the same balance of signal and noise.
But some guides might just be better at choosing essential MCU movies than others!
It’d be better to weight the average of the votes by the quality of the guides.
However, that quality isn’t a manifest feature – not a number that can just be read off it.

Instead, a more involved model is needed to jointly rate the movies and the guides.

Item-response theory

Concept

A one-dimensional item-response theory (IRT) model can do that job.

IRT models measure latent traits in respondents using their answers to test items.
At the same time, they measure how clearly each item tests for the latent trait.

When you look at them right, a lot of different data can look like test results.

The original, canonical application from psychometrics is educational aptitude testing.
But IRT models are also used to measure, say, left-right ideology in politicians.

So how can the data above be mapped to the notion of items and respondents?

Here, movies will function as respondents, and the eleven guides as test items.
If a given guide picks a given movie, that will count as a correct answer to the item.
The latent trait – the signal – recovered will be the “true” importance of each MCU entry.

Details

The model used is a parameterized equation for the probability of a “correct” answer:

\[\mathrm{Pr}(Y_{ij}=1) = \Lambda(-\alpha_i + \beta_i\delta_j)\]

\(Y_{ij}\) is \(1\) if individual \(j\) got question \(i\) right and \(0\) otherwise.
\(\Lambda(x)\) is the logistic function \(\mathrm{exp}(x)/(1 + \mathrm{exp}(x))\).
\(\delta_j\) is the ability of respondent \(j\): their actual rating for the latent trait.
\(\alpha_i\) is the difficulty of item \(i\): the trait level needed for an even chance of a right answer.
\(\beta_i\) is the discrimination of item \(i\): how well it distinguishes higher trait levels from lower.

In words, the IRT model says that a test-taker will be more likely to get a question right:

the lower the question’s difficulty \(\alpha_i\), and
the higher their ability parameter \(\delta_j\) – in proportion to the question’s quality \(\beta_i\).

So, in this application, parameters represent the following:

\(\delta_j\) rates how important watching movie \(j\) appears to be, given the data.
\(\beta_i\) rates how consistent guide \(i\)’s picks are with the true, latent importance.
\(\alpha_i\) rates how exhaustive the guide is overall.

Procedure

Load the data

I copied each guide’s picks into a spreadsheet and loaded them.

I also captured each movie’s Rotten Tomatoes score, but I’m not using it here.

vengr <- suppressMessages(readr::read_csv(here::here("static", "data", "vengr.csv")))
vengr <- dplyr::select(vengr, -.data$tomatoes)

venga <- tidyr::pivot_longer(vengr, -.data$title, names_to = "item")

# also calculate the naive total votes and set aside
venge <- dplyr::group_by(venga, .data$title)
venge <- suppressMessages(dplyr::summarise(venge, votes =  sum(.data$value)))

Impute the data

Where guides didn’t cast hard up/down votes on movies, I assigned partial votes.

This mostly affects Vulture, who published two revisions of a tier ranking.
Values for partial categories were evenly spaced within each guide (e.g. in steps of 1/6).

But the item-response theory model above doesn’t allow for partial votes.

So I used multiple imputation to randomly create several plausible, usable data sets:

For each guide, roll a d6 to get a random value \(p\).
Treat each guide as voting to watch any movie with \(p/6\) partial votes or above.

Results can then be pooled across imputations by averaging the estimates from each one.

An alternative would be to roll a separate die for each vote, not a single die per guide.

But it’s weird to think guides suppose movies are required independently of each other.

foo <- dplyr::tibble(i = 1:128)
foo <- dplyr::group_by(foo, .data$i)
foo <- dplyr::mutate(foo, venga = list(venga))
foo <- tidyr::unnest(foo, .data$venga)
foo <- dplyr::group_by(foo, .data$i, .data$item)
foo <- dplyr::mutate(foo, draw = runif(1))
foo <- dplyr::mutate(foo, value = .data$value >= .data$draw)
foo <- dplyr::mutate(foo, value  = as.numeric(.data$value))
foo <- dplyr::select(foo, -.data$draw)

Filter the data

In each imputed data set, I filter out movies on which voting was unanimous.

Parameters for consensus entries (either way) can’t be estimated.
They can be simply set to +/- infinity as appropriate, without fitting.

This affected a few movies across all imputations, and two in a fraction of data sets.

Every guide said to watch The Avengers (along with Infinity War and Endgame).
All eleven said to skip The Incredible Hulk and Ant-Man.
Captain America was a consensus watch choice except for one partial vote.
Iron Man 3 received only partial votes to watch.

foo <- dplyr::group_by(foo, .data$i, .data$title)
foo <- dplyr::mutate(foo,
                     OK    = any(.data$value > 0) & any(.data$value < 1),
                     count = sum(.data$value))

quux <- dplyr::filter(foo, !.data$OK)
quux <- dplyr::group_by(quux, .data$i, .data$title, .data$count, .data$OK)
quux <- dplyr::summarise(quux)
quux <- dplyr::mutate(quux, delta = ifelse(.data$count > 0, Inf, -Inf))

foo <- dplyr::filter(foo, .data$OK)
foo <- dplyr::select(foo, -.data$count, -.data$OK)
foo <- dplyr::group_by(foo, .data$i)
foo <- tidyr::nest(foo)

Fit the model

There are R packages to fit IRT models, but the interfaces confuse me, so I’m rolling my own.

I wouldn’t use the below for peer-reviewed work without more care, but here it’ll be fine.

The \(\beta\) and \(\delta\) parameters cannot both be fitted in one logistic regression.

Instead, I alternate between fitting models for the item and respondent parameters.
Each is fitted conditional on the last estimate of the other (i.e. as if the other were data).
The alternating steps are repeated until all the parameters jointly converge.

Technical details that I could say more about, but I won’t here:

For convenience, I fit mixed models that shrink estimates toward zero.
The parameters are rescaled at each step for purposes of identification.

# Helper functions
huh <- function(x) {dplyr::select(x, tidyselect:::where(is.numeric))}
hah <- function(new, old) {
  y <- huh(new)
  abs((y - huh(old)) / y)
}

#' one-dimensional item-response theory model via lme4::glmer and an EM algorithm
#' @param u `data.frame` with columns `value` (the response), `item` (the item), and `title` (the respondent)
#' @param iter `numeric`. Maximum iterations. Default `128`.
#' @param tol  `numeric`. Relative tolerance for convergence. Default `sqrt(sqrt(.Machine$double.eps))`.
#' @return a list with elements:
#'  * `B` (the final item model) and 
#'  * `D` (the final respondent model) 
#' and attributes:
#'  * `z` (convergence) and 
#'  * `i` (number of iterations).
#' @export
irt <- function(u, iter = 128, tol = sqrt(sqrt(.Machine$double.eps))) {
  z <- 1 # convergence flag
  v <- u # working copy of data (not really needed)
  
  ## initialize alpha and *also* initialize delta *assuming beta = 1 for all items*
  A <- lme4::glmer(value ~ (1 | item) + # alpha
                     (1 | title) + # delta
                     0, # no intercept -- think this is why many fits are singular
                   family = binomial, data = v)
  
  ## extract delta estimates and join to v
  d <- lme4::ranef(A)$title
  d <- dplyr::as_tibble(d, rownames = "title")
  d <- dplyr::rename(d, beta = .data$`(Intercept)`) # yes, this name is as desired
  v <- suppressMessages(dplyr::left_join(v, d))
  
  ## initialize alpha and beta assuming delta equals initial values
  B <- lme4::glmer(value ~ (beta | item) + 0, family = binomial, data = v)
  
  ## extract alpha and beta estimates and join to v
  b <- lme4::ranef(B)$item
  b <- dplyr::as_tibble(b, rownames = "item")
  b <- dplyr::rename(b,
                     alpha = .data$`(Intercept)`,
                     delta = .data$beta)
  v <- suppressMessages(dplyr::left_join(v, b))
  
  for(i in 1:iter) {
    d_1 <- d
    b_1 <- b
  
    # estimate delta given alpha and beta
    D <- lme4::glmer(value ~ (delta + 0 | title) + offset(alpha) + 0, family = binomial, data = v)
  
    ## extract delta estimates and join to v
    d <- lme4::ranef(D)$title
    d <- dplyr::as_tibble(d, rownames = "title")
    d <- dplyr::rename(d, beta = .data$delta)
    d <- dplyr::mutate(d, beta = .data$beta / mean(abs(.data$beta))) # normalize
    v <- dplyr::select(v, -.data$beta)
    v <- suppressMessages(dplyr::left_join(v, d))
    
    ## estimate alpha and beta given delta
    B <- lme4::glmer(value ~ (beta | item) + 0, family = binomial, data = v)
  
    ## extract alpha and beta estimates and join to v
    b <- lme4::ranef(B)$item
    b <- dplyr::as_tibble(b, rownames = "item")
    b <- dplyr::rename(b,
                       alpha = .data$`(Intercept)`,
                       delta = .data$beta)
    v <- dplyr::select(v, -.data$alpha, -.data$delta)
    v <- suppressMessages(dplyr::left_join(v, b))
    
    if(max(hah(d, d_1)) < tol & 
       max(hah(b, b_1)) < tol &
       !lme4::isSingular(D)   &
       !lme4::isSingular(B)   &
       TRUE) {
      z <- 0
      break
    }
  }
  
  out <- list(B = B, D = D)
  attr(out, "z") <- z
  attr(out, "i") <- i
  
  out 
}

system.time(foo <- dplyr::mutate(foo, bar = list(irt(dplyr::first(.data$data)))))
saveRDS(foo, here::here("static", "data", "mcu_foo.rds"))

Pool the results

Finally, I average over the imputations to get final ratings.

There are standard error estimates, too, and the rules for pooling these are known.
However, it’s not clear those standard error estimates mean much in the mixed-model setting.

# here's one I made earlier
foo  <- readRDS(here::here("static", "data", "mcu_foo.rds"))
quux <- readRDS(here::here("static", "data", "mcu_quux.rds"))

foo <- dplyr::mutate(foo, bar = list(lapply(dplyr::first(.data$bar), list)))
foo <- dplyr::mutate(foo, bar = list(dplyr::as_tibble(dplyr::first(.data$bar))))
foo <- suppressMessages(dplyr::select(foo, .data$bar))
foo <- tidyr::unnest(foo, .data$bar)

# pool the movie ratings

baz <- suppressMessages(dplyr::select(foo, mod = .data$D))
baz <- dplyr::mutate(baz, mod = list(lme4::ranef(dplyr::first(.data$mod))))
baz <- dplyr::mutate(baz, mod = list(dplyr::as_tibble(dplyr::first(.data$mod))))
baz <- tidyr::unnest(baz, .data$mod)

# prepare to join in the ratings that were infinite
baz <- tidyr::pivot_longer(baz, .data$condval:.data$condsd, "quantity", values_to = "value")
baz <- tidyr::pivot_wider(baz, names_from = .data$grp, values_from = .data$value)
baz <- tidyr::pivot_longer(baz, .data$`Ant-Man and the Wasp`:.data$`Captain America: Civil War`, names_to = "title")

quux <- dplyr::group_by(quux, .data$i, .data$title)
quux <- suppressMessages(dplyr::select(quux, .data$delta))
quux <- tidyr::pivot_longer(quux, .data$delta, names_to = "term", values_to = "degenval")

baz <- suppressMessages(dplyr::left_join(baz, quux))
baz <- dplyr::mutate(baz, value = ifelse(is.na(.data$value), .data$degenval, .data$value))
baz <- suppressMessages(dplyr::select(baz, -.data$degenval))
baz <- tidyr::pivot_wider(baz, names_from = .data$quantity, values_from = .data$value)

baz <- dplyr::group_by(baz, .data$term, .data$title)
baz <- suppressMessages(dplyr::summarise(baz,
                                         n      = sum(is.finite(.data$condval)),
                                         mean   = mean(.data$condval[is.finite(.data$condval)]),
                                         median = median(.data$condval)))
baz <- dplyr::group_by(baz)

# tack on vote totals for comparison
baz <- suppressMessages(dplyr::left_join(baz, venge))

# pool the guide ratings

bar <- suppressMessages(dplyr::select(foo, mod = .data$B))
bar <- dplyr::mutate(bar, mod = list(lme4::ranef(dplyr::first(.data$mod))))
bar <- dplyr::mutate(bar, mod = list(dplyr::as_tibble(dplyr::first(.data$mod))))
bar <- tidyr::unnest(bar, .data$mod)
bar <- dplyr::rename(bar, guide = .data$grp)
bar <- dplyr::mutate(bar, term = ifelse(.data$term == "(Intercept)", "alpha", as.character(.data$term)))
bar <- suppressMessages(dplyr::select(bar, .data$term:.data$condval))
bar <- tidyr::pivot_wider(bar, names_from = .data$term, values_from = .data$condval)
bar <- dplyr::group_by(bar, .data$guide)
bar <- suppressMessages(dplyr::summarise(bar, dplyr::across(tidyselect:::where(is.double), mean)))

Ratings

Movies

knitr::kable(dplyr::arrange(dplyr::select(baz, -.data$term), dplyr::desc(.data$mean)), digits = 3)

title	n	mean	median	votes
Captain America: Civil War	82	1.469	1.527	10.33
Guardians of the Galaxy: Vol. 1	128	1.467	1.344	10.00
Iron Man	128	1.124	1.002	8.00
Captain America: The First Avenger	128	0.764	0.677	8.00
Captain America: The Winter Soldier	128	0.469	0.263	6.33
Thor: Ragnarok	128	-0.120	-0.256	5.50
Avengers: Age of Ultron	128	-0.261	-0.286	4.50
Captain Marvel	128	-0.275	-0.359	4.33
Ant-Man and the Wasp	128	-0.330	-0.427	4.16
Black Panther	128	-0.386	-0.431	4.33
Thor: The Dark World	128	-0.726	-0.831	3.33
Thor	128	-0.957	-1.012	2.00
Doctor Strange	128	-1.022	-1.077	2.67
Spider-Man: Homecoming	128	-1.239	-1.307	1.00
Iron Man 2	128	-1.426	-1.503	1.00
Iron Man 3	46	-1.526	-Inf	0.33
Guardians of the Galaxy: Vol. 2	128	-1.653	-1.698	1.17

Compared to the simple vote total, there turn out not to be a lot of surprises in the ratings.

The vote totals almost perfectly line up in the same order as the average ratings.

The ratings do provide clearer gaps between groups of pictures than the vote totals do, though.

The group with positive ratings seem to be one group. These are the clear essential films.
Then there seems to be another group of five with near-equal ratings. These are optional.
Everything after the first five of the negative ratings should be clearly safe to skip.

Median values are included because the reported mean is biased for Civil War and Iron Man 3.

The mean can only be taken over the data sets where the rating wasn’t infinite.
But the median can be taken over all 128 imputations, no matter what.

Sources

Sources ranged from geek sites to media blogs to news outlets.

The best at picking out essential movies to watch was… some Redditor.

knitr::kable(dplyr::arrange(bar, dplyr::desc(.data$beta)), digits = 3)

guide	alpha	beta
zomghax92	0.247	2.271
collider	0.652	1.898
bgr	0.934	1.683
vox	-0.104	1.640
polygon	0.706	1.638
thirteenth	0.160	1.617
cape	0.670	1.034
screenrant	-0.938	0.780
slashfilm	-0.701	0.657
vulture	-0.331	0.472
insider	-0.133	0.420

I’m not informed enough to speak of “surprises” here, but a couple facts stood out to me:

Vulture’s picks surprised me the most compared to other lists (Doctor Strange?).
Vox dot com managed to hang with the big kids when it came to geek media criticism.
A Place To Hang Your Cape’s list seemed well thought out, but the model wasn’t impressed.
At least no one had a negative beta (positively preferring non-essential movies).

Discussion

This was fun to do, even if it didn’t teach me too much that simpler methods wouldn’t have.

I’m happy enough with these ratings to use them in making my MCU binge manageable.

Readers might be concerned with what the common signal the model found represents more deeply.

Does a higher rating mean a given movie is more important for understanding Endgame?
Or does it mean more that a given movie is a good movie that will be fun to watch?

This is a good question. The data and the model used, by themselves, don’t answer it.

Each guide implicitly chose some balance between the former and the latter in making its picks.
The latent trait extracted by the model is just an average over each guide’s balance.
A two-dimensional IRT model – for tests of two traits at once – might help tease that out.