The Rise fo the Cinematic Superhero

analysis
api
R
Author

Matthew Harris

Published

August 20, 2022

Abstract
What actors haven’t been in a superhero film?

Introduction

It feels as though superhero movies have taken over the box office. It’s hard to name a popular actor who hasn’t appeared in at least one film where people can fly or move things with their minds. The goal of this project is to dig deeper into the growth of superhero movies and ultimately determine what percentage of popular actors have been in such films. Full disclosure, this analysis will be extremely subjective.

TLDR

  • Superhero movies have become more expensive, profitable, and abundant over the last 15 years
  • Over 1 out of every 3 active popular actors has been in a superhero movie
  • Stan Lee was a very busy man

Data

TMDb API

Most of the data used in this analysis, will be queried from The Movie Database. This site offers an API that is free to access with no rate limiting. More information can be found on the TMDb API documentation page. An unofficial TMDb package is also available that provides an R ready interface to the API. It hasn’t been updated in a couple of years so I chose to create the requests on my own. The recent release of the httr2 package should make this an easy process.

Inflation Adjustments

The monetary data that I’ll be pulling from TMDb won’t be inflation adjusted so I’ll need CPI data. Without adjustment my monetary comparisons wouldn’t be apples to apples. I’ll be using the blsR package to pull this data.

Packages

Code
library(dplyr)
library(tidyr)
library(ggplot2)
library(glue)
library(stringr)
library(jsonlite)
library(httr2)
library(purrr)
library(janitor)
library(lubridate)
library(scales)

TMDb Request

Before I can conduct my analysis I will need to query the data using the API. So what’s the best way to start?

Searching for Keywords

First I need to find all of the keywords that contain the word “superhero” and save the IDs for those keywords.

Code
keyword_search <- "superhero"

hero_id_df <- request("https://api.themoviedb.org/3/") %>%
  req_url_path_append("search/keyword") %>%
  req_url_query(api_key = keyring::key_get(service = "TMDb_API"), 
                query = "superhero",
                page = "1") %>%
  req_perform() %>%
  resp_body_string() %>%
  fromJSON() %>%
  pluck("results")
Code
hero_id_df %>%
  head()
                name     id
1          superhero   9715
2     superhero team 155030
3    superhero spoof 157677
4 death of superhero 174016
5   masked superhero 180734
6     superhero kids 191219

Superhero Movies

Using the keyword IDs I can search for any movies that contain those keywords. I’m also choosing to ignore animation films and films that weren’t released in theaters. The next section will cover the following steps:

  1. Determine how many pages of results there are
  2. create a function that makes a request for each pages
  3. Use the map function to iterate over the complete list of pages
Code
sh_page_count <- request("https://api.themoviedb.org/3/") %>%
  req_url_path_append("discover/movie") %>%
  req_url_query(api_key = keyring::key_get(service = "TMDb_API"), 
                certification_country = "US",
                language = "en-US",
                without_genres = 16,
                with_release_type = 3,
                region = "US",
                with_keywords = glue_collapse(hero_id_df$id, sep = '|'),
                primary_release_date.gte = "1970-01-01") %>%
  req_perform() %>%
  resp_body_string() %>%
  fromJSON() %>%
  pluck("total_pages")
Code
sh_page_count
[1] 12
Code
get_super_movies <- function(page_num) {
  request("https://api.themoviedb.org/3/") %>%
    req_url_path_append("discover/movie") %>%
    req_url_query(api_key = keyring::key_get(service = "TMDb_API"), 
                  certification_country = "US",
                  language = "en-US",
                  without_genres = 16,
                  with_release_type = 3,
                  region = "US",
                  with_keywords = glue_collapse(hero_id_df$id, sep = '|'),
                  primary_release_date.gte = "1970-01-01",
                  page = page_num) %>%
    req_perform() %>%
    resp_body_string() %>%
    fromJSON() %>%
    pluck("results")
}
Code
get_super_movies(1) %>%
  glimpse()
Rows: 20
Columns: 14
$ adult             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ backdrop_path     <chr> "/p1F51Lvj3sMopG948F5HsBbl43C.jpg", "/wcKFYIiVDvRURr…
$ genre_ids         <list> <28, 12, 14>, <14, 28, 12>, <28, 12, 878>, <80, 964…
$ id                <int> 616037, 453395, 634649, 414906, 580489, 566525, 5244…
$ original_language <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en"…
$ original_title    <chr> "Thor: Love and Thunder", "Doctor Strange in the Mul…
$ overview          <chr> "After his retirement is interrupted by Gorr the God…
$ popularity        <dbl> 7686.423, 2155.823, 1658.434, 861.089, 604.985, 542.…
$ poster_path       <chr> "/pIkRyD18kl4FhoCNQuWxWu5cBLM.jpg", "/9Gtg2DzBhmYamX…
$ release_date      <chr> "2022-07-08", "2022-05-06", "2021-12-17", "2022-03-0…
$ title             <chr> "Thor: Love and Thunder", "Doctor Strange in the Mul…
$ video             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ vote_average      <dbl> 6.8, 7.5, 8.0, 7.7, 7.0, 7.7, 7.1, 8.3, 7.6, 7.7, 8.…
$ vote_count        <int> 1970, 5384, 14756, 5953, 7841, 7036, 6120, 25215, 64…

The function appears to work as the first two movies visible from the title column are Thor: Love and Thunder and Doctor Strange in the Multiverse of Madness. Next up is the iteration.

Code
hero_movie_df <- map_df(1:sh_page_count,
                        ~get_super_movies(.x))

It looks as though there have been 221 movies released in American theaters since Jan 1, 1970 that have been classified as a “superhero” film.

Movie Details

I could use the rest of the data found in the discover API call, but I want to know more. Have the budgets for superhero movies changed since 1970? Have they become more profitable? Additional information on these movies’ cost and performance can be found using a slightly different API call. Querying the data requires a similar process as collecting all of the movie IDs.

Code
get_super_details <- function(movie_id) {
  request("https://api.themoviedb.org/3/") %>%
    req_url_path_append(glue("movie/{movie_id}")) %>%
    req_url_query(api_key = keyring::key_get(service = "TMDb_API"), 
                  language = "en-US") %>%
    req_perform() %>%
    resp_body_string() %>%
    fromJSON()
}

Wow! This list contains a combination of lists and data frames. I can see some fields that I’m interested in, such as budget and vote_average.

Code
hero_detail_lst <- map(hero_movie_df$id,
       ~get_super_details(.x))
Code
hero_detail_df <- hero_detail_lst %>%
  map_df(~as_tibble(t(.x))) %>%
  select(budget, imdb_id, original_title, title,
         release_date, revenue, runtime, status, vote_average,
         vote_count, id) %>%
  unnest(cols = everything()) %>%
  rename(movie_id = id) %>%
  mutate(release_date = as_date(release_date))

hero_detail_df %>%
  glimpse()
Rows: 221
Columns: 11
$ budget         <int> 250000000, 200000001, 200000000, 185000000, 110000000, …
$ imdb_id        <chr> "tt10648342", "tt9419884", "tt10872600", "tt1877830", "…
$ original_title <chr> "Thor: Love and Thunder", "Doctor Strange in the Multiv…
$ title          <chr> "Thor: Love and Thunder", "Doctor Strange in the Multiv…
$ release_date   <date> 2022-07-06, 2022-05-04, 2021-12-15, 2022-03-01, 2021-0…
$ revenue        <dbl> 720000000, 953200000, 1901000000, 770836163, 506863592,…
$ runtime        <int> 119, 126, 148, 177, 97, 132, 149, 156, 132, 143, 181, 1…
$ status         <chr> "Released", "Released", "Released", "Released", "Releas…
$ vote_average   <dbl> 6.788, 7.504, 8.046, 7.749, 6.958, 7.670, 8.268, 7.073,…
$ vote_count     <int> 1961, 5371, 14748, 5948, 7839, 7033, 25212, 6114, 6484,…
$ movie_id       <int> 616037, 453395, 634649, 414906, 580489, 566525, 299536,…

All of the information that I want is in a tidy data frame and ready for some exploratory data analysis.

blsR Requests

Next I want to grab the CPI data.

Code
cpi_monthly <- blsR::get_series_table(
  series_id = "CUUR0000SA0",
  api_key = keyring::key_get("bls_reg_key"), # Enter your own API key here
  start_year = 1970,
  end_year = year(Sys.Date())
)

cpi_avg_annual <- cpi_monthly %>%
  arrange(year) %>%
  group_by(year) %>%
  summarize(avg_cpi = mean(value, na.rm = TRUE))

curr_cpi <- cpi_avg_annual %>%
  tail(1) %>%
  pull(avg_cpi)

cpi_avg_annual <- cpi_avg_annual %>%
  mutate(inf_rate = curr_cpi / avg_cpi)
Code
cpi_avg_annual %>%
  arrange(desc(year)) %>%
  head()
# A tibble: 6 × 3
   year avg_cpi inf_rate
  <int>   <dbl>    <dbl>
1  2022    289.     1   
2  2021    271.     1.07
3  2020    259.     1.12
4  2019    256.     1.13
5  2018    251.     1.15
6  2017    245.     1.18

Analysis: Part 1

Trended Release Counts

How many superhero movies have been released per year since 1970? Have these movies become more profitable over time? I’ll attempt to answer these questions with some simple tables and plots.

Code
hero_detail_df %>%
  mutate(release_year = year(release_date)) %>%
  filter(release_year < year(Sys.Date())) %>%
  count(release_year) %>%
  ggplot(aes(release_year, n)) +
  geom_line(col = "#D23630", size = 1.75) + 
  geom_point(col = "white", size = 3.5) +
  geom_point(col = "#D23630", size = 2.5) +
  scale_y_continuous(breaks = breaks_width(2)) +
  scale_x_continuous(breaks = breaks_width(5)) +
  labs(x = "Release Year", title = "Superhero Movies Released by Year") +
  theme_minimal(base_size = 14) +
  theme(axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5, face = "bold"))

Super Performance

There has definitely been an upward trend in the number of superhero movies released per year. More of these types of films start to pop up around the mid 2000s. Have these newer superhero films been more financially successful?

Code
hero_detail_adj_df <- hero_detail_df %>%
  mutate(release_year = year(release_date)) %>%
  left_join(cpi_avg_annual, by = c("release_year" = "year")) %>%
  mutate(across(.cols = c(budget, revenue),
                .fns = ~ inf_rate * .x,
                .names = "{col}_inf_adj"),
         profit_inf_adj = revenue_inf_adj - budget_inf_adj,
         roi_inf_adj = profit_inf_adj / budget_inf_adj)

hero_detail_summ <- hero_detail_adj_df %>%
  filter(release_year < year(Sys.Date()), budget_inf_adj > 100000) %>%
  group_by(release_year) %>%
  summarize(across(.cols = c(budget_inf_adj, revenue_inf_adj, profit_inf_adj, roi_inf_adj),
                   .fns = median,
                   .names = "med_{col}"),
            .groups = "drop") %>%
  pivot_longer(cols = contains("med"), names_to = "metric")

hero_detail_summ %>%
  arrange(desc(release_year)) %>%
  head()
# A tibble: 6 × 3
  release_year metric                     value
         <dbl> <chr>                      <dbl>
1         2021 med_budget_inf_adj  197637559.  
2         2021 med_revenue_inf_adj 461770320.  
3         2021 med_profit_inf_adj  301523650.  
4         2021 med_roi_inf_adj             1.88
5         2020 med_budget_inf_adj   74939425.  
6         2020 med_revenue_inf_adj  54996136.  
Code
hero_detail_summ %>%
  filter(metric %in% c("med_budget_inf_adj", "med_profit_inf_adj")) %>%
  mutate(metric = case_when(metric == "med_budget_inf_adj" ~ "Budget", 
                            metric == "med_profit_inf_adj" ~ "Profit"), 
         value = if_else(metric == "Budget", value * -1, value)) %>%
  ggplot(aes(release_year, value, fill = metric)) +
  geom_col() +
  scale_y_continuous(labels = label_dollar(scale = 1/1e6, suffix = "M"), 
                     breaks = breaks_width(250e6)) +
  scale_x_continuous(breaks = breaks_width(5)) +
  scale_fill_manual(values = c("#D23630", "#1B998B")) +
  labs(x = "Release Year", fill = "Metric", title = "Median Performance by Year", caption = "Inflation Adjusted") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom", 
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.caption = element_text(hjust = 0, size = 8),
        axis.title = element_blank())

Code
hero_detail_summ %>%
  filter(metric == "med_roi_inf_adj") %>%
  ggplot(aes(release_year, value)) +
  geom_hline(yintercept = 0, col = "black", linetype = 2) +
  geom_line(col = "#D23630", size = 1.75) + 
  geom_point(col = "white", size = 3.5) +
  geom_point(col = "#D23630", size = 2.5) +
  scale_y_continuous(labels = label_comma(suffix = "X"), 
                     breaks = breaks_width(1)) +
  scale_x_continuous(breaks = breaks_width(5)) +
  labs(x = "Release Year", title = "Median ROI by Year") +
  theme_minimal(base_size = 14) +
  theme(axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5, face = "bold"))

There appear to be some early superhero films that performed great at the box office and offered investors some amazing returns. However, the highest returns seem to be held by movies released after 2012. There also appear to be less occurrences of annual losses for this genre after 2012. This could be a product of studios’ improved understanding of how to produce and market these types of movies.

The Strongest Avenger

Before I move on to collecting the cast information, I’d like to know what the top performing movie of each decade was.

Code
hero_detail_adj_df %>% 
  filter(budget_inf_adj > 100000, release_date < Sys.Date()) %>% 
  mutate(decade = glue("{release_year - (release_year %% 10)}'s")) %>% 
  group_by(decade) %>% 
  slice_max(order_by = profit_inf_adj,
            n = 1) %>% 
  ggplot(aes(decade, profit_inf_adj)) + 
  geom_col(fill = "#1B998B") +
  geom_text(aes(label = glue("{title}\n{label_percent(big.mark = ',', accuracy = 1.0)(roi_inf_adj)} ROI")), y = 15e7, hjust = 0, col = "white", fontface = "bold") +
  scale_y_continuous(labels = label_dollar(scale = 1/1e9, suffix = "B", accuracy = 0.1), 
                     breaks = breaks_width(500e6)) +
  labs(x = "Decade", y = "Profit") +
  coord_flip() +
  labs(title = "Top Performing Movies by Decade", y = "Profit") +
  theme_minimal(base_size = 14) +
  theme(axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5, face = "bold"))

This plot is pretty interesting. Avengers: Endgame is by far the greatest superhero success at the box office, but 1989’s Batman offered investors a higher rate of return at a whopping 1,075% ROI.

Analysis: Part 2

Cast Info

The movie data that I’ve collected so far is great, but it lacks any information on the cast for each film. I’ll have to use a slightly different set of API calls to collect all of the cast listing for the movies I’ve identified.

Code
get_super_cast <- function(movie_id) {
  request("https://api.themoviedb.org/3/") %>%
    req_url_path_append(glue("movie/{movie_id}/credits")) %>%
    req_url_query(api_key = keyring::key_get(service = "TMDb_API")) %>%
    req_perform() %>%
    resp_body_string() %>%
    fromJSON()
}

hero_cast_lst <- map(hero_movie_df$id, ~get_super_cast(.x))
Code
hero_cast_df <- hero_cast_lst %>%
  map_df(~as_tibble(t(.x))) %>% 
  select(-crew) %>%
  mutate(df_len = map_dbl(cast, ~length(.x))) %>%
  filter(df_len > 0) %>%
  unnest(c(id, cast), names_repair = "unique") %>%
  rename(movie_id = 1, actor_id = 4, actor = 6) %>%
  select(-df_len)

hero_cast_df %>%
  head()
# A tibble: 6 × 13
  movie_id adult gender actor_id known_f…¹ actor origi…² popul…³ profi…⁴ cast_id
     <int> <lgl>  <int>    <int> <chr>     <chr> <chr>     <dbl> <chr>     <int>
1   616037 FALSE      2    74568 Acting    Chri… Chris …   113.  /jpurJ…      85
2   616037 FALSE      2     3894 Acting    Chri… Christ…    29.5 /qCpZn…      87
3   616037 FALSE      1    62561 Acting    Tess… Tessa …    34.6 /fycqd…      88
4   616037 FALSE      2    55934 Directing Taik… Taika …    41.6 /tQeio…      89
5   616037 FALSE      1      524 Acting    Nata… Natali…    65.1 /xcbZj…      86
6   616037 FALSE      1    59817 Acting    Jaim… Jaimie…    33.6 /hO8AD…      91
# … with 3 more variables: character <chr>, credit_id <chr>, order <int>, and
#   abbreviated variable names ¹​known_for_department, ²​original_name,
#   ³​popularity, ⁴​profile_path
# ℹ Use `colnames()` to see all variable names

Now that I have the cast data I can join it together with the rest of the movie data.

Code
hero_full_df <- hero_cast_df %>%
  left_join(hero_detail_adj_df, by = "movie_id")

King of the Cameos

Before I move on to the last step, I’d like to answer a couple more questions. First up, what actor has been in the most superhero movies? This question should be pretty easy to answer for any MCU fan.

Code
hero_full_df %>% 
  count(actor) %>% 
  arrange(desc(n)) %>% 
  head()
# A tibble: 6 × 2
  actor                 n
  <chr>             <int>
1 Stan Lee             43
2 Samuel L. Jackson    15
3 Chris Evans          13
4 Jon Favreau          10
5 Robert Downey Jr.    10
6 Chris Hemsworth       9
Code
hero_full_df %>% 
  filter(actor == "Stan Lee") %>% 
  slice_max(order_by = release_date, n = 1) %>% 
  select(title, release_date, actor, character)
# A tibble: 1 × 4
  title             release_date actor    character
  <chr>             <date>       <chr>    <chr>    
1 Avengers: Endgame 2019-04-24   Stan Lee Driver   

Stan Lee has made a cameo in nearly every Marvel film with his last being Avengers: Endgame. His cameos were always a great touch in the films.

Analysis: Part 3

All of this exploration has been fun, but I still need to answer my original question: How many popular actors have been in a superhero film?

Finale

After a little work I have my very rough estimate of how many popular actors have been in a superhero film, 38%.

I hope you enjoyed my unnecessarily deep look into the popularity of superhero movies. Thanks for checking out my post!!!