Introduction

It feels as though superhero movies have taken over the box office. It’s hard to name a popular actor who hasn’t appeared in at least one film where people can fly or move things with their minds. The goal of this project is to dig deeper into the growth of superhero movies and ultimately determine what percentage of popular actors have been in such films. Full disclosure, this analysis will be extremely subjective.

TLDR

Superhero movies have become more expensive, profitable, and abundant over the last 15 years
Over 1 out of every 3 active popular actors has been in a superhero movie
Stan Lee was a very busy man

Data

TMDb API

Most of the data used in this analysis, will be queried from The Movie Database. This site offers an API that is free to access with no rate limiting. More information can be found on the TMDb API documentation page. An unofficial TMDb package is also available that provides an R ready interface to the API. It hasn’t been updated in a couple of years so I chose to create the requests on my own. The recent release of the httr2 package should make this an easy process.

Inflation Adjustments

The monetary data that I’ll be pulling from TMDb won’t be inflation adjusted so I’ll need CPI data. Without adjustment my monetary comparisons wouldn’t be apples to apples. I’ll be using the blsR package to pull this data.

Packages

Code

library(dplyr)
library(tidyr)
library(ggplot2)
library(glue)
library(stringr)
library(jsonlite)
library(httr2)
library(purrr)
library(janitor)
library(lubridate)
library(scales)

TMDb Request

Before I can conduct my analysis I will need to query the data using the API. So what’s the best way to start?

Searching for Keywords

First I need to find all of the keywords that contain the word “superhero” and save the IDs for those keywords.

Code

keyword_search <- "superhero"

hero_id_df <- request("https://api.themoviedb.org/3/") %>%
  req_url_path_append("search/keyword") %>%
  req_url_query(api_key = keyring::key_get(service = "TMDb_API"), 
                query = "superhero",
                page = "1") %>%
  req_perform() %>%
  resp_body_string() %>%
  fromJSON() %>%
  pluck("results")

Code

hero_id_df %>%
  head()

                name     id
1          superhero   9715
2     superhero team 155030
3    superhero spoof 157677
4 death of superhero 174016
5   masked superhero 180734
6     superhero kids 191219

Superhero Movies

Using the keyword IDs I can search for any movies that contain those keywords. I’m also choosing to ignore animation films and films that weren’t released in theaters. The next section will cover the following steps:

Determine how many pages of results there are
create a function that makes a request for each pages
Use the map function to iterate over the complete list of pages

Code

sh_page_count <- request("https://api.themoviedb.org/3/") %>%
  req_url_path_append("discover/movie") %>%
  req_url_query(api_key = keyring::key_get(service = "TMDb_API"), 
                certification_country = "US",
                language = "en-US",
                without_genres = 16,
                with_release_type = 3,
                region = "US",
                with_keywords = glue_collapse(hero_id_df$id, sep = '|'),
                primary_release_date.gte = "1970-01-01") %>%
  req_perform() %>%
  resp_body_string() %>%
  fromJSON() %>%
  pluck("total_pages")

Code

sh_page_count

[1] 12

Code

get_super_movies <- function(page_num) {
  request("https://api.themoviedb.org/3/") %>%
    req_url_path_append("discover/movie") %>%
    req_url_query(api_key = keyring::key_get(service = "TMDb_API"), 
                  certification_country = "US",
                  language = "en-US",
                  without_genres = 16,
                  with_release_type = 3,
                  region = "US",
                  with_keywords = glue_collapse(hero_id_df$id, sep = '|'),
                  primary_release_date.gte = "1970-01-01",
                  page = page_num) %>%
    req_perform() %>%
    resp_body_string() %>%
    fromJSON() %>%
    pluck("results")
}

Code

get_super_movies(1) %>%
  glimpse()

Rows: 20
Columns: 14
$ adult             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ backdrop_path     <chr> "/p1F51Lvj3sMopG948F5HsBbl43C.jpg", "/wcKFYIiVDvRURr…
$ genre_ids         <list> <28, 12, 14>, <14, 28, 12>, <28, 12, 878>, <80, 964…
$ id                <int> 616037, 453395, 634649, 414906, 580489, 566525, 5244…
$ original_language <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en"…
$ original_title    <chr> "Thor: Love and Thunder", "Doctor Strange in the Mul…
$ overview          <chr> "After his retirement is interrupted by Gorr the God…
$ popularity        <dbl> 7686.423, 2155.823, 1658.434, 861.089, 604.985, 542.…
$ poster_path       <chr> "/pIkRyD18kl4FhoCNQuWxWu5cBLM.jpg", "/9Gtg2DzBhmYamX…
$ release_date      <chr> "2022-07-08", "2022-05-06", "2021-12-17", "2022-03-0…
$ title             <chr> "Thor: Love and Thunder", "Doctor Strange in the Mul…
$ video             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ vote_average      <dbl> 6.8, 7.5, 8.0, 7.7, 7.0, 7.7, 7.1, 8.3, 7.6, 7.7, 8.…
$ vote_count        <int> 1970, 5384, 14756, 5953, 7841, 7036, 6120, 25215, 64…

The function appears to work as the first two movies visible from the title column are Thor: Love and Thunder and Doctor Strange in the Multiverse of Madness. Next up is the iteration.

Code

hero_movie_df <- map_df(1:sh_page_count,
                        ~get_super_movies(.x))

It looks as though there have been 221 movies released in American theaters since Jan 1, 1970 that have been classified as a “superhero” film.

Movie Details

I could use the rest of the data found in the discover API call, but I want to know more. Have the budgets for superhero movies changed since 1970? Have they become more profitable? Additional information on these movies’ cost and performance can be found using a slightly different API call. Querying the data requires a similar process as collecting all of the movie IDs.

Code

get_super_details <- function(movie_id) {
  request("https://api.themoviedb.org/3/") %>%
    req_url_path_append(glue("movie/{movie_id}")) %>%
    req_url_query(api_key = keyring::key_get(service = "TMDb_API"), 
                  language = "en-US") %>%
    req_perform() %>%
    resp_body_string() %>%
    fromJSON()
}

Wow! This list contains a combination of lists and data frames. I can see some fields that I’m interested in, such as budget and vote_average.

Code

hero_detail_lst <- map(hero_movie_df$id,
       ~get_super_details(.x))

Code

hero_detail_df <- hero_detail_lst %>%
  map_df(~as_tibble(t(.x))) %>%
  select(budget, imdb_id, original_title, title,
         release_date, revenue, runtime, status, vote_average,
         vote_count, id) %>%
  unnest(cols = everything()) %>%
  rename(movie_id = id) %>%
  mutate(release_date = as_date(release_date))

hero_detail_df %>%
  glimpse()

Rows: 221
Columns: 11
$ budget         <int> 250000000, 200000001, 200000000, 185000000, 110000000, …
$ imdb_id        <chr> "tt10648342", "tt9419884", "tt10872600", "tt1877830", "…
$ original_title <chr> "Thor: Love and Thunder", "Doctor Strange in the Multiv…
$ title          <chr> "Thor: Love and Thunder", "Doctor Strange in the Multiv…
$ release_date   <date> 2022-07-06, 2022-05-04, 2021-12-15, 2022-03-01, 2021-0…
$ revenue        <dbl> 720000000, 953200000, 1901000000, 770836163, 506863592,…
$ runtime        <int> 119, 126, 148, 177, 97, 132, 149, 156, 132, 143, 181, 1…
$ status         <chr> "Released", "Released", "Released", "Released", "Releas…
$ vote_average   <dbl> 6.788, 7.504, 8.046, 7.749, 6.958, 7.670, 8.268, 7.073,…
$ vote_count     <int> 1961, 5371, 14748, 5948, 7839, 7033, 25212, 6114, 6484,…
$ movie_id       <int> 616037, 453395, 634649, 414906, 580489, 566525, 299536,…

All of the information that I want is in a tidy data frame and ready for some exploratory data analysis.

`blsR` Requests

Next I want to grab the CPI data.

Code

cpi_monthly <- blsR::get_series_table(
  series_id = "CUUR0000SA0",
  api_key = keyring::key_get("bls_reg_key"), # Enter your own API key here
  start_year = 1970,
  end_year = year(Sys.Date())
)

cpi_avg_annual <- cpi_monthly %>%
  arrange(year) %>%
  group_by(year) %>%
  summarize(avg_cpi = mean(value, na.rm = TRUE))

curr_cpi <- cpi_avg_annual %>%
  tail(1) %>%
  pull(avg_cpi)

cpi_avg_annual <- cpi_avg_annual %>%
  mutate(inf_rate = curr_cpi / avg_cpi)

Code

cpi_avg_annual %>%
  arrange(desc(year)) %>%
  head()

# A tibble: 6 × 3
   year avg_cpi inf_rate
  <int>   <dbl>    <dbl>
1  2022    289.     1   
2  2021    271.     1.07
3  2020    259.     1.12
4  2019    256.     1.13
5  2018    251.     1.15
6  2017    245.     1.18

Analysis: Part 1

Trended Release Counts

How many superhero movies have been released per year since 1970? Have these movies become more profitable over time? I’ll attempt to answer these questions with some simple tables and plots.

Code

hero_detail_df %>%
  mutate(release_year = year(release_date)) %>%
  filter(release_year < year(Sys.Date())) %>%
  count(release_year) %>%
  ggplot(aes(release_year, n)) +
  geom_line(col = "#D23630", size = 1.75) + 
  geom_point(col = "white", size = 3.5) +
  geom_point(col = "#D23630", size = 2.5) +
  scale_y_continuous(breaks = breaks_width(2)) +
  scale_x_continuous(breaks = breaks_width(5)) +
  labs(x = "Release Year", title = "Superhero Movies Released by Year") +
  theme_minimal(base_size = 14) +
  theme(axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5, face = "bold"))

Super Performance

There has definitely been an upward trend in the number of superhero movies released per year. More of these types of films start to pop up around the mid 2000s. Have these newer superhero films been more financially successful?

Code

hero_detail_adj_df <- hero_detail_df %>%
  mutate(release_year = year(release_date)) %>%
  left_join(cpi_avg_annual, by = c("release_year" = "year")) %>%
  mutate(across(.cols = c(budget, revenue),
                .fns = ~ inf_rate * .x,
                .names = "{col}_inf_adj"),
         profit_inf_adj = revenue_inf_adj - budget_inf_adj,
         roi_inf_adj = profit_inf_adj / budget_inf_adj)

hero_detail_summ <- hero_detail_adj_df %>%
  filter(release_year < year(Sys.Date()), budget_inf_adj > 100000) %>%
  group_by(release_year) %>%
  summarize(across(.cols = c(budget_inf_adj, revenue_inf_adj, profit_inf_adj, roi_inf_adj),
                   .fns = median,
                   .names = "med_{col}"),
            .groups = "drop") %>%
  pivot_longer(cols = contains("med"), names_to = "metric")

hero_detail_summ %>%
  arrange(desc(release_year)) %>%
  head()

# A tibble: 6 × 3
  release_year metric                     value
         <dbl> <chr>                      <dbl>
1         2021 med_budget_inf_adj  197637559.  
2         2021 med_revenue_inf_adj 461770320.  
3         2021 med_profit_inf_adj  301523650.  
4         2021 med_roi_inf_adj             1.88
5         2020 med_budget_inf_adj   74939425.  
6         2020 med_revenue_inf_adj  54996136.

Code

hero_detail_summ %>%
  filter(metric %in% c("med_budget_inf_adj", "med_profit_inf_adj")) %>%
  mutate(metric = case_when(metric == "med_budget_inf_adj" ~ "Budget", 
                            metric == "med_profit_inf_adj" ~ "Profit"), 
         value = if_else(metric == "Budget", value * -1, value)) %>%
  ggplot(aes(release_year, value, fill = metric)) +
  geom_col() +
  scale_y_continuous(labels = label_dollar(scale = 1/1e6, suffix = "M"), 
                     breaks = breaks_width(250e6)) +
  scale_x_continuous(breaks = breaks_width(5)) +
  scale_fill_manual(values = c("#D23630", "#1B998B")) +
  labs(x = "Release Year", fill = "Metric", title = "Median Performance by Year", caption = "Inflation Adjusted") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom", 
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.caption = element_text(hjust = 0, size = 8),
        axis.title = element_blank())

Code

hero_detail_summ %>%
  filter(metric == "med_roi_inf_adj") %>%
  ggplot(aes(release_year, value)) +
  geom_hline(yintercept = 0, col = "black", linetype = 2) +
  geom_line(col = "#D23630", size = 1.75) + 
  geom_point(col = "white", size = 3.5) +
  geom_point(col = "#D23630", size = 2.5) +
  scale_y_continuous(labels = label_comma(suffix = "X"), 
                     breaks = breaks_width(1)) +
  scale_x_continuous(breaks = breaks_width(5)) +
  labs(x = "Release Year", title = "Median ROI by Year") +
  theme_minimal(base_size = 14) +
  theme(axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5, face = "bold"))

There appear to be some early superhero films that performed great at the box office and offered investors some amazing returns. However, the highest returns seem to be held by movies released after 2012. There also appear to be less occurrences of annual losses for this genre after 2012. This could be a product of studios’ improved understanding of how to produce and market these types of movies.

The Strongest Avenger

Before I move on to collecting the cast information, I’d like to know what the top performing movie of each decade was.

Code

hero_detail_adj_df %>% 
  filter(budget_inf_adj > 100000, release_date < Sys.Date()) %>% 
  mutate(decade = glue("{release_year - (release_year %% 10)}'s")) %>% 
  group_by(decade) %>% 
  slice_max(order_by = profit_inf_adj,
            n = 1) %>% 
  ggplot(aes(decade, profit_inf_adj)) + 
  geom_col(fill = "#1B998B") +
  geom_text(aes(label = glue("{title}\n{label_percent(big.mark = ',', accuracy = 1.0)(roi_inf_adj)} ROI")), y = 15e7, hjust = 0, col = "white", fontface = "bold") +
  scale_y_continuous(labels = label_dollar(scale = 1/1e9, suffix = "B", accuracy = 0.1), 
                     breaks = breaks_width(500e6)) +
  labs(x = "Decade", y = "Profit") +
  coord_flip() +
  labs(title = "Top Performing Movies by Decade", y = "Profit") +
  theme_minimal(base_size = 14) +
  theme(axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5, face = "bold"))

This plot is pretty interesting. Avengers: Endgame is by far the greatest superhero success at the box office, but 1989’s Batman offered investors a higher rate of return at a whopping 1,075% ROI.

Analysis: Part 2

Cast Info

The movie data that I’ve collected so far is great, but it lacks any information on the cast for each film. I’ll have to use a slightly different set of API calls to collect all of the cast listing for the movies I’ve identified.

Code

get_super_cast <- function(movie_id) {
  request("https://api.themoviedb.org/3/") %>%
    req_url_path_append(glue("movie/{movie_id}/credits")) %>%
    req_url_query(api_key = keyring::key_get(service = "TMDb_API")) %>%
    req_perform() %>%
    resp_body_string() %>%
    fromJSON()
}

hero_cast_lst <- map(hero_movie_df$id, ~get_super_cast(.x))

Code

hero_cast_df <- hero_cast_lst %>%
  map_df(~as_tibble(t(.x))) %>% 
  select(-crew) %>%
  mutate(df_len = map_dbl(cast, ~length(.x))) %>%
  filter(df_len > 0) %>%
  unnest(c(id, cast), names_repair = "unique") %>%
  rename(movie_id = 1, actor_id = 4, actor = 6) %>%
  select(-df_len)

hero_cast_df %>%
  head()

# A tibble: 6 × 13
  movie_id adult gender actor_id known_f…¹ actor origi…² popul…³ profi…⁴ cast_id
     <int> <lgl>  <int>    <int> <chr>     <chr> <chr>     <dbl> <chr>     <int>
1   616037 FALSE      2    74568 Acting    Chri… Chris …   113.  /jpurJ…      85
2   616037 FALSE      2     3894 Acting    Chri… Christ…    29.5 /qCpZn…      87
3   616037 FALSE      1    62561 Acting    Tess… Tessa …    34.6 /fycqd…      88
4   616037 FALSE      2    55934 Directing Taik… Taika …    41.6 /tQeio…      89
5   616037 FALSE      1      524 Acting    Nata… Natali…    65.1 /xcbZj…      86
6   616037 FALSE      1    59817 Acting    Jaim… Jaimie…    33.6 /hO8AD…      91
# … with 3 more variables: character <chr>, credit_id <chr>, order <int>, and
#   abbreviated variable names ¹known_for_department, ²original_name,
#   ³popularity, ⁴profile_path
# ℹ Use `colnames()` to see all variable names

Now that I have the cast data I can join it together with the rest of the movie data.

Code

hero_full_df <- hero_cast_df %>%
  left_join(hero_detail_adj_df, by = "movie_id")

King of the Cameos

Before I move on to the last step, I’d like to answer a couple more questions. First up, what actor has been in the most superhero movies? This question should be pretty easy to answer for any MCU fan.

Code

hero_full_df %>% 
  count(actor) %>% 
  arrange(desc(n)) %>% 
  head()

# A tibble: 6 × 2
  actor                 n
  <chr>             <int>
1 Stan Lee             43
2 Samuel L. Jackson    15
3 Chris Evans          13
4 Jon Favreau          10
5 Robert Downey Jr.    10
6 Chris Hemsworth       9

Code

hero_full_df %>% 
  filter(actor == "Stan Lee") %>% 
  slice_max(order_by = release_date, n = 1) %>% 
  select(title, release_date, actor, character)

# A tibble: 1 × 4
  title             release_date actor    character
  <chr>             <date>       <chr>    <chr>    
1 Avengers: Endgame 2019-04-24   Stan Lee Driver

Stan Lee has made a cameo in nearly every Marvel film with his last being Avengers: Endgame. His cameos were always a great touch in the films.

Analysis: Part 3

All of this exploration has been fun, but I still need to answer my original question: How many popular actors have been in a superhero film?

Popular Actors

First I need to identify these actors.

Code

pop_actor_page_n <- request("https://api.themoviedb.org/3/") %>%
  req_url_path_append(glue("person/popular")) %>%
  req_url_query(api_key = keyring::key_get(service = "TMDb_API"),
                language = "en-US") %>%
  req_perform() %>%
  resp_body_string() %>%
  fromJSON() %>%
  pluck("total_pages")

get_pop_actors <- function(page_num) {
  request("https://api.themoviedb.org/3/") %>%
    req_url_path_append(glue("person/popular")) %>%
    req_url_query(api_key = keyring::key_get(service = "TMDb_API"),
                  language = "en-US",
                  page = page_num) %>%
    req_perform() %>%
    resp_body_string() %>%
    fromJSON() %>%
    pluck("results")
}

pop_actors_lst <- map(1:pop_actor_page_n, ~get_pop_actors(.x))

Code

pop_actors_df <- pop_actors_lst %>%
  map_df(~as_tibble(.x))

pop_actors_df %>%
  select(popularity, name) %>% 
  slice_sample(n = 10)

# A tibble: 10 × 2
   popularity name             
        <dbl> <chr>            
 1       9.93 Ryan O'Neal      
 2      19.7  Brad Dourif      
 3      23.4  Thomas Doherty   
 4      15.1  Teagan Croft     
 5      27.8  Taylor Lautner   
 6       9.52 Casimere Jollette
 7      13.9  Julie Peters     
 8       9.48 Karlina Zhang    
 9      16.6  Sam Claflin      
10      11.0  Jacqueline Pinol

Really Really Popular Actors

This list is great but it contains some popular people that I don’t want. I only want to focus on current, popular, movie actors. I decided to filter for actors who have been in a American movie since 2005 and who are in the 75th percentile based upon the popularity metric.

Code

top_pop_actors_df <- pop_actors_df %>%
  # Unnest the known_for data frames for each actor
  unnest(known_for, names_repair = "unique") %>%
  rename(actor = 23, actor_id = 3,
         movie_id = 7) %>% 
  # Removes missing release date rows
  filter(release_date > 0) %>%  
  mutate(release_date = date(release_date)) %>% 
  group_by(actor) %>%
  # Filters for the newest known_for item for each actor
  slice_max(order_by = release_date, n = 1) %>% 
  ungroup() %>%
  filter(release_date > "2005-01-01", media_type == "movie", 
         known_for_department == "Acting", 
         original_language == "en") %>%
  # Creates a quantile value for each actor based upon their popularity
  mutate(q90th = quantile(popularity, probs = 0.9)) %>% 
  filter(popularity >= q90th) %>%
  select(actor, actor_id)

top_pop_actors_df %>% 
  slice_sample(n = 10)

# A tibble: 10 × 2
   actor              actor_id
   <chr>                 <int>
 1 Nicholas Galitzine  1425934
 2 James Purefoy         17648
 3 Dax Shepard           51298
 4 Dominic Cooper        55470
 5 Tyler Hoechlin        78198
 6 Michael Rooker        12132
 7 Alexandra Daddario   109513
 8 James Marsden         11006
 9 Doug Jones            17005
10 Eiza González       1222992

After filtering I’ve narrowed the pool down to the top 578 actors. A sample of this pool should show more recognizable names.

I can use the setdiff() function to identify the popular actors who have never been in a superhero film.

Code

non_supers <- top_pop_actors_df %>%
  dplyr::setdiff(hero_full_df %>%
                   distinct(actor, actor_id))

super_prc <- label_percent()(1 - (nrow(non_supers) / nrow(top_pop_actors_df)))

Finale

After a little work I have my very rough estimate of how many popular actors have been in a superhero film, 38%.

I hope you enjoyed my unnecessarily deep look into the popularity of superhero movies. Thanks for checking out my post!!!

Introduction

TLDR

Data

TMDb API

Inflation Adjustments

Packages

TMDb Request

Searching for Keywords

Superhero Movies

Movie Details

blsR Requests

Analysis: Part 1

Trended Release Counts

Super Performance

The Strongest Avenger

Analysis: Part 2

Cast Info

King of the Cameos

Analysis: Part 3

Popular Actors

Really Really Popular Actors

Finale

`blsR` Requests