Code
library(dplyr)
library(tidyr)
library(ggplot2)
library(glue)
library(stringr)
library(jsonlite)
library(httr2)
library(purrr)
library(janitor)
library(lubridate)
library(scales)Matthew Harris
August 20, 2022
It feels as though superhero movies have taken over the box office. It’s hard to name a popular actor who hasn’t appeared in at least one film where people can fly or move things with their minds. The goal of this project is to dig deeper into the growth of superhero movies and ultimately determine what percentage of popular actors have been in such films. Full disclosure, this analysis will be extremely subjective.
Most of the data used in this analysis, will be queried from The Movie Database. This site offers an API that is free to access with no rate limiting. More information can be found on the TMDb API documentation page. An unofficial TMDb package is also available that provides an R ready interface to the API. It hasn’t been updated in a couple of years so I chose to create the requests on my own. The recent release of the httr2 package should make this an easy process.
The monetary data that I’ll be pulling from TMDb won’t be inflation adjusted so I’ll need CPI data. Without adjustment my monetary comparisons wouldn’t be apples to apples. I’ll be using the blsR package to pull this data.
Before I can conduct my analysis I will need to query the data using the API. So what’s the best way to start?
First I need to find all of the keywords that contain the word “superhero” and save the IDs for those keywords.
keyword_search <- "superhero"
hero_id_df <- request("https://api.themoviedb.org/3/") %>%
req_url_path_append("search/keyword") %>%
req_url_query(api_key = keyring::key_get(service = "TMDb_API"),
query = "superhero",
page = "1") %>%
req_perform() %>%
resp_body_string() %>%
fromJSON() %>%
pluck("results")Using the keyword IDs I can search for any movies that contain those keywords. I’m also choosing to ignore animation films and films that weren’t released in theaters. The next section will cover the following steps:
map function to iterate over the complete list of pagessh_page_count <- request("https://api.themoviedb.org/3/") %>%
req_url_path_append("discover/movie") %>%
req_url_query(api_key = keyring::key_get(service = "TMDb_API"),
certification_country = "US",
language = "en-US",
without_genres = 16,
with_release_type = 3,
region = "US",
with_keywords = glue_collapse(hero_id_df$id, sep = '|'),
primary_release_date.gte = "1970-01-01") %>%
req_perform() %>%
resp_body_string() %>%
fromJSON() %>%
pluck("total_pages")get_super_movies <- function(page_num) {
request("https://api.themoviedb.org/3/") %>%
req_url_path_append("discover/movie") %>%
req_url_query(api_key = keyring::key_get(service = "TMDb_API"),
certification_country = "US",
language = "en-US",
without_genres = 16,
with_release_type = 3,
region = "US",
with_keywords = glue_collapse(hero_id_df$id, sep = '|'),
primary_release_date.gte = "1970-01-01",
page = page_num) %>%
req_perform() %>%
resp_body_string() %>%
fromJSON() %>%
pluck("results")
}Rows: 20
Columns: 14
$ adult <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ backdrop_path <chr> "/p1F51Lvj3sMopG948F5HsBbl43C.jpg", "/wcKFYIiVDvRURr…
$ genre_ids <list> <28, 12, 14>, <14, 28, 12>, <28, 12, 878>, <80, 964…
$ id <int> 616037, 453395, 634649, 414906, 580489, 566525, 5244…
$ original_language <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en"…
$ original_title <chr> "Thor: Love and Thunder", "Doctor Strange in the Mul…
$ overview <chr> "After his retirement is interrupted by Gorr the God…
$ popularity <dbl> 7686.423, 2155.823, 1658.434, 861.089, 604.985, 542.…
$ poster_path <chr> "/pIkRyD18kl4FhoCNQuWxWu5cBLM.jpg", "/9Gtg2DzBhmYamX…
$ release_date <chr> "2022-07-08", "2022-05-06", "2021-12-17", "2022-03-0…
$ title <chr> "Thor: Love and Thunder", "Doctor Strange in the Mul…
$ video <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ vote_average <dbl> 6.8, 7.5, 8.0, 7.7, 7.0, 7.7, 7.1, 8.3, 7.6, 7.7, 8.…
$ vote_count <int> 1970, 5384, 14756, 5953, 7841, 7036, 6120, 25215, 64…
The function appears to work as the first two movies visible from the title column are Thor: Love and Thunder and Doctor Strange in the Multiverse of Madness. Next up is the iteration.
It looks as though there have been 221 movies released in American theaters since Jan 1, 1970 that have been classified as a “superhero” film.
I could use the rest of the data found in the discover API call, but I want to know more. Have the budgets for superhero movies changed since 1970? Have they become more profitable? Additional information on these movies’ cost and performance can be found using a slightly different API call. Querying the data requires a similar process as collecting all of the movie IDs.
Wow! This list contains a combination of lists and data frames. I can see some fields that I’m interested in, such as budget and vote_average.
hero_detail_df <- hero_detail_lst %>%
map_df(~as_tibble(t(.x))) %>%
select(budget, imdb_id, original_title, title,
release_date, revenue, runtime, status, vote_average,
vote_count, id) %>%
unnest(cols = everything()) %>%
rename(movie_id = id) %>%
mutate(release_date = as_date(release_date))
hero_detail_df %>%
glimpse()Rows: 221
Columns: 11
$ budget <int> 250000000, 200000001, 200000000, 185000000, 110000000, …
$ imdb_id <chr> "tt10648342", "tt9419884", "tt10872600", "tt1877830", "…
$ original_title <chr> "Thor: Love and Thunder", "Doctor Strange in the Multiv…
$ title <chr> "Thor: Love and Thunder", "Doctor Strange in the Multiv…
$ release_date <date> 2022-07-06, 2022-05-04, 2021-12-15, 2022-03-01, 2021-0…
$ revenue <dbl> 720000000, 953200000, 1901000000, 770836163, 506863592,…
$ runtime <int> 119, 126, 148, 177, 97, 132, 149, 156, 132, 143, 181, 1…
$ status <chr> "Released", "Released", "Released", "Released", "Releas…
$ vote_average <dbl> 6.788, 7.504, 8.046, 7.749, 6.958, 7.670, 8.268, 7.073,…
$ vote_count <int> 1961, 5371, 14748, 5948, 7839, 7033, 25212, 6114, 6484,…
$ movie_id <int> 616037, 453395, 634649, 414906, 580489, 566525, 299536,…
All of the information that I want is in a tidy data frame and ready for some exploratory data analysis.
blsR RequestsNext I want to grab the CPI data.
cpi_monthly <- blsR::get_series_table(
series_id = "CUUR0000SA0",
api_key = keyring::key_get("bls_reg_key"), # Enter your own API key here
start_year = 1970,
end_year = year(Sys.Date())
)
cpi_avg_annual <- cpi_monthly %>%
arrange(year) %>%
group_by(year) %>%
summarize(avg_cpi = mean(value, na.rm = TRUE))
curr_cpi <- cpi_avg_annual %>%
tail(1) %>%
pull(avg_cpi)
cpi_avg_annual <- cpi_avg_annual %>%
mutate(inf_rate = curr_cpi / avg_cpi)How many superhero movies have been released per year since 1970? Have these movies become more profitable over time? I’ll attempt to answer these questions with some simple tables and plots.
hero_detail_df %>%
mutate(release_year = year(release_date)) %>%
filter(release_year < year(Sys.Date())) %>%
count(release_year) %>%
ggplot(aes(release_year, n)) +
geom_line(col = "#D23630", size = 1.75) +
geom_point(col = "white", size = 3.5) +
geom_point(col = "#D23630", size = 2.5) +
scale_y_continuous(breaks = breaks_width(2)) +
scale_x_continuous(breaks = breaks_width(5)) +
labs(x = "Release Year", title = "Superhero Movies Released by Year") +
theme_minimal(base_size = 14) +
theme(axis.title.y = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"))
There has definitely been an upward trend in the number of superhero movies released per year. More of these types of films start to pop up around the mid 2000s. Have these newer superhero films been more financially successful?
hero_detail_adj_df <- hero_detail_df %>%
mutate(release_year = year(release_date)) %>%
left_join(cpi_avg_annual, by = c("release_year" = "year")) %>%
mutate(across(.cols = c(budget, revenue),
.fns = ~ inf_rate * .x,
.names = "{col}_inf_adj"),
profit_inf_adj = revenue_inf_adj - budget_inf_adj,
roi_inf_adj = profit_inf_adj / budget_inf_adj)
hero_detail_summ <- hero_detail_adj_df %>%
filter(release_year < year(Sys.Date()), budget_inf_adj > 100000) %>%
group_by(release_year) %>%
summarize(across(.cols = c(budget_inf_adj, revenue_inf_adj, profit_inf_adj, roi_inf_adj),
.fns = median,
.names = "med_{col}"),
.groups = "drop") %>%
pivot_longer(cols = contains("med"), names_to = "metric")
hero_detail_summ %>%
arrange(desc(release_year)) %>%
head()# A tibble: 6 × 3
release_year metric value
<dbl> <chr> <dbl>
1 2021 med_budget_inf_adj 197637559.
2 2021 med_revenue_inf_adj 461770320.
3 2021 med_profit_inf_adj 301523650.
4 2021 med_roi_inf_adj 1.88
5 2020 med_budget_inf_adj 74939425.
6 2020 med_revenue_inf_adj 54996136.
hero_detail_summ %>%
filter(metric %in% c("med_budget_inf_adj", "med_profit_inf_adj")) %>%
mutate(metric = case_when(metric == "med_budget_inf_adj" ~ "Budget",
metric == "med_profit_inf_adj" ~ "Profit"),
value = if_else(metric == "Budget", value * -1, value)) %>%
ggplot(aes(release_year, value, fill = metric)) +
geom_col() +
scale_y_continuous(labels = label_dollar(scale = 1/1e6, suffix = "M"),
breaks = breaks_width(250e6)) +
scale_x_continuous(breaks = breaks_width(5)) +
scale_fill_manual(values = c("#D23630", "#1B998B")) +
labs(x = "Release Year", fill = "Metric", title = "Median Performance by Year", caption = "Inflation Adjusted") +
theme_minimal(base_size = 14) +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.caption = element_text(hjust = 0, size = 8),
axis.title = element_blank())
hero_detail_summ %>%
filter(metric == "med_roi_inf_adj") %>%
ggplot(aes(release_year, value)) +
geom_hline(yintercept = 0, col = "black", linetype = 2) +
geom_line(col = "#D23630", size = 1.75) +
geom_point(col = "white", size = 3.5) +
geom_point(col = "#D23630", size = 2.5) +
scale_y_continuous(labels = label_comma(suffix = "X"),
breaks = breaks_width(1)) +
scale_x_continuous(breaks = breaks_width(5)) +
labs(x = "Release Year", title = "Median ROI by Year") +
theme_minimal(base_size = 14) +
theme(axis.title.y = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"))
There appear to be some early superhero films that performed great at the box office and offered investors some amazing returns. However, the highest returns seem to be held by movies released after 2012. There also appear to be less occurrences of annual losses for this genre after 2012. This could be a product of studios’ improved understanding of how to produce and market these types of movies.
Before I move on to collecting the cast information, I’d like to know what the top performing movie of each decade was.
hero_detail_adj_df %>%
filter(budget_inf_adj > 100000, release_date < Sys.Date()) %>%
mutate(decade = glue("{release_year - (release_year %% 10)}'s")) %>%
group_by(decade) %>%
slice_max(order_by = profit_inf_adj,
n = 1) %>%
ggplot(aes(decade, profit_inf_adj)) +
geom_col(fill = "#1B998B") +
geom_text(aes(label = glue("{title}\n{label_percent(big.mark = ',', accuracy = 1.0)(roi_inf_adj)} ROI")), y = 15e7, hjust = 0, col = "white", fontface = "bold") +
scale_y_continuous(labels = label_dollar(scale = 1/1e9, suffix = "B", accuracy = 0.1),
breaks = breaks_width(500e6)) +
labs(x = "Decade", y = "Profit") +
coord_flip() +
labs(title = "Top Performing Movies by Decade", y = "Profit") +
theme_minimal(base_size = 14) +
theme(axis.title.y = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"))
This plot is pretty interesting. Avengers: Endgame is by far the greatest superhero success at the box office, but 1989’s Batman offered investors a higher rate of return at a whopping 1,075% ROI.
The movie data that I’ve collected so far is great, but it lacks any information on the cast for each film. I’ll have to use a slightly different set of API calls to collect all of the cast listing for the movies I’ve identified.
get_super_cast <- function(movie_id) {
request("https://api.themoviedb.org/3/") %>%
req_url_path_append(glue("movie/{movie_id}/credits")) %>%
req_url_query(api_key = keyring::key_get(service = "TMDb_API")) %>%
req_perform() %>%
resp_body_string() %>%
fromJSON()
}
hero_cast_lst <- map(hero_movie_df$id, ~get_super_cast(.x))# A tibble: 6 × 13
movie_id adult gender actor_id known_f…¹ actor origi…² popul…³ profi…⁴ cast_id
<int> <lgl> <int> <int> <chr> <chr> <chr> <dbl> <chr> <int>
1 616037 FALSE 2 74568 Acting Chri… Chris … 113. /jpurJ… 85
2 616037 FALSE 2 3894 Acting Chri… Christ… 29.5 /qCpZn… 87
3 616037 FALSE 1 62561 Acting Tess… Tessa … 34.6 /fycqd… 88
4 616037 FALSE 2 55934 Directing Taik… Taika … 41.6 /tQeio… 89
5 616037 FALSE 1 524 Acting Nata… Natali… 65.1 /xcbZj… 86
6 616037 FALSE 1 59817 Acting Jaim… Jaimie… 33.6 /hO8AD… 91
# … with 3 more variables: character <chr>, credit_id <chr>, order <int>, and
# abbreviated variable names ¹known_for_department, ²original_name,
# ³popularity, ⁴profile_path
# ℹ Use `colnames()` to see all variable names
Now that I have the cast data I can join it together with the rest of the movie data.
Before I move on to the last step, I’d like to answer a couple more questions. First up, what actor has been in the most superhero movies? This question should be pretty easy to answer for any MCU fan.
# A tibble: 6 × 2
actor n
<chr> <int>
1 Stan Lee 43
2 Samuel L. Jackson 15
3 Chris Evans 13
4 Jon Favreau 10
5 Robert Downey Jr. 10
6 Chris Hemsworth 9
# A tibble: 1 × 4
title release_date actor character
<chr> <date> <chr> <chr>
1 Avengers: Endgame 2019-04-24 Stan Lee Driver
Stan Lee has made a cameo in nearly every Marvel film with his last being Avengers: Endgame. His cameos were always a great touch in the films.
All of this exploration has been fun, but I still need to answer my original question: How many popular actors have been in a superhero film?
First I need to identify these actors.
pop_actor_page_n <- request("https://api.themoviedb.org/3/") %>%
req_url_path_append(glue("person/popular")) %>%
req_url_query(api_key = keyring::key_get(service = "TMDb_API"),
language = "en-US") %>%
req_perform() %>%
resp_body_string() %>%
fromJSON() %>%
pluck("total_pages")
get_pop_actors <- function(page_num) {
request("https://api.themoviedb.org/3/") %>%
req_url_path_append(glue("person/popular")) %>%
req_url_query(api_key = keyring::key_get(service = "TMDb_API"),
language = "en-US",
page = page_num) %>%
req_perform() %>%
resp_body_string() %>%
fromJSON() %>%
pluck("results")
}
pop_actors_lst <- map(1:pop_actor_page_n, ~get_pop_actors(.x))# A tibble: 10 × 2
popularity name
<dbl> <chr>
1 9.93 Ryan O'Neal
2 19.7 Brad Dourif
3 23.4 Thomas Doherty
4 15.1 Teagan Croft
5 27.8 Taylor Lautner
6 9.52 Casimere Jollette
7 13.9 Julie Peters
8 9.48 Karlina Zhang
9 16.6 Sam Claflin
10 11.0 Jacqueline Pinol
This list is great but it contains some popular people that I don’t want. I only want to focus on current, popular, movie actors. I decided to filter for actors who have been in a American movie since 2005 and who are in the 75th percentile based upon the popularity metric.
top_pop_actors_df <- pop_actors_df %>%
# Unnest the known_for data frames for each actor
unnest(known_for, names_repair = "unique") %>%
rename(actor = 23, actor_id = 3,
movie_id = 7) %>%
# Removes missing release date rows
filter(release_date > 0) %>%
mutate(release_date = date(release_date)) %>%
group_by(actor) %>%
# Filters for the newest known_for item for each actor
slice_max(order_by = release_date, n = 1) %>%
ungroup() %>%
filter(release_date > "2005-01-01", media_type == "movie",
known_for_department == "Acting",
original_language == "en") %>%
# Creates a quantile value for each actor based upon their popularity
mutate(q90th = quantile(popularity, probs = 0.9)) %>%
filter(popularity >= q90th) %>%
select(actor, actor_id)
top_pop_actors_df %>%
slice_sample(n = 10)# A tibble: 10 × 2
actor actor_id
<chr> <int>
1 Nicholas Galitzine 1425934
2 James Purefoy 17648
3 Dax Shepard 51298
4 Dominic Cooper 55470
5 Tyler Hoechlin 78198
6 Michael Rooker 12132
7 Alexandra Daddario 109513
8 James Marsden 11006
9 Doug Jones 17005
10 Eiza González 1222992
After filtering I’ve narrowed the pool down to the top 578 actors. A sample of this pool should show more recognizable names.
I can use the setdiff() function to identify the popular actors who have never been in a superhero film.
After a little work I have my very rough estimate of how many popular actors have been in a superhero film, 38%.
I hope you enjoyed my unnecessarily deep look into the popularity of superhero movies. Thanks for checking out my post!!!