Code
library(dplyr)
library(rvest)
library(lubridate)
library(purrr)
library(stringr)
library(data.table)
library(tidyr)
library(ggplot2)
library(scales)
library(ggsci)
Matthew Harris
February 11, 2021
This analysis was originally done in February of 2021 but I unfortunately didn’t save the data that I scrapped when it was run.
Like many people I have failed to successfully order a PS5 from any of the major electronic retailers in America. I’ve only attempted a couple of times but it appears as though the consoles sell out within minutes of being available. Since I don’t have a PS5 to distract me in my free time I decided to make a quick post on the launch so far. The data that I am using is pulled from the nowinstock.net website. This site is used to track item inventory for various products. The site contains a table of all previous PS5 stock information going all the way back to when the console was open for pre-order. # Packages
The first step is to scrape the table from the site. I prefer to use APIs to access data from websites but there isn’t one available for this site. Luckily the data can be scrapped in a single call and doesn’t require me to do any pagination.
# A tibble: 6 × 3
...1 date_time status
<dbl> <chr> <chr>
1 1 Fri Aug 12 2022 - 5:10 PM EST "Amazon - Console Preorder for\n$?"
2 2 Fri Aug 12 2022 - 4:26 PM EST "Amazon - Console Out of Stock"
3 3 Fri Aug 12 2022 - 1:55 PM EST "Sam's Club - Bundle: Disc Horizon Forbid…
4 4 Thu Aug 11 2022 - 6:25 AM EST "Amazon - Console Preorder for\n$?"
5 5 Thu Aug 11 2022 - 6:14 AM EST "Amazon - Console Out of Stock"
6 6 Tue Aug 09 2022 - 9:25 AM EST "Amazon - Bundle: Digital Horizon Forbidd…
Great. I now have a data frame that contains all of the stock information. Next I need to clean the data to make it a little easier to use. The date_time column should be pretty easy to reformat but the status column will need a little more attention. The goal is to extract the name of the retailer, console details, retail price, and the time that the console was in/out of stock. All of this can be achieved using regular expression and functions from the {stringr} and {lubridate} packages.
ps5_stock_data_clean <- ps5_stock_data %>%
mutate(sale_date = parse_date_time(date_time, "%b %d %Y - %H:%M %p"),
across(.cols = sale_date, .fns = c(day = day, mon = month, yr = year,
hr = hour, mnt = minute),
.names = "sale_{.fn}"),
retailer = str_extract(status, ".*(?=\\s-)"),
stock_status = if_else(str_detect(status, "In Stock"), TRUE, FALSE),
product_type = str_extract(status, "(?<=\\s-\\s).*"),
product_type = str_extract(product_type, ".*(?=\\sOut|In)"),
product_type = str_remove_all(product_type, "\\s|:"),
retail_price = as.double(str_extract(status, "(?<=\\$)\\d*\\.\\d*"))) %>%
select(-c(date_time, status))
ps5_stock_data %>%
glimpse()
Rows: 600
Columns: 3
$ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ date_time <chr> "Fri Aug 12 2022 - 5:10 PM EST", "Fri Aug 12 2022 - 4:26 PM …
$ status <chr> "Amazon - Console Preorder for\n$?", "Amazon - Console Out o…
I’ve decided to use the roll feature of the data.table join methods. This allows me to the in-stock and out-of-stock times that are closest to each other based upon the same retailer, product type and date. It’s not the best way to match the sales but it works for now.
ps5_in_stock <- ps5_stock_data_clean %>%
filter(sale_date >= "2020-11-11", stock_status == TRUE) %>%
mutate(in_date = sale_date) %>%
select(-stock_status) %>%
as.data.table()
setkey(ps5_in_stock, retailer, product_type, sale_date)
ps5_out_stock <- ps5_stock_data_clean %>%
filter(sale_date >= "2020-11-11", stock_status == FALSE) %>%
mutate(out_date = sale_date) %>%
select(sale_date, retailer, product_type, out_date) %>%
as.data.table()
setkey(ps5_out_stock, retailer, product_type, sale_date)
ps5_sales <- ps5_in_stock[ps5_out_stock, roll = TRUE] %>%
as_tibble() %>%
na.omit() %>%
mutate(sale_length = as.integer(out_date - in_date),
sale_date = as.Date(sale_date))
head(ps5_sales)
# A tibble: 6 × 13
...1 sale_date sale_day sale_mon sale_yr sale_hr sale_mnt retailer product…¹
<dbl> <date> <int> <int> <int> <int> <int> <chr> <chr>
1 505 2021-09-02 2 9 2021 11 9 Amazon ConsoleD…
2 73 2022-06-07 7 6 2022 1 8 Best Buy BundleDi…
3 75 2022-06-07 7 6 2022 12 8 Best Buy BundleDi…
4 75 2022-07-12 7 6 2022 12 8 Best Buy BundleDi…
5 49 2022-07-12 12 7 2022 2 6 Best Buy BundleDi…
6 47 2022-07-12 12 7 2022 2 45 Best Buy BundleDi…
# … with 4 more variables: retail_price <dbl>, in_date <dttm>, out_date <dttm>,
# sale_length <int>, and abbreviated variable name ¹product_type
# ℹ Use `colnames()` to see all variable names
I’ve got all of my data cleaned and formatted the way that I want, so what can I learn? The plot below details the sales from each retailer. The y axis measures how long the console took to sell out and the price of console is mapped to the size. From the plot I’m able to see that GameStop primarily sells more expensive bundles and most consoles sell out within 2 hours of going on sale.
ps5_sales %>%
arrange(desc(sale_date)) %>%
filter(sale_length < 86400) %>%
mutate(sale_length_minutes = sale_length / 60) %>%
ggplot(aes(sale_date, sale_length_minutes,
col = retailer, size = retail_price)) +
geom_point(alpha = 0.7) + scale_y_continuous(breaks = breaks_width(5)) +
scale_x_date(breaks = breaks_width("3 month"), date_labels = "%b %Y") +
scale_y_continuous(breaks = breaks_width(60)) +
scale_color_npg() +
theme(axis.text.x = element_text(angle = 45)) +
labs(x = "Sale Date", y = "Sale Length (minutes)",
col = "Retailer", size = "Retail Price ($)", title = "PS5 Sales by Retailer", subtitle = "Sale Length and Price") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(size = 12))