{purrr} and Excel

Author

Matthew Harris

Published

June 28, 2020

Abstract
Using {purrr} to automate Excel document creation.

Goals

Being able to export and share your work is a key step in the data analysis process. In this post I’ll walk through the process of performing some group level summary statistics and exporting those grouped data frames to separate tabs within an Excel document and separate Excel workbooks.

Packages

The key package for this demonstration will be writexl. It allows you to read/create Excel files without having to install other Java or Perl dependencies.

Code
library(dplyr)
library(readr)
library(tidyr)
library(purrr)
library(writexl)
library(lubridate)

Data Import

The data used for this analysis can be found at the following link.

Formula One Data

Data Wrangling

I’ll be using a data set that contains performance statistics on Formula 1 drivers.

Code
f1_summary <- f1_data %>% 
  # Removes special characters from driver names
  mutate(across(.cols = c("forename", "surname"), 
            .fns = ~iconv(x = .x, from = "UTF-8", 
                   to = "ASCII", ""))) %>%
  filter(race_year >= 2000) %>%
  mutate(podium = if_else(fl_position <= 3, 
                          TRUE, FALSE)) %>%
  group_by(driverId, race_year, forename, surname) %>%
  summarize(total_races = n(),
            fasest_lap = min(fastestLapTime, na.rm = TRUE),
            total_podiums = sum(podium, na.rm = TRUE),
            total_points = sum(points, na.rm = TRUE), .groups = "drop") %>%
  ungroup()

head(f1_summary)
# A tibble: 6 × 8
  driverId race_year forename surname  total_races fasest_lap  total_p…¹ total…²
     <dbl>     <dbl> <chr>    <chr>          <int> <drtn>          <int>   <dbl>
1        1      2007 Lewis    Hamilton          17 72.506 secs        14     109
2        1      2008 Lewis    Hamilton          18 74.159 secs        12      98
3        1      2009 Lewis    Hamilton          17 74.345 secs         7      49
4        1      2010 Lewis    Hamilton          19 73.851 secs         9     240
5        1      2011 Lewis    Hamilton          19 77.209 secs         8     227
6        1      2012 Lewis    Hamilton          20 77.020 secs        10     190
# … with abbreviated variable names ¹​total_podiums, ²​total_points

Splitting and Exporting the Data to Separate Tabs

Now that the table is in the desired format I can split it based upon the race_year variable using the group_split function. This allows me to separate the data and keep the grouping variable in the split tables. In order for the sheet naming to work correctly I also need to name the list of data frames that I have created. This is done by using the set_names function. I’m choosing to use the race years to create the names for the list. The final step is to pipe the named list of data frames into the write_xlsx function where it it used to create the Excel workbook with it’s named tabs.

Code
f1_summary %>%
  group_split(race_year) %>%
  set_names(map_chr(.x = .,
                    .f = ~ .$race_year %>%
                      unique() %>%
                      as.character())) %>%
  write_xlsx(path = "annual_f1_data.xlsx")

Exporting to Multiple Excel Files

A similar process is used to export the data frames into separate Excel files. This time I will be nesting the data and mapping over the data frames to create each file. The nest function allows me to use the grouping variable, in this case race_year, to name each Excel file.

Code
f1_summary %>%
  mutate(ry = race_year) %>%
  nest(data = -race_year) %>% 
  arrange(race_year) %>%
  walk2(.x = .$data, .y = .$race_year,
      .f = ~write_xlsx(x = .x,
                   path = paste0("f1_data_", .y, ".xlsx")))