Code
::pkg_attach(c("tidyverse", "httr2", "jsonlite", "ggsci",
xfun"cowplot", "here",
"lubridate", "glue", "scales",
"janitor", "skimr", "tigris"))
Matthew Harris
February 9, 2021
Arlington County offers a wealth of data for free through their Open Data Portal. What better way to learn more about the area that I live than through local data analysis. This post will cover dog permit registration within the county.
I’ll start by creating a function that creates an API call.
# The order by argument is necessary to ensure that duplicate rows aren't
# introduced when using pagination
arl_odp <- function(resource, ..., skip = 0, top = 10, orderby = NULL, f_select = NULL, r_filter = NULL) {
params <- list(
skip = skip,
top = top,
orderby = orderby,
select = f_select,
filter = r_filter
) %>%
set_names(nm = paste0("$", names(.)))
arl_resp <- request("https://datahub-v2.arlingtonva.us/api/") %>%
req_url_path_append(resource) %>%
req_url_query(!!!params) %>%
req_perform() %>%
resp_body_string() %>%
fromJSON()
if(is.data.frame(arl_resp)) {
return(arl_resp %>% clean_names())
} else {
return()
}
}
The functions works but the data types for the columns will need to be updated. I’ll iterate over chunks of the data instead of querying everything at once. I’m choosing to use the purr map()
function for this part of the process.
Great. I’ve got all of the data. Let’s examine it with the glimpse()
function see what we ended up with.
Rows: 6,250
Columns: 8
$ current_dog_license_key <int> 1, 2, 3, 4, 5, 6…
$ owner_address_text <chr> "10 BENDING OAK …
$ payment_date <chr> "2020-01-29T00:0…
$ license_term_code <chr> "LIFETIME", "LIF…
$ valid_through_date <chr> "2023-01-18T00:0…
$ dog_tag_nbr <chr> "003375L", "0059…
$ dog_sex_code <chr> "MALE", "FEMALE"…
$ spayed_neutered_ind <chr> "YES", "NO", "YE…
It looks like we have 6,250 rows. I’m assuming the payment_date
column refers to when the dog permit registration was submitted. There isn’t anything super interesting about the dogs other than the genders and their spayed/neutered status at registration. The owner_address_text
column also contains zip code information that can be extracted using regular expressions. Before I can start summarizing or visualizing anything I need to perform some data cleansing/transformations.
dog_permit_clean_df <- dog_permit_df %>%
clean_names() %>%
rename(owner_address = owner_address_text, license_term = license_term_code,
dog_gender = dog_sex_code, sn_ind = spayed_neutered_ind) %>%
select(owner_address, payment_date, license_term, dog_gender,
sn_ind)
dog_permit_clean_df <- dog_permit_clean_df %>%
mutate(reg_date = parse_date_time(payment_date, orders = "%Y-%m-%d%H:%M:%S"),
across(.cols = reg_date,
.fns = list(y = year, m = month, wd = weekdays),
.names = "reg_{fn}"),
reg_ym = glue("{reg_y}{str_pad(reg_m, 2, 'left', '0')}"),
curr_ym = glue("{year(Sys.Date())}{str_pad(month((Sys.Date())), 2, 'left', '0')}"),
owner_zip = as.factor(str_extract(owner_address,
"(?<=VA\\s)\\d{5}")),
sn_ind = if_else(sn_ind == "YES", TRUE, FALSE)) %>%
filter(str_detect(owner_address, "ARLINGTON")) %>%
na.omit()
dog_permit_clean_df %>%
head()
owner_address payment_date
1 100 N GRANADA ST ARLINGTON VA 22203-1227 2021-10-12T00:00:00.000Z
2 100 S PARK DR ARLINGTON VA 22204-1354 2020-04-09T00:00:00.000Z
3 1000 N RANDOLPH ST APT 509 ARLINGTON VA 22201-5628 2022-05-06T00:00:00.000Z
4 1000 N RANDOLPH ST APT 909 ARLINGTON VA 22201-5629 2020-10-13T00:00:00.000Z
5 1001 N RANDOLPH ST APT 1022 ARLINGTON VA 22201-5610 2020-07-17T00:00:00.000Z
6 1001 N RANDOLPH ST APT 209 ARLINGTON VA 22201-5603 2022-02-07T00:00:00.000Z
license_term dog_gender sn_ind reg_date reg_y reg_m reg_wd reg_ym curr_ym
1 LIFETIME MALE TRUE 2021-10-12 2021 10 Tuesday 202110 202208
2 LIFETIME MALE FALSE 2020-04-09 2020 4 Thursday 202004 202208
3 LIFETIME FEMALE FALSE 2022-05-06 2022 5 Friday 202205 202208
4 LIFETIME MALE TRUE 2020-10-13 2020 10 Tuesday 202010 202208
5 LIFETIME MALE TRUE 2020-07-17 2020 7 Friday 202007 202208
6 LIFETIME MALE TRUE 2022-02-07 2022 2 Monday 202202 202208
owner_zip
1 22203
2 22204
3 22201
4 22201
5 22201
6 22201
There are a lot of cool visualization that can be created from this new tidied data frame. I’ll start by graphing the frequency of dog registrations by month for each year.
dog_permit_clean_df %>%
filter(reg_ym != curr_ym, reg_date >= "2019-06-01") %>%
count(reg_m, reg_y) %>%
mutate(reg_y = factor(reg_y,
levels = seq(max(dog_permit_clean_df$reg_y),
min(dog_permit_clean_df$reg_y)))) %>%
ggplot(aes(reg_m, n,
col = reg_y)) +
geom_line(size = 1.1) +
geom_point(size = 3L, col = "white") +
geom_point(size = 1.9) +
scale_x_continuous(breaks = breaks_width(1)) +
scale_color_npg() +
labs(x = "Month", y = "n", col = "Year",
title = "Arlington County Dog Permit Registrations",
caption = glue("Arlington County Data API was used \\
for this purpose, but it is not endorsed \\
or certified by Arlington County.")) +
theme_minimal(base_size = 14) +
theme(text = element_text(family = "Courier"),
plot.title = element_text(hjust = 0.5, size = 15),
plot.caption = element_text(hjust = 0.45, size = 7))
I chose to overlay each year to get a better sense of how 2020 registrations compare to prior years. It looks as though there was a spike in June of 2020. This spike could be attributed to a lag in reporting due to Covid. Or the spike could be a genuine representation of Arlingtonians’ increased desire to be dog owners during the pandemic. I’m leaning towards the latter given that the amount of registrations from January to May are still above previous year.
Next I’d like to know where the dog are registered the most.
arlington_sf %>%
rename(zip_code = ZCTA5CE10) %>%
left_join(dog_permit_clean_df %>%
filter(reg_ym != curr_ym) %>%
count(owner_zip),
by = c("zip_code" = "owner_zip")) %>%
ggplot() + geom_sf(aes(fill = n)) +
geom_sf_label(aes(label = zip_code)) +
theme(text = element_text(family = "Courier"),
plot.title = element_text(hjust = 0.5, size = 15),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.line = element_blank(),
plot.caption = element_text(hjust = 0.35, size = 7)) +
scale_fill_gradient(low = "#86C0EA", high = "#0A2940") +
labs(title = "Dog Permit Registrations by Zip",
fill = "",
caption = glue("Arlington County Data API was used \\
for this purpose, but it is not endorsed \\
or certified by Arlington County."))