Arlington County offers a wealth of data for free through their Open Data Portal. What better way to learn more about the area that I live than through local data analysis. This post will cover dog permit registration within the county.

Packages

Code

xfun::pkg_attach(c("tidyverse", "httr2", "jsonlite", "ggsci",
                   "cowplot", "here",
                   "lubridate", "glue", "scales",
                   "janitor", "skimr", "tigris"))

API Calls

I’ll start by creating a function that creates an API call.

Code

# The order by argument is necessary to ensure that duplicate rows aren't
# introduced when using pagination
arl_odp <- function(resource, ..., skip = 0, top = 10, orderby = NULL, f_select = NULL, r_filter = NULL) {
  params <- list(
    skip = skip,
    top = top,
    orderby = orderby,
    select = f_select,
    filter = r_filter
  ) %>%
    set_names(nm = paste0("$", names(.)))
  
  arl_resp <- request("https://datahub-v2.arlingtonva.us/api/") %>%
    req_url_path_append(resource) %>%
    req_url_query(!!!params) %>%
    req_perform() %>%
    resp_body_string() %>%
    fromJSON()
  
  if(is.data.frame(arl_resp)) {
    return(arl_resp %>% clean_names())
  } else {
    return()
  }

}

The functions works but the data types for the columns will need to be updated. I’ll iterate over chunks of the data instead of querying everything at once. I’m choosing to use the purr map() function for this part of the process.

Code

dog_permit_df <- map2_df(
  .x = seq(0, 10000, 1000),
  .y = 1000,
  .f = ~ arl_odp(
    resource = "Tax/CurrentDogLicense",
    skip = .x,
    top = .y,
    orderby = "currentDogLicenseKey"))

Data Wrangling

Great. I’ve got all of the data. Let’s examine it with the glimpse() function see what we ended up with.

Code

dog_permit_df %>%
  glimpse(width = 50)

Rows: 6,250
Columns: 8
$ current_dog_license_key <int> 1, 2, 3, 4, 5, 6…
$ owner_address_text      <chr> "10 BENDING OAK …
$ payment_date            <chr> "2020-01-29T00:0…
$ license_term_code       <chr> "LIFETIME", "LIF…
$ valid_through_date      <chr> "2023-01-18T00:0…
$ dog_tag_nbr             <chr> "003375L", "0059…
$ dog_sex_code            <chr> "MALE", "FEMALE"…
$ spayed_neutered_ind     <chr> "YES", "NO", "YE…

It looks like we have 6,250 rows. I’m assuming the payment_date column refers to when the dog permit registration was submitted. There isn’t anything super interesting about the dogs other than the genders and their spayed/neutered status at registration. The owner_address_text column also contains zip code information that can be extracted using regular expressions. Before I can start summarizing or visualizing anything I need to perform some data cleansing/transformations.

Code

dog_permit_clean_df <- dog_permit_df %>%
  clean_names() %>%
  rename(owner_address = owner_address_text, license_term = license_term_code,
         dog_gender = dog_sex_code, sn_ind = spayed_neutered_ind) %>%
  select(owner_address, payment_date, license_term, dog_gender,
         sn_ind)

dog_permit_clean_df <- dog_permit_clean_df %>%
  mutate(reg_date = parse_date_time(payment_date, orders = "%Y-%m-%d%H:%M:%S"),
         across(.cols = reg_date, 
                .fns = list(y = year, m = month, wd = weekdays), 
                .names = "reg_{fn}"),
         reg_ym = glue("{reg_y}{str_pad(reg_m, 2, 'left', '0')}"),
         curr_ym = glue("{year(Sys.Date())}{str_pad(month((Sys.Date())), 2, 'left', '0')}"),
         owner_zip = as.factor(str_extract(owner_address,
                                           "(?<=VA\\s)\\d{5}")),
         sn_ind = if_else(sn_ind == "YES", TRUE, FALSE)) %>%
  filter(str_detect(owner_address, "ARLINGTON")) %>% 
  na.omit()

dog_permit_clean_df %>%
  head()

                                        owner_address             payment_date
1            100 N GRANADA ST ARLINGTON VA 22203-1227 2021-10-12T00:00:00.000Z
2               100 S PARK DR ARLINGTON VA 22204-1354 2020-04-09T00:00:00.000Z
3  1000 N RANDOLPH ST APT 509 ARLINGTON VA 22201-5628 2022-05-06T00:00:00.000Z
4  1000 N RANDOLPH ST APT 909 ARLINGTON VA 22201-5629 2020-10-13T00:00:00.000Z
5 1001 N RANDOLPH ST APT 1022 ARLINGTON VA 22201-5610 2020-07-17T00:00:00.000Z
6  1001 N RANDOLPH ST APT 209 ARLINGTON VA 22201-5603 2022-02-07T00:00:00.000Z
  license_term dog_gender sn_ind   reg_date reg_y reg_m   reg_wd reg_ym curr_ym
1     LIFETIME       MALE   TRUE 2021-10-12  2021    10  Tuesday 202110  202208
2     LIFETIME       MALE  FALSE 2020-04-09  2020     4 Thursday 202004  202208
3     LIFETIME     FEMALE  FALSE 2022-05-06  2022     5   Friday 202205  202208
4     LIFETIME       MALE   TRUE 2020-10-13  2020    10  Tuesday 202010  202208
5     LIFETIME       MALE   TRUE 2020-07-17  2020     7   Friday 202007  202208
6     LIFETIME       MALE   TRUE 2022-02-07  2022     2   Monday 202202  202208
  owner_zip
1     22203
2     22204
3     22201
4     22201
5     22201
6     22201

Visualization

There are a lot of cool visualization that can be created from this new tidied data frame. I’ll start by graphing the frequency of dog registrations by month for each year.

Code

dog_permit_clean_df %>%
  filter(reg_ym != curr_ym, reg_date >= "2019-06-01") %>%
  count(reg_m, reg_y) %>%
  mutate(reg_y = factor(reg_y,
                           levels = seq(max(dog_permit_clean_df$reg_y),
                                        min(dog_permit_clean_df$reg_y)))) %>%
  ggplot(aes(reg_m, n,
             col = reg_y)) +
  geom_line(size = 1.1) +
  geom_point(size = 3L, col = "white") +
  geom_point(size = 1.9) +
  scale_x_continuous(breaks = breaks_width(1)) +
  scale_color_npg() +
  labs(x = "Month", y = "n", col = "Year",
       title = "Arlington County Dog Permit Registrations",
       caption = glue("Arlington County Data API was used \\
                      for this purpose, but it is not endorsed \\
                      or certified by Arlington County.")) +
  theme_minimal(base_size = 14) +
  theme(text = element_text(family = "Courier"),
        plot.title = element_text(hjust = 0.5, size = 15),
        plot.caption = element_text(hjust = 0.45, size = 7))

I chose to overlay each year to get a better sense of how 2020 registrations compare to prior years. It looks as though there was a spike in June of 2020. This spike could be attributed to a lag in reporting due to Covid. Or the spike could be a genuine representation of Arlingtonians’ increased desire to be dog owners during the pandemic. I’m leaning towards the latter given that the amount of registrations from January to May are still above previous year.

Next I’d like to know where the dog are registered the most.

Code

arlington_sf <- tigris::zctas(starts_with = c("222"),
                              state = "virginia", cb = FALSE, year = 2010)

Code

arlington_sf %>%
  rename(zip_code = ZCTA5CE10) %>%
  left_join(dog_permit_clean_df %>%
              filter(reg_ym != curr_ym) %>%
              count(owner_zip),
            by = c("zip_code" = "owner_zip")) %>%
  ggplot() + geom_sf(aes(fill = n)) +
  geom_sf_label(aes(label = zip_code)) +
  theme(text = element_text(family = "Courier"),
        plot.title = element_text(hjust = 0.5, size = 15),
        axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.line = element_blank(),
        plot.caption = element_text(hjust = 0.35, size = 7)) +
  scale_fill_gradient(low = "#86C0EA", high = "#0A2940") +
  labs(title = "Dog Permit Registrations by Zip",
       fill = "",
       caption = glue("Arlington County Data API was used \\
                      for this purpose, but it is not endorsed \\
                      or certified by Arlington County."))