Intro

Lately I have been doing some consulting work on the topic of commercial bank feasibility in the state of Idaho. My client is interested in looking at potential under-serviced markets they could expand to. While I won’t publish specific data concerning my client, I thought this could be a good opportunity to play with some public data published by the Bureau of Labor Statistics, US Census Bureau, FDIC and others.

In this analysis I will rely heavily on library(purrr), library(blscrapeR) and library(tidycensus). I will also, at some point, need to utilize library(rlang)’s enquo() and !!. This is so I can keep my code reproducible and concise. For more information about how to use library(rlang), you should first consult Programming with dplyr for an introduction to the topic. While it may seem daunting at first, it’s obvious to me that this is a powerful tool all data scientists should have in their toolbox.

Thanks for reading and if you have any questions please feel free to at my twitter handle: @dylanjm_ds

All I See are Quotients

Here are the libraries we’ll need for this analytic foray.

library(blscrapeR)
library(tidyverse)
library(albersusa)
library(ggalt)
library(wesanderson)
library(cowplot)
library(tidycensus)

One of the first things that pops into my mind when thinking about potential market opportunities is the Location Quotient. The Location Quotient is defined as follows:

\[ LQ = \dfrac{e_{i}/e}{E_{i}/E} \]

Where:

\[ e_{i} = \text{regional employment in sector} i \\ e = \text{total regional employment} \\ E_{i} = \text{national employment in sector} i \\ E = \text{total national employment} \]

This ratio can provide a lot of insight into the saturation of a particular sector for any given area of the United States. Typically, as ratios go, anything higher than 1 starts to exhibit over-saturation for a given market, and anything less than 1 displays market under-utilization. What I want to do is look at the Location Quotient for Commercial Banking, Credit Unions, and Mortgage Loan Brokers (NIACS codes 5221, 52213, 52231 respectively) for each county in Idaho.

Let’s write a function in combination with library(blsrapeR) to get all the data we need at one time from the Quarterly Census of Employment and Wages from the BLS.

# The original blscrapeR::qcew_api() is not really suited to take multiple years
# or niac codes at once. Here is a function that grabs it all!
get_industry_qcew <- function(years, niac_codes, qtr = "a"){
  # We want the years to repeat for each industry we're looking for
  year_rep <- rep(years, length(niac_codes))
  # We also want each code to be repeated for each year of interest
  niac_each <- rep(niac_codes, each = length(years))
  
  # Utilize purrr::map2() and purrr::possibly() to either grab our data 
  # or carry on when it hits an error
  industry_dat <- map2(year_rep, niac_each,
                       possibly(
                         ~ qcew_api(year = .x, qtr = qtr, 
                                    slice = "industry", sliceCode = .y),
                         otherwise = NULL
                       )
  ) %>% 
    compact() %>% # remove any NULL values from our list
    map_df(bind_rows) # create one large data set 
  return(industry_dat)
}

The above function has the ability to take multiple years and multiple industry codes and return the proper data. If you remember in one of my previous blog posts, Using purrr to wrangle and clean economic data I had to utilize the possibly() function again, to ensure that my function would not stop when gathering the data. Since I’m only interested in annual averages right now, I’ll go ahead and run the function:

qcew_dat <- get_industry_qcew(2013:2016, c(5221, 52213, 52231))

A quick glimpse() at the data show the number of observations and variables I’ll be working with.

## Observations: 30,711
## Variables: 38
## $ area_fips                        <chr> "01000", "01000", "01001", "0...
## $ own_code                         <int> 3, 5, 5, 3, 5, 5, 5, 5, 5, 5,...
## $ industry_code                    <int> 5221, 5221, 5221, 5221, 5221,...
## $ agglvl_code                      <int> 56, 56, 76, 76, 76, 76, 76, 7...
## $ size_code                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ year                             <int> 2013, 2013, 2013, 2013, 2013,...
## $ qtr                              <chr> "A", "A", "A", "A", "A", "A",...
## $ disclosure_code                  <chr> "N", "", "", "N", "", "", "",...
## $ annual_avg_estabs                <int> 1, 2044, 25, 1, 109, 16, 8, 1...
## $ annual_avg_emplvl                <int> 0, 31836, 200, 0, 782, 126, 5...
## $ total_annual_wages               <dbl> 0, 1768643334, 8713843, 0, 35...
## $ taxable_annual_wages             <dbl> 0, 284144494, 1621091, 0, 656...
## $ annual_contributions             <int> 0, 6261283, 35548, 0, 168962,...
## $ annual_avg_wkly_wage             <int> 0, 1068, 836, 0, 863, 742, 54...
## $ avg_annual_pay                   <int> 0, 55555, 43497, 0, 44880, 38...
## $ lq_disclosure_code               <chr> "N", "", "", "N", "", "", "",...
## $ lq_annual_avg_estabs             <dbl> 15.86, 1.48, 2.53, 350.90, 1....
## $ lq_annual_avg_emplvl             <dbl> 0.00, 1.34, 1.52, 0.00, 0.97,...
## $ lq_total_annual_wages            <dbl> 0.00, 1.38, 1.53, 0.00, 1.05,...
## $ lq_taxable_annual_wages          <dbl> 0.00, 1.26, 1.46, 0.00, 0.81,...
## $ lq_annual_contributions          <dbl> 0.00, 1.15, 1.46, 0.00, 0.94,...
## $ lq_annual_avg_wkly_wage          <dbl> 0.00, 1.03, 1.01, 0.00, 1.08,...
## $ lq_avg_annual_pay                <dbl> 0.00, 1.03, 1.01, 0.00, 1.08,...
## $ oty_disclosure_code              <chr> "N", "", "", "N", "", "", "",...
## $ oty_annual_avg_estabs_chg        <int> 0, 31, 0, 0, -1, 1, 0, 1, 0, ...
## $ oty_annual_avg_estabs_pct_chg    <dbl> 0.0, 1.5, 0.0, 0.0, -0.9, 6.7...
## $ oty_annual_avg_emplvl_chg        <int> 0, -114, -5, 0, 0, -6, 0, 6, ...
## $ oty_annual_avg_emplvl_pct_chg    <dbl> 0.0, -0.4, -2.4, 0.0, 0.0, -4...
## $ oty_total_annual_wages_chg       <dbl> 0, 40036196, 1052977, 0, -767...
## $ oty_total_annual_wages_pct_chg   <dbl> 0.0, 2.3, 13.7, 0.0, -2.1, -1...
## $ oty_taxable_annual_wages_chg     <int> 0, -7696811, 204295, 0, -8503...
## $ oty_taxable_annual_wages_pct_chg <dbl> 0.0, -2.6, 14.4, 0.0, -11.5, ...
## $ oty_annual_contributions_chg     <int> 0, -2014732, -1282, 0, -65727...
## $ oty_annual_contributions_pct_chg <dbl> 0.0, -24.3, -3.5, 0.0, -28.0,...
## $ oty_annual_avg_wkly_wage_chg     <int> 0, 28, 116, 0, -19, 25, 27, 3...
## $ oty_annual_avg_wkly_wage_pct_chg <dbl> 0.0, 2.7, 16.1, 0.0, -2.2, 3....
## $ oty_avg_annual_pay_chg           <int> 0, 1451, 6081, 0, -977, 1297,...
## $ oty_avg_annual_pay_pct_chg       <dbl> 0.0, 2.7, 16.3, 0.0, -2.1, 3....

According to the field layouts guide at the BLS website. lq_annual_avg_estabs is the variable I am looking for. I’m looking to make a heat map by county for all the LQ data. I’m going to go ahead and just clean up this data a little bit:

qcew_dat_clean <- qcew_dat %>% 
  # Reorder columns and drop unwanted variables
  select(area_fips, year, qtr, 
         industry_code, everything(),
         -own_code, -agglvl_code, -size_code, -disclosure_code,
         -lq_disclosure_code, -oty_disclosure_code)

Now, I want to make my data set interpretable so that when I give it to my client the file is easily digestible. So I’m going to create a County and State column by merging my data with another predefined data.frame that comes with library(blscrapeR). I am also going to create a column that gives a verbal description of the respective NIACS code. This will make titling my plots easier later.

idaho_dat <- qcew_dat_clean %>% 
  left_join(area_titles, by = "area_fips") %>% 
  separate(area_title, c("county", "state"), sep = ",") %>% 
  mutate_at(vars(state, industry_code), trimws) %>% 
  left_join(niacs, by = "industry_code") %>% 
  mutate_at(vars(industry_title), as.character) %>% 
  select(area_fips, state, county, 
         year, qtr, industry_code, industry_title, everything()) %>% 
  filter(state %in% "Idaho")

At this point, I’m dying to just get some plots printed, (let’s be real - I iteratively wrote this code so that it would be as concise as possible, I’ve already seen 1000 plots of this data, but let’s suspend our disbelief as to make it feel natural) What I need to do now is leverage library(alberusa) to set up my mapping data.

# This will give me all of the lat/long geometry's for each county in Idaho
us <- counties_composite()
us_map <- broom::tidy(us, region = "fips") %>% 
  filter(id %in% idaho_dat$area_fips)

Let’s take a look at how the Commercial Banking (Depository Credit Intermediation) LQ has done over the last few years in Idaho:

com_banking <- idaho_dat %>% 
  filter(industry_code %in% "5221")

ggplot() +
    geom_map(data = us_map, map = us_map,
             aes(long, lat, map_id = id),
             color = "black", fill = NA) + 
    geom_map(data = com_banking, map = us_map,
             aes(fill = lq_annual_avg_estabs, map_id = area_fips)) + 
    scale_fill_gradientn(colors = wes_palettes$Zissou1) +
    coord_map() + 
    facet_grid(.~ year) + 
    labs(title = glue::glue("{tools::toTitleCase(com_banking$industry_title)}"),
         fill = "LQ", x = NULL, y = NULL) +
    theme_minimal() +
    theme(strip.text = element_text(face = "bold"),
          axis.title = element_blank(),
          axis.text = element_blank(),
          legend.title = element_text(face = "bold"),
          plot.title = element_text(family = "Helvetica Bold Oblique"),
          plot.margin=grid::unit(c(0,0,0,0), "mm"))

Wow, okay so quite a few counties in Idaho rank high on its LQ but there are a few counties that show potential and it appears that over time some markets have shifted away from being over-saturated and vice-versa. While this is interesting to look at, let’s take a look at Credit Unions and Mortgage Brokers in Idaho over time.

Enquo the Quotient

For this next section, instead of repeating the code three times to create all the plots I need. I am going to take a moment to refactor the code into a function. This function will then be able to be utilized by purrr:map() to create a list of plots that I can then arrange into a grid using library(cowplots). I want to code this in a way where if I want to change my fill variable, it’s easy to do so. This will require that I use tidyeval.

# Not the most pretty function
# This is something I'm still learning on function development.
# My gut feeling is that I don't want to assume scope when writing a function
# So I don't want to call variables from my Global Environment.
# Therefore I'm going to give this function, everything I have create thus far.
plot_industry_map <- function(dat, map_dat, ind_code, fill_var, fill_pall = "Zissou1"){
  # We want to plot a specific industry at a time
  ind_dat <- dat %>% 
    filter(industry_code %in% ind_code)
  
  # Here is where we use tidyeval to enquo the fill variable.
  fill_var_enc <- enquo(fill_var)
  
  # Here is the plot in question
  ggplot() +
    geom_map(data = map_dat, map = map_dat,
             aes(long, lat, map_id = id),
             color = "black", fill = NA) + 
    geom_map(data = ind_dat, map = map_dat,
             aes(fill = !!fill_var_enc, map_id = area_fips)) + 
    scale_fill_gradientn(colors = wes_palette(fill_pall)) +
    coord_map() + 
    facet_grid(.~ year) + 
    labs(title = glue::glue("{tools::toTitleCase(ind_dat$industry_title)}"),
         fill = "LQ") +
    theme_minimal() +
    theme(strip.text = element_text(face = "bold"),
          axis.title = element_blank(),
          axis.text = element_blank(),
          legend.title = element_text(face = "bold"),
          plot.title = element_text(family = "Helvetica Bold Oblique"))
}

Now I should be able to call my function and create a list of plots!

plts <- map(c("5221","52213","52231"), plot_industry_map, 
            dat = idaho_dat, 
            map_dat = us_map, 
            fill_var = lq_annual_avg_estabs
)

cowplot::plot_grid(plotlist = plts, ncol = 1) %>% 
  add_sub(glue::glue("Source: Bureau of Labor Statistics QCEW Database",
                     "\n*Data not available for years with non-filled counties"), 
          y  = .5, hjust = -.8, size = 8, fontface = "italic") %>%
ggdraw()

Perfect, at this point I can see across several industries what the saturation is for each market. Don’t let the colors fool you as well, if you look at the scale for each map, we can see that credit unions are a highly saturated market in Idaho.

Let’s take a look at where all these banking institutions are. By taking a look at the FDIC Data Download Tools, we can map out, where all these banks actually are and at the same time, let’s look at the median household income per county. We will gather this data from the US Census:

idaho_banks <- read_csv(here::here("Data/ID_2017_Data.csv")) %>% 
  filter(SIMS_LONGITUDE < -1)

idaho_median_income <- get_acs(geography = "county", 
        variables = c(mediancome = "B19013_001"), 
        state = "ID", geometry = TRUE)
ggplot() +
  geom_sf(data = idaho_median_income, aes(fill = estimate)) +
  geom_point(data = idaho_banks, 
             aes(y = SIMS_LATITUDE, x = SIMS_LONGITUDE),
             color = "grey95", size = .8) + 
  coord_sf(datum = NA) + 
  scale_fill_gradientn(colors = wes_palettes$Rushmore1,
                       labels = scales::dollar) + 
  labs(title = "Median Income Across Counties in Idaho",
       subtitle = "Each white dot represents a physical branch\nlocation of a bank registered by the FDIC, updated June 2017",
       fill =  "Estimate") + 
  theme(axis.title = element_blank(),
        plot.subtitle = element_text(face = "italic"))

As we can see, there are a few counties where the median income is high and not a lot of financial institutions. These could be prime opportunities for a bank to expand.

I hope you enjoyed this small tutorial on taking secondary data and turning it into valuable graphics that can be easily digested. While this would only be a small part of the consulting process for my client, I have created a reproducible graphic that, with only a few small changes, can give this exact analysis for any state in the country. I have also been able to bypass having to use excel and deal with the potential overhead of the data sets being too large to deal with. If you have any questions or would like to see more please feel free to ask! You can find a gist of my script at github