pronounciation: jah/jaw/jia
A good friend of mine is in the public health sector focusing on food and nutrition policy, specifically focusing on access and equity for farmer’s markets. I’ve learned a lot from her around the patterns and trends associated with farmer’s markets in California. There’s a normative aspect to her work: a belief that localizing supply chains and connecting farmers directly with consumers is going to be a key go forward strategy for both environmentalism and food access. One of her research programs right now is focused on understanding the barriers discouraging Black communities from patronizing these markets to the same levels that other communities in California do. There’s the obvious factor of price; yet it doesn’t fully explain the consumption patterns as there are definitely lower price point markets that working class immigrant enclaves heavily rely on. In addition to price then, maybe there’s a secondary factor of proximity. Regardless, it seems like wrapped up in this is just larger questions around food access for poor Black communities in America, a topic that I’m not even close to understanding enough to be able to have a proper discussion about.
As part of building out my general knowledge there, I’ve been reading some papers around the discriminatory aspects of food access in America. In parallel, I decided to take a look at the data itself, specifically San Francisco. Fortunately, the SF City Government has open sourced a lot of interesting datasets, with one being a collection of all registered business in the city (including grocery stores and markets!). Below, I document my efforts in exploring this dataset and understanding a bit more about the trends of food and grocery retail in San Francisco.
Packages for this project. The tidyverse is pretty much default, as is ggplot
+ plotting support (palettes/scales/dates). Since we’re querying an API, I have httr
/jsonlite
. (If I’m querying a database for work, I’ll be looking more towards database clients such as RPostgresql
). ggmap
is a bit more use case dependent, since I know I’ll be doing a lot more geospatial analysis here. Bear with me it’s been a very long time since I’ve done anything with maps and geolocation data.
library(tidyr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(scales)
library(httr)
library(jsonlite)
library(readr)
library(purrr)
library(RColorBrewer)
library(scales)
library(ggmap)
# https://data.sfgov.org/Economy-and-Community/Registered-Business-Locations-San-Francisco/g8m3-pdis
The SF gov has a nicely maintained public API for accessing datasets. You can also just download a csv and read it in manually, but I thought I’d be fancy for reproducibility purposes! A few nuances regarding access patterns:
jsonlite
parses the .json, it stores these values as a nested dataframe column within the larger dataframe# check how many total records so we can back into the number of requests we have to make
records <- fromJSON(content(GET("https://data.sfgov.org/resource/g8m3-pdis.json?$select=count(ttxid)"), as="text"))[1,1]
total_pages <- ceiling(as.integer(records)/50000)
endpoint <- "https://data.sfgov.org/resource/g8m3-pdis.json"
# loop through the number of requests
for (i in seq(c(1:total_pages))) {
# logic with offset is pretty straightforward; on page 1 the offset is 0, for page 2 we want our offset to start us at 50,001
url <- paste0(endpoint, "?$limit=50000&$offset=", format((i-1)*50000, scientific = FALSE), "&$order=ttxid")
print(url)
df_temp <- fromJSON(content(GET(url, add_headers(`X-App-Token` = "AQrzpwEVnUYlmA22uZxnaqiVY")), as = "text"))
if ("location" %in% colnames(df_temp)) {
# there's a nested df called location within our parsed dataframe; one of the columns is a nested list-wise column called coordinates, contain lat/long as individual elements within a list
location_df <-
df_temp$location %>%
mutate(
long = map_dbl(coordinates, function(x) {if (is.null(x[1])) {as.double(NA)} else {x[1]}}),
lat = map_dbl(coordinates, function(x) {if (is.null(x[2])) {as.double(NA)} else {x[2]}})
) %>%
select(-coordinates)
df_temp <- cbind.data.frame(df_temp %>% select(-location), location_df)
} else {
df_temp <- df_temp %>% mutate(lat = as.double(NA), long = as.double(NA), type = as.character(NA))
}
# bind everything together
if (exists("business_df")) {
business_df <- rbind.data.frame(business_df, df_temp)
} else {
business_df <- df_temp
}
}
rm(df_temp)
rm(location_df)
Some data cleaning (timestamps mostly)…
# for timestamps, remove the T and everything after the '.' inclusive
business_df <- business_df %>%
mutate(
dba_start_date = as.POSIXct(gsub("\\..*$", "", gsub("T", " ", dba_start_date))),
location_start_date = as.POSIXct(gsub("\\..*$", "", gsub("T", " ", location_start_date))),
location_end_date = as.POSIXct(gsub("\\..*$", "", gsub("T", " ", location_end_date))),
dba_end_date = as.POSIXct(gsub("\\..*$", "", gsub("T", " ", dba_end_date))),
supervisor_district = as.integer(supervisor_district)
)
# save the rds so we don't have to rerun this everytime
business_df %>% saveRDS("sf_business_data.rds")
business_df <- readRDS("sf_business_data.rds")
Really basic Get to Know Your Data
. Looks like Food Services and Retail Trade are the NAIC codes we want for groceries. There’s a lot of business without a code; I did some quick checks and I don’t think there are any groceries in there, mostly apartments and things like that. One thing to note is that multiple businesses can be registered at the same address. For example, there’s a market in Laurel Heights that’s one record, but inside that market is a butcher store, coffee shop, cheese shop, etc. that all registered as separate businesses within same address. The “primary” market has a Food Services
NAIC code, but all the secondary businesses have an NA
code.
business_df %>%
group_by(naic_code_description) %>%
count()
business_df %>%
filter(
is.na(naic_code_description)
) %>%
head
business_df %>%
filter(
is.na(naic_code_description)
) %>%
left_join(
business_df %>%
filter(
naic_code_description %in% c("Food Services", "Retail Trade")
) %>%
select(ttxid_match = ttxid, full_business_address),
by="full_business_address"
) %>%
filter(
is.na(ttxid_match)
) %>%
head
# "Food Services", "Retail Trade"
According the description of the dataset on the SF Gov site, the NIC code is from a federal standard called the North American Industry Classification System. When businesses register with the city, a subset of these codes are needed for tax purposes. It’s not super clear what the Lic code is; seems like it stands for “business license code”, and is definitely a much more detailed specification.
business_df %>%
filter(
naic_code_description %in% c("Food Services", "Retail Trade")
) %>%
group_by(lic_code_description) %>%
count(sort = TRUE)
Mostly SF based businesses, no surprise.
business_df %>%
filter(
naic_code_description %in% c("Food Services", "Retail Trade")
) %>%
group_by(city) %>%
count(sort = TRUE)
Pretty fascinating…some very old businesses here!
business_df %>%
filter(dba_start_date < '1950-01-01') %>%
mutate(
dba_name = substr(dba_name, 1, 40)
) %>%
select(dba_name, dba_start_date, dba_end_date, naic_code_description, lic_code_description)
The pandemic has definitely hurt the number of new food/retail businesses being created :(
business_df %>%
filter(
!is.na(location_start_date),
naic_code_description %in% c("Retail Trade", "Food Services")
) %>%
group_by(
business_registered_month = as.Date(floor_date(location_start_date, 'year')),
naic_code_description
) %>%
summarise(
count = n()
) %>%
ggplot(aes(x = business_registered_month, y = count, colour = naic_code_description)) + geom_line() + theme_bw() +
scale_x_date(name = "Business Registration Month", date_breaks = "8 years", date_label = "%Y") +
scale_colour_brewer(name = "NAIC code description", palette = "Dark2") +
scale_y_continuous(breaks = seq(0, 2000, 250))
Decided to quickly plot retail/food businesses by location. As an FYI, I’m leveraging the ggmap
package which has a built in integration with the Google Maps platform. This does require a little bit of authentication setup work (typical creating API keys within the Google API console), but ggmap
makes it incredibly simple. Probably took less than 10 minutes to setup. Anyways, nothing too shocking around the map results- kind of interesting the Richmond doesn’t have any dense business districts running on the north/south avenues. This is effectively a population density map…
map_data <-
business_df %>%
filter(
!is.na(location_start_date),
naic_code_description %in% c("Retail Trade", "Food Services")
)
sf_map_plot <-
get_googlemap(zoom = 12, size = c(480,480), maptype = "road", center = c(-122.44, 37.76))
ggmap(sf_map_plot) + geom_point(data = map_data, aes(x = long, y = lat), size = .3, alpha = .1)