While you may be thinking “writing R code is not an easy way to download my data,” your instructor can confidently say that using tidycensus is much much easier than the “point and click” approach using multiple webpages or interacting with the Census API on your own. Even if you are not comfortable with coding in R yet, this module provides a step-by-step explanation of how to download data using tidycensus.
In R, packages contain special functions that extend the capability of it can do out of the box. We will be using multiple packages, but a good coding practice is to only load those you need for your script to run.
Packages are loaded using the library()
function.
Statements loading your packages should be the first thing in an R
Script. We will be using the tidycensus, tidyverse, and sf packages in
this module.
The next command that goes in your code is a function that essentially tells tidycensus what API key to use when requesting data in your code.
### Load packages
library(tidycensus)
library(tidyverse)
library(sf)
### Load API key
census_api_key("YOUR API KEY GOES HERE")
The tidycensus allows you to download data from numerous US Census sources. We will concentrate on the decennial census, ACS, and population estimates in this workshop; however, it is good to know that other data are available as well. Using tidycensus, the data source is defined by which function you use. In all, five datasets are available:
get_acs()
American Community Survey (tables and
spatial)get_decennial()
decennial US Census (tables and
spatial)get_estimates()
Population Estimates (tables)get_flows()
ACS Migration Flows (tables and
spatial)get_pums()
ACS Public Use Microdata Series
(tables)One of the more impenetrable barriers to working with census data is figuring out exactly what data they have at what geographic resolution. In this workshop, we do not have time to cover all the different variables available from all the sources. However, there are some resources included in tidycensus that allow the user to explore the available data. The following set of commands should create a set of tables with variable information for the specific data source. While these are nice resources, the Census also includes various forms of help documentation for their data. For example, this page (https://www.census.gov/programs-surveys/acs/technical-documentation.html) has the technical information about the ACS data, including an excel sheet with all the table names, codes, and descriptions.
### Load the geographic data helper for ACS data
data("acs5_geography")
### Retrieve the tables/variables for the most recent census
var_census_2020_pl <- load_variables(year = 2020,
dataset = "pl")
### Retrieve the tables/variables for the most recent 5-year ACS
var_acs5_2020 <- load_variables(year = 2020,
dataset = "acs5")
### Retrieve the variables for the most recent 5-year PUMS data
data("pums_variables")
Once you determine which data source you need and which variables, you must also determine your study area (geographic extent of the data request) and the geographic resolution of the data (note that all variables are not available at every enumeration unit). Whereas the functions themselves determine which data source, the parameters of the function determine the details of the request.
We will begin by downloading age data, which happens to be in the first table of the available ACS data (B01001). For reference purposes, we will begin by downloading only a single variable, male population between the ages of 0 and 4 years (B01001_003) for Wake County, NC at the block group level.
### Retrieve data
wake_acs5_2020_A0004_M <- get_acs(state = "NC",
county = "Wake",
geography = "block group",
variables = "B01001_003",
year = 2020,
survey = "acs5")
### Preview data
glimpse(wake_acs5_2020_A0004_M)
## Rows: 597
## Columns: 5
## $ GEOID [3m[38;5;246m<chr>[39m[23m "371830501001", "371830501002", "371830501003", "371830503001", "371830503002", "371830503003", "371830504001"…
## $ NAME [3m[38;5;246m<chr>[39m[23m "Block Group 1, Census Tract 501, Wake County, North Carolina", "Block Group 2, Census Tract 501, Wake County,…
## $ variable [3m[38;5;246m<chr>[39m[23m "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003"…
## $ estimate [3m[38;5;246m<dbl>[39m[23m 42, 9, 0, 0, 0, 0, 38, 0, 22, 34, 41, 0, 61, 106, 47, 74, 0, 80, 42, 10, 28, 0, 26, 0, 44, 0, 83, 0, 0, 0, 0, …
## $ moe [3m[38;5;246m<dbl>[39m[23m 33, 14, 13, 13, 13, 13, 16, 13, 19, 33, 75, 13, 57, 58, 37, 44, 13, 61, 35, 19, 25, 13, 25, 13, 35, 13, 56, 13…
The result of this command is a new object called
wake_acs5_2020_A0004_M
that is a tibble (a
fancy R table) with 597 rows (corresponding to 597 block groups in Wake
County) and 5 columns (including the GEOID, name, variable number,
value, and error estimate [because this is survey data, not census
data]). This is a flat table, meaning there is no spatial information
contained in it (these data cannot be mapped). If you want the spatial
features attached to the object, you must request them!
### Retrieve data
wake_acs5_2020_A0004_M <- get_acs(state = "NC",
county = "Wake",
geography = "block group",
variables = "B01001_003",
year = 2020,
survey = "acs5",
geometry = TRUE) #<< This is the only difference from above!
### Preview data
glimpse(wake_acs5_2020_A0004_M)
## Rows: 597
## Columns: 6
## $ GEOID [3m[38;5;246m<chr>[39m[23m "371830540162", "371830542041", "371830518001", "371830531102", "371830532051", "371830534092", "371830534112"…
## $ NAME [3m[38;5;246m<chr>[39m[23m "Block Group 2, Census Tract 540.16, Wake County, North Carolina", "Block Group 1, Census Tract 542.04, Wake C…
## $ variable [3m[38;5;246m<chr>[39m[23m "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003"…
## $ estimate [3m[38;5;246m<dbl>[39m[23m 46, 228, 129, 205, 108, 58, 70, 55, 245, 114, 64, 345, 387, 53, 49, 50, 80, 35, 0, 36, 28, 69, 53, 0, 0, 23, 2…
## $ moe [3m[38;5;246m<dbl>[39m[23m 43, 165, 192, 210, 50, 51, 47, 36, 103, 72, 60, 183, 246, 64, 85, 44, 90, 42, 13, 42, 25, 48, 63, 13, 13, 29, …
## $ geometry [3m[38;5;246m<MULTIPOLYGON [°]>[39m[23m MULTIPOLYGON (((-78.59955 3..., MULTIPOLYGON (((-78.52094 3..., MULTIPOLYGON (((-78.64622 3..., M…
Now, we have replaced the original
wake_acs5_2020_A0004_M
object with an spatial data object
consisting of a set of spatial polygon features representing the block
group boundaries (an sf object) linked to the values (a
data.frame, which is another fancy R table). In this
format, the spatial features are stored in a special table column called
“geometry.”
While the raw count estimates are great, we might be interested in
the percent of residents living in a block group who are male and age
0-4. To make this calculation, we also need to request the total number
of people per block group. This functionality is built directly into
tidycensus using the summary_var
parameter or you can
manually request the total population variable (B01001_001) in the
variables
parameter (shown below).
### Retrieve data
wake_acs5_2020_A0004_M <- get_acs(state = "NC",
county = "Wake",
geography = "block group",
variables = c("B01001_001",
"B01001_003"),
year = 2020,
survey = "acs5",
geometry = TRUE)
### Preview data
glimpse(wake_acs5_2020_A0004_M)
## Rows: 1,194
## Columns: 6
## $ GEOID [3m[38;5;246m<chr>[39m[23m "371830540162", "371830540162", "371830542041", "371830542041", "371830518001", "371830518001", "371830531102"…
## $ NAME [3m[38;5;246m<chr>[39m[23m "Block Group 2, Census Tract 540.16, Wake County, North Carolina", "Block Group 2, Census Tract 540.16, Wake C…
## $ variable [3m[38;5;246m<chr>[39m[23m "B01001_001", "B01001_003", "B01001_001", "B01001_003", "B01001_001", "B01001_003", "B01001_001", "B01001_003"…
## $ estimate [3m[38;5;246m<dbl>[39m[23m 2107, 46, 3025, 228, 1989, 129, 1396, 205, 2567, 108, 2753, 58, 2243, 70, 2858, 55, 2611, 245, 2509, 114, 2485…
## $ moe [3m[38;5;246m<dbl>[39m[23m 385, 43, 446, 165, 1168, 192, 535, 210, 423, 50, 576, 51, 392, 47, 511, 36, 564, 103, 436, 72, 358, 60, 988, 1…
## $ geometry [3m[38;5;246m<MULTIPOLYGON [°]>[39m[23m MULTIPOLYGON (((-78.59955 3..., MULTIPOLYGON (((-78.59955 3..., MULTIPOLYGON (((-78.52094 3..., M…
For those of you used to working with spatial data, you may be surprised to see that the output is now 1,194 rows long, which means that that every spatial feature is replicated (not an ideal situation). This long data format is a hallmark of tidy data, which the tidycensus is built upon. However, the tidy format is not ideal when working with spatial data because each spatial object should only be represented once in the table! However, tidycensus has a solution built in using the output parameter, which can be set to “wide”.
### Retrieve data
wake_acs5_2020_A0004_M <- get_acs(state = "NC",
county = "Wake",
geography = "block group",
variables = c("B01001_001",
"B01001_003"),
year = 2020,
survey = "acs5",
output = "wide",
geometry = TRUE)
### Preview data
glimpse(wake_acs5_2020_A0004_M)
## Rows: 597
## Columns: 7
## $ GEOID [3m[38;5;246m<chr>[39m[23m "371830540162", "371830542041", "371830518001", "371830531102", "371830532051", "371830534092", "3718305341…
## $ NAME [3m[38;5;246m<chr>[39m[23m "Block Group 2, Census Tract 540.16, Wake County, North Carolina", "Block Group 1, Census Tract 542.04, Wak…
## $ B01001_001E [3m[38;5;246m<dbl>[39m[23m 2107, 3025, 1989, 1396, 2567, 2753, 2243, 2858, 2611, 2509, 2485, 2717, 5246, 2546, 2290, 1500, 5827, 3391,…
## $ B01001_001M [3m[38;5;246m<dbl>[39m[23m 385, 446, 1168, 535, 423, 576, 392, 511, 564, 436, 358, 988, 1482, 493, 813, 262, 1622, 723, 286, 547, 311,…
## $ B01001_003E [3m[38;5;246m<dbl>[39m[23m 46, 228, 129, 205, 108, 58, 70, 55, 245, 114, 64, 345, 387, 53, 49, 50, 80, 35, 0, 36, 28, 69, 53, 0, 0, 23…
## $ B01001_003M [3m[38;5;246m<dbl>[39m[23m 43, 165, 192, 210, 50, 51, 47, 36, 103, 72, 60, 183, 246, 64, 85, 44, 90, 42, 13, 42, 25, 48, 63, 13, 13, 2…
## $ geometry [3m[38;5;246m<MULTIPOLYGON [°]>[39m[23m MULTIPOLYGON (((-78.59955 3..., MULTIPOLYGON (((-78.52094 3..., MULTIPOLYGON (((-78.64622 3...…
Now, wake_acs5_2020_A0004_M
is back to 597 rows long.
Notice that the previous column names have been replaced with the
variable codes, along with E (for estimate) and M (for margin of
error).
In some cases, it may be easier to simply request all the variables
in a table. For example, maybe you want all of the age
and gender breakdown data in the ACS’ age table. To do this, instead of
using the variable
parameter, you can use the
table
parameter.
### Retrieve data
wake_acs5_2020_age <- get_acs(state = "NC",
county = "Wake",
geography = "block group",
table = "B01001", #<< Note the difference here!
year = 2020,
survey = "acs5",
output = "wide",
geometry = TRUE)
No preview of the data is included because the new object
wake_acs_2020_age
has 101 columns, which include 2 columns
for the GEOID and name, 6 columns containing the total population, male
population, and female population (and their margins of error), and 92
columns containing the 23 age group populations for males and females
(and their margins of error), and 1 column for the spatial features.
The function and options to download decennial data from the US
Census are quite similar to those for the ACS data. For example, the
following command will download the block level data
for the housing table from the most recent census for Durham County
(geometry
is set to FALSE because the block level spatial
data are large files that take a long time to download!).
### Retrieve data
durham_dec_2020_housing <- get_decennial(state = "NC",
county = "Durham",
geography = "block",
table = "H1",
year = 2020,
sumfile = "pl",
output = "wide",
geometry = FALSE)
### Preview data
glimpse(durham_dec_2020_housing)
## Rows: 4,401
## Columns: 5
## $ GEOID [3m[38;5;246m<chr>[39m[23m "370630013011001", "370630013031011", "370630013032014", "370630014002005", "370630015042000", "370630015051011…
## $ NAME [3m[38;5;246m<chr>[39m[23m "Block 1001, Block Group 1, Census Tract 13.01, Durham County, North Carolina", "Block 1011, Block Group 1, Cen…
## $ H1_001N [3m[38;5;246m<dbl>[39m[23m 20, 17, 15, 12, 0, 0, 3, 360, 0, 16, 14, 24, 24, 18, 12, 31, 46, 0, 24, 1, 47, 25, 25, 18, 0, 10, 31, 3, 54, 12…
## $ H1_002N [3m[38;5;246m<dbl>[39m[23m 18, 6, 15, 9, 0, 0, 3, 356, 0, 16, 14, 24, 18, 18, 12, 30, 42, 0, 24, 1, 43, 6, 25, 12, 0, 2, 29, 3, 51, 12, 0,…
## $ H1_003N [3m[38;5;246m<dbl>[39m[23m 2, 11, 0, 3, 0, 0, 0, 4, 0, 0, 0, 0, 6, 0, 0, 1, 4, 0, 0, 0, 4, 19, 0, 6, 0, 8, 2, 0, 3, 0, 0, 0, 0, 13, 8, 7, …
The last task in the data retrieval section is to learn how to write out (and then read back in) a spatial data layer. The sf package makes this relatively easy. The reference pages for reading and writing sf objects are here: https://r-spatial.github.io/sf/reference/st_read.html, https://r-spatial.github.io/sf/reference/st_write.html
## Write out data as a geopackage to current working directory
st_write(wake_acs5_2020_age, "wake_example_age.gpkg")
## Quiet version that will also overwrite existing file
write_sf(wake_acs5_2020_age, "wake_example_age.gpkg")
## Read in a spatial file
new_obj_wake <- st_read("wake_example_age.gpkg")
## Quiet version for reading in
new_obj_wake <- read_sf("wake_example_age.gpkg")
Create a new R script that contains (only) the required commands to download the 2018 ACS 5-year data for Median Household Income (B19013) at the tract level for Kent County, Michigan. Make sure that the data are in wide format and include the spatial features. Write out the data!
Click here to download the R code from this module
This page was last updated on February 26, 2024