Overview

While you may be thinking “writing R code is not an easy way to download my data,” your instructor can confidently say that using tidycensus is much much easier than the “point and click” approach using multiple webpages or interacting with the Census API on your own. Even if you are not comfortable with coding in R yet, this module provides a step-by-step explanation of how to download data using tidycensus.

Set Up

In R, packages contain special functions that extend the capability of it can do out of the box. We will be using multiple packages, but a good coding practice is to only load those you need for your script to run.

Packages are loaded using the library() function. Statements loading your packages should be the first thing in an R Script. We will be using the tidycensus, tidyverse, and sf packages in this module.

The next command that goes in your code is a function that essentially tells tidycensus what API key to use when requesting data in your code.

### Load packages
library(tidycensus)
library(tidyverse)
library(sf)

### Load API key
census_api_key("YOUR API KEY GOES HERE")

Data Sources

The tidycensus allows you to download data from numerous US Census sources. We will concentrate on the decennial census, ACS, and population estimates in this workshop; however, it is good to know that other data are available as well. Using tidycensus, the data source is defined by which function you use. In all, five datasets are available:

get_acs() American Community Survey (tables and spatial)
get_decennial() decennial US Census (tables and spatial)
get_estimates() Population Estimates (tables)
get_flows() ACS Migration Flows (tables and spatial)
get_pums() ACS Public Use Microdata Series (tables)

Available Variables

One of the more impenetrable barriers to working with census data is figuring out exactly what data they have at what geographic resolution. In this workshop, we do not have time to cover all the different variables available from all the sources. However, there are some resources included in tidycensus that allow the user to explore the available data. The following set of commands should create a set of tables with variable information for the specific data source. While these are nice resources, the Census also includes various forms of help documentation for their data. For example, this page (https://www.census.gov/programs-surveys/acs/technical-documentation.html) has the technical information about the ACS data, including an excel sheet with all the table names, codes, and descriptions.

### Load the geographic data helper for ACS data
data("acs5_geography")

### Retrieve the tables/variables for the most recent census
var_census_2020_pl <- load_variables(year = 2020,
                                     dataset = "pl")

### Retrieve the tables/variables for the most recent 5-year ACS
var_acs5_2020 <- load_variables(year = 2020,
                                dataset = "acs5")

### Retrieve the variables for the most recent 5-year PUMS data
data("pums_variables")

Retrieve Data

ACS Data

Once you determine which data source you need and which variables, you must also determine your study area (geographic extent of the data request) and the geographic resolution of the data (note that all variables are not available at every enumeration unit). Whereas the functions themselves determine which data source, the parameters of the function determine the details of the request.

We will begin by downloading age data, which happens to be in the first table of the available ACS data (B01001). For reference purposes, we will begin by downloading only a single variable, male population between the ages of 0 and 4 years (B01001_003) for Wake County, NC at the block group level.

### Retrieve data
wake_acs5_2020_A0004_M <- get_acs(state = "NC",
                                  county = "Wake",
                                  geography = "block group",
                                  variables = "B01001_003",
                                  year = 2020,
                                  survey = "acs5")

### Preview data
glimpse(wake_acs5_2020_A0004_M)

## Rows: 597
## Columns: 5
## $ GEOID    [3m[38;5;246m<chr>[39m[23m "371830501001", "371830501002", "371830501003", "371830503001", "371830503002", "371830503003", "371830504001"…
## $ NAME     [3m[38;5;246m<chr>[39m[23m "Block Group 1, Census Tract 501, Wake County, North Carolina", "Block Group 2, Census Tract 501, Wake County,…
## $ variable [3m[38;5;246m<chr>[39m[23m "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003"…
## $ estimate [3m[38;5;246m<dbl>[39m[23m 42, 9, 0, 0, 0, 0, 38, 0, 22, 34, 41, 0, 61, 106, 47, 74, 0, 80, 42, 10, 28, 0, 26, 0, 44, 0, 83, 0, 0, 0, 0, …
## $ moe      [3m[38;5;246m<dbl>[39m[23m 33, 14, 13, 13, 13, 13, 16, 13, 19, 33, 75, 13, 57, 58, 37, 44, 13, 61, 35, 19, 25, 13, 25, 13, 35, 13, 56, 13…

Add Spatial Features

The result of this command is a new object called wake_acs5_2020_A0004_M that is a tibble (a fancy R table) with 597 rows (corresponding to 597 block groups in Wake County) and 5 columns (including the GEOID, name, variable number, value, and error estimate [because this is survey data, not census data]). This is a flat table, meaning there is no spatial information contained in it (these data cannot be mapped). If you want the spatial features attached to the object, you must request them!

### Retrieve data
wake_acs5_2020_A0004_M <- get_acs(state = "NC",
                                  county = "Wake",
                                  geography = "block group",
                                  variables = "B01001_003",
                                  year = 2020,
                                  survey = "acs5",
                                  geometry = TRUE)           #<< This is the only difference from above!

### Preview data
glimpse(wake_acs5_2020_A0004_M)

## Rows: 597
## Columns: 6
## $ GEOID    [3m[38;5;246m<chr>[39m[23m "371830540162", "371830542041", "371830518001", "371830531102", "371830532051", "371830534092", "371830534112"…
## $ NAME     [3m[38;5;246m<chr>[39m[23m "Block Group 2, Census Tract 540.16, Wake County, North Carolina", "Block Group 1, Census Tract 542.04, Wake C…
## $ variable [3m[38;5;246m<chr>[39m[23m "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003", "B01001_003"…
## $ estimate [3m[38;5;246m<dbl>[39m[23m 46, 228, 129, 205, 108, 58, 70, 55, 245, 114, 64, 345, 387, 53, 49, 50, 80, 35, 0, 36, 28, 69, 53, 0, 0, 23, 2…
## $ moe      [3m[38;5;246m<dbl>[39m[23m 43, 165, 192, 210, 50, 51, 47, 36, 103, 72, 60, 183, 246, 64, 85, 44, 90, 42, 13, 42, 25, 48, 63, 13, 13, 29, …
## $ geometry [3m[38;5;246m<MULTIPOLYGON [°]>[39m[23m MULTIPOLYGON (((-78.59955 3..., MULTIPOLYGON (((-78.52094 3..., MULTIPOLYGON (((-78.64622 3..., M…

Now, we have replaced the original wake_acs5_2020_A0004_M object with an spatial data object consisting of a set of spatial polygon features representing the block group boundaries (an sf object) linked to the values (a data.frame, which is another fancy R table). In this format, the spatial features are stored in a special table column called “geometry.”

Additional Variables

While the raw count estimates are great, we might be interested in the percent of residents living in a block group who are male and age 0-4. To make this calculation, we also need to request the total number of people per block group. This functionality is built directly into tidycensus using the summary_var parameter or you can manually request the total population variable (B01001_001) in the variables parameter (shown below).

### Retrieve data
wake_acs5_2020_A0004_M <- get_acs(state = "NC",
                                  county = "Wake",
                                  geography = "block group",
                                  variables = c("B01001_001",
                                                "B01001_003"),
                                  year = 2020,
                                  survey = "acs5",
                                  geometry = TRUE)

### Preview data
glimpse(wake_acs5_2020_A0004_M)

## Rows: 1,194
## Columns: 6
## $ GEOID    [3m[38;5;246m<chr>[39m[23m "371830540162", "371830540162", "371830542041", "371830542041", "371830518001", "371830518001", "371830531102"…
## $ NAME     [3m[38;5;246m<chr>[39m[23m "Block Group 2, Census Tract 540.16, Wake County, North Carolina", "Block Group 2, Census Tract 540.16, Wake C…
## $ variable [3m[38;5;246m<chr>[39m[23m "B01001_001", "B01001_003", "B01001_001", "B01001_003", "B01001_001", "B01001_003", "B01001_001", "B01001_003"…
## $ estimate [3m[38;5;246m<dbl>[39m[23m 2107, 46, 3025, 228, 1989, 129, 1396, 205, 2567, 108, 2753, 58, 2243, 70, 2858, 55, 2611, 245, 2509, 114, 2485…
## $ moe      [3m[38;5;246m<dbl>[39m[23m 385, 43, 446, 165, 1168, 192, 535, 210, 423, 50, 576, 51, 392, 47, 511, 36, 564, 103, 436, 72, 358, 60, 988, 1…
## $ geometry [3m[38;5;246m<MULTIPOLYGON [°]>[39m[23m MULTIPOLYGON (((-78.59955 3..., MULTIPOLYGON (((-78.59955 3..., MULTIPOLYGON (((-78.52094 3..., M…

Wide Format

For those of you used to working with spatial data, you may be surprised to see that the output is now 1,194 rows long, which means that that every spatial feature is replicated (not an ideal situation). This long data format is a hallmark of tidy data, which the tidycensus is built upon. However, the tidy format is not ideal when working with spatial data because each spatial object should only be represented once in the table! However, tidycensus has a solution built in using the output parameter, which can be set to “wide”.

### Retrieve data
wake_acs5_2020_A0004_M <- get_acs(state = "NC",
                                  county = "Wake",
                                  geography = "block group",
                                  variables = c("B01001_001",
                                                "B01001_003"),
                                  year = 2020,
                                  survey = "acs5",
                                  output = "wide",
                                  geometry = TRUE)

### Preview data
glimpse(wake_acs5_2020_A0004_M)

## Rows: 597
## Columns: 7
## $ GEOID       [3m[38;5;246m<chr>[39m[23m "371830540162", "371830542041", "371830518001", "371830531102", "371830532051", "371830534092", "3718305341…
## $ NAME        [3m[38;5;246m<chr>[39m[23m "Block Group 2, Census Tract 540.16, Wake County, North Carolina", "Block Group 1, Census Tract 542.04, Wak…
## $ B01001_001E [3m[38;5;246m<dbl>[39m[23m 2107, 3025, 1989, 1396, 2567, 2753, 2243, 2858, 2611, 2509, 2485, 2717, 5246, 2546, 2290, 1500, 5827, 3391,…
## $ B01001_001M [3m[38;5;246m<dbl>[39m[23m 385, 446, 1168, 535, 423, 576, 392, 511, 564, 436, 358, 988, 1482, 493, 813, 262, 1622, 723, 286, 547, 311,…
## $ B01001_003E [3m[38;5;246m<dbl>[39m[23m 46, 228, 129, 205, 108, 58, 70, 55, 245, 114, 64, 345, 387, 53, 49, 50, 80, 35, 0, 36, 28, 69, 53, 0, 0, 23…
## $ B01001_003M [3m[38;5;246m<dbl>[39m[23m 43, 165, 192, 210, 50, 51, 47, 36, 103, 72, 60, 183, 246, 64, 85, 44, 90, 42, 13, 42, 25, 48, 63, 13, 13, 2…
## $ geometry    [3m[38;5;246m<MULTIPOLYGON [°]>[39m[23m MULTIPOLYGON (((-78.59955 3..., MULTIPOLYGON (((-78.52094 3..., MULTIPOLYGON (((-78.64622 3...…

Now, wake_acs5_2020_A0004_M is back to 597 rows long. Notice that the previous column names have been replaced with the variable codes, along with E (for estimate) and M (for margin of error).

Request Entire Table

In some cases, it may be easier to simply request all the variables in a table. For example, maybe you want all of the age and gender breakdown data in the ACS’ age table. To do this, instead of using the variable parameter, you can use the table parameter.

### Retrieve data
wake_acs5_2020_age <- get_acs(state = "NC",
                              county = "Wake",
                              geography = "block group",
                              table = "B01001",            #<< Note the difference here!
                              year = 2020,
                              survey = "acs5",
                              output = "wide",
                              geometry = TRUE)

No preview of the data is included because the new object wake_acs_2020_age has 101 columns, which include 2 columns for the GEOID and name, 6 columns containing the total population, male population, and female population (and their margins of error), and 92 columns containing the 23 age group populations for males and females (and their margins of error), and 1 column for the spatial features.

Decennial Census Data

The function and options to download decennial data from the US Census are quite similar to those for the ACS data. For example, the following command will download the block level data for the housing table from the most recent census for Durham County (geometry is set to FALSE because the block level spatial data are large files that take a long time to download!).

### Retrieve data
durham_dec_2020_housing <- get_decennial(state = "NC",
                                         county = "Durham",
                                         geography = "block",
                                         table = "H1",
                                         year = 2020,
                                         sumfile = "pl",
                                         output = "wide",
                                         geometry = FALSE)

### Preview data
glimpse(durham_dec_2020_housing)

## Rows: 4,401
## Columns: 5
## $ GEOID   [3m[38;5;246m<chr>[39m[23m "370630013011001", "370630013031011", "370630013032014", "370630014002005", "370630015042000", "370630015051011…
## $ NAME    [3m[38;5;246m<chr>[39m[23m "Block 1001, Block Group 1, Census Tract 13.01, Durham County, North Carolina", "Block 1011, Block Group 1, Cen…
## $ H1_001N [3m[38;5;246m<dbl>[39m[23m 20, 17, 15, 12, 0, 0, 3, 360, 0, 16, 14, 24, 24, 18, 12, 31, 46, 0, 24, 1, 47, 25, 25, 18, 0, 10, 31, 3, 54, 12…
## $ H1_002N [3m[38;5;246m<dbl>[39m[23m 18, 6, 15, 9, 0, 0, 3, 356, 0, 16, 14, 24, 18, 18, 12, 30, 42, 0, 24, 1, 43, 6, 25, 12, 0, 2, 29, 3, 51, 12, 0,…
## $ H1_003N [3m[38;5;246m<dbl>[39m[23m 2, 11, 0, 3, 0, 0, 0, 4, 0, 0, 0, 0, 6, 0, 0, 1, 4, 0, 0, 0, 4, 19, 0, 6, 0, 8, 2, 0, 3, 0, 0, 0, 0, 13, 8, 7, …

Read/Write Data

The last task in the data retrieval section is to learn how to write out (and then read back in) a spatial data layer. The sf package makes this relatively easy. The reference pages for reading and writing sf objects are here: https://r-spatial.github.io/sf/reference/st_read.html, https://r-spatial.github.io/sf/reference/st_write.html

## Write out data as a geopackage to current working directory
st_write(wake_acs5_2020_age, "wake_example_age.gpkg")

## Quiet version that will also overwrite existing file
write_sf(wake_acs5_2020_age, "wake_example_age.gpkg")

## Read in a spatial file
new_obj_wake <- st_read("wake_example_age.gpkg")

## Quiet version for reading in
new_obj_wake <- read_sf("wake_example_age.gpkg")

Challenge

Create a new R script that contains (only) the required commands to download the 2018 ACS 5-year data for Median Household Income (B19013) at the tract level for Kent County, Michigan. Make sure that the data are in wide format and include the spatial features. Write out the data!

Code

Click here to download the R code from this module

This page was last updated on February 26, 2024

Data Retrieval

Paul L. Delamater

Odum Institute

February 26, 2024