tidycensus code examples

Author

David W. Body

Published

June 11, 2022

KCRUG

These are the code examples I was unable to present today due to technical issues.

Setup

library(tidycensus)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
options(tigris_use_cache = TRUE)
# census_api_key("YOUR KEY GOES HERE")

2020 Decennial Census

What variables are available for the 2020 Decennial Census?

Data released for political redistricting pursuant to P.L. 94-171 is dataset “pl”.

dc2020_vars <- load_variables(2020, "pl", cache = TRUE)
dc2020_vars
# A tibble: 301 × 3
   name    label                                                         concept
   <chr>   <chr>                                                         <chr>  
 1 H1_001N " !!Total:"                                                   OCCUPA…
 2 H1_002N " !!Total:!!Occupied"                                         OCCUPA…
 3 H1_003N " !!Total:!!Vacant"                                           OCCUPA…
 4 P1_001N " !!Total:"                                                   RACE   
 5 P1_002N " !!Total:!!Population of one race:"                          RACE   
 6 P1_003N " !!Total:!!Population of one race:!!White alone"             RACE   
 7 P1_004N " !!Total:!!Population of one race:!!Black or African Americ… RACE   
 8 P1_005N " !!Total:!!Population of one race:!!American Indian and Ala… RACE   
 9 P1_006N " !!Total:!!Population of one race:!!Asian alone"             RACE   
10 P1_007N " !!Total:!!Population of one race:!!Native Hawaiian and Oth… RACE   
# … with 291 more rows

Variables are grouped into tables.

dc2020_vars |>
  transmute(table = str_sub(name, 1, 2), concept) |>
  count(table, concept)
# A tibble: 6 × 3
  table concept                                                                n
  <chr> <chr>                                                              <int>
1 H1    OCCUPANCY STATUS                                                       3
2 P1    RACE                                                                  71
3 P2    HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE                73
4 P3    RACE FOR THE POPULATION 18 YEARS AND OVER                             71
5 P4    HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE FOR THE PO…    73
6 P5    GROUP QUARTERS POPULATION BY MAJOR GROUP QUARTERS TYPE                10

Example: Total population of Nebraska counties.

Note

Notice the output about differential privacy.

# total population for Nebraska counties
ne_pop_2020 <- get_decennial(
  geography = "county",
  variables = "P2_001N",
  year = 2020,
  state = "Nebraska",
)
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.
ne_pop_2020
# A tibble: 93 × 4
   GEOID NAME                    variable value
   <chr> <chr>                   <chr>    <dbl>
 1 31005 Arthur County, Nebraska P2_001N    434
 2 31011 Boone County, Nebraska  P2_001N   5379
 3 31017 Brown County, Nebraska  P2_001N   2903
 4 31023 Butler County, Nebraska P2_001N   8369
 5 31029 Chase County, Nebraska  P2_001N   3893
 6 31035 Clay County, Nebraska   P2_001N   6104
 7 31039 Cuming County, Nebraska P2_001N   9013
 8 31045 Dawes County, Nebraska  P2_001N   8199
 9 31051 Dixon County, Nebraska  P2_001N   5606
10 31057 Dundy County, Nebraska  P2_001N   1654
# … with 83 more rows

tidycensus gives us data in regular dataframes (tibbles), so we can wrangle it as usual.

Here are the counties with the five highest and five lowest populations in Nebraska.

ne_pop_2020 |>
  slice_max(value, n = 5)
# A tibble: 5 × 4
  GEOID NAME                       variable  value
  <chr> <chr>                      <chr>     <dbl>
1 31055 Douglas County, Nebraska   P2_001N  584526
2 31109 Lancaster County, Nebraska P2_001N  322608
3 31153 Sarpy County, Nebraska     P2_001N  190604
4 31079 Hall County, Nebraska      P2_001N   62895
5 31019 Buffalo County, Nebraska   P2_001N   50084
ne_pop_2020 |>
  slice_min(value, n = 5)
# A tibble: 5 × 4
  GEOID NAME                       variable value
  <chr> <chr>                      <chr>    <dbl>
1 31117 McPherson County, Nebraska P2_001N    399
2 31009 Blaine County, Nebraska    P2_001N    431
3 31005 Arthur County, Nebraska    P2_001N    434
4 31115 Loup County, Nebraska      P2_001N    607
5 31075 Grant County, Nebraska     P2_001N    611

First attempt at visualizing the data.

ne_pop_2020 |>
  ggplot(aes(value, NAME)) +
  geom_col()

Okay. That’s pretty bad. Let’s see if we can improve on that.

ne_pop_2020 |>
  mutate(NAME = str_remove(NAME, " County, Nebraska")) |>
  ggplot(aes(value, reorder(NAME, value))) +
  geom_col() +
  scale_x_continuous(labels = scales::label_comma()) +
  labs(title = "Population of Nebraska counties",
       subtitle = "2020 Decennial US Census",
       x = "Population",
       y = "") +
  theme_light()

Retrieving multiple variables at once

We can retrieve multiple variables at once by passing a character vector of variable names to get_decennial.

variables <- c("P2_001N", "P2_005N", "P2_002N", "P2_006N", "P2_007N",
               "P2_008N", "P2_009N", "P2_010N", "P2_011N")

ne_race_ethnicity_2020 <- get_decennial(
  geography = "county",
  variables = variables,
  year = 2020,
  state = "NE", # <=
  # output = "wide"
)
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
ne_race_ethnicity_2020
# A tibble: 837 × 4
   GEOID NAME                    variable value
   <chr> <chr>                   <chr>    <dbl>
 1 31005 Arthur County, Nebraska P2_001N    434
 2 31011 Boone County, Nebraska  P2_001N   5379
 3 31017 Brown County, Nebraska  P2_001N   2903
 4 31023 Butler County, Nebraska P2_001N   8369
 5 31029 Chase County, Nebraska  P2_001N   3893
 6 31035 Clay County, Nebraska   P2_001N   6104
 7 31039 Cuming County, Nebraska P2_001N   9013
 8 31045 Dawes County, Nebraska  P2_001N   8199
 9 31051 Dixon County, Nebraska  P2_001N   5606
10 31057 Dundy County, Nebraska  P2_001N   1654
# … with 827 more rows

Note that by default, the output is in tidy format. We could get this into wide format using something like pivot_wider, but tidycensus gives us the option of requesting wide format.

Note that I also changed “Nebraska” to “NE”. tidycensus accepts either.

ne_race_ethnicity_2020 <- get_decennial(
  geography = "county",
  variables = variables,
  year = 2020,
  state = "NE", # <=
  output = "wide"
)
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
ne_race_ethnicity_2020
# A tibble: 93 × 11
   GEOID NAME    P2_001N P2_005N P2_002N P2_006N P2_007N P2_008N P2_009N P2_010N
   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 31005 Arthur…     434     400      12       3       0       0       0       0
 2 31011 Boone …    5379    5057     183      21      15       7       0       6
 3 31017 Brown …    2903    2629     176       2       7       6       0       2
 4 31023 Butler…    8369    7626     482      28      22       9       1      23
 5 31029 Chase …    3893    3209     562      20       2      11       0       2
 6 31035 Clay C…    6104    5372     523       8      13      10       0      14
 7 31039 Cuming…    9013    7638    1112      21      26      18       9      23
 8 31045 Dawes …    8199    6943     328     164     322      53       6      21
 9 31051 Dixon …    5606    4643     796      17      19      17       0       8
10 31057 Dundy …    1654    1491     103       7       9       1       3       4
# … with 83 more rows, and 1 more variable: P2_011N <dbl>

We can give the variables meaningful names by using a named vector like this. Notice that I am omitting total population (P2_001N) from the variable list and adding it back with the parameter summary_var.

variables <- c(
  # Total = "P2_001N",
  White = "P2_005N",
  Hispanic = "P2_002N",
  Black = "P2_006N",
  Native = "P2_007N",
  Asian = "P2_008N",
  HIPI = "P2_009N",
  Other = "P2_010N",
  TwoOrMore = "P2_011N")

ne_race_ethnicity_2020 <- get_decennial(
  geography = "county",
  variables = variables,
  year = 2020,
  summary_var = "P2_001N", # <=
  state = "NE",
  # output = "wide"
)
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
ne_race_ethnicity_2020
# A tibble: 744 × 5
   GEOID NAME                    variable value summary_value
   <chr> <chr>                   <chr>    <dbl>         <dbl>
 1 31005 Arthur County, Nebraska White      400           434
 2 31011 Boone County, Nebraska  White     5057          5379
 3 31017 Brown County, Nebraska  White     2629          2903
 4 31023 Butler County, Nebraska White     7626          8369
 5 31029 Chase County, Nebraska  White     3209          3893
 6 31035 Clay County, Nebraska   White     5372          6104
 7 31039 Cuming County, Nebraska White     7638          9013
 8 31045 Dawes County, Nebraska  White     6943          8199
 9 31051 Dixon County, Nebraska  White     4643          5606
10 31057 Dundy County, Nebraska  White     1491          1654
# … with 734 more rows

We can wrangle the data as usual. For example here are totals for the state of Nebraska.

ne_race_ethnicity_2020 |>
  group_by(variable) |>
  summarize(n = sum(value)) |>
  arrange(desc(n))
# A tibble: 8 × 2
  variable        n
  <chr>       <dbl>
1 White     1484687
2 Hispanic   234715
3 Black       94405
4 TwoOrMore   72634
5 Asian       52359
6 Native      15051
7 Other        6335
8 HIPI         1318

Now we also have a separate column called summary_value that contains the total population for each county. That makes it convenient to calculate percentages. Here we compute the percentage of the population that is Hispanic or Latino for each county.

ne_hispanic_percent <- ne_race_ethnicity_2020 |>
  filter(variable == "Hispanic") |>
  mutate(percent = 100 * (value / summary_value)) |>
  select(NAME, percent)

ne_hispanic_percent
# A tibble: 93 × 2
   NAME                    percent
   <chr>                     <dbl>
 1 Arthur County, Nebraska    2.76
 2 Boone County, Nebraska     3.40
 3 Brown County, Nebraska     6.06
 4 Butler County, Nebraska    5.76
 5 Chase County, Nebraska    14.4 
 6 Clay County, Nebraska      8.57
 7 Cuming County, Nebraska   12.3 
 8 Dawes County, Nebraska     4.00
 9 Dixon County, Nebraska    14.2 
10 Dundy County, Nebraska     6.23
# … with 83 more rows
# Lowest 5
ne_hispanic_percent |>
  slice_min(percent, n = 5)
# A tibble: 5 × 2
  NAME                       percent
  <chr>                        <dbl>
1 McPherson County, Nebraska   0.752
2 Garfield County, Nebraska    1.32 
3 Pawnee County, Nebraska      1.53 
4 Hooker County, Nebraska      1.55 
5 Rock County, Nebraska        1.58 
# Highest 5
ne_hispanic_percent |>
  slice_max(percent, n = 5)
# A tibble: 5 × 2
  NAME                    percent
  <chr>                     <dbl>
1 Colfax County, Nebraska    47.2
2 Dakota County, Nebraska    40.8
3 Dawson County, Nebraska    35.8
4 Hall County, Nebraska      30.5
5 Saline County, Nebraska    28.5

Choropleth maps

The tidycensus package makes it easy to retrieve the geographic data necessary to draw maps. All we have to do is add geometry = TRUE.

ne_race_ethnicity_2020 <- get_decennial(
  geography = "county",
  variables = variables,
  year = 2020,
  state = "NE",
  summary_var = "P2_001N",
  geometry = TRUE # <=
)
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
ne_race_ethnicity_2020
Simple feature collection with 744 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -104.0535 ymin: 40 xmax: -95.30829 ymax: 43.00171
Geodetic CRS:  NAD83
# A tibble: 744 × 6
   GEOID NAME             variable value summary_value                  geometry
   <chr> <chr>            <chr>    <dbl>         <dbl>        <MULTIPOLYGON [°]>
 1 31105 Kimball County,… White     3026          3434 (((-104.0534 41.17054, -…
 2 31105 Kimball County,… Hispanic   262          3434 (((-104.0534 41.17054, -…
 3 31105 Kimball County,… Black        1          3434 (((-104.0534 41.17054, -…
 4 31105 Kimball County,… Native      24          3434 (((-104.0534 41.17054, -…
 5 31105 Kimball County,… Asian       15          3434 (((-104.0534 41.17054, -…
 6 31105 Kimball County,… HIPI         0          3434 (((-104.0534 41.17054, -…
 7 31105 Kimball County,… Other        9          3434 (((-104.0534 41.17054, -…
 8 31105 Kimball County,… TwoOrMo…    97          3434 (((-104.0534 41.17054, -…
 9 31069 Garden County, … White     1722          1874 (((-102.6784 41.87489, -…
10 31069 Garden County, … Hispanic    88          1874 (((-102.6784 41.87489, -…
# … with 734 more rows

Notice that we now have a simple feature collection rather than an ordinary dataframe (tibble). The geographic data is in the geometry column. Geographic data is composed of polygons in the form of lists of longitude/latitude points. The geometry column is actually “multipolygons” to support geographic entities that are not contiguous. E.g. coastal counties that include islands.

Next we recompute the percentage Hispanic or Latino.

ne_hispanic_percent <- ne_race_ethnicity_2020 |>
  filter(variable == "Hispanic") |>
  mutate(percent = 100 * (value / summary_value)) |>
  select(NAME, percent)

ne_hispanic_percent
Simple feature collection with 93 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -104.0535 ymin: 40 xmax: -95.30829 ymax: 43.00171
Geodetic CRS:  NAD83
# A tibble: 93 × 3
   NAME                       percent                                   geometry
   <chr>                        <dbl>                         <MULTIPOLYGON [°]>
 1 Kimball County, Nebraska     7.63  (((-104.0534 41.17054, -104.0532 41.19995…
 2 Garden County, Nebraska      4.70  (((-102.6784 41.87489, -102.6772 41.99066…
 3 Wheeler County, Nebraska     2.58  (((-98.76116 42.08852, -98.71327 42.08851…
 4 Greeley County, Nebraska     2.38  (((-98.75222 41.74037, -98.69451 41.74043…
 5 Thomas County, Nebraska      4.33  (((-100.8461 42.08817, -100.8119 42.0881,…
 6 Hooker County, Nebraska      1.55  (((-101.4266 42.09219, -101.4122 42.09211…
 7 Perkins County, Nebraska     5.11  (((-102.0518 41.00387, -102.0332 41.00313…
 8 Dawes County, Nebraska       4.00  (((-103.5051 43.00077, -103.4046 43.00074…
 9 Holt County, Nebraska        5.14  (((-99.25704 42.8043, -99.25215 42.80782,…
10 McPherson County, Nebraska   0.752 (((-101.4073 41.40985, -101.4071 41.43708…
# … with 83 more rows

We can make our first choropleth using the plot function.

plot(ne_hispanic_percent["percent"])

If we want to use ggplot2, we use geom_sf.

ne_hispanic_percent |>
  ggplot(aes(geometry = geometry, fill = percent)) +
  geom_sf()

Let’s clean that up by removing the axis labels, selecting a different color palette, and adding some labels.

We’ll also transform the data to a different coordinate reference system (CRS) The details of this are beyond the scope of today’s demo, but notice the subtle change to the shape of the state.

library(sf) # for st_transform function
Linking to GEOS 3.9.0, GDAL 3.2.2, PROJ 7.2.1; sf_use_s2() is TRUE
ne_hispanic_percent |>
  st_transform(crs = 32104) |>
  ggplot(aes(geometry = geometry, fill = percent)) +
  geom_sf() +
  scale_fill_viridis_c() +
  theme_void() +
  labs(title = "Nebraska Hispanic or Latino Population",
       subtitle = "2020 Decennial US Census",
       fill = "Percent of total\ncounty population")

2016-2020 American Community Survey (ACS)

Working with the ACS is similar. We start by getting a list of available variables.

acs5_2020_vars <- load_variables(2020, "acs5", cache = TRUE)
acs5_2020_vars
# A tibble: 27,850 × 4
   name       label                                    concept    geography  
   <chr>      <chr>                                    <chr>      <chr>      
 1 B01001_001 Estimate!!Total:                         SEX BY AGE block group
 2 B01001_002 Estimate!!Total:!!Male:                  SEX BY AGE block group
 3 B01001_003 Estimate!!Total:!!Male:!!Under 5 years   SEX BY AGE block group
 4 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years    SEX BY AGE block group
 5 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years  SEX BY AGE block group
 6 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years  SEX BY AGE block group
 7 B01001_007 Estimate!!Total:!!Male:!!18 and 19 years SEX BY AGE block group
 8 B01001_008 Estimate!!Total:!!Male:!!20 years        SEX BY AGE block group
 9 B01001_009 Estimate!!Total:!!Male:!!21 years        SEX BY AGE block group
10 B01001_010 Estimate!!Total:!!Male:!!22 to 24 years  SEX BY AGE block group
# … with 27,840 more rows

Notice that there are a lot more variables and the output contains a column indicating the smallest geography for which data is available for each variable.

Let’s get median household income for each county in Massachusetts.

ma_income_2020 <- get_acs(
  geography = "county",
  variables = "B19013_001",
  year = 2020,
  state = "MA",
  # moe_level = 99,
  # output = "wide"
)
Getting data from the 2016-2020 5-year ACS
ma_income_2020
# A tibble: 14 × 5
   GEOID NAME                             variable   estimate   moe
   <chr> <chr>                            <chr>         <dbl> <dbl>
 1 25001 Barnstable County, Massachusetts B19013_001    76863  1835
 2 25003 Berkshire County, Massachusetts  B19013_001    62166  2305
 3 25005 Bristol County, Massachusetts    B19013_001    71450  1420
 4 25007 Dukes County, Massachusetts      B19013_001    77318 11503
 5 25009 Essex County, Massachusetts      B19013_001    82225  1120
 6 25011 Franklin County, Massachusetts   B19013_001    61198  1823
 7 25013 Hampden County, Massachusetts    B19013_001    57623  1342
 8 25015 Hampshire County, Massachusetts  B19013_001    73518  2175
 9 25017 Middlesex County, Massachusetts  B19013_001   106202  1144
10 25019 Nantucket County, Massachusetts  B19013_001   112306 10162
11 25021 Norfolk County, Massachusetts    B19013_001   105320  1736
12 25023 Plymouth County, Massachusetts   B19013_001    92906  2042
13 25025 Suffolk County, Massachusetts    B19013_001    74881  1408
14 25027 Worcester County, Massachusetts  B19013_001    77155   994

Because the ACS is based on sampling, the variables represent estimates, and estimates have margins of error. Those are the columns named estimate and moe. By default, the margins of error are at the 90% level, but you can request 95% or 99% margin of error levels using the moe_level parameter.

ACS data can also be retrieved in wide format. The tidycensus package will append an E to the names of estimate columns and an M for the margin of error columns.

Finally, we’ll do a plot where we represent the uncertainty using error bars.

ma_income_2020 |>
  mutate(NAME = str_remove(NAME, " County, Massachusetts")) |>
  ggplot(aes(estimate, reorder(NAME, estimate))) +
  geom_point(color = "red", size = 2.5) +
  geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe)) +
  scale_x_continuous(labels = scales::label_dollar()) +
  theme_light() +
  labs(
    title = "Median Household Income for Massachusetts Counties",
    subtitle = "with 90% compatibility intervals",
    x = "",
    y = "",
    caption = "Data source: 2016-2020 ACS & tidycensus R package"
  )

Exercise for the reader: Why do those two counties have such high margins of error?