# A tibble: 301 × 3
name label concept
<chr> <chr> <chr>
1 H1_001N " !!Total:" OCCUPA…
2 H1_002N " !!Total:!!Occupied" OCCUPA…
3 H1_003N " !!Total:!!Vacant" OCCUPA…
4 P1_001N " !!Total:" RACE
5 P1_002N " !!Total:!!Population of one race:" RACE
6 P1_003N " !!Total:!!Population of one race:!!White alone" RACE
7 P1_004N " !!Total:!!Population of one race:!!Black or African Americ… RACE
8 P1_005N " !!Total:!!Population of one race:!!American Indian and Ala… RACE
9 P1_006N " !!Total:!!Population of one race:!!Asian alone" RACE
10 P1_007N " !!Total:!!Population of one race:!!Native Hawaiian and Oth… RACE
# … with 291 more rows
# A tibble: 6 × 3
table concept n
<chr> <chr> <int>
1 H1 OCCUPANCY STATUS 3
2 P1 RACE 71
3 P2 HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE 73
4 P3 RACE FOR THE POPULATION 18 YEARS AND OVER 71
5 P4 HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE FOR THE PO… 73
6 P5 GROUP QUARTERS POPULATION BY MAJOR GROUP QUARTERS TYPE 10
Example: Total population of Nebraska counties.
Note
Notice the output about differential privacy.
# total population for Nebraska countiesne_pop_2020 <-get_decennial(geography ="county",variables ="P2_001N",year =2020,state ="Nebraska",)
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.
ne_pop_2020
# A tibble: 93 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 31005 Arthur County, Nebraska P2_001N 434
2 31011 Boone County, Nebraska P2_001N 5379
3 31017 Brown County, Nebraska P2_001N 2903
4 31023 Butler County, Nebraska P2_001N 8369
5 31029 Chase County, Nebraska P2_001N 3893
6 31035 Clay County, Nebraska P2_001N 6104
7 31039 Cuming County, Nebraska P2_001N 9013
8 31045 Dawes County, Nebraska P2_001N 8199
9 31051 Dixon County, Nebraska P2_001N 5606
10 31057 Dundy County, Nebraska P2_001N 1654
# … with 83 more rows
tidycensus gives us data in regular dataframes (tibbles), so we can wrangle it as usual.
Here are the counties with the five highest and five lowest populations in Nebraska.
ne_pop_2020 |>slice_max(value, n =5)
# A tibble: 5 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 31055 Douglas County, Nebraska P2_001N 584526
2 31109 Lancaster County, Nebraska P2_001N 322608
3 31153 Sarpy County, Nebraska P2_001N 190604
4 31079 Hall County, Nebraska P2_001N 62895
5 31019 Buffalo County, Nebraska P2_001N 50084
ne_pop_2020 |>slice_min(value, n =5)
# A tibble: 5 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 31117 McPherson County, Nebraska P2_001N 399
2 31009 Blaine County, Nebraska P2_001N 431
3 31005 Arthur County, Nebraska P2_001N 434
4 31115 Loup County, Nebraska P2_001N 607
5 31075 Grant County, Nebraska P2_001N 611
Using the PL 94-171 Redistricting Data summary file
ne_race_ethnicity_2020
# A tibble: 837 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 31005 Arthur County, Nebraska P2_001N 434
2 31011 Boone County, Nebraska P2_001N 5379
3 31017 Brown County, Nebraska P2_001N 2903
4 31023 Butler County, Nebraska P2_001N 8369
5 31029 Chase County, Nebraska P2_001N 3893
6 31035 Clay County, Nebraska P2_001N 6104
7 31039 Cuming County, Nebraska P2_001N 9013
8 31045 Dawes County, Nebraska P2_001N 8199
9 31051 Dixon County, Nebraska P2_001N 5606
10 31057 Dundy County, Nebraska P2_001N 1654
# … with 827 more rows
Note that by default, the output is in tidy format. We could get this into wide format using something like pivot_wider, but tidycensus gives us the option of requesting wide format.
Note that I also changed “Nebraska” to “NE”. tidycensus accepts either.
We can give the variables meaningful names by using a named vector like this. Notice that I am omitting total population (P2_001N) from the variable list and adding it back with the parameter summary_var.
Using the PL 94-171 Redistricting Data summary file
ne_race_ethnicity_2020
# A tibble: 744 × 5
GEOID NAME variable value summary_value
<chr> <chr> <chr> <dbl> <dbl>
1 31005 Arthur County, Nebraska White 400 434
2 31011 Boone County, Nebraska White 5057 5379
3 31017 Brown County, Nebraska White 2629 2903
4 31023 Butler County, Nebraska White 7626 8369
5 31029 Chase County, Nebraska White 3209 3893
6 31035 Clay County, Nebraska White 5372 6104
7 31039 Cuming County, Nebraska White 7638 9013
8 31045 Dawes County, Nebraska White 6943 8199
9 31051 Dixon County, Nebraska White 4643 5606
10 31057 Dundy County, Nebraska White 1491 1654
# … with 734 more rows
We can wrangle the data as usual. For example here are totals for the state of Nebraska.
# A tibble: 8 × 2
variable n
<chr> <dbl>
1 White 1484687
2 Hispanic 234715
3 Black 94405
4 TwoOrMore 72634
5 Asian 52359
6 Native 15051
7 Other 6335
8 HIPI 1318
Now we also have a separate column called summary_value that contains the total population for each county. That makes it convenient to calculate percentages. Here we compute the percentage of the population that is Hispanic or Latino for each county.
# A tibble: 93 × 2
NAME percent
<chr> <dbl>
1 Arthur County, Nebraska 2.76
2 Boone County, Nebraska 3.40
3 Brown County, Nebraska 6.06
4 Butler County, Nebraska 5.76
5 Chase County, Nebraska 14.4
6 Clay County, Nebraska 8.57
7 Cuming County, Nebraska 12.3
8 Dawes County, Nebraska 4.00
9 Dixon County, Nebraska 14.2
10 Dundy County, Nebraska 6.23
# … with 83 more rows
# Lowest 5ne_hispanic_percent |>slice_min(percent, n =5)
# A tibble: 5 × 2
NAME percent
<chr> <dbl>
1 McPherson County, Nebraska 0.752
2 Garfield County, Nebraska 1.32
3 Pawnee County, Nebraska 1.53
4 Hooker County, Nebraska 1.55
5 Rock County, Nebraska 1.58
# Highest 5ne_hispanic_percent |>slice_max(percent, n =5)
# A tibble: 5 × 2
NAME percent
<chr> <dbl>
1 Colfax County, Nebraska 47.2
2 Dakota County, Nebraska 40.8
3 Dawson County, Nebraska 35.8
4 Hall County, Nebraska 30.5
5 Saline County, Nebraska 28.5
Choropleth maps
The tidycensus package makes it easy to retrieve the geographic data necessary to draw maps. All we have to do is add geometry = TRUE.
Notice that we now have a simple feature collection rather than an ordinary dataframe (tibble). The geographic data is in the geometry column. Geographic data is composed of polygons in the form of lists of longitude/latitude points. The geometry column is actually “multipolygons” to support geographic entities that are not contiguous. E.g. coastal counties that include islands.
Next we recompute the percentage Hispanic or Latino.
Simple feature collection with 93 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -104.0535 ymin: 40 xmax: -95.30829 ymax: 43.00171
Geodetic CRS: NAD83
# A tibble: 93 × 3
NAME percent geometry
<chr> <dbl> <MULTIPOLYGON [°]>
1 Kimball County, Nebraska 7.63 (((-104.0534 41.17054, -104.0532 41.19995…
2 Garden County, Nebraska 4.70 (((-102.6784 41.87489, -102.6772 41.99066…
3 Wheeler County, Nebraska 2.58 (((-98.76116 42.08852, -98.71327 42.08851…
4 Greeley County, Nebraska 2.38 (((-98.75222 41.74037, -98.69451 41.74043…
5 Thomas County, Nebraska 4.33 (((-100.8461 42.08817, -100.8119 42.0881,…
6 Hooker County, Nebraska 1.55 (((-101.4266 42.09219, -101.4122 42.09211…
7 Perkins County, Nebraska 5.11 (((-102.0518 41.00387, -102.0332 41.00313…
8 Dawes County, Nebraska 4.00 (((-103.5051 43.00077, -103.4046 43.00074…
9 Holt County, Nebraska 5.14 (((-99.25704 42.8043, -99.25215 42.80782,…
10 McPherson County, Nebraska 0.752 (((-101.4073 41.40985, -101.4071 41.43708…
# … with 83 more rows
We can make our first choropleth using the plot function.
plot(ne_hispanic_percent["percent"])
If we want to use ggplot2, we use geom_sf.
ne_hispanic_percent |>ggplot(aes(geometry = geometry, fill = percent)) +geom_sf()
Let’s clean that up by removing the axis labels, selecting a different color palette, and adding some labels.
We’ll also transform the data to a different coordinate reference system (CRS) The details of this are beyond the scope of today’s demo, but notice the subtle change to the shape of the state.
library(sf) # for st_transform function
Linking to GEOS 3.9.0, GDAL 3.2.2, PROJ 7.2.1; sf_use_s2() is TRUE
ne_hispanic_percent |>st_transform(crs =32104) |>ggplot(aes(geometry = geometry, fill = percent)) +geom_sf() +scale_fill_viridis_c() +theme_void() +labs(title ="Nebraska Hispanic or Latino Population",subtitle ="2020 Decennial US Census",fill ="Percent of total\ncounty population")
2016-2020 American Community Survey (ACS)
Working with the ACS is similar. We start by getting a list of available variables.
# A tibble: 27,850 × 4
name label concept geography
<chr> <chr> <chr> <chr>
1 B01001_001 Estimate!!Total: SEX BY AGE block group
2 B01001_002 Estimate!!Total:!!Male: SEX BY AGE block group
3 B01001_003 Estimate!!Total:!!Male:!!Under 5 years SEX BY AGE block group
4 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years SEX BY AGE block group
5 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years SEX BY AGE block group
6 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years SEX BY AGE block group
7 B01001_007 Estimate!!Total:!!Male:!!18 and 19 years SEX BY AGE block group
8 B01001_008 Estimate!!Total:!!Male:!!20 years SEX BY AGE block group
9 B01001_009 Estimate!!Total:!!Male:!!21 years SEX BY AGE block group
10 B01001_010 Estimate!!Total:!!Male:!!22 to 24 years SEX BY AGE block group
# … with 27,840 more rows
Notice that there are a lot more variables and the output contains a column indicating the smallest geography for which data is available for each variable.
Let’s get median household income for each county in Massachusetts.
# A tibble: 14 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 25001 Barnstable County, Massachusetts B19013_001 76863 1835
2 25003 Berkshire County, Massachusetts B19013_001 62166 2305
3 25005 Bristol County, Massachusetts B19013_001 71450 1420
4 25007 Dukes County, Massachusetts B19013_001 77318 11503
5 25009 Essex County, Massachusetts B19013_001 82225 1120
6 25011 Franklin County, Massachusetts B19013_001 61198 1823
7 25013 Hampden County, Massachusetts B19013_001 57623 1342
8 25015 Hampshire County, Massachusetts B19013_001 73518 2175
9 25017 Middlesex County, Massachusetts B19013_001 106202 1144
10 25019 Nantucket County, Massachusetts B19013_001 112306 10162
11 25021 Norfolk County, Massachusetts B19013_001 105320 1736
12 25023 Plymouth County, Massachusetts B19013_001 92906 2042
13 25025 Suffolk County, Massachusetts B19013_001 74881 1408
14 25027 Worcester County, Massachusetts B19013_001 77155 994
Because the ACS is based on sampling, the variables represent estimates, and estimates have margins of error. Those are the columns named estimate and moe. By default, the margins of error are at the 90% level, but you can request 95% or 99% margin of error levels using the moe_level parameter.
ACS data can also be retrieved in wide format. The tidycensus package will append an E to the names of estimate columns and an M for the margin of error columns.
Finally, we’ll do a plot where we represent the uncertainty using error bars.