Urban Data Analysis

Domain Specific Languages

2019-05-06T05:00:00+00:00

A reflection on domain specific languages

I can’t recommend Mark van der Loo and Edwin de Jonge’s Statistical Data Cleaning in with Applications in R enough. It is extremely thorough, extremely useful, and serves as a great reference as well as an introduction.

However, in their chapter on their data validation validate package and how to use it–and it can’t be said enough that the package is brilliant and has changed how I have worked for a while–they pick up the question of why begin to develop specific syntaxes embedded in R at all? I’ve been searching for a way to express this myself, since so much of what makes the R universe unique are things like base R’s interesting statistical function syntax, e.g. lm(var1 ~ var2, var3). And I can’t think of a more concise statement of the pros and cons of embedding a domain specific language within another than theirs:

There are many advantages to implementing a data validation syntax embedded into R. First, R’s facilities to compute on the language including access to the abstract syntax tree of statements and nonstandard evaluation make experimenting with such an implementation a breeze. In particular, it makes it easy to experiment with different ideas and test them out in practice, something which is much harder while developing a standalone DSL. Second, using R as a host language means access to the truly enormous data processing and statistical capabilities that come with R and its packages at no cost whatsoever. Third, many users interested in data validation are already familiar with R and will be able to use the DSL with relative ease.

The downside of embedding a DSL into another language includes leakage: a user may (unwittingly) “escape” the DSL and use more advanced features of the host language. A second downside is that the syntax of the host language may be too limited to accurately capture the concepts for which the DSL was designed. It is of interest to note that R is more flexible than many other languages, since it allows for the definition of user-defined infix operators. The most famous example is probably the pipe operator %>% of the magrittr package. (From Statistical Data Cleaning in with Applications in R, p. 149)

Runs and wins: regression with tidy data

2019-04-12T23:00:00+00:00

A quick update to a chapter on analyzing baseball data

A great book for learning R is Analyzing Baseball Data with R by Mack Marchi and Jim Albert. Besides being a great book about how to automate a lot of tasks in baseball analytics, the book provides a context for many of the data analysis workflows that are able to be automated in R: if you know a little about baseball you can get a really great feel for many data analysis tasks, and vice versa.

Unfortunately, because the book was published in 2013, there’s little support in the text for tidy data workflows and a lot of use of base R: lots of subset(), lots of with(), and lots of plotting with plot() rather than ggplot(). This is good, on the one hand, because base R is already intuitive and useful and sometimes the tidy data framework (particularly ggplot) is not always the foolproof solution to a lot of problems. However, since the tidy data framework is generally even more intuitive and also so very extensible for many data analysis tasks, I’ve decided to give an indication of how some of the analyses might change.

1.1 The relation between runs and wins

Chapter 4 of Marchi and Albert’s book covers the relation between runs and wins. Specifically it follows the very interesting question of whether run differential matters in baseball. Run differential is the difference between the winning team’s runs and the runs of the opponent, or how much a team beat their opponent by. All a team has to do to beat an opponent, of course, is one more run than them. But not all opponents are beaten equally, as it were. Some teams may beat their opponents by a lot, in order presumably to protect their leads. Baseball can be a streaky game: few runs are scored, until when several people get on base, and then they can be all knocked in at once. On the other hand, some teams may play more defensively, keeping their opponents to low scores, and then squeaking by. Looking at the distribution of run differentials across teams’ seasons shows this:

We can also look at the summary statistics:

Minimum: -349
Median: 5
Mean: 0
Maximum: 411

The median run differential for the 2452 team-seasons played since 1900 is a remarkable number: 5 runs scored more than allowed across the entire season. On average–even more remarkably– teams scored no more runs than they allowed by the other team in each season. The distribution shows this well: most cases are precisely around 0, with a peak just below and just above 0.

1.2 A remark on regression

However, it is the information other than that able to be gathered by looking at central tendency which we will be concerned with, since each of the teams here may have had different outcomes in terms of their success at winning.

It’s important to review what we are trying to do when we say something like this, before we proceed. What we are doing in linear regression is trying to see if a distribution of a variable can be related linearly to another variable. We are trying to relate one variable to another by making it dependent on that variable in order to get a better answer about any and every variable. This allows us to find out more information using the model: given a certain run differential, we should be able to tell more about the winning percentage, in the sense that it should be able to be an element of a series which would fall on a line of other relationships between run differential and winning percentage. While our model may give us a precise answer, and the actual world may be more messy than that, we have nevertheless provided a link between two variables that is plausible. To use the language of some econometricians (specifically, Philip Hans Frances), this link allows us to shift from unconditional to conditional expectations, or an expectation that depends on a linked variable.

1.3 Calculating basic stats

First, let’s import libraries. We will use Lahman, an R package maintained by Chris Dalzell which allows you to access Sean Lahman’s amazing baseball database easily, with its comprehensive team statistics. We will also use the tidyverse package, for tidy data manipulation. Finally we will want broom, by Alex Hayes, which has tools for taking statistical test output and placing it in a tidy data format.

library(Lahman)
library(tidyverse)
library(broom)

Next, we can subset the data table on teams data which is accessible thanks to the Lahman package in the variable Teams. Marchi and Albert are concerned with run differentials from all teams from season 2000 onward, and I’ll follow them. However, I will not use subset() from base R to subset the data and keep the crucial fields I need. Insteaad, I will use dplyr and filter() applied to Teams to remove all values before 2000 using the yearID variable in the dataset, and select() to retain crucial fields for calculating run differential. This will include the team id, the year, league id, total games played, total wins, total losses, runs scored and runs allowed:

t <- Teams %>%
  filter(yearID > 2000) %>%
  select(teamID, yearID, lgID, G, W, L, R, RA)

Next, we can calculate the run differential by subtracting the runs allowed (RA) from runs (R), and add this as a separate column to our dataframe with mutate(). Then, we similarly calculate the win percentage (which Marchi and Albert make clear is actually better named a win proportion), by dividing the wins (W) from the total amount of games (wins plus losses, or W plus L):

t <- t %>%
  mutate(rundiff = R-RA) %>%
  mutate(wpct = (W/(W+L)))

If we look at the five number summary of run differentials with summary(), we see:

Minimum: -337
First Quartile: -78.25
Median: 4.5
Mean: 0
Third Quartile: 78.5
Maximum: 300

And if we look at the five number summary of winning percentage with summary(), we see:

Minimum: .265
First Quartile: .444
Median: .505
Mean: .500
Third Quartile: .556
Maximum: .716

That should give us some expectation of just how much run differentials can translate into winning percentages in the furst place: clearly no one is winning all the games and even the lowest performing team is winning roughly a quarter of their games (.265, to be exact). However, we can plot the two against each other to test the general assumption, which to Marchi and Albert seems obvious but which I treat with a little more hesitation, that greater run differential may translate into a higher winning percentage:

p <- ggplot(data=t, aes(x=rundiff, y=wpct))+
  geom_point()+
  labs(title="Winning Percentage vs Run Differentials, 2000-Present")+
  xlab("Run Differential")+ylab("Winning Percentage")+
  scale_x_continuous(breaks=seq(from=-400, to=400, by=100))+
  theme_classic()

If we do the same work and look at the data from 1900, this pattern becomes even more clear:

1.4 Building a model

We can understand this association in two ways: we can understand even better the actual historical association, or we can build a model to understand the expected winning percentage given a particular run differential. Marchi and Albert do the latter.

First, we suggest the model:

Winning percentage = a + b * run differential + residuals

And we make the linear model with lm(). We keep the data specified at the end of the function call because it allows us to focus on specifying (and if need arises, modifying) the actual model itself:

fit <- lm(wpct ~ rundiff, data=t)

Instead of viewing the results with summary() in base R, we can use tidy() from the broom package:

tidy(fit)

This returns the summary output of lm() as a tidy data frame. (This is not only useful in the moment, but is particularly useful if we would like to work with multiple models.) We will pay attention to the estimate variable:

# A tibble: 2 x 5
  term        estimate std.error statistic   p.value
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) 0.500     0.00116    433.      0.       
2 rundiff     0.000626  0.0000110  57.1      1.31e-215

Marchi and Albert explain the estimate variable and its two cases: a team with a run differential of zero will win half of its games (.500, the intercept), and a one run increase in a season’s run differential will correspond to a .00626 increase in winning percentage. The model, in other words, can be expressed:

Winning percentage = .500 + .00626 * run differential + residuals

We can use broom and its function augment() to return the team data with the fitted values and residuals added (automating a common workflow in base R):

fitaug <- augment(fit)

This returns a tibble with everything we need to know more about each of the values:

# A tibble: 480 x 9
    wpct rundiff .fitted .se.fit   .resid    .hat .sigma   .cooksd .std.resid
 * <dbl>   <int>   <dbl>   <dbl>    <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
0.463     -39   0.476 0.00123 -0.0126  0.00237 0.0253 0.000296      -0.499
0.568     141   0.588 0.00193 -0.0203  0.00581 0.0253 0.00190       -0.806
0.543      86   0.554 0.00149 -0.0106  0.00347 0.0253 0.000307      -0.420
0.391    -142   0.411 0.00194 -0.0198  0.00586 0.0253 0.00182       -0.786
0.509      27   0.517 0.00119 -0.00757 0.00222 0.0253 0.0000998     -0.300
0.512       3   0.502 0.00116  0.0105  0.00209 0.0253 0.000179       0.414
0.543      76   0.548 0.00142 -0.00435 0.00317 0.0253 0.0000470     -0.172
0.407    -115   0.428 0.00171 -0.0206  0.00456 0.0253 0.00153       -0.816
0.562      76   0.548 0.00142  0.0142  0.00317 0.0253 0.000499       0.561
0.451      17   0.511 0.00117 -0.0600  0.00214 0.0252 0.00603       -2.37
# ... with 470 more rows

It can also be easily joined to the original team data with a simple left_join(). It also allows a much easier way to plot with ggplot. We can use the same plot as our previous one, only with an extra line added which uses the new data frame and the .fitted variable:

p+
  geom_line(data=fitaug, aes(x = rundiff, y = .fitted), size = 1, color="red")
p

All this would have to be achieved with a rather unintuitive set of calls to predict() in base R. Finally, we can look at the R-squared value and confirm the goodness of fit with a glance() in broom. glance(fit) returns:

# A tibble: 1 x 11
  r.squared adj.r.squared  sigma statistic   p.value    df logLik    AIC    BIC deviance
*     <dbl>         <dbl>  <dbl>     <dbl>     <dbl> <int>  <dbl>  <dbl>  <dbl>    <dbl>
1     0.872         0.872 0.0253     3260. 1.31e-215     2  1085. -2163. -2151.    0.306
# ... with 1 more variable: df.residual <int>

The r-squared value is 0.872, and the root mean standard error, which Marchi and Albert calculate, is listed in the sigma variable as 0.0253. Finally, we can further investigate the model by plotting a quick residual plot using the broom augmented model data, plotting the .fitted variable against the .resid variable. In order to be clear about which teams’ seasons are least captured by the model (or rather, where our residuals are the greatest), we can also create labels and pass this to the residual plot with geom_text(). First, we merge the augmented model (fitaug) back with our team data (t), taking time to create a label column with mutate() which combines the team id and their season. Then we simply specify the label in the aesthetic and tack on a geom_text() to our plot, making sure to filter out everything but the least extreme values (here, anything with residuals greater than .07):

fitaug <- left_join(t, fitaug, by = c("rundiff", "wpct")) %>%
  mutate(label = paste(teamID, yearID))

ggplot(data=fitaug, aes(x=.fitted, y=.resid, label=label))+
  geom_point(alpha=.3)+
  geom_hline(yintercept=0, col="red", linetype="dashed")+
  labs(title="Residual Plot")+
  xlab("Fitted Values")+ylab("Residuals")+
  geom_text(data=filter(fitaug, .resid > .07 | .resid < -.07),
            nudge_x=.025)
  theme_classic()

We see that the 2005 Arizona Diamondbacks, the 2006 Cleveland team, the 2008 Los Angeles Angels, and the 2016 Texas Rangers all have high residuals. A quick look at their run differential compared to their winning percentage shows why:

teamID yearID lgID   G   W  L   R  RA rundiff      wpct      .resid
  ARI   2005   NL 162  77 85 696 856    -160 0.4753086    0.07544339
  CLE   2006   AL 162  78 84 870 782      88 0.4814815   -0.07358386
  LAA   2008   AL 162 100 62 765 697      68 0.6172840    0.07473474
  TEX   2016   AL 162  95 67 765 757       8 0.5864198    0.08141895

Arizona had a very large negative run differential but still a rather average winning percentage (near the median of .500 and between the first and third quartile, see our consideration of it above). Cleveland had high differentials but average winning percentage; Texas had a small run differential with a high winning percentage. Finally, the Angels had run differential within the first and third quartiles (and somewhat close to the mean) and a high winning percentage. If the relation between run differentials and winning percentages are to be better understood, a more complex model may need to be built.

1.5 Conclusion

In the end, the tidy data format allows much cleaner code, much more familiar and intuitive operations with data, and also works well with visualization to make it a compelling alternative to many base R workflows. I hope this modification to one basic exercise in Marchi and Albert’s excellent book is helpful, and shows the ways that some approaches there and elsewhere might be updated.

Introduction to R as a GIS

2019-04-03T17:00:00+00:00

An introduction to using R as a GIS for urban spatial analysis

1.1 Overview

As an urban policy analyst and planner, I have to do a lot of analysis of spatial data. Over the last few years R has become a great way to do many of the basic tasks which would normally be done in QGIS or ArcGIS, and then link them to its powerful statistical tools. But as I’ve learned R as a GIS, I’ve noticed there are few introductions for the types of user cases which are common in my field. This brief walkthrough and the accompanying code should guide readers through some of this territory. My primary aim is clarity in these introductions, and at the same time demonstrating the workflows most optimal for data analysis.

In this introduction I’ll cover mostly data wrangling and initial exploration of some spatial datasets, and how to bring them into relationship with well-established datasets which may already be around, like US Census data. This is a common task for which we use GIS: we have a set of events which are geolocated, and we want to look at these events in relationship to their context. But not only do certain packages in R–particularly simple features spatial data–make all that work faster and easier, they also bring handling spatial data work within the flows of familiar data analysis tasks.

I will cover, then, several aspects of relating spatial data to their context:

Importing spatial data
Performing a spatial join
Visualizing spatial data
Other operations (a dissolve) and joining to US Census tracts
Looking for trends with US Census data

Again, these are basic operations; in future posts I’ll explore spatial analysis proper, and deriving actual conclusions from the data. Be sure to go to the project repository to find the code.

Each of these files is available in the “data” folder above. Our workflow will involve:

Importing the neighborhood boundaries, and importing and wrangling the collisions data to extract collisions in 2018
Joining the collisions to the neighborhood boundaries
Creating a summary (specifically, a count) of the number of collisions in each neighborhood
Visualizing the result

1.2 Importing and Wrangling Data

The data I will be using is available from the City of Seattle, which has made great strides in Open Data practices. To begin, I will use:

Vehicle collisions data, which is available as .csv file.
City Clerk data on the neighborhoods of Seattle, specifically the shapefile of the neighborhood boundaries with their identities.

We will import all the libraries we need:

library(sf)
library(tidyverse)
library(ggplot2)
library(scales)

The first, sf, is for dealing with data. The tidyverse will give us our data wrangling tools, and ggplot2, with the addition of the scales package, will be our framework for graphics.

Starting with the City Clerk shapefile of the neighborhood boundaries, I’ll import the file with read_sf():

neighborhoods <- read_sf("project/data/City_Clerk_Neighborhoods.shp")

We can immediately map the file with ggplot to see what looks like:

ggplot() +
  geom_sf(data = neighborhoods)

And we can see with head() what the data looks like:

OBJECTID PERIMETER S_HOOD L_HOOD L_HOODID SYMBOL SYMBOL2   AREA   ...
     <dbl>     <dbl> <chr>  <chr>     <dbl>  <dbl>   <dbl>  <dbl> ...
     1.      618. OOO    NA           0.     0.      0.  3588. ...
     2.      734. OOO    NA           0.     0.      0. 22295. ...
     3.     4088. OOO    NA           0.     0.      0. 56695. ...
     4.     1809. OOO    NA           0.     0.      0. 64157. ...
     5.      250. OOO    NA           0.     0.      0.  2993. ...
     6.      409. OOO    NA           0.     0.      0. 11371. ...

There are other variables as well. If you notice the last, geometry, you can see that the data also includes a variable for the geometries of the polygons in the shapefiles.

For my purposes, I want to simply look further at the S_HOOD variable, which has the City Clerk’s names for each of the major neighborhoods. I will want to join, later, on this variable. One matter of data cleaning: we can drop the areas where S_HOOD has a value of NA. I have checked, and these mostly small areas such as Kellogg Island, which should be connected to neighborhoods rather than considered seperate. A quick use of dplyr’s drop_na() can do that, which we can pipe, rather than repeatedly assign the variable. Though in this case the code is just about as long either way, it is good habit and leads to more readable code:

neighborhods <- neighborhoods %>% drop_na(S_HOOD)

One last matter before moving on to the collisions data: it is important to determine (and, for later, retrieve) the coordinate reference system from the shapefile-become-sf:

s_crs <- st_crs(neighborhoods)

Now I’ll import the collisions data. I also do some data wrangling because of the way the file is organized. I’ll begin with the import:

collisions <- read.csv("project/data/collisions.csv", stringsAsFactors = FALSE)

This data is tidy: the columns specify variable names, the rows specify cases. There’s only a little bit of wrangling to do. First, I’ll change the first variable name in the csv, which is hard to manipulate, and drop the rows which have locations but no other information (a few of which are in the dataset), using drop_na(), this time with all fields specified. The reasoning is that if there is a missing value, especially in the first two fields, we can’t plot it. Of course, these data wrangling tasks should be done with care and attention to how they affect the entire analysis:

# Change a misnamed column name in the csv
names <- colnames(collisions)
names[1] <-"X"
colnames(collisions) <- names
collisions <- collisions %>% drop_na()

Then I will filter to retrieve the collisions from 2018. This involves some data manipulation with dplyr. First, I select only the x and y columns and the date variable. I choose to rename the variables as I go (by specifying first the desired field name just for ease of reference. Then I create a field with mutate() in which I extract the first four characters of the date variable. For this I use the substring function substr, in which I specify the point to start and stop extracting characters in the date string. Since I am only going to be using dates from 2018, I actually replace the old date field by assigning it the same variable name (“date”). From there, I filter for the values of 2018 and then drop the date field, leaving me with just the coordinates:

collisions <- collisions %>%
  select(x = X, y = Y, date = INCDATE) %>%
  mutate(year = substr(date, start = 1, stop = 4)) %>%
  filter(year == "2018") %>%
  select(-date)

Next, I take the resulting collisions data frame and turn it into a sf object. The City of Seattle confirms that the coordinate reference system is the same as the shapefile of the neighborhoods (WGS-84 or EPSG:4326), and so I set it to the crs variable I extracted from that shapefile:

collisions_sf <- st_as_sf(collisions,
                   coords = c('x', 'y'),
                   crs = s_crs,
                   remove = F)

We can plot the result. This takes a little while with ggplot given the size of the data frame. I have changed the alpha transparency of the points to make their frequency easier to display:

ggplot() +
  geom_sf(data=collisions_sf, alpha=.3)

1.3 Spatially Joining the Data

Now we can join the data. I use st_join to specify a spatial join, and also specify that we want the collisions_sf shape joined to the neighborhoods shape. I will also make clear that I want all the collisions completely within the neighborhoods to be joined (this can be modified to include any of the usual qualities, including intersecting, touching, etc.):

collisions_join <- st_join(collisions_sf, neighborhoods, join = st_within)

We can see the results of the join, where the x and y variables of the collisions dataframe now have variables of the neighborhoods data joined to them:

x        y OBJECTID PERIMETER            S_HOOD           ...
-122.3329 47.70956      112  29413.55       Haller Lake ...
-122.2794 47.51707       81  38996.71 South Beacon Hill ...
-122.2907 47.69020       96  23840.22          Wedgwood ...
-122.3498 47.64651       46  26753.56  North Queen Anne ...
-122.3300 47.61226       63  13225.23        First Hill ...
-122.3050 47.60217       55  18241.55             Minor ...

1.4 Summarizing the Data

Now we want to count how many crashes are within the each neighborhood, and visualize the result. Since the sf objects are data frames, this operation can be done simply by summarizing the data as one would normally do with the group_by() and summarize() workflow of dplyr, here group_by(), which will be set to the S_HOOD variable and count() (a quick call to just count() would have been sufficient, but I wanted to specify the variable name for future reference.)

The only additional step I have to consider in this operation is that we have to set aside the geometry data which attaches to each case of the spatial data, in order to perform the count. Freed of the geometries, we can then re-attach them by joining them the back to the original neighborhoods dataset. To do this, I then make a quick call to as.data.frame() before peforming the grouping and summarizing:

collisions_count <- collisions_join %>%
  as.data.frame() %>%  #Use this to remove the sticky geometry
  group_by(S_HOOD) %>%
  summarize(collisions_n = n())

This gives us the number of collisions per neighborhood:

S_HOOD          collisions_n
 <chr>         <int>
Adams           164
Alki             69
Arbor Heights    23
Atlantic        244
Belltown        437
Bitter Lake     146

(As an aside, this detaching and re-attaching geometries here has to be done because sf can’t yet join spatial objects to spatial objects directly. But as you can see, it makes intuitive sense from within the workflow to see geometry data as “sticky”: the workflow is from extracting and manipulating the spatial data variables like any other tidy data, then joining the variables back to the data they came from, when they want to be used in context with all the other variables. The sf workflow just allows the geometries to unstick and stick back on when we want them.)

1.5 Visualizing

As mentioned above, we have to join the count data back to the original neighborhood data in order to see it in context with the rest of the variables. Just as in any GIS interface, there’s no need for any spatial joins here at all, but just a joining of the data: sf lets me just do this with a simple left_join() on the S_HOOD variable.

neighborhood_collisions <- left_join(neighborhoods,
                                     collisions_count,
                                     by="S_HOOD")

Now let’s plot the finished product with ggplot(). With a simple call to geom_sf(), mentioned earlier, we can specify that the fill should be the new variable we created which counts the number of collisions. I here also specify a scale fill, with some colors, and use the comma argument from the scales package to make sure that the data doesn’t display in scientific format in the legend. After setting the theme with theme_bw(), the final thing is to remove the axis text and ticks:

ggplot() +
  geom_sf(data = neighborhood_collisions, aes(fill = collisions_n)) +
  labs(fill = "Collisions in 2018") +
  scale_fill_continuous(low = "grey90",
                        high = "darkblue",
                        labels=comma)+
  theme_bw() +
  theme(axis.text.x = element_blank()) +
  theme(axis.text.y = element_blank()) +
  theme(axis.ticks.x = element_blank()) +
  theme(axis.ticks.y = element_blank())

However, a common operation would be not plotting just the count but the count per area. The City Clerk neighborhood’s file has this already calculated in square feet in the AREA field and so a simple change of the fill argument to collisions_n/AREA will plot the collisions per sq.ft., a more accurate and informative chloropleth map. If we wanted to make the calculation ourself, however, st_area() in sf allows us to do this (with the help of the lgeom package, which must be loaded), and the units package to convert the result, which is in square meters, to square feet. We then simply add this variable (after converting it to a decimal result) to our data (we can just use $) and use mutate to add another variable which would be the collisions per square foot. This then can be used for plotting the more accurate map:

library(lwgeom)
library(units)
neighborhood_areas <- st_area(neighborhood_collisions)
units(neighborhood_areas) <- with(ud_units, ft^2)

neighborhood_collisions$areas <- as.numeric(neighborhood_areas) #add row, convert to numeric

neighborhood_collisions <- neighborhood_collisions %>%
  mutate(collisions_sqft = collisions_n/areas)

ggplot()+
    geom_sf(data = neighborhood_collisions, aes(fill = collisions_sqft)) +
    labs(fill = "Seattle Collision Density, 2018 (Collisions per sqft.)") +
    scale_fill_continuous(low = "grey90",
                          high = "darkblue",
                          labels=comma)+
    theme_bw() +
    theme(axis.text.x = element_blank()) +
    theme(axis.text.y = element_blank()) +
    theme(axis.ticks.x = element_blank()) +
    theme(axis.ticks.y = element_blank())

1.6 Other Operations: Joins and the Census

We can also perform another common operation, which is to join this data to census tracts. In what follows, we will use the tigris package.

Our tasks will be, then:

Dissolving the neighborhood boundaries object to obtain a city boundary
Downloading Census data and selecting those within the city boundaries
Joining the collision data to the tracts

We can dissolve the city neighborhoods by simply creating a group within the sf which includes all of the neighborhoods, and summarizing over them:

s_city <- neighborhoods %>%
  mutate(group = 1) %>%
  group_by(group) %>%
  summarize()

The result is what you would expect, a dissolved polygon:

Now, let’s get the census tracts. We can set the environment so that we download simple features data with the tigris package, and put it in the cache as we work (rather than download it to a directory). Then with a call to tracts() we download the county data (specifying cb as true for a simpler geometry, rather than a 500k resolution geometry):

library(tigris)

options(tigris_class = "sf")
options(tigris_use_cache = TRUE)
s_tracts <- tracts(state="WA", county="King", cb=TRUE)

You can check if the census tracts are all sf objects by way of st_is_valid, which will check each polygon and return TRUE or FALSE, depending. Wrapping that in base R’s all(), which checks if all values are true–as in all(st_is_valid(s_tracts))–returns TRUE.

Next, we transform the tracts sf from its coordinate reference system into the reference system used by the Seattle census tracts:

s_tracts <- st_transform(s_tracts, crs=s_crs)

And then extract the census tracts which overlap with the city boundaries by simply subsetting one to the other. We can do this two ways. First, we proceed with a filter() which would keep everything in the s_tracts spatial data frame which did not intersect with the city boundaries. This is possible with st_intersect, which checks for exactly this, then returning the length() of the result, which is 0 or 1. We want to keep everything greater than 0, so we then pass this to filter():

s_city_tracts <- s_tracts %>% filter(lengths(st_intersects(s_tracts, s_city)) > 0)

Alternatively, and as some have pointed out, we can just perform this clip just as if we were subsetting one dataframe by another, as we would normally do:

s_city_tracts <- s_tracts[s_city,]

Plotting this shows us the census tracts overlaying our dissolved city limits polygon.

Now there is a problem which will produce a need for some further data wrangling. Some of the tracts share edges with the border of the city, and so were not left out when we subsetted the data. We can clean this if we want in two ways: either we can go back and, instead of subsetting the tracts by what is intersecting with the city border, we can do a join specifying that only areas within the city borders will be kept. This may, however, not work exactly if you have overlapping files (in our case, we do, because we are using different jurisdictional borders–this case appears often when using city boundaries). Alternatively, because the data is simple features data and works like a table, we can simply look in the tracts and drop the cases which include the tracts outside the border. The tradeoff with the latter approach is that it is only really useful where small amounts of tracts need to be excluded: it isn’t practical for large datasets. For simplicity’s sake, however, I’ll do the latter, though it is time-intensive, because it shows another great feature of dealing with simple feature data, which is that ggplot can immediately label things for you with a call to geom_sf_text() or geom_sf_label(). We’ll use the latter, with the TRACTCE variable, which will show tract numbers. We can then subset based on that:

ggplot(data=s_city_tracts) +
  geom_sf(data=s_city, fill="grey",color=NA)+
  geom_sf(fill=NA, color="black")+
  geom_sf_text(aes(label=TRACTCE))


s_city_tracts <- s_city_tracts %>%
  filter(! TRACTCE %in% c("020900", "021000",
                          "021100", "021300", "026400", "026100",
                          "026700","026600", "026300", "026001"))

Once we do this, we have the set of data we want to work with.

We can then go about all of the spatial joining to the census tracts just as we did above, and the calculations for density by the tract’s area in square feet. This gives us another detailed map when we plot it:

Now we can fetch data from the Census and look for spatial correlations (which, as in this example, might be tenuous), or simply research using the Census data with the addition of the data we joined.

1.7 Looking for Trends with Census Data

We can download Census data with the acs package, and Kyle Walker’s handy tidycensus, which uses acs. The acs package, an immensely useful tool developed for planners, economists, and anyone who uses census data, was put together by the data division of Puget Sound Regional Council. It requires an API key, which can be obtained easily from the U.S. Census Bureau. See help(package="acs") for instructions on how to set this up easily, with a quick use of api.key.install(key="YOUR API KEY"). Walker’s tidycensus uses acs but makes the process even easier by returning fetched data directly as tidy data frames. Even more useful, it uses tigris to join the data to the TIGER simple feature geometries, if you like. We don’t need to do that because of our previous step, but we will use it to fetch the census data. So, our workflow will look like this:

Fetch Census data
Wrangle for the kinds of data we want
Join to our data based on collisions
Look for trends

First, let’s fetch the data. We will be looking at a table familiar to many planners, American Community Survey data table B08141 on the means of transportation to work. It shows a breakdown of the various means of transportation for every census tract, and can be used to inform and justify policy decisions which would improve transportation planning in certain areas. What I want to see for the purposes of this introduction is something at once very basic and also complex: the varying degrees of car ownership of the households and the number of collisions per square foot. Notice that the table itself has many more demographic characteristics, including means of transportation to work: perhaps one of these may be better to use if we are looking at the relationship of collisions to the demography of neighborhoods. More on this later. For now, let’s do the work of fetching the data.

The data can be retrieved with the tidycensus function get_acs(). This assumes a little familiarity with acs, and I recommend looking at the latter more in depth to get a handle on exactly how to fetch data. But if you know how to work with Census data normally, the method is intuitive: you spend a lot of time making a geography for the data you want to retrieve, and then you specify the tables from which you want to fetch data, then usually wrangle it into a tidy data frame. tidycensus makes this even easier than acs, and does this all in one go: in the arguments, you specify the geography of the data you want, the table, and where you want it from. What’s more, it uses tigris just as we did above to append simple feature data to the geography, if you specify TRUE in the geometry argument. I am specifying FALSE because we already have that data:

library(tidycensus)
# Make sure API key set up

s_acs <- get_acs(geography = "tract", table = "B08141",
                state ="WA", county="King County", geometry = FALSE)

We now have the data. Each tract is specified with a GEOID and a NAME, and variables are listed as variable. In American Community Survey data, like we are using here, the margin of error is specified in the moe field:

GEOID       NAME                                    variable   estimate   moe
  <chr>       <chr>                                   <chr>         <dbl> <dbl>
53033000100 Census Tract 1, King County, Washington B08141_001    4060.  401.
53033000100 Census Tract 1, King County, Washington B08141_002     101.   61.
53033000100 Census Tract 1, King County, Washington B08141_003    1871.  356.
53033000100 Census Tract 1, King County, Washington B08141_004     972.  293.
53033000100 Census Tract 1, King County, Washington B08141_005    1116.  371.
53033000100 Census Tract 1, King County, Washington B08141_006    2224.  370.

While acs’s acs.fetch() returns a description of the variable, tidycensus assumes you are pretty certain of the variables you want. If you are uncertain of what you are looking for but know the table you want to look up, you can use acs or simply look at the table on the Census website to confirm: for example, B08141_001 is the variable showing the total number of households in the tract, while B08141_002 shows the total number with no vehicles available to travel to work, and B08141_005 shows the total number with three cars or more available.

What I will do next is use dplyr to filter for these three variables with the %in% operator, which looks within a string of characters where we place the three fields we want. Next, I drop the moe field, then spread() the data across three fields. This will turn the data frame from one which has variables arranged by census tract to one where each census tract has the three variable fields we want to consider (it can be undone with gather()). We can then use the latter to make calculations with mutate(), and specify the number of households with cars available, the number with no cars, and the number with three or more cars available, which we will later reduce to a density (per square foot of the census tract, a figure we already have):

s_popcars <- s_acs %>%
  filter(variable %in% c("B08141_001", "B08141_002", "B08141_005")) %>%
  select(-moe) %>%
  spread(key=variable, value=estimate) %>%
  rename(total=B08141_001, nocars=B08141_002, threecars="B08141_005") %>%
  mutate(pctcars = (total-nocars)/total) %>%
  mutate(pctnocars = nocars/total) %>%
  mutate(pctthreecars = threecars/total)

Now that we have that complete, we can make the join to the previous data which had census tracts, and keep all we need with select():

s_cars <- left_join(tract_collisions, s_popcars, by="GEOID") %>%
  select(GEOID, collisions_sqft, pctcars, pctnocars, geometry)

Now, if we plot this data, we can see certain trends. First let’s start with the amount of households which do not have cars available per area vs. the number of collisions per square foot in each tract. A simple point plot will be able to display the trend, and stat_smooth() can also display what an ordinary least squares regression model (lm) looks like on the data:

ggplot(s_cars, aes(x=cars/areas, y=collisions_sqft)) +
  geom_point()+
  stat_smooth(method="lm", color="Orange", se=FALSE) +
  labs(title="Collision Density by Density of Households with Cars Available
       in Seattle Census Tracts")+
  xlab("Households with 1, 2, or 3+ Cars / Sq.Ft.")+
  ylab("Collisions / Sq.Ft.")+
  scale_x_continuous(labels=comma)+
  scale_y_continuous(labels=comma)+
  theme_classic()

Now let’s look at collision density vs. the density of households without cars:

ggplot(s_cars, aes(x=nocars/areas, y=collisions_sqft)) +
  geom_point()+
  stat_smooth(method="lm", color="Orange", se=FALSE) +
  labs(title="Collision Density by Density of Households with 0 Cars Available
       in Seattle Census Tracts")+
  xlab("Households with 0 Cars / Sq.Ft.")+
  ylab("Collisions / Sq.Ft.")+
  scale_x_continuous(labels=comma)+
  scale_y_continuous(labels=comma)+
  theme_classic()

We could continue to explore the data in this way, and build better models using other Census data. This, and the changing of the geographies, is essentially how transportation planners have constructed Traffic Analysis Zones, and how they document trends concerning them. However, by way of bringing this to a close, we should note how little we have modeled to produce these results compared to more sophisticated transportation planning analyses, and, especially, how little there is a basis for concluding anything about the relationship between collisions and the mode of travel in each census tract.

Let’s be clear: our initial exploration of trends appears to show that we have positive relationship between collisions per square foot and the density of households with cars. But this relationship is still very unclear. We see this from comaring first plot to our second, which, while showing a steeper positive relationship also, has less frequency of collisions as the density of households with no cars available increases. One might conclude there is not really a realtionship between the number of collisions and the number of households with cars available from this data. In fact, if we plot our third census variable, we could come up with the opposite idea: that neighborhoods with more cars available have lower collision density!

This is not to introduce skepticism concerning our work, just to make clear that they are not results, but moments in the data exploration phase, useful to building a more robust model. This is immediately explained by mapping households with no cars available and with three or more cars, which is something now completely familiar to us:

ggplot()+
  geom_sf(data = s_cars, aes(fill = nocars/areas)) +
  labs(fill = "Seattle Density of Households with 0 Cars Available, 2018
       (by Census Tract)") +
  scale_fill_continuous(low = "grey90",
                        high = "darkblue",
                        labels=comma)+
  theme_bw() +
  theme(axis.text.x = element_blank()) +
  theme(axis.text.y = element_blank()) +
  theme(axis.ticks.x = element_blank()) +
  theme(axis.ticks.y = element_blank())

ggplot()+
    geom_sf(data = s_cars, aes(fill = threecars)) +
    labs(fill = "Seattle Density of Households with 3+ Cars Available, 2018
         (by Census Tract)") +
    scale_fill_continuous(low = "grey90",
                          high = "darkblue",
                          labels=comma)+
    theme_bw() +
    theme(axis.text.x = element_blank()) +
    theme(axis.text.y = element_blank()) +
    theme(axis.ticks.x = element_blank()) +
    theme(axis.ticks.y = element_blank())

Clearly, the relationship of collisions and households has to do with larger patterns in the urban form, including the walkability and bikability of the city, and also the sheer amount of activity within the core versus many of the outer neighborhoods (which might account for collisions showing a negative relationship with households with 3+ cars available). While we have accounted for some of this by using densities rather than counts, we have not at all considered the density of traffic or interactions between travelers, and the differences in available modes as a function of the differences in the infrastructure.

So, most fundamentally, we must go back to the basic assumption behind much of the exploration we have already done, and consider whether looking at demographic data based on one or two variables about census tracts can tell us anything remotely about collisions at all in the areas of the city where they appear (we also must consider collisions which emerge not because of local interactions, but interactions between local and non-local travelers). Again, this is not to undermine our faith in the data, just to underline the need for much more data exploration in order to build a model. What is encouraging is that model building, as every one dealing with these types of data know, is an iterative process: I hope that the tools above help make the phases of data exploration much easier to accomplish.

1.8 Conclusions

As you can see, you can do these common GIS operations rather easily. The only additional thing we may want to do, for now, is write our manipulated data to a shapefile with a simple call to write_sf():

write_sf(neighborhood_collisions,
  "project/data/neighborhood_collisions.shp",
  delete_layer = TRUE)

And our joined US Census Data:

write_sf(s_cars, "project/data/s_cars.shp",
  delete_layer = TRUE)

There even more things we can do with everything we’ve done now within R. I will have further introductions to spatial analysis with these workflows in the future.

In the meantime, for more information on visualizing the data in a more sophisticated manner than I have attempted here, you may want to check out r-spatial’s great series of posts on making maps in R, which also involve including many of the traditional cartographic elements useful for presentation-quality material. You might also want to see the sf package’s documentation for more information, in particular their cheatsheet, which works as a handy graphical summary of the package’s approach to many common spatial data manipulation operations. You may also want to check out the tigris, acs and tidycensus packages for info on easily retrieving and manipulating census data with R.

Introduction

2019-04-03T14:45:32+00:00

I hope this site will clearly introduce readers to some of the work I’ve been doing in urban data analysis. Check out my first post, a tutorial on using R as a GIS for urban data analysis work. Also make sure to check out my github repos for code.