<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.5">Jekyll</generator><link href="-%3E/blog/feed.xml" rel="self" type="application/atom+xml" /><link href="-%3E/blog/" rel="alternate" type="text/html" /><updated>2019-05-06T05:04:47+00:00</updated><id>-%3E/blog/feed.xml</id><title type="html">Urban Data Analysis</title><subtitle>Urban data analysis - issues and examples from my work</subtitle><entry><title type="html">Domain Specific Languages</title><link href="-%3E/blog/project/2019/05/06/DSLs.html" rel="alternate" type="text/html" title="Domain Specific Languages" /><published>2019-05-06T05:00:00+00:00</published><updated>2019-05-06T05:00:00+00:00</updated><id>-%3E/blog/project/2019/05/06/DSLs</id><content type="html" xml:base="-%3E/blog/project/2019/05/06/DSLs.html">&lt;p&gt;&lt;em&gt;A reflection on domain specific languages&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/retailers.png&quot; alt=&quot;retailers&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I can’t recommend Mark van der Loo and Edwin de Jonge’s &lt;em&gt;Statistical Data Cleaning in with Applications in R&lt;/em&gt; enough. It is extremely thorough, extremely useful, and serves as a great reference as well as an introduction.&lt;/p&gt;

&lt;p&gt;However, in their chapter on their data validation &lt;code class=&quot;highlighter-rouge&quot;&gt;validate&lt;/code&gt; package and how to use it–and it can’t be said enough that the package is brilliant and has changed how I have worked for a while–they pick up the question of why begin to develop specific syntaxes embedded in R at all? I’ve been searching for a way to express this myself, since so much of what makes the R universe unique are things like base R’s interesting statistical function syntax, e.g. &lt;code class=&quot;highlighter-rouge&quot;&gt;lm(var1 ~ var2, var3)&lt;/code&gt;. And I can’t think of a more concise statement of the pros and cons of embedding a domain specific language within another than theirs:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;There are many advantages to implementing a data validation syntax embedded into R. First, R’s facilities to compute on the language including access to the abstract syntax tree of statements and nonstandard evaluation make experimenting with such an implementation a breeze. In particular, it makes it easy to experiment with different ideas and test them out in practice, something which is much harder while developing a standalone DSL. Second, using R as a host language means access to the truly enormous data processing and statistical capabilities that come with R and its packages at no cost whatsoever. Third, many users interested in data validation are already familiar with R and will be able to use the DSL with relative ease.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;The downside of embedding a DSL into another language includes leakage: a user may (unwittingly) “escape” the DSL and use more advanced features of the host language. A second downside is that the syntax of the host language may be too limited to accurately capture the concepts for which the DSL was designed. It is of interest to note that R is more flexible than many other languages, since it allows for the definition of user-defined infix operators. The most famous example is probably the pipe operator %&amp;gt;% of the magrittr package. (From Statistical Data Cleaning in with Applications in R, p. 149)&lt;/p&gt;
&lt;/blockquote&gt;</content><author><name></name></author><summary type="html">A reflection on domain specific languages</summary></entry><entry><title type="html">Runs and wins: regression with tidy data</title><link href="-%3E/blog/project/2019/04/12/runs-and-wins.html" rel="alternate" type="text/html" title="Runs and wins: regression with tidy data" /><published>2019-04-12T23:00:00+00:00</published><updated>2019-04-12T23:00:00+00:00</updated><id>-%3E/blog/project/2019/04/12/runs-and-wins</id><content type="html" xml:base="-%3E/blog/project/2019/04/12/runs-and-wins.html">&lt;p&gt;&lt;em&gt;A quick update to a chapter on analyzing baseball data&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A great book for learning R is &lt;em&gt;Analyzing Baseball Data with R&lt;/em&gt; by Mack Marchi and Jim Albert. Besides being a great book about how to automate a lot of tasks in baseball analytics, the book provides a context for many of the data analysis workflows that are able to be automated in R: if you know a little about baseball you can get a really great feel for many data analysis tasks, and vice versa.&lt;/p&gt;

&lt;p&gt;Unfortunately, because the book was published in 2013, there’s little support in the text for tidy data workflows and a lot of use of base R: lots of &lt;code class=&quot;highlighter-rouge&quot;&gt;subset()&lt;/code&gt;, lots of &lt;code class=&quot;highlighter-rouge&quot;&gt;with()&lt;/code&gt;, and lots of plotting with &lt;code class=&quot;highlighter-rouge&quot;&gt;plot()&lt;/code&gt; rather than &lt;code class=&quot;highlighter-rouge&quot;&gt;ggplot()&lt;/code&gt;. This is good, on the one hand, because base R is already intuitive and useful and sometimes the tidy data framework (particularly ggplot) is not always the foolproof solution to a lot of problems. However, since the tidy data framework is generally even more intuitive and also so very extensible for many data analysis tasks, I’ve decided to give an indication of how some of the analyses might change.&lt;/p&gt;

&lt;h2 id=&quot;11-the-relation-between-runs-and-wins&quot;&gt;1.1 The relation between runs and wins&lt;/h2&gt;

&lt;p&gt;Chapter 4 of Marchi and Albert’s book covers the relation between runs and wins. Specifically it follows the very interesting question of whether run differential matters in baseball. Run differential is the difference between the winning team’s runs and the runs of the opponent, or &lt;em&gt;how much a team beat their opponent by&lt;/em&gt;. All a team has to do to beat an opponent, of course, is one more run than them. But not all opponents are beaten equally, as it were. Some teams may beat their opponents by a lot, in order presumably to protect their leads. Baseball can be a streaky game: few runs are scored, until when several people get on base, and then they can be all knocked in at once. On the other hand, some teams may play more defensively, keeping their opponents to low scores, and then squeaking by. Looking at the distribution of run differentials across teams’ seasons shows this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/rundifhist.jpeg&quot; alt=&quot;rundifhist&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can also look at the summary statistics:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Minimum: -349&lt;/li&gt;
  &lt;li&gt;Median: 5&lt;/li&gt;
  &lt;li&gt;Mean: 0&lt;/li&gt;
  &lt;li&gt;Maximum: 411&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The median run differential for the 2452 team-seasons played since 1900 is a remarkable number: 5 runs scored more than allowed across the entire season. On average–even more remarkably– teams scored no more runs than they allowed by the other team in each season. The distribution shows this well: most cases are precisely around 0, with a peak just below and just above 0.&lt;/p&gt;

&lt;h2 id=&quot;12-a-remark-on-regression&quot;&gt;1.2 A remark on regression&lt;/h2&gt;
&lt;p&gt;However, it is the information other than that able to be gathered by looking at central tendency which we will be concerned with, since each of the teams here may have had different outcomes in terms of their success at winning.&lt;/p&gt;

&lt;p&gt;It’s important to review what we are trying to do when we say something like this, before we proceed. What we are doing in linear regression is trying to see if a distribution of a variable can be related &lt;em&gt;linearly&lt;/em&gt; to another variable. We are trying to relate one variable to another by making it dependent on that variable in order to get a better answer about any and every variable. This allows us to find out more information using the model: given a certain run differential, we should be able to tell more about the winning percentage, in the sense that it should be able to be an element of a series which would fall on a line of other relationships between run differential and winning percentage. While our model may give us a precise answer, and the actual world may be more messy than that, we have nevertheless provided a link between two variables that is plausible. To use the language of some econometricians (specifically, Philip Hans Frances), this link allows us to shift from unconditional to conditional expectations, or an expectation that depends on a linked variable.&lt;/p&gt;

&lt;h2 id=&quot;13-calculating-basic-stats&quot;&gt;1.3 Calculating basic stats&lt;/h2&gt;
&lt;p&gt;First, let’s import libraries. We will use &lt;code class=&quot;highlighter-rouge&quot;&gt;Lahman&lt;/code&gt;, an &lt;a href=&quot;https://cran.r-project.org/web/packages/Lahman/index.html&quot;&gt;R package maintained by Chris Dalzell&lt;/a&gt; which allows you to access &lt;a href=&quot;http://www.seanlahman.com/&quot;&gt;Sean Lahman’s amazing baseball database&lt;/a&gt; easily, with its comprehensive team statistics. We will also use the &lt;code class=&quot;highlighter-rouge&quot;&gt;tidyverse&lt;/code&gt; package, for tidy data manipulation. Finally we will want &lt;code class=&quot;highlighter-rouge&quot;&gt;broom&lt;/code&gt;, &lt;a href=&quot;https://cran.r-project.org/web/packages/broom/index.html&quot;&gt;by Alex Hayes&lt;/a&gt;, which has tools for &lt;a href=&quot;https://cran.r-project.org/web/packages/broom/vignettes/broom.html&quot;&gt;taking statistical test output and placing it&lt;/a&gt; in a tidy data format.&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;library(Lahman)
library(tidyverse)
library(broom)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Next, we can subset the data table on teams data which is accessible thanks to the &lt;code class=&quot;highlighter-rouge&quot;&gt;Lahman&lt;/code&gt; package in the variable &lt;code class=&quot;highlighter-rouge&quot;&gt;Teams&lt;/code&gt;. Marchi and Albert are concerned with run differentials from all teams from season 2000 onward, and I’ll follow them. However, I will not use &lt;code class=&quot;highlighter-rouge&quot;&gt;subset()&lt;/code&gt; from base R to subset the data and keep the crucial fields I need. Insteaad, I will use &lt;code class=&quot;highlighter-rouge&quot;&gt;dplyr&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;filter()&lt;/code&gt; applied to &lt;code class=&quot;highlighter-rouge&quot;&gt;Teams&lt;/code&gt; to remove all values before 2000 using the &lt;code class=&quot;highlighter-rouge&quot;&gt;yearID&lt;/code&gt; variable in the dataset, and &lt;code class=&quot;highlighter-rouge&quot;&gt;select()&lt;/code&gt; to retain crucial fields for calculating run differential. This will include the team id, the year, league id, total games played, total wins, total losses, runs scored and runs allowed:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;t &amp;lt;- Teams %&amp;gt;%
  filter(yearID &amp;gt; 2000) %&amp;gt;%
  select(teamID, yearID, lgID, G, W, L, R, RA)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Next, we can calculate the run differential by subtracting the runs allowed (&lt;em&gt;RA&lt;/em&gt;) from runs (&lt;em&gt;R&lt;/em&gt;), and add this as a separate column to our  dataframe with &lt;code class=&quot;highlighter-rouge&quot;&gt;mutate()&lt;/code&gt;. Then, we similarly calculate the win percentage (which Marchi and Albert make clear is actually better named a &lt;em&gt;win proportion&lt;/em&gt;), by dividing the wins (&lt;em&gt;W&lt;/em&gt;) from the total amount of games (wins plus losses, or &lt;em&gt;W&lt;/em&gt; plus &lt;em&gt;L&lt;/em&gt;):&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;t &amp;lt;- t %&amp;gt;%
  mutate(rundiff = R-RA) %&amp;gt;%
  mutate(wpct = (W/(W+L)))
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;If we look at the five number summary of run differentials with &lt;code class=&quot;highlighter-rouge&quot;&gt;summary()&lt;/code&gt;, we see:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Minimum: -337&lt;/li&gt;
  &lt;li&gt;First Quartile: -78.25&lt;/li&gt;
  &lt;li&gt;Median: 4.5&lt;/li&gt;
  &lt;li&gt;Mean: 0&lt;/li&gt;
  &lt;li&gt;Third Quartile: 78.5&lt;/li&gt;
  &lt;li&gt;Maximum: 300&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And if we look at the five number summary of winning percentage with &lt;code class=&quot;highlighter-rouge&quot;&gt;summary()&lt;/code&gt;, we see:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Minimum: .265&lt;/li&gt;
  &lt;li&gt;First Quartile: .444&lt;/li&gt;
  &lt;li&gt;Median: .505&lt;/li&gt;
  &lt;li&gt;Mean: .500&lt;/li&gt;
  &lt;li&gt;Third Quartile: .556&lt;/li&gt;
  &lt;li&gt;Maximum: .716&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That should give us some expectation of just how much run differentials can translate into winning percentages in the furst place: clearly no one is winning all the games and even the lowest performing team is winning roughly a quarter of their games (.265, to be exact). However, we can plot the two against each other to test the general assumption, which to Marchi and Albert seems obvious but which I treat with a little more hesitation, that greater run differential may translate into a higher winning percentage:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;p &amp;lt;- ggplot(data=t, aes(x=rundiff, y=wpct))+
  geom_point()+
  labs(title=&quot;Winning Percentage vs Run Differentials, 2000-Present&quot;)+
  xlab(&quot;Run Differential&quot;)+ylab(&quot;Winning Percentage&quot;)+
  scale_x_continuous(breaks=seq(from=-400, to=400, by=100))+
  theme_classic()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/wpctrd.jpeg&quot; alt=&quot;wpctrd&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If we do the same work and look at the data from 1900, this pattern becomes even more clear:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/wpctrd1900.jpeg&quot; alt=&quot;wpctrd1900&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;14-building-a-model&quot;&gt;1.4 Building a model&lt;/h2&gt;
&lt;p&gt;We can understand this association in two ways: we can understand even better the actual historical association, or we can build a model to understand the expected winning percentage given a particular run differential. Marchi and Albert do the latter.&lt;/p&gt;

&lt;p&gt;First, we suggest the model:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Winning percentage = a + b * run differential + residuals&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And we make the linear model with &lt;code class=&quot;highlighter-rouge&quot;&gt;lm()&lt;/code&gt;. We keep the data specified at the end of the function call because it allows us to focus on specifying (and if need arises, modifying) the actual model itself:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;fit &amp;lt;- lm(wpct ~ rundiff, data=t)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Instead of viewing the results with &lt;code class=&quot;highlighter-rouge&quot;&gt;summary()&lt;/code&gt; in base R, we can use &lt;code class=&quot;highlighter-rouge&quot;&gt;tidy()&lt;/code&gt; from the &lt;code class=&quot;highlighter-rouge&quot;&gt;broom&lt;/code&gt; package:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;tidy(fit)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This returns the summary output of &lt;code class=&quot;highlighter-rouge&quot;&gt;lm()&lt;/code&gt; as a tidy data frame. (This is not only useful in the moment, but is particularly useful if we would like to work with multiple models.) We will pay attention to the estimate variable:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# A tibble: 2 x 5
  term        estimate std.error statistic   p.value
  &amp;lt;chr&amp;gt;       &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
1 (Intercept) 0.500     0.00116    433.      0.       
2 rundiff     0.000626  0.0000110  57.1      1.31e-215
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Marchi and Albert explain the estimate variable and its two cases: a team with a run differential of zero will win half of its games (.500, the intercept), and a one run increase in a season’s run differential will correspond to a .00626 increase in winning percentage. The model, in other words, can be expressed:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Winning percentage = .500 + .00626 * run differential + residuals&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We can use &lt;code class=&quot;highlighter-rouge&quot;&gt;broom&lt;/code&gt; and its function &lt;code class=&quot;highlighter-rouge&quot;&gt;augment()&lt;/code&gt; to return the team data with the fitted values and residuals added (automating a common workflow in base R):&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;fitaug &amp;lt;- augment(fit)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This returns a tibble with everything we need to know more about each of the values:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# A tibble: 480 x 9
    wpct rundiff .fitted .se.fit   .resid    .hat .sigma   .cooksd .std.resid
 * &amp;lt;dbl&amp;gt;   &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt;
 1 0.463     -39   0.476 0.00123 -0.0126  0.00237 0.0253 0.000296      -0.499
 2 0.568     141   0.588 0.00193 -0.0203  0.00581 0.0253 0.00190       -0.806
 3 0.543      86   0.554 0.00149 -0.0106  0.00347 0.0253 0.000307      -0.420
 4 0.391    -142   0.411 0.00194 -0.0198  0.00586 0.0253 0.00182       -0.786
 5 0.509      27   0.517 0.00119 -0.00757 0.00222 0.0253 0.0000998     -0.300
 6 0.512       3   0.502 0.00116  0.0105  0.00209 0.0253 0.000179       0.414
 7 0.543      76   0.548 0.00142 -0.00435 0.00317 0.0253 0.0000470     -0.172
 8 0.407    -115   0.428 0.00171 -0.0206  0.00456 0.0253 0.00153       -0.816
 9 0.562      76   0.548 0.00142  0.0142  0.00317 0.0253 0.000499       0.561
10 0.451      17   0.511 0.00117 -0.0600  0.00214 0.0252 0.00603       -2.37
# ... with 470 more rows
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It can also be easily joined to the original team data with a simple &lt;code class=&quot;highlighter-rouge&quot;&gt;left_join()&lt;/code&gt;. It also allows a much easier way to plot with &lt;code class=&quot;highlighter-rouge&quot;&gt;ggplot&lt;/code&gt;. We can use the same plot as our previous one, only with an extra line added which uses the new data frame and the &lt;em&gt;.fitted&lt;/em&gt; variable:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;p+
  geom_line(data=fitaug, aes(x = rundiff, y = .fitted), size = 1, color=&quot;red&quot;)
p
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/fitplot.jpeg&quot; alt=&quot;fitplot&quot; /&gt;&lt;/p&gt;

&lt;p&gt;All this would have to be achieved with a rather unintuitive set of calls to &lt;code class=&quot;highlighter-rouge&quot;&gt;predict()&lt;/code&gt; in base R. Finally, we can look at the R-squared value and confirm the goodness of fit with a &lt;code class=&quot;highlighter-rouge&quot;&gt;glance()&lt;/code&gt; in &lt;code class=&quot;highlighter-rouge&quot;&gt;broom&lt;/code&gt;. &lt;code class=&quot;highlighter-rouge&quot;&gt;glance(fit)&lt;/code&gt; returns:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# A tibble: 1 x 11
  r.squared adj.r.squared  sigma statistic   p.value    df logLik    AIC    BIC deviance
*     &amp;lt;dbl&amp;gt;         &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
1     0.872         0.872 0.0253     3260. 1.31e-215     2  1085. -2163. -2151.    0.306
# ... with 1 more variable: df.residual &amp;lt;int&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The r-squared value is 0.872, and the root mean standard error, which Marchi and Albert calculate, is listed in the &lt;em&gt;sigma&lt;/em&gt; variable as 0.0253. Finally, we can further investigate the model by plotting a quick residual plot using the &lt;code class=&quot;highlighter-rouge&quot;&gt;broom&lt;/code&gt; augmented model data, plotting the &lt;em&gt;.fitted&lt;/em&gt; variable against the &lt;em&gt;.resid&lt;/em&gt; variable. In order to be clear about which teams’ seasons are least captured by the model (or rather, where our residuals are the greatest), we can also create labels and pass this to the residual plot with &lt;code class=&quot;highlighter-rouge&quot;&gt;geom_text()&lt;/code&gt;. First, we merge the augmented model (&lt;em&gt;fitaug&lt;/em&gt;) back with our team data (&lt;em&gt;t&lt;/em&gt;), taking time to create a label column with &lt;code class=&quot;highlighter-rouge&quot;&gt;mutate()&lt;/code&gt; which combines the team id and their season. Then we simply specify the label in the aesthetic and tack on a &lt;code class=&quot;highlighter-rouge&quot;&gt;geom_text()&lt;/code&gt; to our plot, making sure to filter out everything but the least extreme values (here, anything with residuals greater than .07):&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;fitaug &amp;lt;- left_join(t, fitaug, by = c(&quot;rundiff&quot;, &quot;wpct&quot;)) %&amp;gt;%
  mutate(label = paste(teamID, yearID))

ggplot(data=fitaug, aes(x=.fitted, y=.resid, label=label))+
  geom_point(alpha=.3)+
  geom_hline(yintercept=0, col=&quot;red&quot;, linetype=&quot;dashed&quot;)+
  labs(title=&quot;Residual Plot&quot;)+
  xlab(&quot;Fitted Values&quot;)+ylab(&quot;Residuals&quot;)+
  geom_text(data=filter(fitaug, .resid &amp;gt; .07 | .resid &amp;lt; -.07),
            nudge_x=.025)
  theme_classic()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/residsbball.jpeg&quot; alt=&quot;residsbball&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We see that the 2005 Arizona Diamondbacks, the 2006 Cleveland team, the 2008 Los Angeles Angels, and the 2016 Texas Rangers all have high residuals. A quick look at their run differential compared to their winning percentage shows why:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;teamID yearID lgID   G   W  L   R  RA rundiff      wpct      .resid
1    ARI   2005   NL 162  77 85 696 856    -160 0.4753086    0.07544339
2    CLE   2006   AL 162  78 84 870 782      88 0.4814815   -0.07358386
3    LAA   2008   AL 162 100 62 765 697      68 0.6172840    0.07473474
4    TEX   2016   AL 162  95 67 765 757       8 0.5864198    0.08141895
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Arizona had a very large negative run differential but still a rather average winning percentage (near the median of .500  and between the first and third quartile, see our consideration of it above). Cleveland had high differentials but average winning percentage; Texas had a small run differential with a high winning percentage. Finally, the Angels had run differential within the first and third quartiles (and somewhat close to the mean) and a high winning percentage. If the relation between run differentials and winning percentages are to be better understood, a more complex model may need to be built.&lt;/p&gt;

&lt;h2 id=&quot;15-conclusion&quot;&gt;1.5 Conclusion&lt;/h2&gt;
&lt;p&gt;In the end, the tidy data format allows much cleaner code, much more familiar and intuitive operations with data, and also works well with visualization to make it a compelling alternative to many base R workflows. I hope this modification to one basic exercise in Marchi and Albert’s excellent book is helpful, and shows the ways that some approaches there and elsewhere might be updated.&lt;/p&gt;</content><author><name></name></author><summary type="html">A quick update to a chapter on analyzing baseball data</summary></entry><entry><title type="html">Introduction to R as a GIS</title><link href="-%3E/blog/project/2019/04/03/R-GIS-introduction.html" rel="alternate" type="text/html" title="Introduction to R as a GIS" /><published>2019-04-03T17:00:00+00:00</published><updated>2019-04-03T17:00:00+00:00</updated><id>-%3E/blog/project/2019/04/03/R-GIS-introduction</id><content type="html" xml:base="-%3E/blog/project/2019/04/03/R-GIS-introduction.html">&lt;p&gt;&lt;em&gt;An introduction to using R as a GIS for urban spatial analysis&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/map-3-cars.jpeg&quot; alt=&quot;map-3-cars&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;11-overview&quot;&gt;1.1 Overview&lt;/h2&gt;

&lt;p&gt;As an urban policy analyst and planner, I have to do a lot of analysis of spatial data. Over the last few years R has become a great way to do many of the basic tasks which would normally be done in QGIS or ArcGIS, and then link them to its powerful statistical tools. But as I’ve learned R as a GIS, I’ve noticed there are few introductions for the types of user cases which are common in my field. This brief walkthrough and the accompanying code should guide readers through some of this territory. My primary aim is clarity in these introductions, and at the same time demonstrating the workflows most optimal for data analysis.&lt;/p&gt;

&lt;p&gt;In this introduction I’ll cover mostly data wrangling and initial exploration of some spatial datasets, and how to bring them into relationship with well-established datasets which may already be around, like US Census data. This is a common task for which we use GIS: we have a set of events which are geolocated, and we want to look at these events in relationship to their context. But not only do certain packages in R–particularly &lt;a href=&quot;https://r-spatial.github.io/sf/&quot;&gt;simple features&lt;/a&gt; spatial data–make all that work faster and easier, they also bring handling spatial data work within the flows of familiar data analysis tasks.&lt;/p&gt;

&lt;p&gt;I will cover, then, several aspects of relating spatial data to their context:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Importing spatial data&lt;/li&gt;
  &lt;li&gt;Performing a spatial join&lt;/li&gt;
  &lt;li&gt;Visualizing spatial data&lt;/li&gt;
  &lt;li&gt;Other operations (a dissolve) and joining to US Census tracts&lt;/li&gt;
  &lt;li&gt;Looking for trends with US Census data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Again, these are basic operations; in future posts I’ll explore spatial analysis proper, and deriving actual conclusions from the data. Be sure to go to the project &lt;a href=&quot;https://github.com/michaeljoseph04/gis_intro&quot;&gt;repository&lt;/a&gt; to find the code.&lt;/p&gt;

&lt;p&gt;Each of these files is available in the “data” folder above. Our workflow will involve:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Importing the neighborhood boundaries, and importing and wrangling the collisions data to extract collisions in 2018&lt;/li&gt;
  &lt;li&gt;Joining the collisions to the neighborhood boundaries&lt;/li&gt;
  &lt;li&gt;Creating a summary (specifically, a count) of the number of collisions in each neighborhood&lt;/li&gt;
  &lt;li&gt;Visualizing the result&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;12-importing-and-wrangling-data&quot;&gt;1.2 Importing and Wrangling Data&lt;/h2&gt;

&lt;p&gt;The data I will be using is available from the &lt;a href=&quot;https://data.seattle.gov/&quot;&gt;City of Seattle&lt;/a&gt;, which has made great strides in &lt;a href=&quot;http://www.seattle.gov/tech/initiatives/open-data&quot;&gt;Open Data practices&lt;/a&gt;. To begin, I will use:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://data.seattle.gov/Transportation/Collisions/vac5-r8kk&quot;&gt;Vehicle collisions data&lt;/a&gt;, which is available as &lt;a href=&quot;http://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv&quot;&gt;.csv file&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://data.seattle.gov/dataset/City-Clerk-Neighborhoods/926y-cwh9&quot;&gt;City Clerk data on the neighborhoods of Seattle&lt;/a&gt;, specifically the &lt;a href=&quot;http://data-seattlecitygis.opendata.arcgis.com/datasets/b76cdd45f7b54f2a96c5e97f2dda3408_2.zip&quot;&gt;shapefile&lt;/a&gt; of the neighborhood boundaries with their identities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will import all the libraries we need:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;library(sf)
library(tidyverse)
library(ggplot2)
library(scales)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The first, &lt;code class=&quot;highlighter-rouge&quot;&gt;sf&lt;/code&gt;, is for dealing with data. The &lt;code class=&quot;highlighter-rouge&quot;&gt;tidyverse&lt;/code&gt; will give us our data wrangling tools, and &lt;code class=&quot;highlighter-rouge&quot;&gt;ggplot2&lt;/code&gt;, with the addition of the &lt;code class=&quot;highlighter-rouge&quot;&gt;scales&lt;/code&gt; package, will be our framework for graphics.&lt;/p&gt;

&lt;p&gt;Starting with the City Clerk shapefile of the neighborhood boundaries, I’ll import the file with &lt;code class=&quot;highlighter-rouge&quot;&gt;read_sf()&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;neighborhoods &amp;lt;- read_sf(&quot;project/data/City_Clerk_Neighborhoods.shp&quot;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We can immediately map the file with ggplot to see what looks like:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ggplot() +
  geom_sf(data = neighborhoods)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/plot1.jpeg&quot; alt=&quot;plot1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;And we can see with &lt;code class=&quot;highlighter-rouge&quot;&gt;head()&lt;/code&gt; what the data looks like:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;OBJECTID PERIMETER S_HOOD L_HOOD L_HOODID SYMBOL SYMBOL2   AREA   ...
     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;chr&amp;gt;     &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt; ...
1       1.      618. OOO    NA           0.     0.      0.  3588. ...
2       2.      734. OOO    NA           0.     0.      0. 22295. ...
3       3.     4088. OOO    NA           0.     0.      0. 56695. ...
4       4.     1809. OOO    NA           0.     0.      0. 64157. ...
5       5.      250. OOO    NA           0.     0.      0.  2993. ...
6       6.      409. OOO    NA           0.     0.      0. 11371. ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;There are other variables as well. If you notice the last, &lt;em&gt;geometry&lt;/em&gt;, you can see that the data also includes a variable for the geometries of the polygons in the shapefiles.&lt;/p&gt;

&lt;p&gt;For my purposes, I want to simply look further at the &lt;em&gt;S_HOOD&lt;/em&gt; variable, which has the City Clerk’s names for each of the major neighborhoods. I will want to join, later, on this variable. One matter of data cleaning: we can drop the areas where &lt;em&gt;S_HOOD&lt;/em&gt; has a value of &lt;em&gt;NA&lt;/em&gt;. I have checked, and these mostly small areas such as &lt;a href=&quot;http://wikimapia.org/606799/Kellogg-Island&quot;&gt;Kellogg Island&lt;/a&gt;, which should be connected to neighborhoods rather than considered seperate. A quick use of &lt;code class=&quot;highlighter-rouge&quot;&gt;dplyr&lt;/code&gt;’s &lt;code class=&quot;highlighter-rouge&quot;&gt;drop_na()&lt;/code&gt; can do that, which we can pipe, rather than repeatedly assign the variable. Though in this case the code is just about as long either way, it is good habit and leads to more readable code:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;neighborhods &amp;lt;- neighborhoods %&amp;gt;% drop_na(S_HOOD)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;One last matter before moving on to the collisions data: it is important to determine (and, for later, retrieve) the coordinate reference system from the shapefile-become-sf:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;s_crs &amp;lt;- st_crs(neighborhoods)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Now I’ll import the collisions data. I also do some data wrangling because of the way the file is organized. I’ll begin with the import:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;collisions &amp;lt;- read.csv(&quot;project/data/collisions.csv&quot;, stringsAsFactors = FALSE)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This data is tidy: the columns specify variable names, the rows specify cases. There’s only a little bit of wrangling to do. First, I’ll change the first variable name in the csv, which is hard to manipulate, and drop the rows which have locations but no other information (a few of which are in the dataset), using &lt;code class=&quot;highlighter-rouge&quot;&gt;drop_na()&lt;/code&gt;, this time with all fields specified. The reasoning is that if there is a missing value, especially in the first two fields, we can’t plot it. Of course, these data wrangling tasks should be done with care and attention to how they affect the entire analysis:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# Change a misnamed column name in the csv
names &amp;lt;- colnames(collisions)
names[1] &amp;lt;-&quot;X&quot;
colnames(collisions) &amp;lt;- names
collisions &amp;lt;- collisions %&amp;gt;% drop_na()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Then I will filter to retrieve the collisions from 2018. This involves some data manipulation with &lt;code class=&quot;highlighter-rouge&quot;&gt;dplyr&lt;/code&gt;. First, I select only the x and y columns and the date variable. I choose to rename the variables as I go (by specifying first the desired field name just for ease of reference. Then I create a field with &lt;code class=&quot;highlighter-rouge&quot;&gt;mutate()&lt;/code&gt; in which I extract the first four characters of the date variable. For this I use the substring function &lt;code class=&quot;highlighter-rouge&quot;&gt;substr&lt;/code&gt;, in which I specify the point to start and stop extracting characters in the date string. Since I am only going to be using dates from 2018, I actually replace the old date field by assigning it the same variable name (“date”). From there, I filter for the values of 2018 and then drop the date field, leaving me with just the coordinates:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;collisions &amp;lt;- collisions %&amp;gt;%
  select(x = X, y = Y, date = INCDATE) %&amp;gt;%
  mutate(year = substr(date, start = 1, stop = 4)) %&amp;gt;%
  filter(year == &quot;2018&quot;) %&amp;gt;%
  select(-date)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Next, I take the resulting collisions data frame and turn it into a &lt;code class=&quot;highlighter-rouge&quot;&gt;sf&lt;/code&gt; object. The City of Seattle confirms that the coordinate reference system is the same as the shapefile of the neighborhoods (WGS-84 or EPSG:4326), and so I set it to the crs variable I extracted from that shapefile:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;collisions_sf &amp;lt;- st_as_sf(collisions,
                   coords = c('x', 'y'),
                   crs = s_crs,
                   remove = F)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We can plot the result. This takes a little while with ggplot given the size of the data frame. I have changed the alpha transparency of the points to make their frequency easier to display:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ggplot() +
  geom_sf(data=collisions_sf, alpha=.3)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/plot5.jpeg&quot; alt=&quot;plot5&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;13-spatially-joining-the-data&quot;&gt;1.3 Spatially Joining the Data&lt;/h2&gt;

&lt;p&gt;Now we can join the data. I use &lt;code class=&quot;highlighter-rouge&quot;&gt;st_join&lt;/code&gt; to specify a spatial join, and also specify that we want the &lt;em&gt;collisions_sf&lt;/em&gt; shape joined to the &lt;em&gt;neighborhoods&lt;/em&gt; shape. I will also make clear that I want all the collisions completely within the neighborhoods to be joined (this can be modified to include any of the usual qualities, including intersecting, touching, etc.):&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;collisions_join &amp;lt;- st_join(collisions_sf, neighborhoods, join = st_within)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We can see the results of the join, where the &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt; variables of the &lt;em&gt;collisions&lt;/em&gt; dataframe now have variables of the &lt;em&gt;neighborhoods&lt;/em&gt; data joined to them:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;x        y OBJECTID PERIMETER            S_HOOD           ...
1 -122.3329 47.70956      112  29413.55       Haller Lake ...
2 -122.2794 47.51707       81  38996.71 South Beacon Hill ...
3 -122.2907 47.69020       96  23840.22          Wedgwood ...
4 -122.3498 47.64651       46  26753.56  North Queen Anne ...
5 -122.3300 47.61226       63  13225.23        First Hill ...
6 -122.3050 47.60217       55  18241.55             Minor ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;14-summarizing-the-data&quot;&gt;1.4 Summarizing the Data&lt;/h2&gt;
&lt;p&gt;Now we want to count how many crashes are within the each neighborhood, and visualize the result. Since the &lt;code class=&quot;highlighter-rouge&quot;&gt;sf&lt;/code&gt; objects are data frames, this operation can be done simply by summarizing the data as one would normally do with the &lt;code class=&quot;highlighter-rouge&quot;&gt;group_by()&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;summarize()&lt;/code&gt; workflow of &lt;code class=&quot;highlighter-rouge&quot;&gt;dplyr&lt;/code&gt;, here &lt;code class=&quot;highlighter-rouge&quot;&gt;group_by()&lt;/code&gt;, which will be set to the &lt;em&gt;S_HOOD&lt;/em&gt; variable and &lt;code class=&quot;highlighter-rouge&quot;&gt;count()&lt;/code&gt; (a quick call to just &lt;code class=&quot;highlighter-rouge&quot;&gt;count()&lt;/code&gt; would have been sufficient, but I wanted to specify the variable name for future reference.)&lt;/p&gt;

&lt;p&gt;The only additional step I have to consider in this operation is that we have to set aside the geometry data which attaches to each case of the spatial data, in order to perform the count. Freed of the geometries, we can then re-attach them by joining them the back to the original &lt;em&gt;neighborhoods&lt;/em&gt; dataset. To do this, I then make a quick call to &lt;code class=&quot;highlighter-rouge&quot;&gt;as.data.frame()&lt;/code&gt; before peforming the grouping and summarizing:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;collisions_count &amp;lt;- collisions_join %&amp;gt;%
  as.data.frame() %&amp;gt;%  #Use this to remove the sticky geometry
  group_by(S_HOOD) %&amp;gt;%
  summarize(collisions_n = n())
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This gives us the number of collisions per neighborhood:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;S_HOOD          collisions_n
 &amp;lt;chr&amp;gt;         &amp;lt;int&amp;gt;
1 Adams           164
2 Alki             69
3 Arbor Heights    23
4 Atlantic        244
5 Belltown        437
6 Bitter Lake     146
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;(As an aside, this detaching and re-attaching geometries here has to be done because &lt;code class=&quot;highlighter-rouge&quot;&gt;sf&lt;/code&gt; can’t yet join spatial objects to spatial objects directly. But as you can see, it  makes intuitive sense from within the workflow to see geometry data as “sticky”: the workflow is from extracting and manipulating the spatial data &lt;em&gt;variables&lt;/em&gt; like any other tidy data, then joining the variables back to the data they came from, when they want to be used in context with all the other variables. The &lt;code class=&quot;highlighter-rouge&quot;&gt;sf&lt;/code&gt; workflow just allows the geometries to unstick and stick back on when we want them.)&lt;/p&gt;

&lt;h2 id=&quot;15-visualizing&quot;&gt;1.5 Visualizing&lt;/h2&gt;
&lt;p&gt;As mentioned above, we have to join the count data back to the original neighborhood data in order to see it in context with the rest of the variables. Just as in any GIS interface, there’s no need for any spatial joins here at all, but just a joining of the data: &lt;code class=&quot;highlighter-rouge&quot;&gt;sf&lt;/code&gt; lets me just do this with a simple &lt;code class=&quot;highlighter-rouge&quot;&gt;left_join()&lt;/code&gt; on the &lt;em&gt;S_HOOD&lt;/em&gt; variable.&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;neighborhood_collisions &amp;lt;- left_join(neighborhoods,
                                     collisions_count,
                                     by=&quot;S_HOOD&quot;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Now let’s plot the finished product with &lt;code class=&quot;highlighter-rouge&quot;&gt;ggplot()&lt;/code&gt;. With a simple call to &lt;code class=&quot;highlighter-rouge&quot;&gt;geom_sf()&lt;/code&gt;, mentioned earlier, we can specify that the &lt;code class=&quot;highlighter-rouge&quot;&gt;fill&lt;/code&gt; should be the new variable we created which counts the number of collisions. I here also specify a scale fill, with some colors, and use the &lt;code class=&quot;highlighter-rouge&quot;&gt;comma&lt;/code&gt; argument from the &lt;code class=&quot;highlighter-rouge&quot;&gt;scales&lt;/code&gt; package to make sure that the data doesn’t display in scientific format in the legend. After setting the theme with &lt;code class=&quot;highlighter-rouge&quot;&gt;theme_bw()&lt;/code&gt;, the final thing is to remove the axis text and ticks:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ggplot() +
  geom_sf(data = neighborhood_collisions, aes(fill = collisions_n)) +
  labs(fill = &quot;Collisions in 2018&quot;) +
  scale_fill_continuous(low = &quot;grey90&quot;,
                        high = &quot;darkblue&quot;,
                        labels=comma)+
  theme_bw() +
  theme(axis.text.x = element_blank()) +
  theme(axis.text.y = element_blank()) +
  theme(axis.ticks.x = element_blank()) +
  theme(axis.ticks.y = element_blank())
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/plot6.jpeg&quot; alt=&quot;plot6&quot; /&gt;&lt;/p&gt;

&lt;p&gt;However, a common operation would be not plotting just the count but the count per area. The City Clerk neighborhood’s file has this already calculated in square feet in the &lt;em&gt;AREA&lt;/em&gt; field and so a simple change of the &lt;code class=&quot;highlighter-rouge&quot;&gt;fill&lt;/code&gt; argument to &lt;code class=&quot;highlighter-rouge&quot;&gt;collisions_n/AREA&lt;/code&gt; will plot the collisions per sq.ft., a more accurate and informative chloropleth map. If we wanted to make the calculation ourself, however, &lt;code class=&quot;highlighter-rouge&quot;&gt;st_area()&lt;/code&gt; in &lt;code class=&quot;highlighter-rouge&quot;&gt;sf&lt;/code&gt; allows us to do this (with the help of the &lt;code class=&quot;highlighter-rouge&quot;&gt;lgeom&lt;/code&gt; package, which must be loaded), and the &lt;code class=&quot;highlighter-rouge&quot;&gt;units&lt;/code&gt; package to convert the result, which is in square meters, to square feet. We then simply add this variable (after converting it to a decimal result) to our data (we can just use &lt;code class=&quot;highlighter-rouge&quot;&gt;$&lt;/code&gt;) and use &lt;code class=&quot;highlighter-rouge&quot;&gt;mutate&lt;/code&gt; to add another variable which would be the collisions per square foot. This then can be used for plotting the more accurate map:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;library(lwgeom)
library(units)
neighborhood_areas &amp;lt;- st_area(neighborhood_collisions)
units(neighborhood_areas) &amp;lt;- with(ud_units, ft^2)

neighborhood_collisions$areas &amp;lt;- as.numeric(neighborhood_areas) #add row, convert to numeric

neighborhood_collisions &amp;lt;- neighborhood_collisions %&amp;gt;%
  mutate(collisions_sqft = collisions_n/areas)

ggplot()+
    geom_sf(data = neighborhood_collisions, aes(fill = collisions_sqft)) +
    labs(fill = &quot;Seattle Collision Density, 2018 (Collisions per sqft.)&quot;) +
    scale_fill_continuous(low = &quot;grey90&quot;,
                          high = &quot;darkblue&quot;,
                          labels=comma)+
    theme_bw() +
    theme(axis.text.x = element_blank()) +
    theme(axis.text.y = element_blank()) +
    theme(axis.ticks.x = element_blank()) +
    theme(axis.ticks.y = element_blank())
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/plotF.jpeg&quot; alt=&quot;plotF&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;16-other-operations-joins-and-the-census&quot;&gt;1.6 Other Operations: Joins and the Census&lt;/h2&gt;

&lt;p&gt;We can also perform another common operation, which is to join this data to census tracts. In what follows, we will use the &lt;code class=&quot;highlighter-rouge&quot;&gt;tigris&lt;/code&gt; package.&lt;/p&gt;

&lt;p&gt;Our tasks will be, then:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Dissolving the neighborhood boundaries object to obtain a city boundary&lt;/li&gt;
  &lt;li&gt;Downloading Census data and selecting those within the city boundaries&lt;/li&gt;
  &lt;li&gt;Joining the collision data to the tracts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can dissolve the city neighborhoods by simply creating a group within the sf which includes all of the neighborhoods, and summarizing over them:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;s_city &amp;lt;- neighborhoods %&amp;gt;%
  mutate(group = 1) %&amp;gt;%
  group_by(group) %&amp;gt;%
  summarize()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The result is what you would expect, a dissolved polygon:
&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/plot7.jpeg&quot; alt=&quot;plot7&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now, let’s get the census tracts. We can set the environment so that we download simple features data with the tigris package, and put it in the cache as we work (rather than download it to a directory). Then with a call to &lt;code class=&quot;highlighter-rouge&quot;&gt;tracts()&lt;/code&gt; we download the county data (specifying cb as true for a simpler geometry, rather than a 500k resolution geometry):&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;library(tigris)

options(tigris_class = &quot;sf&quot;)
options(tigris_use_cache = TRUE)
s_tracts &amp;lt;- tracts(state=&quot;WA&quot;, county=&quot;King&quot;, cb=TRUE)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;You can check if the census tracts are all &lt;code class=&quot;highlighter-rouge&quot;&gt;sf&lt;/code&gt; objects by way of &lt;code class=&quot;highlighter-rouge&quot;&gt;st_is_valid&lt;/code&gt;, which will check each polygon and return &lt;em&gt;TRUE&lt;/em&gt; or &lt;em&gt;FALSE&lt;/em&gt;, depending. Wrapping that in base R’s &lt;code class=&quot;highlighter-rouge&quot;&gt;all()&lt;/code&gt;, which checks if all values are true–as in &lt;code class=&quot;highlighter-rouge&quot;&gt;all(st_is_valid(s_tracts))&lt;/code&gt;–returns &lt;em&gt;TRUE&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Next, we transform the tracts &lt;em&gt;sf&lt;/em&gt; from its coordinate reference system into the reference system used by the Seattle census tracts:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;s_tracts &amp;lt;- st_transform(s_tracts, crs=s_crs)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;And then extract the census tracts which overlap with the city boundaries by simply subsetting one to the other. We can do this two ways. First, we proceed with a &lt;code class=&quot;highlighter-rouge&quot;&gt;filter()&lt;/code&gt; which would keep everything in the &lt;em&gt;s_tracts&lt;/em&gt; spatial data frame which did not intersect with the city boundaries. This is possible with &lt;code class=&quot;highlighter-rouge&quot;&gt;st_intersect&lt;/code&gt;, which checks for exactly this, then returning the &lt;code class=&quot;highlighter-rouge&quot;&gt;length()&lt;/code&gt; of the result, which is &lt;em&gt;0&lt;/em&gt; or &lt;em&gt;1&lt;/em&gt;. We want to keep everything greater than &lt;em&gt;0&lt;/em&gt;, so we then pass this to &lt;code class=&quot;highlighter-rouge&quot;&gt;filter()&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;s_city_tracts &amp;lt;- s_tracts %&amp;gt;% filter(lengths(st_intersects(s_tracts, s_city)) &amp;gt; 0)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Alternatively, and as some have pointed out, we can just perform this clip just as if we were subsetting one dataframe by another, as we would normally do:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;s_city_tracts &amp;lt;- s_tracts[s_city,]

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Plotting this shows us the census tracts overlaying our dissolved city limits polygon.
&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/city_tracts_big.jpeg&quot; alt=&quot;city_tracts_big&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now there is a problem which will produce a need for some further data wrangling. Some of the tracts share edges with the border of the city, and so were not left out when we subsetted the data. We can clean this if we want in two ways: either we can go back and, instead of subsetting the tracts by what is intersecting with the city border, we can do a join specifying that only areas &lt;em&gt;within&lt;/em&gt; the city borders will be kept. This may, however, not work exactly if you have overlapping files (in our case, we do, because we are using different jurisdictional borders–this case appears often when using city boundaries). Alternatively, because the data is simple features data and works like a table, we can simply look in the tracts and drop the cases which include the tracts outside the border. The tradeoff with the latter approach is that it is only really useful where small amounts of tracts need to be excluded: it isn’t practical for large datasets. For simplicity’s sake, however, I’ll do the latter, though it is time-intensive, because it shows another great feature of dealing with simple feature data, which is that &lt;code class=&quot;highlighter-rouge&quot;&gt;ggplot&lt;/code&gt; can immediately label things for you with a call to &lt;code class=&quot;highlighter-rouge&quot;&gt;geom_sf_text()&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;geom_sf_label()&lt;/code&gt;. We’ll use the latter, with the &lt;em&gt;TRACTCE&lt;/em&gt; variable, which will show tract numbers. We can then subset based on that:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ggplot(data=s_city_tracts) +
  geom_sf(data=s_city, fill=&quot;grey&quot;,color=NA)+
  geom_sf(fill=NA, color=&quot;black&quot;)+
  geom_sf_text(aes(label=TRACTCE))


s_city_tracts &amp;lt;- s_city_tracts %&amp;gt;%
  filter(! TRACTCE %in% c(&quot;020900&quot;, &quot;021000&quot;,
                          &quot;021100&quot;, &quot;021300&quot;, &quot;026400&quot;, &quot;026100&quot;,
                          &quot;026700&quot;,&quot;026600&quot;, &quot;026300&quot;, &quot;026001&quot;))

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Once we do this, we have the set of data we want to work with.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/tracts_final.jpeg&quot; alt=&quot;tracts_final&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can then go about all of the spatial joining to the census tracts just as we did above, and the calculations for density by the tract’s area in square feet. This gives us another detailed map when we plot it:
&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/collision_density_final.jpeg&quot; alt=&quot;collision_density_final&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now we can fetch data from the Census and look for spatial correlations (which, as in this example, might be tenuous), or simply research using the Census data with the addition of the data we joined.&lt;/p&gt;

&lt;h2 id=&quot;17-looking-for-trends-with-census-data&quot;&gt;1.7 Looking for Trends with Census Data&lt;/h2&gt;

&lt;p&gt;We can download Census data with the &lt;code class=&quot;highlighter-rouge&quot;&gt;acs&lt;/code&gt; package, and &lt;a href=&quot;https://walkerke.github.io/tidycensus/articles/basic-usage.html&quot;&gt;Kyle Walker’s&lt;/a&gt; handy &lt;code class=&quot;highlighter-rouge&quot;&gt;tidycensus&lt;/code&gt;, which uses &lt;code class=&quot;highlighter-rouge&quot;&gt;acs&lt;/code&gt;. The &lt;code class=&quot;highlighter-rouge&quot;&gt;acs&lt;/code&gt; package, an immensely useful tool developed for planners, economists, and anyone who uses census data, was put together by the data division of Puget Sound Regional Council. It requires an API key, which can be obtained easily &lt;a href=&quot;https://api.census.gov/data/key_signup.html&quot;&gt;from the U.S. Census Bureau&lt;/a&gt;. See &lt;code class=&quot;highlighter-rouge&quot;&gt;help(package=&quot;acs&quot;)&lt;/code&gt; for instructions on how to set this up easily, with a quick use of &lt;code class=&quot;highlighter-rouge&quot;&gt;api.key.install(key=&quot;YOUR API KEY&quot;)&lt;/code&gt;. Walker’s &lt;code class=&quot;highlighter-rouge&quot;&gt;tidycensus&lt;/code&gt; uses &lt;code class=&quot;highlighter-rouge&quot;&gt;acs&lt;/code&gt; but makes the process even easier by returning fetched data directly as tidy data frames. Even more useful, it uses &lt;code class=&quot;highlighter-rouge&quot;&gt;tigris&lt;/code&gt; to join the data to the TIGER simple feature geometries, if you like. We don’t need to do that because of our previous step, but we will use it to fetch the census data. So, our workflow will look like this:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Fetch Census data&lt;/li&gt;
  &lt;li&gt;Wrangle for the kinds of data we want&lt;/li&gt;
  &lt;li&gt;Join to our data based on collisions&lt;/li&gt;
  &lt;li&gt;Look for trends&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First, let’s fetch the data. We will be looking at a table familiar to many planners, &lt;a href=&quot;https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk&quot;&gt;American Community Survey data table B08141&lt;/a&gt; on the means of transportation to work. It shows a breakdown of the various means of transportation for every census tract, and can be used to inform and justify policy decisions which would improve transportation planning in certain areas. What I want to see for the purposes of this introduction is something at once very basic and also complex: the varying degrees of car ownership of the households and the number of collisions per square foot. Notice that the table itself has many more demographic characteristics, including means of transportation to work: perhaps one of these may be better to use if we are looking at the relationship of collisions to the demography of neighborhoods. More on this later. For now, let’s do the work of fetching the data.&lt;/p&gt;

&lt;p&gt;The data can be retrieved with the &lt;code class=&quot;highlighter-rouge&quot;&gt;tidycensus&lt;/code&gt; function &lt;code class=&quot;highlighter-rouge&quot;&gt;get_acs()&lt;/code&gt;. This assumes a little familiarity with &lt;code class=&quot;highlighter-rouge&quot;&gt;acs&lt;/code&gt;, and I recommend looking at the latter more in depth to get a handle on exactly how to fetch data. But if you know how to work with Census data normally, the method is intuitive: you spend a lot of time making a geography for the data you want to retrieve, and then you specify the tables from which you want to fetch data, then usually wrangle it into a tidy data frame. &lt;code class=&quot;highlighter-rouge&quot;&gt;tidycensus&lt;/code&gt; makes this even easier than &lt;code class=&quot;highlighter-rouge&quot;&gt;acs&lt;/code&gt;, and does this all in one go: in the arguments, you specify the geography of the data you want, the table, and where you want it from. What’s more, it uses &lt;code class=&quot;highlighter-rouge&quot;&gt;tigris&lt;/code&gt; just as we did above to append simple feature data to the geography, if you specify &lt;em&gt;TRUE&lt;/em&gt; in the &lt;code class=&quot;highlighter-rouge&quot;&gt;geometry&lt;/code&gt; argument. I am specifying &lt;em&gt;FALSE&lt;/em&gt; because we already have that data:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;library(tidycensus)
# Make sure API key set up

s_acs &amp;lt;- get_acs(geography = &quot;tract&quot;, table = &quot;B08141&quot;,
                state =&quot;WA&quot;, county=&quot;King County&quot;, geometry = FALSE)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We now have the data. Each tract is specified with a &lt;em&gt;GEOID&lt;/em&gt; and a &lt;em&gt;NAME&lt;/em&gt;, and variables are listed as &lt;em&gt;variable&lt;/em&gt;. In American Community Survey data, like we are using here, the margin of error is specified in the &lt;em&gt;moe&lt;/em&gt; field:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GEOID       NAME                                    variable   estimate   moe
  &amp;lt;chr&amp;gt;       &amp;lt;chr&amp;gt;                                   &amp;lt;chr&amp;gt;         &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
1 53033000100 Census Tract 1, King County, Washington B08141_001    4060.  401.
2 53033000100 Census Tract 1, King County, Washington B08141_002     101.   61.
3 53033000100 Census Tract 1, King County, Washington B08141_003    1871.  356.
4 53033000100 Census Tract 1, King County, Washington B08141_004     972.  293.
5 53033000100 Census Tract 1, King County, Washington B08141_005    1116.  371.
6 53033000100 Census Tract 1, King County, Washington B08141_006    2224.  370.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;While &lt;code class=&quot;highlighter-rouge&quot;&gt;acs&lt;/code&gt;’s &lt;code class=&quot;highlighter-rouge&quot;&gt;acs.fetch()&lt;/code&gt; returns a description of the variable, &lt;code class=&quot;highlighter-rouge&quot;&gt;tidycensus&lt;/code&gt; assumes you are pretty certain of the variables you want. If you are uncertain of what you are looking for but know the table you want to look up, you can use &lt;code class=&quot;highlighter-rouge&quot;&gt;acs&lt;/code&gt; or simply look at the table on the Census website to confirm: for example, &lt;em&gt;B08141_001&lt;/em&gt; is the variable showing the total number of households in the tract, while &lt;em&gt;B08141_002&lt;/em&gt; shows the total number with no vehicles available to travel to work, and &lt;em&gt;B08141_005&lt;/em&gt; shows the total number with three cars or more available.&lt;/p&gt;

&lt;p&gt;What I will do next is use &lt;code class=&quot;highlighter-rouge&quot;&gt;dplyr&lt;/code&gt; to filter for these three variables with the &lt;code class=&quot;highlighter-rouge&quot;&gt;%in%&lt;/code&gt; operator, which looks within a string of characters where we place the three fields we want. Next, I drop the &lt;em&gt;moe&lt;/em&gt; field, then &lt;code class=&quot;highlighter-rouge&quot;&gt;spread()&lt;/code&gt; the data across three fields. This will turn the data frame from one which has variables arranged by census tract to one where each census tract has the three variable fields we want to consider (it can be undone with &lt;code class=&quot;highlighter-rouge&quot;&gt;gather()&lt;/code&gt;). We can then use the latter to make calculations with &lt;code class=&quot;highlighter-rouge&quot;&gt;mutate()&lt;/code&gt;, and specify the number of households with cars available, the number with no cars, and the number with three or more cars available, which we will later reduce to a density (per square foot of the census tract, a figure we already have):&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;s_popcars &amp;lt;- s_acs %&amp;gt;%
  filter(variable %in% c(&quot;B08141_001&quot;, &quot;B08141_002&quot;, &quot;B08141_005&quot;)) %&amp;gt;%
  select(-moe) %&amp;gt;%
  spread(key=variable, value=estimate) %&amp;gt;%
  rename(total=B08141_001, nocars=B08141_002, threecars=&quot;B08141_005&quot;) %&amp;gt;%
  mutate(pctcars = (total-nocars)/total) %&amp;gt;%
  mutate(pctnocars = nocars/total) %&amp;gt;%
  mutate(pctthreecars = threecars/total)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Now that we have that complete, we can make the join to the previous data which had census tracts, and keep all we need with &lt;code class=&quot;highlighter-rouge&quot;&gt;select()&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;s_cars &amp;lt;- left_join(tract_collisions, s_popcars, by=&quot;GEOID&quot;) %&amp;gt;%
  select(GEOID, collisions_sqft, pctcars, pctnocars, geometry)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Now, if we plot this data, we can see certain trends. First let’s start with the amount of households which do not have cars available per area vs. the number of collisions per square foot in each tract. A simple point plot will be able to display the trend, and &lt;code class=&quot;highlighter-rouge&quot;&gt;stat_smooth()&lt;/code&gt; can also display what an ordinary least squares regression model (&lt;code class=&quot;highlighter-rouge&quot;&gt;lm&lt;/code&gt;) looks like on the data:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ggplot(s_cars, aes(x=cars/areas, y=collisions_sqft)) +
  geom_point()+
  stat_smooth(method=&quot;lm&quot;, color=&quot;Orange&quot;, se=FALSE) +
  labs(title=&quot;Collision Density by Density of Households with Cars Available
       in Seattle Census Tracts&quot;)+
  xlab(&quot;Households with 1, 2, or 3+ Cars / Sq.Ft.&quot;)+
  ylab(&quot;Collisions / Sq.Ft.&quot;)+
  scale_x_continuous(labels=comma)+
  scale_y_continuous(labels=comma)+
  theme_classic()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/plot-cars.jpeg&quot; alt=&quot;plot-cars&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now let’s look at collision density vs. the density of households without cars:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ggplot(s_cars, aes(x=nocars/areas, y=collisions_sqft)) +
  geom_point()+
  stat_smooth(method=&quot;lm&quot;, color=&quot;Orange&quot;, se=FALSE) +
  labs(title=&quot;Collision Density by Density of Households with 0 Cars Available
       in Seattle Census Tracts&quot;)+
  xlab(&quot;Households with 0 Cars / Sq.Ft.&quot;)+
  ylab(&quot;Collisions / Sq.Ft.&quot;)+
  scale_x_continuous(labels=comma)+
  scale_y_continuous(labels=comma)+
  theme_classic()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/plot-0-cars.jpeg&quot; alt=&quot;plot-0-cars&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We could continue to explore the data in this way, and build better models using other Census data. This, and the changing of the geographies, is essentially how transportation planners have constructed &lt;a href=&quot;https://en.wikipedia.org/wiki/Traffic_analysis_zone&quot;&gt;Traffic Analysis Zones&lt;/a&gt;, and how they document trends concerning them. However, by way of bringing this to a close, we should note how little we have modeled to produce these results compared to more sophisticated transportation planning analyses, and, especially, how little there is a basis for concluding anything about the relationship between collisions and the mode of travel in each census tract.&lt;/p&gt;

&lt;p&gt;Let’s be clear: our initial exploration of trends appears to show that we have positive relationship between collisions per square foot and the density of households with cars. But this relationship is still very unclear. We see this from comaring first plot to our second, which, while showing a steeper positive relationship also, has less frequency of collisions as the density of households with no cars available increases. One might conclude there is not really a realtionship between the number of collisions and the number of households with cars available from this data. In fact, if we plot our third census variable, we could come up with the opposite idea: that neighborhoods with more cars available have lower collision density!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/plot-3-cars.jpeg&quot; alt=&quot;plot-3-cars&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is not to introduce skepticism concerning our work, just to make clear that they are not results, but moments in the &lt;em&gt;data exploration&lt;/em&gt; phase, useful to building a more robust model. This is immediately explained by mapping households with no cars available and with three or more cars, which is something now completely familiar to us:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ggplot()+
  geom_sf(data = s_cars, aes(fill = nocars/areas)) +
  labs(fill = &quot;Seattle Density of Households with 0 Cars Available, 2018
       (by Census Tract)&quot;) +
  scale_fill_continuous(low = &quot;grey90&quot;,
                        high = &quot;darkblue&quot;,
                        labels=comma)+
  theme_bw() +
  theme(axis.text.x = element_blank()) +
  theme(axis.text.y = element_blank()) +
  theme(axis.ticks.x = element_blank()) +
  theme(axis.ticks.y = element_blank())
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/map-0-cars.jpeg&quot; alt=&quot;map-0-cars&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ggplot()+
    geom_sf(data = s_cars, aes(fill = threecars)) +
    labs(fill = &quot;Seattle Density of Households with 3+ Cars Available, 2018
         (by Census Tract)&quot;) +
    scale_fill_continuous(low = &quot;grey90&quot;,
                          high = &quot;darkblue&quot;,
                          labels=comma)+
    theme_bw() +
    theme(axis.text.x = element_blank()) +
    theme(axis.text.y = element_blank()) +
    theme(axis.ticks.x = element_blank()) +
    theme(axis.ticks.y = element_blank())
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/michaeljoseph04/blog/gh-pages/images/map-3-cars.jpeg&quot; alt=&quot;map-3-cars&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Clearly, the relationship of collisions and households has to do with larger patterns in the urban form, including the walkability and bikability of the city, and also the sheer amount of activity within the core versus many of the outer neighborhoods (which might account for collisions showing a negative relationship with households with 3+ cars available). While we have accounted for some of this by using densities rather than counts, we have not at all considered the density of traffic or interactions between travelers, and the differences in available modes as a function of the differences in the infrastructure.&lt;/p&gt;

&lt;p&gt;So, most fundamentally, we must go back to the basic assumption behind much of the exploration we have already done, and consider whether looking at demographic data based on one or two variables about census tracts can tell us anything remotely about collisions at all in the areas of the city where they appear (we also must consider collisions which emerge not because of local interactions, but interactions between local and non-local travelers). Again, this is not to undermine our faith in the data, just to underline the need for much more data exploration in order to build a model. What is encouraging is that model building, as every one dealing with these types of data know, is an iterative process: I hope that the tools above help make the phases of data exploration much easier to accomplish.&lt;/p&gt;

&lt;h2 id=&quot;18-conclusions&quot;&gt;1.8 Conclusions&lt;/h2&gt;

&lt;p&gt;As you can see, you can do these common GIS operations rather easily. The only additional thing we may want to do, for now, is write our manipulated data to a shapefile with a simple call to &lt;code class=&quot;highlighter-rouge&quot;&gt;write_sf()&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;write_sf(neighborhood_collisions,
  &quot;project/data/neighborhood_collisions.shp&quot;,
  delete_layer = TRUE)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;And our joined US Census Data:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;write_sf(s_cars, &quot;project/data/s_cars.shp&quot;,
  delete_layer = TRUE)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There even more things we can do with everything we’ve done now within R. I will have further introductions to spatial analysis with these workflows in the future.&lt;/p&gt;

&lt;p&gt;In the meantime, for more information on visualizing the data in a more sophisticated manner than I have attempted here, you may want to check out &lt;a href=&quot;https://www.r-spatial.org/r/2018/10/25/ggplot2-sf.html&quot;&gt;r-spatial’s great series of posts on making maps&lt;/a&gt; in R, which also involve including many of the traditional cartographic elements useful for presentation-quality material. You might also want to see the &lt;code class=&quot;highlighter-rouge&quot;&gt;sf&lt;/code&gt; package’s &lt;a href=&quot;https://r-spatial.github.io/sf/&quot;&gt;documentation&lt;/a&gt; for more information, in particular their &lt;a href=&quot;https://github.com/rstudio/cheatsheets/blob/master/sf.pdf&quot;&gt;cheatsheet&lt;/a&gt;, which works as a handy graphical summary of the package’s approach to many common spatial data manipulation operations. You may also want to check out the &lt;code class=&quot;highlighter-rouge&quot;&gt;tigris&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;acs&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;tidycensus&lt;/code&gt; packages for info on easily retrieving and manipulating census data with R.&lt;/p&gt;</content><author><name></name></author><summary type="html">An introduction to using R as a GIS for urban spatial analysis</summary></entry><entry><title type="html">Introduction</title><link href="-%3E/blog/introduction/2019/04/03/introduction.html" rel="alternate" type="text/html" title="Introduction" /><published>2019-04-03T14:45:32+00:00</published><updated>2019-04-03T14:45:32+00:00</updated><id>-%3E/blog/introduction/2019/04/03/introduction</id><content type="html" xml:base="-%3E/blog/introduction/2019/04/03/introduction.html">&lt;p&gt;I hope this site will clearly introduce readers to some of the work I’ve been doing in urban data analysis. Check out my &lt;a href=&quot;http://michaeljoseph04.github.io/blog/2019-04-03-R-GIS-introduction.markdown&quot;&gt;first post, a tutorial on using R as a GIS for urban data analysis work&lt;/a&gt;. Also make sure to check out my &lt;a href=&quot;https://github.com/michaeljoseph04/&quot;&gt;github repos&lt;/a&gt; for code.&lt;/p&gt;</content><author><name></name></author><summary type="html">I hope this site will clearly introduce readers to some of the work I’ve been doing in urban data analysis. Check out my first post, a tutorial on using R as a GIS for urban data analysis work. Also make sure to check out my github repos for code.</summary></entry></feed>