class: center, middle, inverse, title-slide # Regression --- class: center, middle ### a regression model is just an equation for a scatterplot --- class: center, middle # `\(y_i = \alpha + \beta x_i + r_i\)` <img src="09-sl-regression_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- class: center, middle # What's a good line? --- class: center, middle # The Conditional Average --- class: center, middle <img src="09-sl-regression_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- class: center, middle <img src="09-sl-regression_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- class: center, middle <img src="09-sl-regression_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- class: center, middle <img src="09-sl-regression_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- class: center, middle <img src="09-sl-regression_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- class: middle Two facts: 1. The average is the point that minimizes the RMS of the deviations from the average. 1. We want a line that captures the *conditional* average. Just as the *average* minimizes the RMS of the deviations *from the average*, perhaps we should choose the **line** that minimizes the RMS of the deviations (or "residuals") **from the line**. --- class: center, middle ## that's exactly what we do. --- class: center, middle **We want the pair of coefficients `\((\hat{\alpha}, \hat{\beta})\)` that minimizes the RMS of the residuals or... ** `\(\DeclareMathOperator*{\argmin}{arg\,min}\)` `\begin{equation} (\hat{\alpha}, \hat{\beta}) = \displaystyle \argmin_{( \alpha, \, \beta ) \, \in \, \mathbb{R}^2} \sqrt{\frac{r_i^2}{n}} \end{equation}` --- class: center, middle # HTF do we find this minimum??? --- class: center, middle ## grid search --- class: center, middle ![](grid-search.gif) --- class: center, middle # Numerical Optimization We need to minimize.... `\(f(\alpha, \beta) = \displaystyle \sqrt{\frac{\sum_{i = 1}^n r_i^2}{n}} = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} = \sqrt{\frac{ \sum_{i = 1}^n [y_i - (\alpha + \beta x_i)]^2}{n}}\)` --- class: center, middle
--- class: center, middle # analytical optimization (calculus) --- class: middle # Two Facts 1. If `\(x\)` goes up 1 SD, then `\(y\)` goes up `\(r\)`. 1. The regression line goes through the point of averages. --- class: middle # RMS of the Residuals - Just like the SD offers a give-or-take number around the average, the "RMS of the residuals" offers a give-or-take number around the regression line. - Indeed, just like the SD is the RMS of the deviations from the average, the RMS of the residuals is the RMS of the deviations from the regression line. **The RMS of the residuals tells us how far typical points fall from the regression line.** - Sometimes the RMS of the residuals is called the "RMS error (of the regression)," the "standard error of the regression," or denoted as `\(\hat{\sigma}\)`. - We can compute the RMS of the regression by computing each residual and then taking the root-mean-square. But we can also use the much simpler formula `\(\sqrt{1 - r^2} \times \text{ SD of }y\)`. - This formula makes sense because `\(y\)` has an SD, but `\(x\)` explains some of that variation. As `\(r\)` increases, `\(x\)` explains more and more of the variation. As `\(x\)` explains more variation, then the RMS of the residuals shrinks away from SD of `\(y\)` toward zero. It turns out that the SD of `\(y\)` shrinks toward zero by a factor of `\(\sqrt{1 - r^2}\)`. --- class: center, middle # Adequacy of the Line --- class: middle <img src="09-sl-regression_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- class: middle <img src="09-sl-regression_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- class: middle <img src="09-sl-regression_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- class: middle <img src="09-sl-regression_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- class: center, middle # Computing --- class: middle ```r gamson <- read_rds("data/gamson.rds") ggplot(gamson, aes(x = seat_share, y = portfolio_share)) + geom_point() + geom_smooth() ``` <img src="09-sl-regression_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- class: middle ```r ggplot(gamson, aes(x = seat_share, y = portfolio_share)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + geom_abline(intercept = 0, slope = 1, color = "red") ``` <img src="09-sl-regression_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- class: middle ```r fit <- lm(portfolio_share ~ seat_share, data = gamson) coef(fit) ``` ``` ## (Intercept) seat_share ## 0.06913558 0.79158398 ``` --- class: middle ## Standard Errors and *p*-Values Almost all software reports *p*-values or standard errors by default. 1. The p-values are tests of the null hypothesis that the slope/intercept equals zero. 1. The standard errors have the usual interpretations--we can hope 2 to the left and 2 to the right to get a 95% CI. --- class: middle ```r fit <- lm(portfolio_share ~ seat_share, data = gamson) arm::display(fit, detail = TRUE) ``` ``` ## lm(formula = portfolio_share ~ seat_share, data = gamson) ## coef.est coef.se t value Pr(>|t|) ## (Intercept) 0.07 0.00 17.12 0.00 ## seat_share 0.79 0.01 80.81 0.00 ## --- ## n = 826, k = 2 ## residual sd = 0.07, R-Squared = 0.89 ``` --- class: center, middle # Warning **statistical models describe the factual world** --- class: middle > With few exceptions, statistical data analysis describes the outcomes of real social processes and not the processes themselves. It is therefore important to attend to the descriptive accuracy of statistical models, and to refrain from reifying them. (Fox 2008, p.3)