# Chapter 7 Estimation / Bootstrap / Uncertainty

```
library(tidyverse)
library(ggformula)
library(mosaic)
library(broom)
library(statthink)
# Set theme for plots
theme_set(theme_statthinking())
read_csv("https://raw.githubusercontent.com/lebebr01/statthink/master/data-raw/baby.csv")
baby <-head(baby)
```

```
## [90m# A tibble: 6 x 6[39m
## birth_weight gestational_days maternal_age maternal_height maternal_pregna…
## [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
## [90m1[39m 120 284 27 62 100
## [90m2[39m 113 282 33 64 135
## [90m3[39m 128 279 28 64 115
## [90m4[39m 108 282 23 67 125
## [90m5[39m 136 286 25 62 93
## [90m6[39m 138 244 33 62 178
## [90m# … with 1 more variable: maternal_smoker [3m[90m<lgl>[90m[23m[39m
```

Bootstrap and resampling methods can be used to estimate the variability in the estimated effects.

## 7.1 Estimating Error

To get some sense of the amount of error in the estimate of the linear slope here, a bootstrap can be done to provide some evidence of the likely range of slope values. The bootstrap will take the following general steps:

- Resample the observed data available, with replacement
- Fit the same linear regression model as above.
- Save the slope coefficient representing the relationship between birth weight and gestational days
- Repeat steps 1 - 3 many times
- Explore the distribution of slope estimates from the many resampled data sets.

When this was done with the classification tree, a function was used to do these steps once, then these were repeated many times. Below is a function that does the steps 1 - 3 above a single time.

```
function(...) {
resample_baby <- baby %>%
baby_resample <- sample_n(nrow(baby), replace = TRUE)
%>%
baby_resample lm(birth_weight ~ gestational_days, data = .) %>%
coef(.) %>%
.[2] %>%
data.frame()
}
resample_baby()
```

```
## .
## gestational_days 0.4972706
```

Now that there is a function that does steps 1 - 3, these processes can now be repeated many times.

```
map(1:10000, resample_baby) %>%
baby_coef <- bind_rows()
names(baby_coef) <- 'slope'
gf_density(~ slope, data = baby_coef)
```

```
%>%
baby_coef df_stats(~ slope, quantile(c(0.05, 0.5, 0.95)))
```

```
## response 5% 50% 95%
## 1 slope 0.3947727 0.4686122 0.5445012
```

## 7.2 Categorical Predictor(s)

Before, linear regression has been ran with a continuous attribute. In both models, the baby’s birth weight was the outcome of interest and the predictor in one model was the number of gestational days and in the other was the age of the mother at time of birth. What happens when a categorical predictor is used instead of a continuous predictor? This section will introduce that idea with a categorical predictor that has two different levels.

### 7.2.1 Mother’s smoking

It is known that a mother smoking while pregnant can hamper the development of the unborn fetus. Will this transition into lower birth weight for baby’s born to mothers who smoked during the pregnancy? First, let’s explore the distribution and calculate descriptive statistics for birth weight across the two groups.

```
gf_density(~ birth_weight, color = ~ maternal_smoker, size = 1.25, fill = 'gray80', data = baby) %>%
gf_labs(x = 'Birth Weight (in oz)',
color = 'Smoked?')
```

What are the general take-aways from the distributions above? To give some additional information, a violin plot may be helpful.

```
gf_violin(birth_weight ~ maternal_smoker, data = baby, draw_quantiles = c(0.1, 0.5, 0.9),
fill = 'gray85', size = 1) %>%
gf_refine(coord_flip()) %>%
gf_labs(y = "Birth Weight (in oz)",
x = "Smoker?")
```

Any additional information shown here that shows differences? To finish the descriptive exploration, let’s compute some descriptive statistics.

```
%>%
baby df_stats(birth_weight ~ maternal_smoker, mean, sd, median, quantile(c(0.25, 0.75)), length)
```

```
## response maternal_smoker mean sd median 25% 75% length
## 1 birth_weight FALSE 123.0853 17.42370 123 113 134 715
## 2 birth_weight TRUE 113.8192 18.29501 115 101 126 459
```

### 7.2.2 Linear Regression - Categorical Predictor

Now it is time to fit a model to the data here to explore if there indeed is a difference in the population. We know descriptively there is a difference in the two group means and medians, but is this difference large enough to be practical? The model is fitted similar to before with the `lm()`

function and a similar formula as before. The outcome (birth weight) is to the left of the `~`

and the predictor (maternal smoking status) is to the right.

```
lm(birth_weight ~ maternal_smoker, data = baby)
smoker_reg <-coef(smoker_reg)
```

```
## (Intercept) maternal_smokerTRUE
## 123.085315 -9.266143
```

To explore what these coefficients mean in a bit more detail, let’s look at the data a bit more.

```
baby %>%
baby <- mutate(smoker = ifelse(maternal_smoker, 1, 0))
head(baby)
```

```
## [90m# A tibble: 6 x 7[39m
## birth_weight gestational_days maternal_age maternal_height maternal_pregna…
## [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
## [90m1[39m 120 284 27 62 100
## [90m2[39m 113 282 33 64 135
## [90m3[39m 128 279 28 64 115
## [90m4[39m 108 282 23 67 125
## [90m5[39m 136 286 25 62 93
## [90m6[39m 138 244 33 62 178
## [90m# … with 2 more variables: maternal_smoker [3m[90m<lgl>[90m[23m, smoker [3m[90m<dbl>[90m[23m[39m
```

Instead of using the `maternal_smoker`

attribute, instead let’s run the model with the `smoker`

attribute.

```
lm(birth_weight ~ smoker, data = baby)
smoker_reg_new <-coef(smoker_reg_new)
```

```
## (Intercept) smoker
## 123.085315 -9.266143
```

Notice that the coefficients for the linear regression are the same no matter which attribute is entered into the model. When a categorical attribute is entered into the regression in R, the attribute is automatically converted into something called an indicator or dummy variable. This means that one of the two values are represented with a 1, the other with a 0. The value that is represented with a 0 is the one that is closer to the letter “A”, meaning that the 0 is the first category in alphabetical order.

To again get a better grasp, the descriptive stats and the coefficients from the regression are shown together below.

```
%>%
baby df_stats(birth_weight ~ maternal_smoker, mean, sd, median, quantile(c(0.25, 0.75)), length)
```

```
## response maternal_smoker mean sd median 25% 75% length
## 1 birth_weight FALSE 123.0853 17.42370 123 113 134 715
## 2 birth_weight TRUE 113.8192 18.29501 115 101 126 459
```

`coef(smoker_reg)`

```
## (Intercept) maternal_smokerTRUE
## 123.085315 -9.266143
```

### 7.2.3 Inference

Similar to the continuous predictor, resampling/bootstrap takes a similar method in the case with a single categorical predictor.

In order to get some sense of the amount of error in the estimate of the linear slope here, a bootstrap can be done to provide some evidence of the likely range of slope values. The bootstrap will take the following general steps:

- Resample the observed data available, with replacement
- Fit the same linear regression model as above.
- Save the slope coefficient representing the relationship between birth weight and gestational days
- Repeat steps 1 - 3 many times
- Explore the distribution of slope estimates from the many resampled data sets.

```
function(...) {
resample_baby <- baby %>%
baby_resample <- sample_n(nrow(baby), replace = TRUE)
%>%
baby_resample lm(birth_weight ~ maternal_smoker, data = .) %>%
coef(.) %>%
.[2] %>%
data.frame()
}
resample_baby()
```

```
## .
## maternal_smokerTRUE -7.79664
```

Now that there is a function that does steps 1 - 3, these processes can now be repeated many times.

```
map(1:10000, resample_baby) %>%
baby_coef <- bind_rows()
names(baby_coef) <- 'slope'
gf_density(~ slope, data = baby_coef)
```

```
%>%
baby_coef df_stats(~ slope, quantile(c(0.05, 0.5, 0.95)))
```

```
## response 5% 50% 95%
## 1 slope -11.03984 -9.277623 -7.499066
```

## 7.3 More than 2 categorical groups

Before the model contained one attribute that represented two groups. What happens when there are more than two groups for an attribute? To explore this, the college scorecard data will be used again.

` read_csv("https://raw.githubusercontent.com/lebebr01/statthink/master/data-raw/College-scorecard-clean.csv", guess_max = 10000) college_score <-`

```
## Parsed with column specification:
## cols(
## instnm = [31mcol_character()[39m,
## city = [31mcol_character()[39m,
## stabbr = [31mcol_character()[39m,
## preddeg = [31mcol_character()[39m,
## region = [31mcol_character()[39m,
## locale = [31mcol_character()[39m,
## adm_rate = [32mcol_double()[39m,
## actcmmid = [32mcol_double()[39m,
## ugds = [32mcol_double()[39m,
## costt4_a = [32mcol_double()[39m,
## costt4_p = [32mcol_double()[39m,
## tuitionfee_in = [32mcol_double()[39m,
## tuitionfee_out = [32mcol_double()[39m,
## debt_mdn = [32mcol_double()[39m,
## grad_debt_mdn = [32mcol_double()[39m,
## female = [32mcol_double()[39m,
## bachelor_degree = [32mcol_double()[39m
## )
```

`head(college_score)`

```
## [90m# A tibble: 6 x 17[39m
## instnm city stabbr preddeg region locale adm_rate actcmmid ugds costt4_a
## [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
## [90m1[39m Alaba… Norm… AL Bachel… South… City:… 0.903 18 [4m4[24m824 [4m2[24m[4m2[24m886
## [90m2[39m Unive… Birm… AL Bachel… South… City:… 0.918 25 [4m1[24m[4m2[24m866 [4m2[24m[4m4[24m129
## [90m3[39m Unive… Hunt… AL Bachel… South… City:… 0.812 28 [4m6[24m917 [4m2[24m[4m2[24m108
## [90m4[39m Alaba… Mont… AL Bachel… South… City:… 0.979 18 [4m4[24m189 [4m1[24m[4m9[24m413
## [90m5[39m The U… Tusc… AL Bachel… South… City:… 0.533 28 [4m3[24m[4m2[24m387 [4m2[24m[4m8[24m836
## [90m6[39m Aubur… Mont… AL Bachel… South… City:… 0.825 22 [4m4[24m211 [4m1[24m[4m9[24m892
## [90m# … with 7 more variables: costt4_p [3m[90m<dbl>[90m[23m, tuitionfee_in [3m[90m<dbl>[90m[23m,[39m
## [90m# tuitionfee_out [3m[90m<dbl>[90m[23m, debt_mdn [3m[90m<dbl>[90m[23m, grad_debt_mdn [3m[90m<dbl>[90m[23m, female [3m[90m<dbl>[90m[23m,[39m
## [90m# bachelor_degree [3m[90m<dbl>[90m[23m[39m
```

### 7.3.1 Explore distribution 3 groups

Early in the course, the distribution of admission rates by the primary degree that the institution grants was explored. Below is a violin plot that shows these three distributions.

```
gf_violin(adm_rate ~ preddeg, data = college_score, fill = 'gray85',
size = 1, draw_quantiles = c(0.1, 0.5, 0.9))
```

There may be some small differences between these groups, but more formally we can test this to understand the amount of uncertainty in the average of the distributions. This again will make use of the `lm()`

function in R and the formula is very similar to what was done before and mimics the formula from the violin plot above.

```
lm(adm_rate ~ preddeg, data = college_score)
adm_model <-coef(adm_model)
```

```
## (Intercept) preddegBachelor Degree preddegCertificate Degree
## 0.72296993 -0.05170254 0.02193828
```

Guesses as to what these coefficients represent? How were the categorical groups turned into the different elements in the model?

### 7.3.2 Overall model fit

There is a measure of overall model fit that is commonly used in the research literature for linear regression models, called R-squared. R-squared represents the proportion of variation in the outcome that is explained by the attributes in the model. The statistic ranges from 0 to 1 where values closer to 1 indicate larger percentages of variation explained. This can be extracted from the model directly.

`summary(adm_model)$r.squared`

`## [1] 0.01404376`

Another one can be computed from the baby data where the birth weight was the outcome and gestational days was the primary attribute used as a predictor.

```
lm(birth_weight ~ gestational_days, data = baby)
baby_reg <-summary(baby_reg)$r.squared
```

`## [1] 0.1660911`

For models with a single predictor variable, R-squared is the correlation coefficient squared. For example:

`cor(birth_weight ~ gestational_days, data = baby) ^ 2`

`## [1] 0.1660911`

## 7.4 Multiple Regression

What happens if we would like to combine the two predictors? Shown above is that the number of gestational days has a moderate relationship to the baby weight, therefore exploring the effects of smoking, it would be nice to remove the effect of gestational days from the baby weight. More specifically, this essentially allows us to make comparisons on the effect of smoking for the **same** gestational days. One way to think about this is through conditional means. Exploration of these visually first can be particularly helpful.

```
gf_point(birth_weight ~ gestational_days, data = baby, size = 3) %>%
gf_smooth() %>%
gf_facet_wrap(~ maternal_smoker)
```

`## `geom_smooth()` using method = 'loess'`

```
lm(birth_weight ~ I(gestational_days - mean(gestational_days)) + maternal_smoker, data = baby)
baby_reg_smoker <-coef(baby_reg_smoker)
```

```
## (Intercept)
## 122.7366688
## I(gestational_days - mean(gestational_days))
## 0.4511679
## maternal_smokerTRUE
## -8.3743990
```

We can write out the regression equation similar to before:

\[\begin{equation} birth\_weight = 122.67 + 0.49 (gestational\_days - mean(gestational\_days) - 8.17 maternal\_smoker + \epsilon \end{equation}\]

Let’s explore how these are interpreted. ### Distribution of Effects Similar to before, the distribution of effects can be obtained with the following steps:

- Resample the observed data available, with replacement
- Estimate linear model coefficients.
- Save terms of interest
- Repeat steps 1 - 3 many times
- Explore the distribution of median differences from the many resampled data sets.

```
function(...) {
resample_baby <- baby %>%
baby_resample <- sample_n(nrow(baby), replace = TRUE)
%>%
baby_resample lm(birth_weight ~ I(gestational_days - mean(gestational_days)) + maternal_smoker, data = .) %>%
tidy(.) %>%
select(term, estimate)
}resample_baby()
```

```
## [90m# A tibble: 3 x 2[39m
## term estimate
## [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m
## [90m1[39m (Intercept) 123.
## [90m2[39m I(gestational_days - mean(gestational_days)) 0.486
## [90m3[39m maternal_smokerTRUE -[31m8[39m[31m.[39m[31m68[39m
```

```
map(1:10000, resample_baby) %>%
coef_baby <- bind_rows()
%>%
coef_baby gf_density(~ estimate) %>%
gf_facet_wrap(~ term, scales = 'free')
```

### 7.4.1 Interactions

One additional idea that can be quite powerful is the idea of interactions. This was indirectly shown earlier in the course with classification and regression trees, where the models after each split re-evaluated which attributes were most helpful. In this way, the same attribute could be used in different places with different scores identifying the split. A similar idea can be explored in the regression framework, where the idea is that there are differential effects for different groups. This can be shown visually:

```
gf_point(birth_weight ~ gestational_days, data = baby, size = 3) %>%
gf_smooth() %>%
gf_facet_wrap(~ maternal_smoker)
```

`## `geom_smooth()` using method = 'loess'`

```
lm(birth_weight ~ I(gestational_days - mean(gestational_days)) * maternal_smoker, data = baby)
baby_reg_int <-coef(baby_reg_int)
```

```
## (Intercept)
## 122.799690
## I(gestational_days - mean(gestational_days))
## 0.369615
## maternal_smokerTRUE
## -8.257707
## I(gestational_days - mean(gestational_days)):maternal_smokerTRUE
## 0.230846
```

```
function(...) {
resample_baby <- baby %>%
baby_resample <- sample_n(nrow(baby), replace = TRUE)
%>%
baby_resample lm(birth_weight ~ I(gestational_days - mean(gestational_days)) * maternal_smoker, data = .) %>%
tidy(.) %>%
select(term, estimate)
}resample_baby()
```

```
## [90m# A tibble: 4 x 2[39m
## term estimate
## [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m
## [90m1[39m (Intercept) 124.
## [90m2[39m I(gestational_days - mean(gestational_days)) 0.395
## [90m3[39m maternal_smokerTRUE -[31m9[39m[31m.[39m[31m36[39m
## [90m4[39m I(gestational_days - mean(gestational_days)):maternal_smokerTRUE 0.131
```

```
map(1:10000, resample_baby) %>%
coef_baby <- bind_rows()
%>%
coef_baby gf_density(~ estimate) %>%
gf_facet_wrap(~ term, scales = 'free')
```

### 7.4.2 Evaluating model fit

As discussed earlier, R-square is a measure of overall model fit. These can be compared across the different models to see which one may be doing the best and explaining the most variation in the baby’s birth weight.

`summary(baby_reg)$r.square`

`## [1] 0.1660911`

`summary(smoker_reg)$r.square`

`## [1] 0.06091`

`summary(baby_reg_smoker)$r.square`

`## [1] 0.215661`

`summary(baby_reg_int)$r.square`

`## [1] 0.2249173`