# Chapter 11 Simple linear Regression

In this chapter, let’s continue with another common data modelling technique: linear regression. The premise of data modeling is to make explicit the relationship between:

- an
*outcome variable*\(y\), also called a*dependent variable*or response variable, and - an
*explanatory/predictor variable*\(x\), also called an*independent variable*or covariate.

Another way to state this is using mathematical terminology:
we will model the outcome variable \(y\) “as a function”
of the explanatory/predictor variable \(x\).
When we say “function” here,
we are not referring to functions in R such as the `ggplot()`

function,
but rather as a mathematical function.

Why do we have two different labels, explanatory and predictor, for the variable \(x\)? That’s because even though the two terms are often used interchangeably, roughly speaking data modeling serves one of two purposes:

**Modeling for explanation**: When you want to explicitly describe and quantify the relationship between the outcome variable \(y\) and a set of explanatory variables \(x\), determine the significance of any relationships, have measures summarizing these relationships, and possibly identify any*causal*relationships between the variables.**Modeling for prediction**: When you want to predict an outcome variable \(y\) based on the information contained in a set of predictor variables \(x\). Unlike modeling for explanation, however, you don’t care so much about understanding how all the variables relate and interact with one another, but rather only whether you can make good predictions about \(y\) using the information in \(x\).

For example, say you are interested in an outcome variable \(y\) of whether patients develop lung cancer and information \(x\) on their risk factors, such as smoking habits, age, and socioeconomic status. If we are modeling for explanation, we would be interested in both describing and quantifying the effects of the different risk factors. One reason could be that you want to design an intervention to reduce lung cancer incidence in a population, such as targeting smokers of a specific age group with advertising for smoking cessation programs. If we are modeling for prediction, however, we would not care so much about understanding how all the individual risk factors contribute to lung cancer, but rather only whether we can make good predictions of which people will contract lung cancer.

In this course, we will focus on modeling for explanation
and hence refer to \(x\) as *explanatory variables*.
If you are interested in learning about modeling for prediction,
we suggest you check out books and courses on the field of
*machine learning* such as
*An Introduction to Statistical Learning with Applications in R (ISLR)*
(James et al. 2017).

In this chapter, we will focus on one particular technique:
*linear regression*.
Linear regression is one of the most commonly-used
and easy-to-understand approaches to modeling.
Linear regression involves a *numerical* outcome variable \(y\)
and explanatory variables \(x\) that are either *numerical* or *categorical*.
Furthermore, the relationship between \(y\) and \(x\)
is assumed to be linear, or in other words, a line.

In the current Chapter on basic regression,
we will only consider models with one numerical explanatory variable \(x\)
This scenario is known as *simple linear regression*.

In Chapter 12
on inference for regression,
we will revisit the regression models we discussed in this chapter
and analyze the results using the tools for *statistical inference*
you have developed in Chapters 5, 6,
and 7 on sampling, confidence intervals,
and hypothesis testing, respectively.

In Chapter 13 on regression with one categorical explanatory variable, we will connect what you have learned from Chapter 10 — ANOVA — with the linear regression technique, and show that they are more similar than you may have previously thought.

Let’s now begin with basic regression,
which refers to linear regression models
with a continuous explanatory variable \(x\).
We will also discuss important statistical concepts
such as the *correlation coefficient*,
that “correlation is not necessarily causation,”
and what it means for a line to be “best-fitting.”

### Needed packages

Let’s get ready all the packages we will need for this chapter.

```
# Install xfun so that I can use xfun::pkg_load2
if (!requireNamespace('xfun')) install.packages('xfun')
xf <- loadNamespace('xfun')
cran_packages <- c(
"broom", # a new package we will introduce in this chapter
"dplyr",
"gapminder",
"ggplot2",
"janitor", # a new package we will introduce in this chapter
"moderndive",
"skimr"
)
if (length(cran_packages) != 0) xf$pkg_load2(cran_packages)
gg <- import::from(ggplot2, .all=TRUE, .into={new.env()})
dp <- import::from(dplyr, .all=TRUE, .into={new.env()})
import::from(magrittr, "%>%")
```

Recall one of the examples we saw in chapter 2,
in which Hans Rosling discussed global economy, health, and development.
Let’s use a slice of the same dataset,
and try to replicate his analyses partially.
Specifically, we will try to explain differences
in average life expectancy of a country
as a function of one numerical variable: average income of that country.
Could it be that countries with higher average income
also have higher average life expectancy?
Could it be that countries with higher average income
tend to have lower average life expectancy?
Or could it be that there is no monotonic relationship between
average income and average life expectancy?
We will answer these questions
by modeling the relationship between income and life expectancy
using *simple linear regression* where we have:

- A numerical outcome variable \(y\) (the average life expectancy) and
- A single numerical explanatory variable \(x\) (a country’s average income).

## 11.1 Exploring the data

The data on the 142 countries
can be found in the `gapminder`

data frame from the `gapminder`

package.
To keep things simple, let’s `filter()`

only the 2007 data
and save this data in a new data frame called `df_gapminder2007`

.

```
df_gapminder2007 <- gapminder::gapminder %>%
dp$filter(year == 2007) %>%
dp$select(-year) %>%
dp$rename(
life_exp = lifeExp,
gdp_per_cap = gdpPercap
) %>%
dp$mutate(income = log10(gdp_per_cap)*10000)
```

```
Rows: 142
Columns: 6
$ country <fct> Afghanistan, Albania, Algeria, Angola, Argentina, Austral…
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europe, …
$ life_exp <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829, 7…
$ pop <int> 31889923, 3600523, 33333216, 12420476, 40301927, 20434176…
$ gdp_per_cap <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, 34…
$ income <dbl> 29888.18, 37735.69, 37940.25, 36809.91, 41065.10, 45370.0…
```

Observe that `Observations: 142`

indicates that
there are 142 rows/observations in `df_gapminder2007`

,
where each row corresponds to one observed country.

A full documentation of the dataset `gapminder`

, including its background,
can be found at gapminder.org.
A full description of all the variables included in `gapminder`

can be found in its associated help file
(run `?gapminder::gapminder`

in the console).

Here is a brief description of 6 variables
we selected in `df_gapminder2007`

:

**country**: Name of country.**continent**: Which of the five continents the country is part of. Note that “Americas” includes countries in both North and South America and that Antarctica is excluded.**life_exp**: Life expectancy in years.**pop**: Population, number of people living in the country.**gdp_per_cap**: Gross domestic product (GDP, in US dollars).**income**: log-transformed`gdp_per_cap`

, a proxy for average income

*Good to know*

**Types of data**

In Section 2.10.1, we have compared two common types of data —
continuous and categorical.
Related to this dichotomy, data can also be grouped into four types, namely
**nominal**, **ordinal**, **interval**, and **ratio**.
For example, **country** and **continent** in `df_gapminder2007`

are considered *nominal*.
There is no intrinsic order among countries,
nor is it sensible to measure distances (intervals) between the names of two countries.
*Ordinal* data, as the name suggests, refer to data that can be ordered,
such as low, medium, and high.
However, they should not be confused with *interval data*,
such as temperature,
which not only can be ordered (e.g., -2°C, -1°C, 0°C, 1°C, …),
but also allows for contant intervals between any adjacent numbers
(e.g., 5°C - 4°C = 8°C - 7°C).
With ordinal data, such as restaurant ratings,
it is not necessarily true that the difference between a one-star rated
and a two-star rated restaurant is equal to
that between a two-star rated and a three-star rated.
Finally, *ratio data* allow for not only ranking and constant intervals,
but also an absolute zero point.
Examples of ratio data include height, age, population size, and income.

This classification was
initally developed
by Stanley Smith Stevens and published in a 1946 *Science* article.
It is one of the most widely adopted typologies in psychology,
despite criticism from scholars from within psychology and other disciplines.
For example, it has long been debated whether behavioural and cognitive data
collected using Likert-type scales are ordinal
or could be considered as interval.
Team Likert Data Are Ordinal argues that the difference
between “strongly disagree” (1) and “disagree” (2)
is not always equal to that between “disagree” (2) and “indifferent” (3),
similar to restaurant ratings.
Meanwhile, Team Likert Data Could Be Used As Interval
contends that classifying these data as ordinal
would rule out many powerful statistical models including ANOVA and \(t\)-test,
all of which assume that the dependent variable is interval or ratio data.

An alternative way to look at the raw data values
is by choosing a random sample of the rows in `df_gapminder2007`

.
We can do so
by piping `df_gapminder2007`

into the `dplyr::sample_n()`

function.
Here we set the `size`

argument to be `5`

,
indicating that we want a random sample of 5 rows.
We display the results in Table 11.1.
Note that due to the random nature of the sampling,
you will likely end up with a different subset of 5 rows.

country | continent | life_exp | pop | gdp_per_cap | income |
---|---|---|---|---|---|

Togo | Africa | 58.420 | 5701579 | 882.970 | 29459.46 |

Sao Tome and Principe | Africa | 65.528 | 199579 | 1598.435 | 32036.95 |

Congo, Dem. Rep. | Africa | 46.462 | 64606759 | 277.552 | 24433.44 |

Lesotho | Africa | 42.592 | 2012649 | 1569.331 | 31957.15 |

Bulgaria | Europe | 73.005 | 7322858 | 10680.793 | 40286.04 |

### 11.1.1 Summary statistics — univariate

Now that we’ve looked at the raw values in our `df_gapminder2007`

data frame
and got a preliminary sense of the data,
let’s compute summary statistics for the two target varaibles —
a country’s average life expectancy `life_exp`

and its average `income`

.
We could use `dplyr::summarize()`

along with summary functions such as `mean()`

and `median()`

we saw in Section 3.4.
Instead, let’s create a customized function using `skimr::skim_with()`

.
This function would take in a data frame,
“skims” it, and returns commonly used summary statistics.

```
# Create a template function for descriptives
my_skim <- skimr::skim_with(base = skimr::sfl(n = length, missing = skimr::n_missing),
numeric = skimr::sfl(
mean,
sd,
iqr = IQR,
min,
p25 = ~ quantile(., 1/4),
median,
p75 = ~ quantile(., 3/4),
max
),
append = FALSE
) #sfl stands for "skimr function list"
```

Let’s take our `df_gapminder2007`

data frame,
`select()`

only the outcome and explanatory variables
`life_exp`

and `income`

, and pipe them into the `my_skim()`

function:

skim_variable | n | missing | mean | sd | iqr | min | p25 | median | p75 | max |
---|---|---|---|---|---|---|---|---|---|---|

life_exp | 142 | 0 | 67 | 12 | 19 | 40 | 57 | 72 | 76 | 83 |

income | 142 | 0 | 37418 | 5889 | 10448 | 24433 | 32106 | 37870 | 42555 | 46934 |

For the numerical variables `life_exp`

and `income`

it returns:

`n`

: the number of observations for each variable`missing`

: the number of missing values for each variable`mean`

: the average`sd`

: the standard deviation`iqr`

: the inter-quartile range, or the difference between the \(3^{rd}\) and the \(1^{st}\) quartile.`min`

: the*minimum*value, also known as the \(0^{th}\) percentile, as in the value at which 0% of observations are smaller than it`p25`

: the \(25^{th}\) percentile, the value at which 25% of observations are smaller than it, also known as the \(1^{st}\) quartile.`median`

: also known as the \(2^nd\) quartile, or the \(50^{th}\) percentile, the value at which 50% of observations are smaller than it`p75`

: the \(75^{th}\) percentile, the value at which 75% of observations are smaller than it, also known as the \(3^{rd}\) quartile.`max`

: the*maximum*value, also known as the \(100^{th}\) percentile, the value at which 100% of observations are smaller than it.

Looking at Table 11.2, we can see how the values of both variables distribute. For example, the mean life expectancy was 67 years, whereas the mean income was $37,418.

To understand the percentiles, let’s put them in the context.

```
gg$ggplot(df_gapminder2007,
mapping = gg$aes(x = "", y = life_exp)) +
# `x = ""` is necessary when there is only one vector to plot
gg$geom_boxplot(outlier.colour = "hotpink") +
# add the mean to the boxplot
gg$stat_summary(fun=mean, geom="point", shape=5, size=4) +
gg$geom_jitter(width = 0.1, height = 0, alpha = 0.2) +
# remove x axis title
gg$theme(axis.title.x = gg$element_blank())
```

```
gg$ggplot(df_gapminder2007,
mapping = gg$aes(x = "", y = income)) +
# `x = ""` is necessary when there is only one vector to plot
gg$geom_boxplot(outlier.colour = "hotpink") +
# add the mean to the boxplot
gg$stat_summary(fun=mean, geom="point", shape=5, size=4) +
gg$geom_jitter(width = 0.1, height = 0, alpha = 0.2) +
# remove x axis title
gg$theme(axis.title.x = gg$element_blank())
```

The middle 50% of life expectancy was between 57 to 76 years
(the first quartile, `p25`

, and third quartiles, `p75`

),
whereas the middle 50% of income falls within $32,106 to $42,555.

### 11.1.2 Summary statistics — bivariate

Since both the `life_exp`

and `income`

variables are numerical,
we can also apply a *scatterplot* to visualize this data.
Let’s do this using `geom_point()`

and display the result in Figure 11.1.

As the average income level increases, life expectancy also tends to go up. However, instead of tightly hugging the imaginary diagonal that goes from bottom left to upper right, as should be in a strictly linear relationship, some of the dots deviated, such as the ones highlighted in red in figure 11.1.

*Good to know*

**Logrithmic**

Run the following code chunk and compare the results to Figure 11.1:

```
# Code not run. Try on your own.
gg$ggplot(df_gapminder2007,
mapping = gg$aes(y = life_exp, x = gdp_per_cap)) +
gg$geom_point() +
gg$labs(title = "Relationship of wealth and health")
# plot with gdpPercap with log-scaled x axis
gg$ggplot(df_gapminder2007,
mapping = gg$aes(y = life_exp, x = gdp_per_cap)) +
gg$ geom_point() +
gg$scale_x_log10() +
gg$labs(title = "Relationship of wealth and health")
```

When two variables are numerical,
we can compute the *correlation coefficient*
between them.
Generally speaking, *coefficients* are quantitative expressions
of a specific phenomenon.
A *correlation coefficient* is a quantitative expression
of the *strength of the linear relationship between two numerical variables*.
Its value ranges between -1 and 1 where:

- -1 indicates a perfect
*negative relationship*: As one variable increases, the value of the other variable tends to go down, following a straight line. - 0 indicates no relationship: The values of both variables go up/down independently of each other.
- +1 indicates a perfect
*positive relationship*: As the value of one variable goes up, the value of the other variable tends to go up as well in a linear fashion.

`[.1, .3]`

: weak`[.3, .5]`

: moderate`[.5, 1)`

: strong

Figure 11.2 gives examples of 9 different correlation coefficient values for hypothetical numerical variables \(x\) and \(y\). For example, observe in the top right plot that for a correlation coefficient of -0.75 there is a negative linear relationship between \(x\) and \(y\), but it is not as strong as the negative linear relationship between \(x\) and \(y\) when the correlation coefficient is -0.9 or -1.

Loosely speaking, the tighter the dots are hugging the imaginary diagonal line, the stronger the relationship between the two variables and hence the larger the magnitude of the correlation coefficient.

Previously, we used `my_skim()`

function to compute what are known as *univariate*
summary statistics:
functions that take a single variable
and return some numerical summary of that variable.
However, there also exist *bivariate* summary statistics:
functions that take in two variables
and return some summary of those two variables;
such as a correlation coefficient.

The correlation coefficient can be computed using the `get_correlation()`

function in the `moderndive`

package.
In this case, the inputs to the function are the two numerical variables
for which we want to calculate the correlation coefficient.

We put the name of the outcome variable on the left-hand side
of the `~`

“tilde” sign,
while putting the name of the explanatory variable on the right-hand side.
This is known as R’s *formula notation*.
We will use this same “formula” syntax with regression later in this chapter.

```
# A tibble: 1 x 1
cor
<dbl>
1 0.809
```

An alternative way to compute correlation
is to use the `cor()`

summary function from base R
in conjunction with a `dplyr::summarize()`

:

In our case, the correlation coefficient of 0.809 indicates that the relationship between life expectancy and income is “strongly positive.” There is a certain amount of subjectivity in interpreting correlation coefficients, especially those that aren’t close to the extreme values of -1, 0, and 1. To develop your intuition about correlation coefficients, play the “Guess the Correlation” 1980’s style video game mentioned in Subsection 11.6.1.

## 11.2 Model fitting

Let’s build on the scatterplot in Figure 11.1
by adding a “best-fitting” line:
of all possible lines we can draw on this scatterplot,
it is the line that “best” fits through the cloud of points.
We will explain what criterion dictates which line fits the best
in Section 11.3.
For now, let’s request this line
by adding a new `geom_smooth(method = "lm", se = FALSE)`

layer
to the `ggplot()`

code that created the scatterplot
in Figure 11.1.
The `method = "lm"`

argument sets the line to be a “`l`

inear `m`

odel.”
The `se = FALSE`

argument
suppresses *standard error* uncertainty bars.
We have seen the concept of *standard error* of mean
in Section 5.2.1.
We will extend this concept to regression coefficients
in Chapter 12.

```
gg$ggplot(df_gapminder2007,
mapping = gg$aes(x = income, y = life_exp)) +
gg$geom_point() +
gg$labs(x = "Income", y = "Life Expectancy",
title = "Average Income and Life Expectancy") +
gg$geom_smooth(method = "lm", se = FALSE)
```

The line in the resulting Figure 11.3
is called a “regression line.”
The regression line is a visual summary
of the relationship between two numerical variables,
in our case the outcome variable `life_exp`

and the explanatory variable `income`

.
The positive slope of the blue line is consistent
with our earlier observed correlation coefficient of `cor_gapminder`

suggesting that there is a positive relationship
between these two variables: as the average income of a country increases,
so does the life expectancy in this country.
We will see later that, although the correlation coefficient
and the slope of a regression line typically have the same sign
(positive or negative), they typically do not have the same value.

You may recall from secondary/high school algebra that the equation of a line is \(y = a + b\cdot x\). (Note that the \(\cdot\) symbol is equivalent to the \(\times\) “multiply by” mathematical symbol.) It is defined by two coefficients \(a\) and \(b\). The intercept coefficient \(a\) is the value of \(y\) when \(x = 0\). The slope coefficient \(b\) for \(x\) is the increase in \(y\) for every increase of one in \(x\). This is also called the “rise over run.”

However, when defining a regression line like the regression line in Figure 11.3, we use slightly different notation: the equation of the regression line is \(\widehat{y} = b_0 + b_1 \cdot x\) . The intercept coefficient is \(b_0\), so \(b_0\) is the value of \(\widehat{y}\) when \(x = 0\). The slope coefficient for \(x\) is \(b_1\), i.e., the increase in \(\widehat{y}\) for every increase of one in \(x\). Why do we put a “hat” on top of the \(y\)? It’s a form of notation commonly used in regression to indicate that we have a “fitted value,” or the value of \(y\) on the regression line for a given \(x\) value. We will discuss this more in the upcoming Subsection 11.3.

We know that the regression line in Figure 11.3
has a positive slope \(b_1\) corresponding
to our explanatory \(x\) variable `income`

.
Why? Because as countries tend to have higher average `income`

,
so also do they tend to have higher `life_exp`

, or life expectancy.
What is the numerical value of the slope \(b_1\) and the intercept \(b_0\)?
Let’s not compute these two values by hand,
but rather let’s use a computer!

We can obtain the values of the intercept \(b_0\)
and the slope for `income`

\(b_1\)
by outputting a *linear regression table*. This is done in two steps:

- We first “fit” the linear regression model using the
`lm()`

function and save it in`health_model`

. - We get the regression table by applying the
`get_regression_table()`

function from the`moderndive`

package to`health_model`

.

```
# Fit regression model:
health_model <- lm(life_exp ~ income, data = df_gapminder2007)
# Get regression table:
moderndive::get_regression_table(health_model)
```

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | 4.950 | 3.858 | 1.283 | 0.202 | -2.677 | 12.576 |

income | 0.002 | 0.000 | 16.283 | 0.000 | 0.001 | 0.002 |

Let’s first focus on interpreting the regression table output
in Table 11.3,
and then we will later revisit the code that produced it.
In the `estimate`

column of Table 11.3
are the intercept \(b_0\) = 4.95
and the slope \(b_1\) = 0.002 for `income`

.
Thus the equation of the regression line in Figure 11.3 follows:

\[ \begin{aligned} \widehat{y} &= b_0 + b_1 \cdot x\\ \widehat{\text{life_exp}} &= b_0 + b_{\text{income}} \cdot\text{income}\\ &= 4.95 + 0.002\cdot\text{income} \end{aligned} \]

It is very important to report both the statistics and interpret them in plain English. In this example, it is not enough by just reporting the regression table and the equation of the regression line. As a good researcher, you are also expected to explain what these numbers mean.

The intercept \(b_0\) = 4.95 is the average life expectancy
\(\widehat{y}\) = \(\widehat{\text{life_exp}}\) for those countries
that have an `income`

of 0.
Or in graphical terms,
it’s where the line intersects the \(y\) axis when \(x\) = 0.
Note, however, that while the intercept of the regression line
has a mathematical interpretation,
it has no *practical* interpretation here,
since it is unlikely that a country will have an average income of 0.
Furthermore, looking at the scatterplot
in Figure 11.3,
no country has an income level anywhere near 0.

The more interesting parameter is the slope
\(b_1\) = \(b\_{\text{income}}\) for `income`

of 0.002,
as this summarizes the relationship between the income levels
and life expectancies.
Note that the sign is positive,
suggesting a positive relationship between these two variables,
meaning countries with higher average income
also tend to have longer life expectancy.
Formally, the slope is interpreted as:

For every increase of 1 unit in

`income`

, there is anassociatedincrease of,on average, 0.002 units of`life expectancy`

. Or, for every increase of $1,000 in`income`

, there is anassociated2 years increase in`life expectancy`

.

Recall from earlier that the correlation coefficient is 0.809.
Both correlation coefficient and the slope have the same positive sign,
but they have different values.
A correlation coefficient quantifies the strength of linear association
between two variables.
It is independent of units used in either variable.
In fact, correlation is one of the standardized effect sizes.
A regression slope, however, is unit-dependent.
Changing the unit of `income`

from $ (US dollar) to € (Euro),
for example, will change the magnitude of the slope.
Although a regression slope is not ideal for comparing results across studies,
its unit-dependency provides a meaningful context
for interpreting results within a single study.

Note the language used in the previous interpretation.
We only state that there is an *associated* increase and
not necessarily a *causal* increase.
For example, perhaps it’s not that higher income levels
directly cause longer life expectancies per se.
Instead, the reverse could hold true:
longevity may lead to late retirement,
which would contribute to higher income level.
In other words, just because two variables are strongly associated,
it doesn’t necessarily mean that one causes the other.
This is summed up in the often quoted phrase,
“correlation is not necessarily causation.”
We discuss this idea further in Subsection on
11.5.1.

Furthermore, we say that this associated increase is *on average*
0.002 units of `life expectancy`

,
because you might have two countries whose `income`

levels
differ by 1000 dollars,
but their difference in life expectancies will not necessarily be exactly
2 years (try for yourself).
What the slope of 0.002 is saying is that
across all contries, the *average* difference in life expectancy
between two contries whose incomes differ by 1000 dollars
is 2 years.

Now that we have learned how to compute the equation for the regression line
in Figure 11.3 using the values in the `estimate`

column
of Table 11.3,
and how to interpret the resulting intercept and slope,
let’s revisit the code that generated this table:

```
# Fit regression model:
health_model <- lm(life_exp ~ income, data = df_gapminder2007)
# Get regression table:
moderndive::get_regression_table(health_model)
```

First, we “fit” the linear regression model to the `data`

using the `lm()`

function and save this as `health_model`

.
When we say “fit,” we mean “find the best fitting line to this data.”
`lm()`

stands for “linear model” and is used as follows:
`lm(y ~ x, data = data_frame_name)`

where:

`y`

is the outcome variable, followed by a tilde`~`

. In our case,`y`

is set to`life_exp`

.`x`

is the explanatory variable. In our case,`x`

is set to`income`

.- The combination of
`y ~ x`

is called a*model formula*. (Note the order of`y`

and`x`

.) In our case, the model formula is`life_exp ~ income`

. We saw such model formulas earlier when we computed the correlation coefficient using the`get_correlation()`

function in Subsection 11.1. `data_frame_name`

is the name of the data frame that contains the variables`y`

and`x`

. In our case,`data_frame_name`

is the`df_gapminder2007`

data frame.

Second, we take the saved model in `health_model`

and apply the `get_regression_table()`

function from the `moderndive`

package
to it to obtain the regression table in Table 11.3.
This function is an example of what’s known in computer programming
as a *wrapper function*.
They take other pre-existing functions
and “wrap” them into a single function that hides its inner workings.

This concept is illustrated in Figure 11.4.

So all you need to worry about is
what the inputs look like and what the outputs look like;
you leave all the other details “under the hood of the car.”
In our regression modeling example, the `get_regression_table()`

function
takes a saved `lm()`

linear regression model as input
and returns a object of the regression table as output (which is saved
as type `data frame`

).
If you’re interested in learning more
about the `get_regression_table()`

function’s inner workings,
check out Subsection 11.5.2.

Lastly, you might be wondering what the remaining five columns
in Table 11.3 are: `std_error`

, `statistic`

, `p_value`

,
`lower_ci`

and `upper_ci`

.
They are the *standard error*, *test statistic*, *p-value*,
*lower 95% confidence interval bound*,
and *upper 95% confidence interval bound*.
They tell us about both the *statistical significance*
and *practical significance* of our results.
This is loosely the “meaningfulness” of our results
from a statistical perspective.
Let’s put aside these ideas for now
and revisit them in Chapter
12 on (statistical) inference for regression.

## 11.3 Finding the best-fitting line

Now that we have seen how to use function `moderndive::get_regression_table()`

to get the value of the intercept and the slope of a regression line,
let’s tie up some loose ends:
What criterion dictates which line is the the “best-fitting” regression line?
First, we need to introduce a few important concepts
through the following example.

Table 11.4 below is the 67th row from the `df_gapminder2007`

data frame.

country | continent | life_exp | income |
---|---|---|---|

Japan | Asia | 83 | 45005 |

What would be the value \(\widehat{\text{life_exp}}\) on the regression line
corresponding to Japan’s average `income`

of $45,005?
In Figure 11.5 we zoom in
on the circle, the sqaure, and the arrow for the country Japan:

- Circle: The
*observed value*\(y\) = 82.603 is Japan’s actual average life expectancy. - Square: The
*fitted value*\(\widehat{y}\) is the value on the regression line for \(x\) =`income`

= $45,005. This value is computed using the intercept and slope in the previous regression table:

\[\widehat{y} = b_0 + b_1 \cdot x = 4.95 + 0.002 \cdot 45004.57 = 79.59\]

- Arrow: The length of this arrow is the
*residual*and is computed by subtracting the fitted value \(\widehat{y}\) from the observed value \(y\). The residual can be thought of as a model’s error or “lack of fit” for a particular observation.

In the case of Japan, it is \(y - \widehat{y}\) = 82.603 - 79.59 = 3.013.

Figure 11.6 displays
three more arbitrarily chosen countries in addition to Japan,
with each country’s *observed* \(y\), *fitted* \(\hat{y}\),
and the *residual* \(y - \hat{y}\) marked.

Panel A in Figure 11.6 is a replica of Figure 11.5, which highlights the country Japan. The three other plots refer to:

A country with an average income of \(x\) = $36,954 and life expectancy of \(y\) = 72.96. The residual in this case is \(72.96 - 66.24 = 6.72\), which we mark with a new blue arrow in Panel B.

A country with an average income of \(x\) = $38,372 and life expectancy of \(y\) = 74.99. The residual in this case is \(74.99 - 68.59 = 6.4\), which we mark with a new blue arrow in Panel C.

A country with an average income of \(x\) = $31,653 and life expectancy of \(y\) = 54.11. The residual in this case is \(54.11 - 57.45 = -3.34\), which we mark with a new blue arrow in Panel D.

Now say we want to repeat this process
and compute both the fitted value
\(\widehat{y} = b_0 + b_1 \cdot x\)
and the residual \(y - \widehat{y}\) for *all* 142 countries
in the study.
We could repeat the previous calculations we have performed
by hand 142 times,
but that would be tedious and error prone.
Instead, let’s do this using a computer
with the `moderndive::get_regression_points()`

function.
Just like the `get_regression_table()`

function,
the `get_regression_points()`

function is a “wrapper” function.
However, this function returns a different output.
Let’s apply the `get_regression_points()`

function to `health_model`

,
which is where we saved our `lm()`

model in the previous section.
The results are presented in Table 11.5.

ID | life_exp | income | life_exp_hat | residual |
---|---|---|---|---|

1 | 43.828 | 29888.18 | 54.519 | -10.691 |

2 | 76.423 | 37735.69 | 67.534 | 8.889 |

3 | 72.301 | 37940.25 | 67.874 | 4.427 |

4 | 42.731 | 36809.91 | 65.999 | -23.268 |

5 | 75.320 | 41065.10 | 73.056 | 2.264 |

6 | 81.235 | 45370.05 | 80.196 | 1.039 |

7 | 79.829 | 45578.26 | 80.541 | -0.712 |

8 | 75.635 | 44741.59 | 79.154 | -3.519 |

9 | 64.062 | 31434.06 | 57.083 | 6.979 |

10 | 79.441 | 45275.35 | 80.039 | -0.598 |

11 | 56.728 | 31587.50 | 57.338 | -0.610 |

12 | 65.554 | 35823.06 | 64.362 | 1.192 |

13 | 74.852 | 38719.40 | 69.166 | 5.686 |

14 | 50.728 | 40993.30 | 72.937 | -22.209 |

15 | 72.390 | 39574.06 | 70.583 | 1.807 |

16 | 73.005 | 40286.04 | 71.764 | 1.241 |

17 | 52.295 | 30853.02 | 56.120 | -3.825 |

18 | 49.580 | 26335.40 | 48.627 | 0.953 |

19 | 59.723 | 32339.55 | 58.585 | 1.138 |

20 | 50.430 | 33100.76 | 59.847 | -9.417 |

21 | 80.653 | 45601.37 | 80.580 | 0.073 |

22 | 44.741 | 28488.15 | 52.197 | -7.456 |

23 | 50.651 | 32314.86 | 58.544 | -7.893 |

24 | 78.553 | 41196.40 | 73.274 | 5.279 |

25 | 72.961 | 36954.04 | 66.238 | 6.723 |

26 | 72.889 | 38455.06 | 68.728 | 4.161 |

27 | 65.152 | 29939.42 | 54.604 | 10.548 |

28 | 46.462 | 24433.44 | 45.473 | 0.989 |

29 | 55.322 | 35602.12 | 63.996 | -8.674 |

30 | 78.782 | 39843.05 | 71.030 | 7.752 |

31 | 48.328 | 31888.58 | 57.837 | -9.509 |

32 | 75.748 | 41649.24 | 74.025 | 1.723 |

33 | 78.273 | 39517.31 | 70.489 | 7.784 |

34 | 76.486 | 43585.69 | 77.237 | -0.751 |

35 | 78.332 | 45475.09 | 80.370 | -2.038 |

36 | 54.791 | 33185.81 | 59.988 | -5.197 |

37 | 72.235 | 37799.84 | 67.641 | 4.594 |

38 | 74.994 | 38371.63 | 68.589 | 6.405 |

39 | 71.338 | 37467.26 | 67.089 | 4.249 |

40 | 71.878 | 37580.30 | 67.277 | 4.601 |

41 | 51.579 | 40847.22 | 72.695 | -21.116 |

42 | 58.040 | 28071.08 | 51.506 | 6.534 |

43 | 52.947 | 28393.56 | 52.041 | 0.906 |

44 | 79.313 | 45212.31 | 79.935 | -0.622 |

45 | 80.657 | 44838.73 | 79.315 | 1.342 |

46 | 56.735 | 41207.87 | 73.293 | -16.558 |

47 | 59.448 | 28766.51 | 52.659 | 6.789 |

48 | 79.406 | 45074.56 | 79.706 | -0.300 |

49 | 60.022 | 31230.70 | 56.746 | 3.276 |

50 | 79.483 | 44399.39 | 78.586 | 0.897 |

51 | 70.259 | 37148.37 | 66.560 | 3.699 |

52 | 56.007 | 29743.52 | 54.279 | 1.728 |

53 | 46.388 | 27628.52 | 50.772 | -4.384 |

54 | 60.916 | 30797.73 | 56.028 | 4.888 |

55 | 70.198 | 35500.24 | 63.827 | 6.371 |

56 | 82.208 | 45990.64 | 81.225 | 0.983 |

57 | 73.338 | 42554.88 | 75.527 | -2.189 |

58 | 81.757 | 45584.78 | 80.552 | 1.205 |

59 | 64.698 | 33895.58 | 61.166 | 3.532 |

60 | 70.650 | 35490.83 | 63.811 | 6.839 |

61 | 70.964 | 40646.72 | 72.362 | -1.398 |

62 | 59.545 | 36504.11 | 65.492 | -5.947 |

63 | 78.885 | 46093.38 | 81.396 | -2.511 |

64 | 80.745 | 44069.36 | 78.039 | 2.706 |

65 | 80.546 | 44559.06 | 78.851 | 1.695 |

66 | 72.567 | 38645.63 | 69.044 | 3.523 |

67 | 82.603 | 45004.57 | 79.590 | 3.013 |

68 | 72.535 | 36550.87 | 65.569 | 6.966 |

69 | 54.110 | 31653.18 | 57.447 | -3.337 |

70 | 67.297 | 32022.34 | 58.059 | 9.238 |

71 | 78.623 | 43682.52 | 77.397 | 1.226 |

72 | 77.588 | 46749.25 | 82.484 | -4.896 |

73 | 71.993 | 40195.76 | 71.615 | 0.378 |

74 | 42.592 | 31957.15 | 57.951 | -15.359 |

75 | 45.678 | 26175.32 | 48.362 | -2.684 |

76 | 73.952 | 40812.57 | 72.638 | 1.314 |

77 | 59.443 | 30190.21 | 55.020 | 4.423 |

78 | 48.303 | 28804.42 | 52.722 | -4.419 |

79 | 74.241 | 40952.27 | 72.869 | 1.372 |

80 | 54.467 | 30181.10 | 55.005 | -0.538 |

81 | 64.164 | 32560.32 | 58.951 | 5.213 |

82 | 72.801 | 40396.91 | 71.948 | 0.853 |

83 | 76.195 | 40783.69 | 72.590 | 3.605 |

84 | 66.803 | 34907.69 | 62.844 | 3.959 |

85 | 74.543 | 39663.25 | 70.731 | 3.812 |

86 | 71.164 | 35820.83 | 64.359 | 6.805 |

87 | 42.082 | 29157.62 | 53.308 | -11.226 |

88 | 62.069 | 29749.72 | 54.290 | 7.779 |

89 | 52.906 | 36822.41 | 66.020 | -13.114 |

90 | 63.785 | 30379.68 | 55.335 | 8.450 |

91 | 79.762 | 45658.23 | 80.674 | -0.912 |

92 | 80.204 | 44011.42 | 77.943 | 2.261 |

93 | 72.899 | 34392.25 | 61.989 | 10.910 |

94 | 56.867 | 27921.65 | 51.258 | 5.609 |

95 | 46.859 | 33040.55 | 59.748 | -12.889 |

96 | 80.196 | 46933.50 | 82.789 | -2.593 |

97 | 75.640 | 43486.20 | 77.072 | -1.432 |

98 | 65.483 | 34159.66 | 61.604 | 3.879 |

99 | 75.537 | 39916.33 | 71.151 | 4.386 |

100 | 71.752 | 36204.32 | 64.995 | 6.757 |

101 | 71.421 | 38697.54 | 69.130 | 2.291 |

102 | 71.688 | 35038.56 | 63.061 | 8.627 |

103 | 75.563 | 41872.36 | 74.395 | 1.168 |

104 | 78.098 | 43119.58 | 76.464 | 1.634 |

105 | 78.746 | 42862.03 | 76.037 | 2.709 |

106 | 76.442 | 38848.02 | 69.379 | 7.063 |

107 | 72.476 | 40337.64 | 71.850 | 0.626 |

108 | 46.242 | 29360.55 | 53.644 | -7.402 |

109 | 65.528 | 32036.95 | 58.083 | 7.445 |

110 | 72.777 | 43355.55 | 76.855 | -4.078 |

111 | 63.062 | 32336.24 | 58.579 | 4.483 |

112 | 74.002 | 39906.29 | 71.134 | 2.868 |

113 | 42.568 | 29357.80 | 53.640 | -11.072 |

114 | 79.972 | 46734.19 | 82.459 | -2.487 |

115 | 74.663 | 42713.38 | 75.790 | -1.127 |

116 | 77.926 | 44110.85 | 78.108 | -0.182 |

117 | 48.159 | 29666.77 | 54.152 | -5.993 |

118 | 49.339 | 39670.64 | 70.744 | -21.405 |

119 | 80.941 | 44597.10 | 78.914 | 2.027 |

120 | 72.396 | 35988.01 | 64.636 | 7.760 |

121 | 58.556 | 34153.73 | 61.594 | -3.038 |

122 | 39.613 | 36545.12 | 65.560 | -25.947 |

123 | 80.884 | 45296.84 | 80.075 | 0.809 |

124 | 81.701 | 45741.06 | 80.811 | 0.890 |

125 | 74.143 | 36216.49 | 65.015 | 9.128 |

126 | 78.400 | 44581.58 | 78.888 | -0.488 |

127 | 52.517 | 30443.37 | 55.440 | -2.923 |

128 | 70.616 | 38726.46 | 69.178 | 1.438 |

129 | 58.420 | 29459.46 | 53.808 | 4.612 |

130 | 69.819 | 42554.78 | 75.527 | -5.708 |

131 | 73.923 | 38508.25 | 68.816 | 5.107 |

132 | 71.777 | 39272.82 | 70.084 | 1.693 |

133 | 51.542 | 30238.20 | 55.100 | -3.558 |

134 | 79.425 | 45211.81 | 79.934 | -0.509 |

135 | 78.242 | 46329.80 | 81.788 | -3.546 |

136 | 76.384 | 40257.75 | 71.717 | 4.667 |

137 | 73.747 | 40575.07 | 72.244 | 1.503 |

138 | 74.249 | 33876.70 | 61.134 | 13.115 |

139 | 73.422 | 34807.76 | 62.678 | 10.744 |

140 | 62.698 | 33580.82 | 60.644 | 2.054 |

141 | 42.384 | 31042.18 | 56.433 | -14.049 |

142 | 43.487 | 26718.29 | 49.262 | -5.775 |

Let’s inspect the individual columns in Table 11.5 and match them with the elements in Figure 11.7:

- The
`life_exp`

column represents the observed outcome variable \(y\). This is the y-position of the 142 solid black dots in Figure 11.7. - The
`income`

column represents the values of the explanatory variable \(x\). This is the x-position of the 142 solid black dots in Figure 11.7. - The
`life_exp_hat`

column represents the fitted values \(\widehat{y}\). This is the y-position of the 142 hollow black circles on the regression line in Figyre 11.7. - The
`residual`

column represents the residuals \(y - \widehat{y}\). This is the 142 vertical distances between the 142 solid black dots and their corresponding hollow black circles on the regression line.

Finding the best-fitting regression line
is equivalent to finding a line that minimizes the *residuals*,
or the distances from all the data points to the regression line combined.

In practice, we first square all the residuals and then sum the squares,
resulting in a quantity commonly known as the *sum of squared residuals*.
In an ideal world where the regression line fits all the points perfectly,
the sum of squared residuals would be 0.
This is because if the regression line fits all the points perfectly,
then the fitted value \(\widehat{y}\) equals the observed value \(y\) in all cases,
and hence the residual \(y-\widehat{y}\) = 0 in all cases,
and the sum of even a large number of 0’s is still 0.

In reality, however, the sum of squared residuals is almost never zero, even for the best-fitting regression line. Of all possible lines we can draw through the cloud of 142 points in Figure 11.3, the best-fitting regression line minimizes the sum of the squared residuals:

\[ \sum_{i=1}^{n}(y_i - \widehat{y}_i)^2 \]

Let’s use our data wrangling tools from Chapter 3 to compute the sum of squared residuals for the best-fitting line in the current example:

```
# Fit regression model:
health_model <- lm(life_exp ~ income, data = df_gapminder2007)
health_model <- lm(score ~ bty_avg,
data = evals_ch5)
# Get regression points:
regression_points <- moderndive::get_regression_points(health_model)
regression_points
# Compute sum of squared residuals
regression_points %>%
dp$mutate(squared_residuals = residual^2) %>%
dp$summarize(sum_of_squared_residuals = sum(squared_residuals))
```

```
# A tibble: 142 x 5
ID life_exp income life_exp_hat residual
<int> <dbl> <dbl> <dbl> <dbl>
1 1 43.8 29888. 54.5 -10.7
2 2 76.4 37736. 67.5 8.89
3 3 72.3 37940. 67.9 4.43
4 4 42.7 36810. 66.0 -23.3
5 5 75.3 41065. 73.1 2.26
6 6 81.2 45370. 80.2 1.04
7 7 79.8 45578. 80.5 -0.712
8 8 75.6 44742. 79.2 -3.52
9 9 64.1 31434. 57.1 6.98
10 10 79.4 45275. 80.0 -0.598
# … with 132 more rows
# A tibble: 1 x 1
sum_of_squared_residuals
<dbl>
1 7102.
```

Any other straight line drawn in Figure 11.7
would yield a sum of squared residuals greater than 7102.
This is a mathematically guaranteed fact
that you can prove using calculus and linear algebra.
That’s why a best-fitting linear regression line is also called
a *least-squares line*.

*Good to know*

**Why do we square the residuals (i.e., the arrow lengths)?**

By definition, residuals are calculated as \(y - \hat{y}\). Therefore, some residuals are positive (\(y > \hat{y}\)) whereas some are negative (\(y < \hat{y}\)). To make sure that both positive and negative residuals are treated equally, we can either use the absolute magnitude of residuals and add them up (Equation (11.1)) or square residuals and then add them up (Equation (11.2)).

\[ \lvert{y_1 - \hat{y}_1} \rvert + \lvert{y_2 - \hat{y}_2} \rvert + \cdots + \lvert{y_n - \hat{y}_n} \rvert \tag{11.1} \]

\[ (y_1 - \widehat{y}_1)^2 + (y_2 - \widehat{y}_2)^2 + \cdots + (y_n - \widehat{y}_n) \tag{11.2} \]

In the end, the second method, taking squares of residuals won. Because sum of squared residuals are differentiable, whereas sum of absolute values are not. In other words, sum of squares can be minimized analytically, without a computer.

*Learning check*

**(LC11.1)**
Note in Figure 11.8 there are 3 points marked with dots and:

- The “best” fitting solid regression line in blue
- An arbitrarily chosen dotted red line
- Another arbitrarily chosen dashed green line

Compute the sum of squared residuals by hand for each line and show that of these three lines, the regression line in blue has the smallest value.

## 11.4 A case study

#### Problem statement

Why do some professors and instructors at universities and colleges receive high teaching evaluations scores from students whereas others receive lower ones? Are there differences in teaching evaluations between instructors of different demographic groups? Could there be an impact due to student biases? These are all questions that are of interest to university/college administrators, as teaching evaluations are among the many criteria considered in determining which instructors and professors get promoted. Researchers at the University of Texas in Austin, Texas (UT Austin) tried to answer the following research question: what factors explain differences in instructor teaching evaluation scores? To this end, they collected instructor and course information on 463 course.

In this exercise, we will try to explain differences
in instructor teaching scores as a function of one numerical variable:
the instructor’s “beauty” score.
Could it be that instructors with higher “beauty” scores
also have higher teaching evaluations?
Could it be instead that instructors with higher “beauty” scores
tend to have lower teaching evaluations?
Or could it be that there is no relationship between “beauty” score
and teaching evaluations?
We’ll answer these questions by modeling the relationship between teaching scores
and “beauty” scores using *simple linear regression* where we have:

A numerical outcome variable \(y\) (the instructor’s teaching score) and

A single numerical explanatory variable \(x\) (the instructor’s “beauty” score).

### 11.4.1 Exploring the data

The data on the 463 courses at UT Austin
can be found in the `evals`

data frame included in the `moderndive`

package.
However, to keep things simple,
Let’s only `select()`

the subset of the variables
you need to consider in this example,
and save this data in a new data frame called `evals_simple`

:

#### Eyeballing the data

```
Rows: 463
Columns: 4
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4…
$ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, 3.333…
$ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4…
```

Observe that `Observations: 463`

indicates that
there are 463 rows/observations in `evals_simple`

,
where each row corresponds to one observed course at UT Austin.
It is important to note that the *observational unit*
is an individual course and not an individual instructor.
Since some instructors teach more than one course in an academic year,
the same instructor may appear more than once in the data.
Hence there are fewer than 463 unique instructors
being represented in `evals_simple`

.

A full description of all the variables included in `evals`

can be found at openintro.org
or by reading the associated help file (run `?moderndive::evals`

in the console).
Below is a list of descriptions for the 4 variables
we have selected in `evals_simple`

:

`ID`

: An identification variable used to distinguish between the 1 through 463 courses in the dataset.`score`

: A numerical variable of the course instructor’s average teaching score, where the average is computed from the evaluation scores from all students in that course. Teaching scores of 1 are lowest and 5 are highest. This is the outcome variable \(y\) of interest.`bty_avg`

: A numerical variable of the course instructor’s average “beauty” score, where the average is computed from a separate panel of six students. “Beauty” scores of 1 are lowest and 10 are highest. This is the explanatory variable \(x\) of interest.`age`

: A numerical variable of the course instructor’s age. This will be another explanatory variable \(x\) that will be used in an exercise.

An alternative way to look at the raw data values
is by choosing a random sample of the rows in `evals_simple`

by piping it into the `sample_n()`

function from the `dplyr`

package.
Here we set the `size`

argument to be `5`

,
indicating that we want a random sample of 5 rows.
We display the results below.
Note that due to the random nature of the sampling,
you will likely end up with a different subset of 5 rows.

ID | score | bty_avg | age |
---|---|---|---|

356 | 5.0 | 3.333 | 50 |

327 | 3.8 | 2.333 | 64 |

263 | 4.8 | 3.167 | 52 |

84 | 4.2 | 4.167 | 45 |

369 | 4.2 | 3.333 | 38 |

#### Summary statistics — univariate

Now that we’ve looked at the raw values in our `evals_simple`

data frame
and got a preliminary sense of the data,
let’s continue with computing summary statistics for the target variables —
an instructor’s teaching `score`

on a course
and the instructor’s “beauty” score denoted as `bty_avg`

.
As before, we will use the customized function `my_skim()`

.

If you are coming back to this section since last time you have created `my_skim()`

,
you will need to run the next chunk first.

```
# Create a template function for descriptives
my_skim <- skimr::skim_with(base = skimr::sfl(n = length, missing = skimr::n_missing),
numeric = skimr::sfl(
mean,
sd,
iqr = IQR,
min,
p25 = ~ quantile(., 1/4),
median,
p75 = ~ quantile(., 3/4),
max
),
append = FALSE
) #sfl stands for "skimr function list"
```

skim_variable | n | missing | mean | sd | iqr | min | p25 | median | p75 | max |
---|---|---|---|---|---|---|---|---|---|---|

score | 463 | 0 | 4.17 | 0.54 | 0.80 | 2.30 | 3.80 | 4.30 | 4.6 | 5.00 |

bty_avg | 463 | 0 | 4.42 | 1.53 | 2.33 | 1.67 | 3.17 | 4.33 | 5.5 | 8.17 |

Looking at this output, we can see how the values of both variables distribute. For example, the mean teaching score was 4.17 out of 5, whereas the mean “beauty” score was 4.42 out of 10. The boxplots below complement the information in Table 11.7.

```
gg$ggplot(evals_simple,
# `x = ""` is necessary when there is only one vector to plot
gg$aes(x = "", y = score)) +
gg$geom_boxplot() +
# add the mean to the boxplot
gg$stat_summary(fun=mean, geom="point", shape=5, size=4) +
gg$geom_jitter(position = gg$position_jitter(width = 0.1, height = 0), alpha = 0.2) +
# remove x axis title
gg$theme(axis.title.x = gg$element_blank()
)
```

```
gg$ggplot(evals_simple,
# `x = ""` is necessary when there is only one vector to plot
gg$aes(x = "", y = bty_avg)) +
gg$geom_boxplot() +
# add the mean to the boxplot
gg$stat_summary(fun=mean, geom="point", shape=5, size=4) +
gg$geom_jitter(position = gg$position_jitter(width = 0.1, height = 0), alpha = 0.2) +
# remove x axis title
gg$theme(axis.title.x = gg$element_blank()
)
```

#### Summary statistics — bivariate

```
gg$ggplot(evals_simple, gg$aes(x = bty_avg, y = score)) +
gg$geom_point() +
gg$labs(x = "Beauty Score",
y = "Teaching Score")
```

Observe that most “beauty” scores lie between 2 and 8, while most teaching scores lie between 3 and 5. Furthermore, the relationship between teaching score and “beauty” score seems “weakly positive.”

Computing the correlation coefficient confirms this intuition:

```
# A tibble: 1 x 1
correlation
<dbl>
1 0.187
```

### 11.4.2 Simple linear regression

Recall that there are two steps involved in fitting a simple linear regression model to the data:

“fit” the linear regression model using the

`lm()`

function and save it in`score_model`

.get the regression table by applying the

`get_regression_table()`

function from the`moderndive`

package to`score_model`

```
# Fit regression model:
score_model <- lm(score ~ bty_avg, data = evals_simple)
# Get regression table:
moderndive::get_regression_table(score_model, digits = 3, print = TRUE)
```

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | 3.880 | 0.076 | 50.961 | 0 | 3.731 | 4.030 |

bty_avg | 0.067 | 0.016 | 4.090 | 0 | 0.035 | 0.099 |

When we say “fit,” we mean “find the best fitting line to this data.”
`lm()`

stands for “linear model” and is used as follows:
`lm(y ~ x, data = data_frame_name)`

where:

`y`

is the outcome variable, followed by a tilde`~`

. In our case,`y`

is set to`score`

, which represents the teaching score an instructor received for a given course.`x`

is the explanatory variable. In our case,`x`

is set to an instructor’s “beauty” score, or`bty_avg`

.The combination of

`y ~ x`

is called a*model formula*. (Note the order of`y`

and`x`

.) In our case, the model formula is`score ~ bty_avg`

.`data_frame_name`

is the name of the data frame that contains the variables`y`

and`x`

. In our case,`data_frame_name`

is the`evals_simple`

data frame.

The `get_regression_table()`

function takes a saved `lm()`

linear regression model
as input and returns a data frame of the regression table as output.
If you’re interested in learning more
about the `get_regression_table()`

function’s inner workings,
check out Section 11.5.2.

#### Visualize

Let’s build on the scatterplot in Figure 11.11
by adding a “best-fitting” line:
of all possible lines we can draw on this scatterplot,
it is the line that “best” fits through the cloud of points.
We do this by adding a new `geom_smooth(method = "lm", se = FALSE)`

layer
to the `ggplot()`

code that created the scatterplot
in Figure 11.11.
The `method = "lm"`

argument sets the line to be a “`l`

inear `m`

odel.”
The `se = FALSE`

argument suppresses *standard error* uncertainty bars.

```
gg$ggplot(evals_simple, gg$aes(x = bty_avg, y = score)) +
gg$geom_point() +
gg$labs(x = "Beauty Score", y = "Teaching Score",
title = "Relationship between teaching and beauty scores") +
gg$geom_smooth(method = "lm", se = FALSE)
```

The line in the resulting Figure 11.12
is called a “regression line.”
The regression line is a visual summary of the relationship
between two numerical variables,
in our case the outcome variable `score`

and the explanatory variable `bty_avg`

.
The positive slope of the blue line is consistent
with our earlier observed correlation coefficient of 0.187
suggesting that there is a positive relationship between these two variables:
as instructors have higher “beauty” scores,
so also do they receive higher teaching evaluations.
Furthermore, a regression line is “best-fitting” in that
it minimizes the sum of squared residuals.

### 11.4.3 Interpretation

In the `estimate`

column of Table 11.8
are the intercept \(b_0\) = 3.88
and the slope \(b_1\) = 0.07 for `bty_avg`

.
Thus the equation of the regression line follows:

\[ \begin{aligned} \widehat{y} &= b_0 + b_1 \cdot x\\ \widehat{\text{score}} &= b_0 + b_{\text{bty}\_\text{avg}} \cdot\text{bty}\_\text{avg}\\ &= 3.88 + 0.067\cdot\text{bty}\_\text{avg} \end{aligned} \]

The intercept \(b_0\) = 3.88
is the average teaching score \(\widehat{y}\) = \(\widehat{\text{score}}\)
for those courses where the instructor had a “beauty” score `bty_avg`

of 0.
Or in graphical terms,
it’s where the line intersects the \(y\) axis when \(x\) = 0.
Note, however, that while the intercept of the regression line
has a mathematical interpretation, it has no *practical* interpretation here,
given that the average of six panelists’ “beauty” scores range from 1 to 10
and observing a `bty_avg`

of 0 is impossible.
Furthermore, looking at the scatterplot with the regression line
in Figure 11.1,
no instructors had a “beauty” score anywhere near 0.

Of greater interest is the slope \(b_1\) = \(b_{\text{bty_avg}}\) for `bty_avg`

of 0.067,
as this summarizes the relationship
between the teaching and “beauty” score variables.
Note that the sign is positive,
suggesting a positive relationship between these two variables,
meaning teachers with higher beauty scores also tend to have higher teaching scores.
The slope’s interpretation is:

For every increase of 1 unit in

`bty_avg`

, there is anassociatedincrease of,on average, 0.07 units of`score`

.

We only state that there is an *associated* increase
and not necessarily a *causal* increase.
For example, perhaps it’s not that higher “beauty” scores
directly cause higher teaching scores per se.
Instead, the following could hold true:
individuals from wealthier backgrounds
tend to have stronger educational backgrounds
and hence have higher teaching scores,
while at the same time these wealthy individuals
also tend to have higher “beauty” scores.
In other words, just because two variables are strongly associated,
it doesn’t necessarily mean that one causes the other.
This is summed up in the often quoted phrase,
“correlation is not necessarily causation.”

Furthermore, we say that this associated increase is *on average*
0.07 units of teaching `score`

,
because you might have two instructors whose `bty_avg`

scores differ by 1 unit,
but their difference in teaching scores won’t necessarily be exactly
0.07.
What the slope of 0.07 is saying
is that across all possible courses,
the *average* difference in teaching score
between two instructors whose “beauty” scores differ by one
is 0.07.

*Learning check*

**(LC11.2)**
Conduct a new linear regression analysis
with the same outcome variable \(y\) being `score`

but with `age`

as the new explanatory variable \(x\).
Remember, this involves the following steps:

- Eyeballing the raw data
- Creating data visualizations
- Computing summary statistics, both univariate and bivariate. What can you say about the relationship between age and teaching scores?
- Fitting a simple linear regression
where
`age`

is the explanatory variable \(x\) and`score`

being the outcome variable \(y\) - Interpreting results of the “best-fitting” line based on informtion in the regression table

How do the regression results match up with the results from the first three steps?

## 11.6 Conclusion

### 11.6.1 Additional resources

As we suggested in Subsection 11.1, interpreting coefficients that are not close to the extreme values of -1, 0, and 1 can be somewhat subjective. To help develop your sense of correlation coefficients, we suggest you play the 80s-style video game called, “Guess the Correlation,” at http://guessthecorrelation.com/.

### 11.6.2 What’s to come?

In this chapter, we have introduced *simple linear regression*,
where we fit models that have one continuous explanatory variable.
We also demonstrated how to interpret resutls from a regression table
such as Table 11.3.
However, we have only focused on the two leftmost columns: `term`

and `estimate`

.
In the next chapter, we will pick up where we left off
and focus on the remaining columns including `std_error`

and `p_value`

.