7  Machine learning

7.1 Multiple regression

When you are tracking lots of variables, it can be difficult to spot exactly which ones are driving an effect.

In this example, we’ve fudged the data so that headaches always happen only on days with less than 6 hours of sleep.

Also, headache days always have high stress. But sometimes we have high stress even without a headache.

Code
library(tidyverse)
library(lubridate)

set.seed(1984)
WEEKS <- 140

z <- function(x){
m = NULL
for(i in 1:WEEKS){
  m = c(c(rep(0,6),
              floor(runif(1,min=0,max=3))),
        m)
}

m
}
tracking_table <- data.frame(date=seq(from = today()-weeks(WEEKS),
                     by = "1 day", length.out = 7*WEEKS),
            headache = sample(c(TRUE,FALSE), 7*WEEKS,
                              prob = c(.05,.95),
                              replace = TRUE),
            stress = sample(c("low","medium","high"),
                            size = 7*WEEKS,
                            replace = TRUE),        
            icecream = sample(c(TRUE,FALSE), 7*WEEKS,
                              prob = c(.10,.90),
                              replace = TRUE),
            z = runif(7*WEEKS, min = -2.5, max = .5) + 8,
            wine = z(0))

tracking_table$headache <- tracking_table$z<6  # make headaches on days with low sleep
# tracking_table[tracking_table$stress=="high"] <- "high"  # make stress on headache days

knitr::kable( head(tracking_table,10), digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
date headache stress icecream z wine
2020-04-15 FALSE high FALSE 8.24 0
2020-04-16 FALSE high FALSE 6.60 0
2020-04-17 FALSE medium TRUE 7.16 0
2020-04-18 FALSE low FALSE 6.89 0
2020-04-19 FALSE high FALSE 6.04 0
2020-04-20 FALSE high FALSE 7.34 0
2020-04-21 TRUE high FALSE 5.74 0
2020-04-22 FALSE high FALSE 6.87 0
2020-04-23 FALSE low FALSE 7.23 0
2020-04-24 FALSE high FALSE 6.89 0

Now make a linear model using the 980 rows from our dataframe.

Code
m <- lm(headache~icecream+z+wine+stress, data = tracking_table)
summary(m)

Call:
lm(formula = headache ~ icecream + z + wine + stress, data = tracking_table)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46762 -0.21394 -0.03199  0.16852  0.57443 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.085871   0.073447  28.400   <2e-16 ***
icecreamTRUE  0.030793   0.028976   1.063    0.288    
z            -0.272406   0.010301 -26.445   <2e-16 ***
wine         -0.002729   0.019588  -0.139    0.889    
stresslow    -0.012375   0.022061  -0.561    0.575    
stressmedium -0.034324   0.021410  -1.603    0.109    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2776 on 974 degrees of freedom
Multiple R-squared:  0.4216,    Adjusted R-squared:  0.4186 
F-statistic:   142 on 5 and 974 DF,  p-value: < 2.2e-16

Those triple dots (“***”) in the right-hand column for Coefficients indicate items of very high significance.

In this case, as expected, there seems to be a strong relationship between both sleep and stress.

A quick yet dramatic way to visualize this uses the stat_smooth function of ggplot:

Code
tracking_table %>% ggplot(aes(y=headache, x=z)) + geom_point() + stat_smooth(formula = y~x, method = "lm", color = "red") +
facet_grid(headache ~ stress)

We clearly see that headaches always occur on days when Z<6.

And here’s what happens if we change the data so headaches always happen when stress is high.

Code
tracking_table[tracking_table$stress == "high",]$headache <- TRUE

tracking_table %>% ggplot(aes(y=headache, x=z)) + geom_point() + stat_smooth(formula = y~x, method = "lm", color = "red") +
facet_grid(headache ~ . + stress) +
  labs(title = paste("Adj R2 = ",signif(summary(m)$adj.r.squared, 5),
                     "Intercept =",signif(m$coef[[1]],5 ),
                     " Slope =",signif(m$coef[[2]], 5),
                     " P =",signif(summary(m)$coef[2,4], 5)))

As expected, the plot shows headaches either when z<6 or when stress is high.