6 Applications
Load the data we created in our Chapter 4 example.
Code
<- readr::read_csv("./headache-variables.csv", show_col_types = FALSE) x
With my 14 weeks of data, we can do a few basic calculations:
How frequent are my headaches? Simply total the number of headaches and divide by number of days:
Code
# headaches per day
sum(x$headache) / length(x$headache)
[1] 0.08163265
6.1 Hypothesis
With the data collected and in a nice dataframe format, we can start to ask what might be driving the headaches. One of the first suspected culprits might be something that I eat.
Based on the data collected so far, can I make any guesses about what might be driving my headaches?
The most obvious place to check is whether I see any patterns on the days when I have headaches. Let’s filter for headache days only:
Code
%>% filter(headache) %>% kableExtra::kable() %>%
x ::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) kableExtra
date | headache | icecream | z | wine |
---|---|---|---|---|
2022-07-30 | TRUE | TRUE | 8.478644 | 0 |
2022-08-14 | TRUE | FALSE | 6.880779 | 0 |
2022-08-21 | TRUE | FALSE | 6.909671 | 0 |
2022-09-08 | TRUE | FALSE | 7.278277 | 0 |
2022-09-16 | TRUE | FALSE | 7.797994 | 0 |
2022-10-02 | TRUE | TRUE | 8.385846 | 0 |
2022-10-05 | TRUE | FALSE | 5.849701 | 0 |
2022-10-22 | TRUE | FALSE | 7.679254 | 0 |
But maybe the headache takes a day or two to kick in. We can divide the data by week and see if we can spot any patterns in headache frequency:
Code
%>% group_by(week = ntile(date,7)) %>%
x summarise(headaches = sum(headache),
alcohol = sum(wine),
icecream = sum(icecream)) %>% kableExtra::kable() %>%
::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) kableExtra
week | headaches | alcohol | icecream |
---|---|---|---|
1 | 1 | 0 | 4 |
2 | 1 | 1 | 0 |
3 | 1 | 3 | 0 |
4 | 1 | 2 | 1 |
5 | 1 | 4 | 1 |
6 | 2 | 3 | 3 |
7 | 1 | 3 | 0 |
Buy simply eye-balling the data this way, you might see a pattern. For example, you might spot a week or two with an unusually large number of headaches and notice those weeks are accompanied by an unusually large consumption of some particular food.
But how do you know you’re not just guessing? What looks like a pattern might be a coincidence. To find out with more certainty, we will apply some statistics.
6.2 T-Testing
Hint: an Excel version of this exercise is in Section 12.1 .
The simplest test is called a “T Test”. This is a formula that can compare two equal-sized lists of numbers and return the probability that any differences between the two are the result of chance.
What are the chances that the number of headaches per week is related to the amount of ice cream I eat per week?
If there were a relationship between ice cream and headaches each week, I’d expect that over the weeks in this period, the total number of headaches and the total number of ice cream days should be roughly equal.
Code
<- x %>% group_by(week = ntile(date,7)) %>%
x_week summarise(headaches = sum(headache),
alcohol = sum(wine),
icecream = sum(icecream))
%>% kableExtra::kable() %>%
x_week ::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) kableExtra
week | headaches | alcohol | icecream |
---|---|---|---|
1 | 1 | 0 | 4 |
2 | 1 | 1 | 0 |
3 | 1 | 3 | 0 |
4 | 1 | 2 | 1 |
5 | 1 | 4 | 1 |
6 | 2 | 3 | 3 |
7 | 1 | 3 | 0 |
Code
with(x_week, t.test(headaches,icecream))[["p.value"]]
[1] 0.8254265
By convention, a p-value less than 0.05 (that is, less than 5\%) is considered statistically significant. While this is not a hard and fast rule, it’s often a good place to start. A p-value greater than this is almost certainly due to chance.
6.3 Data visualization
The first step in a more sophisticated analysis is to plot the data to see if we can spot any particular patterns.
Code
%>% pivot_longer(names_to = "activity",
x_week values_to = "amount",
cols = alcohol:icecream ) %>%
ggplot(aes(x=week,y=headaches)) +
geom_bar(aes(x=week,y=amount, fill = activity),
position = "dodge",
stat = "identity") +
geom_line(aes(x=week,y=headaches))