── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'rstatix'
The following object is masked from 'package:stats':
filter
Attaching package: 'janitor'
The following object is masked from 'package:rstatix':
make_clean_names
The following objects are masked from 'package:stats':
chisq.test, fisher.test
here() starts at /Users/mtaliaferro/Documents/GitHub/molb-7950
Problem Set Stats Bootcamp - class 13
Hypothesis Testing
biochem <- read_tsv("http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/Biochemistry.txt", show_col_types = FALSE) |>
janitor::clean_names()
# simplify names a bit more
colnames(biochem) <- gsub(pattern = "biochem_", replacement = "", colnames(biochem))
# we are going to simplify this a bit and only keep some columns
keep <- colnames(biochem)[c(1, 6, 9, 14, 15, 24:28)]
biochem <- biochem[, keep]
# get weights for each individual mouse
# careful: did not come with column names
weight <- read_tsv("http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/weight", col_names = F, show_col_types = FALSE)
# add column names
colnames(weight) <- c("subject_name", "weight")
# add weight to biochem table and get rid of NAs
# rename gender to sex
b <- inner_join(biochem, weight, by = "subject_name") |>
na.omit() |>
rename(sex = gender)
Problem # 1
Is there an association between mouse calcium and sodium levels?
1. Make a scatterplot to inspect variable — (2 pts)
ggscatter(
data = b,
y = "calcium",
x = "sodium"
)
2. Are they normal (enough)? — (1 pts)
ggqqplot(data = b, x = "calcium")
b |> shapiro_test(calcium)
# A tibble: 1 × 3
variable statistic p
<chr> <dbl> <dbl>
1 calcium 0.995 0.0000108
ggqqplot(data = b, x = "sodium")
b |> shapiro_test(sodium)
# A tibble: 1 × 3
variable statistic p
<chr> <dbl> <dbl>
1 sodium 0.995 0.00000432
Which test will you use and why?
I will use the Pearson since the qqplots look ok and there are ~1700 observations. Spearman is fine too - especially since sodium looks a little like an integer and not continuous.
3. Declare null hypothesis \(\mathcal{H}_0\) — (1 pts)
\(\mathcal{H}_0\) is that there is no dependency/association between \(calcium\) and \(sodium\)
4. Calculate and plot the correlation on a scatterplot — (2 pts)
ggscatter(
data = b,
y = "calcium",
x = "sodium"
) +
stat_cor(
method = "pearson",
label.x = 110,
label.y = 2.4
)
Problem # 2
Do mouse calcium levels explain mouse sodium levels? If so, to what extent?
Use a linear model to do the following:
1. Specify the Response and Explanatory variables — (2 pts)
The response variable y is sodium The explantory variable x is calcium
2. Declare the null hypothesis — (1 pts)
The null hypothesis is calcium levels do not explain/predict sodium levels.
3. Use the lm
function to create a fit (linear model) — (1 pts)
also save the slope and intercept for later
fit <- lm(formula = sodium ~ 1 + calcium, data = b)
int <- tidy(fit)[1, 2]
slope <- tidy(fit)[2, 2]
4. Add residuals to the data and create a plot visualizing the residuals — (1 pts)
b_fit <- augment(fit, data = b)
avg_sod <- mean(b$sodium)
ggplot(
data = b_fit,
aes(x = calcium, y = sodium)
) +
geom_point(size = 1, aes(color = .resid)) +
geom_abline(
intercept = pull(int),
slope = pull(slope),
col = "red"
) +
scale_color_gradient2(
low = "blue",
mid = "black",
high = "yellow"
) +
geom_segment(
aes(
xend = calcium,
yend = .fitted
),
alpha = .1
) + # plot line representing residuals
theme_linedraw()
5. Calculate the \(R^2\) and compare to \(R^2\) from fit — (2 pts)
\(R^2 = 1 - \displaystyle \frac {SS_{fit}}{SS_{null}}\)
\(SS_{fit} = \sum_{i=1}^{n} (data - line)^2 = \sum_{i=1}^{n} (y_{i} - (\beta_0 \cdot 1+ \beta_1 \cdot x)^2\)
\(SS_{null}\) — sum of squared errors around the mean of \(y\)
\(SS_{null} = \sum_{i=1}^{n} (data - mean)^2 = \sum_{i=1}^{n} (y_{i} - \overline{y})^2\)
6. Using \(R^2\), describe the extent to which calcium explains sodium levels — (2 pts)
\(calcium\) explains ~68% of variation in \(sodium\) levels
7. Report (do not calculate) the \(p-value\) and your decision on the null — (1 pts)
fit |>
glance() |>
select(p.value)
# A tibble: 1 × 1
p.value
<dbl>
1 0
The null hypothesis is calcium levels do not explain/predict sodium levels IS NOT SUPPORTED
Calcium levels to predict sodium levels.
Problem # 3
What is the association between mouse weight and age levels for different sexes?
1. Calculate the pearson correlation coefficient between weight and age for females and males — (2 pts)
b |>
group_by(factor(sex)) |>
cor_test(weight, age)
# A tibble: 2 × 9
`factor(sex)` var1 var2 cor statistic p conf.low conf.high method
<fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 F weight age 0.21 6.30 4.67e-10 0.143 0.269 Pearson
2 M weight age 0.37 12.0 1.22e-30 0.314 0.427 Pearson
2. Describe your observations — (2 pts)
The relationship between weight and age is stronger for males than it is for females.