MOLB 7950 – Stats Bootcamp

Learning objective stats-bootcamp

Familiarity with probabilities, distributions, and descriptive statistics
Perform exploratory data analysis
Know which statistical methods are appropriate for your data
Understand and execute different statistical approaches
“You can’t be neutral on a moving train”

Class overview

Motivation for why stats are important
History of stats

Why do we need to know statistics?

As scientists we need to assign confidence to our results.

Convince ourselves
Convince other scientists
Convince the public (additional challenges)

Stats + probability are not intuitive

See patterns where there are none
Miss patterns we are NOT expecting
We tend to be overconfident
Bad at Bayesian thinking
Don’t do well with dependencies
Don’t update expectations with new evidence

See patterns where there are none

Cross-section of male nematode worm Ascaris

  [1] "T" "T" "T" "H" "H" "H" "H" "T" "T" "H" "T" "T" "H" "H" "T" "T" "T" "H"
 [19] "T" "H" "H" "T" "H" "T" "H" "T" "H" "T" "H" "T" "H" "T" "T" "T" "H" "H"
 [37] "H" "T" "H" "H" "H" "T" "T" "H" "T" "T" "T" "T" "H" "T" "H" "T" "T" "T"
 [55] "T" "T" "H" "T" "T" "T" "H" "T" "H" "T" "H" "H" "H" "T" "T" "T" "T" "H"
 [73] "H" "T" "T" "T" "T" "T" "H" "T" "T" "H" "H" "T" "T" "T" "H" "T" "H" "H"
 [91] "T" "T" "H" "T" "H" "T" "T" "T" "H" "T"

Need help! My friend thinks coin flips are 50-50!

Miss existing patterns

ALWAYS visualize your data.

Let’s flip a fair coin

Flip a coin 5 times with equal prob of H or T

rbinom(n = 5, size = 1, prob = .5)

[1] 1 1 0 0 1

Again

rbinom(n = 5, size = 1, prob = .5)

[1] 1 0 1 1 0

Set a seed

#
set.seed(33)
rbinom(n = 5, size = 1, prob = .5)

[1] 0 0 0 1 1

#
set.seed(33)
rbinom(n = 5, size = 1, prob = .5)

[1] 0 0 0 1 1

Let’s flip an unfair coin

Flip a coin 5 times with equal prob of H or T

rbinom(n = 5, size = 1, prob = .2)

[1] 0 0 0 0 0

Again

rbinom(n = 5, size = 1, prob = .2)

[1] 0 0 0 0 0

Let’s summarize the flipping results

Flip a fair coin 10

rbinom(n = 10, size = 1, prob = .5)

 [1] 1 1 0 0 1 1 0 0 0 1

Flip a fair coin 10 and calculate mean

rbinom(n = 10, size = 1, prob = .5) |>
  mean()

[1] 0.5

Again

rbinom(n = 10, size = 1, prob = .5) |>
  mean()

[1] 0.5

Unfair coin

rbinom(n = 10, size = 1, prob = .2) |>
  mean()

[1] 0.3

Unfair coin, again

rbinom(n = 10, size = 1, prob = .2) |>
  mean()

[1] 0.3

Let’s go wild flipping

Flip a fair coin 10 times and calculate mean. Then do 5 rounds of that experiment.

numFlips <- 10
numRounds <- 5

myFairTosses <- vector()

for (i in 1:numRounds) {
  myFairTosses[i] <- rbinom(n = numFlips, size = 1, prob = .5) |> mean()
}

myFairTosses

[1] 0.5 0.4 0.5 0.3 0.4

Same thing for an unfair coin.

myUnfairTosses <- vector()

for (i in 1:numRounds) {
  myUnfairTosses[i] <- rbinom(n = numFlips, size = 1, prob = .2) |> mean()
}

myUnfairTosses

[1] 0.1 0.3 0.5 0.4 0.3

Tidy and visualize flips

make a dataframe with means and accompanying info

allFlips <- tibble(
  fair = myFairTosses,
  unfair = myUnfairTosses
) |>
  pivot_longer(
    cols = c("fair", "unfair"),
    names_to = "cheating",
    values_to = "avg"
  )

allFlips |> top_n(5)

# A tibble: 6 x 2
  cheating   avg
  <chr>    <dbl>
1 fair       0.5
2 fair       0.4
3 fair       0.5
4 unfair     0.5
5 unfair     0.4
6 fair       0.4

plot it

ggplot(allFlips, aes(x = cheating, y = avg, color = cheating)) +
  geom_jitter() +
  stat_summary(
    fun = mean, geom = "point", shape = 18,
    size = 3, color = "black"
  ) +
  ylim(-0.05, 1.05) +
  geom_hline(yintercept = .5, linetype = "dashed") + # true mean fair
  geom_hline(yintercept = .2, linetype = "dashed") + # true mean unfair
  theme_cowplot()

Play around some more

numFlips <- 50
numRounds <- 5

myFairTosses <- vector()

for (i in 1:numRounds) {
  myFairTosses[i] <- rbinom(n = numFlips, size = 1, prob = .5) |> mean()
}

myUnfairTosses <- vector()

for (i in 1:numRounds) {
  myUnfairTosses[i] <- rbinom(n = numFlips, size = 1, prob = .2) |> mean()
}

tibble(
  fair = myFairTosses,
  unfair = myUnfairTosses
) |>
  pivot_longer(cols = c("fair", "unfair"), names_to = "cheating", values_to = "avg") |>
  ggplot(aes(x = cheating, y = avg, color = cheating)) +
  geom_jitter() +
  stat_summary(
    fun = mean, geom = "point", shape = 18,
    size = 3, color = "black"
  ) +
  ylim(-0.05, 1.05) +
  geom_hline(yintercept = .5, linetype = "dashed") + # true mean fair
  geom_hline(yintercept = .2, linetype = "dashed") + # true mean unfair
  theme_cowplot()

The Monty Hall Problem

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice? ~ (From Parade magazine’s Ask Marilyn column)

Pick a door, any door

Will you switch?

Switching improves your odds 2x

Say you choose Door #1

Behind Door 1	Behind Door 2	Behind Door 3	Result if STAY	Result if SWITCH
Car	Goat	Goat	Car	Goat
Goat	Car	Goat	Goat	Car
Goat	Goat	Car	Goat	Car

Simulation of Monty Hall Problem

You can’t be neutral on a moving train

Howard Zinn: “To Be Neutral, To Be Passive In A Situation Is To Collaborate With Whatever Is Going On”

Zinn, a prolific writer and scholar, tore down the wall intended to separate activism <e2><80><94> or partisanship <e2><80><94> from the professed objectivity of scholarship. Instead Zinn told his students that he did not

<e2><80><9c>pretend to an objectivity that was neither possible nor desirable. <e2><80><98>You can<e2><80><99>t be neutral on a moving train,<e2><80><99> I would tell them...Events are already moving in certain deadly directions, and to be neutral means to accept that.<e2><80><9d>

Zinn is writing in response to the timeless questions that burn within anyone who cares about creating a more just society and world. Is change possible? Where will it come from? Can we actually make a difference? How do you remain hopeful?

Modern Statistics, Beer, and Eugenics

Stanford Eugenics History Project

From small beginnings: to build an anti-eugenic future

Fathers of statistics

Sir Francis Galton (1822-1911)
Karl Pearson (1857-1936)
Sir Ronald Aylmer Fisher (1890-1962)

Sir Francis Galton (1822-1911)

Discovered regression to the mean
Re-discovered correlation and regression and discovered how to apply these in anthropology, psychology, and more
Defined the concept of standard deviation
Established the field of Eugenics in 1883
Darwin’s cousin.

Galton’s reasoning for coining the term eugenics:

“We greatly want a brief word to express the science of improving stock, which…takes cognisance of all influences that tend in however remote a degree to give the more suitable races or strains of blood a better chance of prevailing speedily over the less suitable than they otherwise would have had.”

Karl Pearson (1857-1936)

Pearson was Galton’s protege and developed/contributed to:

Developed hypothesis testing
Developed the use of p-values
Defined the Chi-Squared test
Correlation coefficient
Principle components analysis

Pearson authored of timeless “classics” such as:

The Woman’s Question

National Life from the standpoint of science

Pearson, eugenics, and anti-semitism

In the year Mein Kampf was published, Pearson wrote an article called:
THE PROBLEM OF ALIEN IMMIGRATION INTO GREAT BRITAIN, ILLUSTRATED BY AN EXAMINATION OF RUSSIAN AND POLISH JEWISH CHILDREN

“[they] will develop into a parasitic race…Taken on the average, and regarding both sexes, this alien Jewish population is somewhat inferior physically and mentally to the native population.”

Sir Ronald Aylmer Fisher (1890-1962)

Fisher’s work established many important methods of statistical inference.

The iris dataset
Establishing p = 0.05 as the normal threshold for significant p-values
Promoting Maximum Likelihood Estimation
Developing the ANalysis Of VAriance (ANOVA)
The Genetical Theory of Natural Selection, which blended the work of Mendel and Darwin.

Fisher and eugenics

There is no lack of Fisher’s strong and consistent support for eugenics. Here is an example from as late as 1954.

Letter from R.A. Fisher to R. Ruggles Gates. Ronald Fisher Archive. University of Adelaide.

Storytime, pt I

Galton founded the Eugenics Record Office (1904)
Galton Eugenics Laboratory within University College London (UCL). Created by Pearson and funded by Galton. (1907)
Galton left UCL enough money to create a Chair in National Eugenics, filled by Pearson and then Fisher.
Annals of Human Genetics was established in 1925 Pearson as the Annals of Eugenics, and obtained its current name in 1954.
Galton laboratory was incorporated into the Department of Eugenics, Biometry and Genetic at UCL in 1944.

Storytime, pt II

Renamed to the Department of Human Genetics and Biometry in 1966.
Became part of the Department of Biology at UCL in 1996.
In 2020: UCL renames three facilities that honoured prominent eugenicists
These views did not appear to be common at UCL in the 1930s. For example, they were not held by JBS Haldane, Egon Pearson (son of Karl), and Lionel Penrose.

What about in the US?

Eugenics Archive

U.S. Scientists’ Role in the Eugenics Movement (1907–1939): A Contemporary Biologist’s Perspective

Charles Davenport (first director of CSHL) and the Carnegie Insitution

Cold Spring Harbor and German Eugenics in the 1930s

Eugenics and the history of Science and AAAS

Government policy

from “America’s Shameful History of Eugenics and Forced Sterilizations”

Modern day: Eugenics and beyond

Sordid genealogies: a conjectural history of Cambridge Analytica’s eugenic roots

American Renaissance

The 5 “races”

‘Race’ cannot be biologically defined due to genetic variation among human individuals and populations. (A) The old concept of the “five races:” African, Asian, European, Native American, and Oceanian. (B) Actual genetic variation in humans.

Polygenic Traits, Human Embryos, and Eugenic Dreams

An academic study debunked the idea of “Screening Human Embryos for Polygenic Traits,” but the CEO of the company Stephen Hsu cofounded announced that they had screened human embryos for polygenic traits.

The amoral nonsense of Orchid’s embryo selection

Superior: The Return of Race Science

https://en.wikipedia.org/wiki/Superior:_The_Return_of_Race_Science

Weapons of Math Destruction

https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction

Just a product of their time

From: Statistics, Eugenics, and Me

Sitting idly by as this happens will make us <e2><80><98>a product of their time<e2><80><99>. This is not good enough. Data Science needs more regulation. Doctors have the Hippocratic Oath, why don<e2><80><99>t we have the Nightingale Oath: <e2><80><9c>Manipulate no data nor results. Promote ethical uses of statistics. Only train models you understand. Don<e2><80><99>t promote Eugenics<e2><80><9d>.

Stats Bootcamp - class 10

Learning objective stats-bootcamp

Class overview

Why do we need to know statistics?

Stats + probability are not intuitive

See patterns where there are none

Miss existing patterns

Let’s flip a fair coin

Let’s flip an unfair coin

Let’s summarize the flipping results

Let’s go wild flipping

Tidy and visualize flips

Play around some more

The Monty Hall Problem

Pick a door, any door

Will you switch?

Switching improves your odds 2x

Simulation of Monty Hall Problem

You can’t be neutral on a moving train

Modern Statistics, Beer, and Eugenics

Fathers of statistics

Sir Francis Galton (1822-1911)

Galton’s reasoning for coining the term eugenics:

Karl Pearson (1857-1936)

Pearson authored of timeless “classics” such as:

Pearson, eugenics, and anti-semitism

Sir Ronald Aylmer Fisher (1890-1962)

Fisher and eugenics

Storytime, pt I

Storytime, pt II

What about in the US?

Government policy

Modern day: Eugenics and beyond

The 5 “races”

Polygenic Traits, Human Embryos, and Eugenic Dreams

Superior: The Return of Race Science

Weapons of Math Destruction

Just a product of their time

References