── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Problem Set 1 Key
Problem Set
Each problem below is worth 8 points.
The problem set is due 12pm on Aug 30.
Grading rubric
- Everything is good: 5 points
- Partially correct answers: 3-4 points
- Reasonable attempt: 2 points
Setup
Start by loading libraries you need analysis below. When in doubt, start by loading the tidyverse package.
Question 1
Create 3 different vectors called x
, y
, and z
:
-
x
should be character vector of length 5 (hint: useLETTERS
orletters
) -
y
should be a numeric vector of length 5 (hint: try1:5
orc(1, 2, 3, 4, 5)
) -
z
should be a logical vector of length 5 (hint: useTRUE
andFALSE
values)
Use length()
to calculate the length of each vector.
Question 2
Using the vectors you created above, create a new tibble with column names x
, y
, and z
. Use the tibble()
function to combine your vectors into a data frame.
Use nrow()
and ncol()
to calculate the number of rows and columns, both with and without the pipe operator.
Use glimpse()
to get a quick overview of your tibble - this shows data types and first few values.
What do you notice about the length of the vectors and the number of rows?
[1] 5
ncol(tbl)
[1] 3
# Get a quick overview
glimpse(tbl)
Rows: 5
Columns: 3
$ x <chr> "A", "B", "C", "D", "E"
$ y <int> 1, 2, 3, 4, 5
$ z <lgl> TRUE, TRUE, FALSE, FALSE, FALSE
Answer
The length of the vectors and the number of rows are the same, because tibble columns are simply the vectors we started with.
Question 3
Let’s explore the penguins
dataset that we loaded.
- Look at the number of rows with
nrow()
- this tells us how many penguins are in the dataset - Look at the number of columns with
ncol()
- this tells us how many variables we measured - Look at the column names with
names()
- this shows us what variables we have - Get a glimpse of the data with
glimpse()
- this shows data types and sample values
# Explore the penguins dataset
nrow(penguins)
[1] 344
ncol(penguins)
[1] 8
names(penguins)
[1] "species" "island" "bill_len" "bill_dep" "flipper_len"
[6] "body_mass" "sex" "year"
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex <fct> male, female, female, NA, female, male, female, male, NA, …
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Question 4
Next we will think about data tidying. Let’s start by analyzing the penguins
dataset.
Part A: Is the penguins
dataset tidy? To determine this, we need to think about the three principles of tidy data:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
Look at the penguins
dataset and answer:
- What are the variables in the dataset? (Hint: use
names(penguins)
to see them) - Does each column represent a single variable?
- Does each row represent a single penguin observation?
# Look at the structure of penguins
penguins |> glimpse()
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex <fct> male, female, female, NA, female, male, female, male, NA, …
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# What are the variables?
names(penguins)
[1] "species" "island" "bill_len" "bill_dep" "flipper_len"
[6] "body_mass" "sex" "year"
# Look at a few rows
penguins |> head()
species island bill_len bill_dep flipper_len body_mass sex year
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
Answer: Yes, the penguins
dataset is tidy because:
- Each column represents one variable (species, island, bill_length_mm, etc.)
- Each row represents one penguin observation
- All observations are of the same type (penguin measurements)
Part B: Now let’s examine some datasets that are NOT tidy. Use data()
to see available datasets, then look at these two examples:
Example 1: anscombe
- This is a classic statistics dataset:
# Look at the anscombe dataset
anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
glimpse(anscombe)
Rows: 11
Columns: 8
$ x1 <dbl> 10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5
$ x2 <dbl> 10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5
$ x3 <dbl> 10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5
$ x4 <dbl> 8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8
$ y1 <dbl> 8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68
$ y2 <dbl> 9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74
$ y3 <dbl> 7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73
$ y4 <dbl> 6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89
Is anscombe
tidy? Think about: - What are the actual variables? (Hint: x and y coordinates for different datasets) - How many different datasets are encoded in the column names? - What would a tidy version look like?
Example 2: Choose another dataset - Pick one more dataset from data()
and analyze whether it’s tidy:
# Look at available datasets
data()
# Choose one and examine it (examples: WorldPhones, UCBAdmissions, HairEyeColor)
# Let's try WorldPhones as an example
WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 1815 1646 89 555
1956 60423 29990 4708 2568 2366 1411 733
1957 64721 32510 5230 2695 2526 1546 773
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
`
Part C: Write a brief explanation (2-3 sentences) for each dataset about: 1. Whether it’s tidy or not 2. What makes it tidy/untidy 3. What the variables actually represent
Your Analysis:
penguins: The penguins dataset is tidy because each column represents a single variable (species, island, bill measurements, etc.), each row represents one penguin observation, and all data is the same type of observational unit (individual penguin measurements). The variables are clearly defined and there’s no mixing of different types of information in single columns.
anscombe: The anscombe dataset is NOT tidy because it violates multiple tidy data principles. The actual variables are x-coordinates, y-coordinates, and dataset identifier, but the dataset identifier is encoded in the column names (x1, y1, x2, y2, etc.). Four different datasets are stored in one table, with each dataset’s x and y values spread across separate columns rather than being in rows with a dataset identifier column.
WorldPhones: The WorldPhones dataset is NOT tidy because it has years as row names instead of a proper column, and regions are spread across columns rather than being values in a “region” variable. The actual variables should be year, region, and number of phones, but currently the year and region information is stored in the structure of the table rather than as data values. A tidy version would have one row per year-region combination.
Submit
Be sure to click the “Render” button to render the HTML output.
Then paste the URL of the Posit Cloud project (NOT the HTML link) into the problem set on Canvas.