Problem Set 1 Key

Author

JH

Published

September 6, 2025

Problem Set

Each problem below is worth 8 points.

The problem set is due 12pm on Aug 30.

Grading rubric

  • Everything is good: 5 points
  • Partially correct answers: 3-4 points
  • Reasonable attempt: 2 points

Setup

Start by loading libraries you need analysis below. When in doubt, start by loading the tidyverse package.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Question 1

Create 3 different vectors called x, y, and z:

  • x should be character vector of length 5 (hint: use LETTERS or letters)
  • y should be a numeric vector of length 5 (hint: try 1:5 or c(1, 2, 3, 4, 5))
  • z should be a logical vector of length 5 (hint: use TRUE and FALSE values)

Use length() to calculate the length of each vector.

x <- LETTERS[1:5]
y <- 1:5
z <- c(TRUE, TRUE, FALSE, FALSE, FALSE)

x
[1] "A" "B" "C" "D" "E"
y
[1] 1 2 3 4 5
z
[1]  TRUE  TRUE FALSE FALSE FALSE
# Traditional way
length(x)
[1] 5
[1] 5
[1] 5

Question 2

Using the vectors you created above, create a new tibble with column names x, y, and z. Use the tibble() function to combine your vectors into a data frame.

Use nrow() and ncol() to calculate the number of rows and columns, both with and without the pipe operator.

Use glimpse() to get a quick overview of your tibble - this shows data types and first few values.

What do you notice about the length of the vectors and the number of rows?

tbl <- tibble(x = x, y = y, z = z)

# Traditional way
nrow(tbl)
[1] 5
ncol(tbl)
[1] 3
# Get a quick overview
glimpse(tbl)
Rows: 5
Columns: 3
$ x <chr> "A", "B", "C", "D", "E"
$ y <int> 1, 2, 3, 4, 5
$ z <lgl> TRUE, TRUE, FALSE, FALSE, FALSE

Answer

The length of the vectors and the number of rows are the same, because tibble columns are simply the vectors we started with.

Question 3

Let’s explore the penguins dataset that we loaded.

  1. Look at the number of rows with nrow() - this tells us how many penguins are in the dataset
  2. Look at the number of columns with ncol() - this tells us how many variables we measured
  3. Look at the column names with names() - this shows us what variables we have
  4. Get a glimpse of the data with glimpse() - this shows data types and sample values
# Explore the penguins dataset
nrow(penguins)
[1] 344
ncol(penguins)
[1] 8
names(penguins)
[1] "species"     "island"      "bill_len"    "bill_dep"    "flipper_len"
[6] "body_mass"   "sex"         "year"       
glimpse(penguins)
Rows: 344
Columns: 8
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
$ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Question 4

Next we will think about data tidying. Let’s start by analyzing the penguins dataset.

Part A: Is the penguins dataset tidy? To determine this, we need to think about the three principles of tidy data:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Look at the penguins dataset and answer:

  • What are the variables in the dataset? (Hint: use names(penguins) to see them)
  • Does each column represent a single variable?
  • Does each row represent a single penguin observation?
# Look at the structure of penguins
penguins |> glimpse()
Rows: 344
Columns: 8
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
$ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# What are the variables?
names(penguins)
[1] "species"     "island"      "bill_len"    "bill_dep"    "flipper_len"
[6] "body_mass"   "sex"         "year"       
# Look at a few rows
penguins |> head()
  species    island bill_len bill_dep flipper_len body_mass    sex year
1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
4  Adelie Torgersen       NA       NA          NA        NA   <NA> 2007
5  Adelie Torgersen     36.7     19.3         193      3450 female 2007
6  Adelie Torgersen     39.3     20.6         190      3650   male 2007

Answer: Yes, the penguins dataset is tidy because:

  • Each column represents one variable (species, island, bill_length_mm, etc.)
  • Each row represents one penguin observation
  • All observations are of the same type (penguin measurements)

Part B: Now let’s examine some datasets that are NOT tidy. Use data() to see available datasets, then look at these two examples:

Example 1: anscombe - This is a classic statistics dataset:

# Look at the anscombe dataset
anscombe
   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89
glimpse(anscombe)
Rows: 11
Columns: 8
$ x1 <dbl> 10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5
$ x2 <dbl> 10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5
$ x3 <dbl> 10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5
$ x4 <dbl> 8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8
$ y1 <dbl> 8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68
$ y2 <dbl> 9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74
$ y3 <dbl> 7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73
$ y4 <dbl> 6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89

Is anscombe tidy? Think about: - What are the actual variables? (Hint: x and y coordinates for different datasets) - How many different datasets are encoded in the column names? - What would a tidy version look like?

Example 2: Choose another dataset - Pick one more dataset from data() and analyze whether it’s tidy:

# Look at available datasets
data()

# Choose one and examine it (examples: WorldPhones, UCBAdmissions, HairEyeColor)
# Let's try WorldPhones as an example
WorldPhones
     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951  45939  21574 2876   1815    1646     89      555
1956  60423  29990 4708   2568    2366   1411      733
1957  64721  32510 5230   2695    2526   1546      773
1958  68484  35218 6662   2845    2691   1663      836
1959  71799  37598 6856   3000    2868   1769      911
1960  76036  40341 8220   3145    3054   1905     1008
1961  79831  43173 9053   3338    3224   2005     1076

`

Part C: Write a brief explanation (2-3 sentences) for each dataset about: 1. Whether it’s tidy or not 2. What makes it tidy/untidy 3. What the variables actually represent

Your Analysis:

penguins: The penguins dataset is tidy because each column represents a single variable (species, island, bill measurements, etc.), each row represents one penguin observation, and all data is the same type of observational unit (individual penguin measurements). The variables are clearly defined and there’s no mixing of different types of information in single columns.

anscombe: The anscombe dataset is NOT tidy because it violates multiple tidy data principles. The actual variables are x-coordinates, y-coordinates, and dataset identifier, but the dataset identifier is encoded in the column names (x1, y1, x2, y2, etc.). Four different datasets are stored in one table, with each dataset’s x and y values spread across separate columns rather than being in rows with a dataset identifier column.

WorldPhones: The WorldPhones dataset is NOT tidy because it has years as row names instead of a proper column, and regions are spread across columns rather than being values in a “region” variable. The actual variables should be year, region, and number of phones, but currently the year and region information is stored in the structure of the table rather than as data values. A tidy version would have one row per year-region combination.

Submit

Be sure to click the “Render” button to render the HTML output.

Then paste the URL of the Posit Cloud project (NOT the HTML link) into the problem set on Canvas.