Problem Set 1 Key

Author

JH

Published

September 6, 2025

Problem Set

Each problem below is worth 5 points.

The problem set is due 12pm on Aug 26.

Grading rubric

  • Everything is good: 5 points
  • Partially correct answers: 3-4 points
  • Reasonable attempt: 2 points

Setup

Start by loading libraries you need analysis below. When in doubt, start by loading the tidyverse package.

library(___)

Question 1

Create 3 different vectors called x, y, and z:

  • x should be character vector of length 5 (hint: use LETTERS or letters)
  • y should be a numeric vector of length 5 (hint: try 1:5 or c(1, 2, 3, 4, 5))
  • z should be a logical vector of length 5 (hint: use TRUE and FALSE values)

Use length() to calculate the length of each vector.

Question 2

Using the vectors you created above, create a new tibble with column names x, y, and z. Use the tibble() function to combine your vectors into a data frame.

Use nrow() and ncol() to calculate the number of rows and columns, both with and without the pipe operator.

Use glimpse() to get a quick overview of your tibble - this shows data types and first few values.

What do you notice about the length of the vectors and the number of rows?

tbl <- tibble(___)

nrow(___)
ncol(___)

# Get a quick overview
glimpse(___)

Answer

The length of the vectors and the number of rows are the same, because tibble columns are simply the vectors we started with.

Question 3

Let’s explore the penguins dataset that we loaded.

  1. Look at the number of rows with nrow() - this tells us how many penguins are in the dataset
  2. Look at the number of columns with ncol() - this tells us how many variables we measured
  3. Look at the column names with names() - this shows us what variables we have
  4. Get a glimpse of the data with glimpse() - this shows data types and sample values
# Explore the penguins dataset
nrow(___)
ncol(___)
names(___)
glimpse(___)

Question 4

Next we will think about data tidying. Let’s start by analyzing the penguins dataset.

Part A: Is the penguins dataset tidy? To determine this, we need to think about the three principles of tidy data:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Look at the penguins dataset and answer:

  • What are the variables in the dataset? (Hint: use names(penguins) to see them)
  • Does each column represent a single variable?
  • Does each row represent a single penguin observation?

Part B: Now let’s examine some datasets that are NOT tidy. Use data() to see available datasets, then look at these two examples:

Example 1: anscombe - This is a classic statistics dataset.

# Look at the anscombe dataset. Start by reading the help page with `?anscombe`

Is anscombe tidy? Think about:

  • What are the actual variables? (Hint: x and y coordinates for different datasets)
  • How many different datasets are encoded in the column names?
  • What would a tidy version look like?

Example 2: Choose another dataset - Pick one more dataset from data() and analyze whether it’s tidy:

# Look at available datasets
data()

Is this other data set tidy? Think about:

  • What are the actual variables? (Hint: x and y coordinates for different datasets)
  • How many different datasets are encoded in the column names?
  • What would a tidy version look like?

Part C: Write a brief explanation (2-3 sentences) for each dataset about:

  1. Whether it’s tidy or not
  2. What makes it tidy/untidy
  3. What the variables actually represent

Your Analysis:

penguins: [Your answer here]

anscombe: [Your answer here]

[Your chosen dataset]: [Your answer here]

Submit

Be sure to click the “Render” button to render the HTML output.

Then paste the URL of the Posit Cloud project (NOT the HTML link) into the problem set on Canvas.