library(___)
Problem Set 1 Key
Problem Set
Each problem below is worth 5 points.
The problem set is due 12pm on Aug 26.
Grading rubric
- Everything is good: 5 points
- Partially correct answers: 3-4 points
- Reasonable attempt: 2 points
Setup
Start by loading libraries you need analysis below. When in doubt, start by loading the tidyverse
package.
Question 1
Create 3 different vectors called x
, y
, and z
:
-
x
should be character vector of length 5 (hint: useLETTERS
orletters
) -
y
should be a numeric vector of length 5 (hint: try1:5
orc(1, 2, 3, 4, 5)
) -
z
should be a logical vector of length 5 (hint: useTRUE
andFALSE
values)
Use length()
to calculate the length of each vector.
Question 2
Using the vectors you created above, create a new tibble with column names x
, y
, and z
. Use the tibble()
function to combine your vectors into a data frame.
Use nrow()
and ncol()
to calculate the number of rows and columns, both with and without the pipe operator.
Use glimpse()
to get a quick overview of your tibble - this shows data types and first few values.
What do you notice about the length of the vectors and the number of rows?
<- tibble(___)
tbl
nrow(___)
ncol(___)
# Get a quick overview
glimpse(___)
Answer
The length of the vectors and the number of rows are the same, because tibble columns are simply the vectors we started with.
Question 3
Let’s explore the penguins
dataset that we loaded.
- Look at the number of rows with
nrow()
- this tells us how many penguins are in the dataset - Look at the number of columns with
ncol()
- this tells us how many variables we measured - Look at the column names with
names()
- this shows us what variables we have - Get a glimpse of the data with
glimpse()
- this shows data types and sample values
# Explore the penguins dataset
nrow(___)
ncol(___)
names(___)
glimpse(___)
Question 4
Next we will think about data tidying. Let’s start by analyzing the penguins
dataset.
Part A: Is the penguins
dataset tidy? To determine this, we need to think about the three principles of tidy data:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
Look at the penguins
dataset and answer:
- What are the variables in the dataset? (Hint: use
names(penguins)
to see them) - Does each column represent a single variable?
- Does each row represent a single penguin observation?
Part B: Now let’s examine some datasets that are NOT tidy. Use data()
to see available datasets, then look at these two examples:
Example 1: anscombe
- This is a classic statistics dataset.
# Look at the anscombe dataset. Start by reading the help page with `?anscombe`
Is anscombe
tidy? Think about:
- What are the actual variables? (Hint: x and y coordinates for different datasets)
- How many different datasets are encoded in the column names?
- What would a tidy version look like?
Example 2: Choose another dataset - Pick one more dataset from data()
and analyze whether it’s tidy:
# Look at available datasets
data()
Is this other data set tidy? Think about:
- What are the actual variables? (Hint: x and y coordinates for different datasets)
- How many different datasets are encoded in the column names?
- What would a tidy version look like?
Part C: Write a brief explanation (2-3 sentences) for each dataset about:
- Whether it’s tidy or not
- What makes it tidy/untidy
- What the variables actually represent
Your Analysis:
penguins: [Your answer here]
anscombe: [Your answer here]
[Your chosen dataset]: [Your answer here]
Submit
Be sure to click the “Render” button to render the HTML output.
Then paste the URL of the Posit Cloud project (NOT the HTML link) into the problem set on Canvas.