R Bootcamp Problem Set 3
Setup
Start by loading libraries you need analysis in the code chunk below. When in doubt, start by loading the tidyverse package.
Problem Set
Each problem below is worth 5 points.
Use the data files in the data/
directory to answer the questions.
For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.
The problem set is due 5pm on Aug 28.
Grading rubric
- Everything is good: 5 points
- Partially correct answers: 3-4 points
- Reasonable attempt: 2 points
Question 1
Load the palmerpenguins
package (already done above). Inspect the penguins
tibble with summary()
to see the distribution of variables and any missing values.
Use drop_na()
to remove rows with NA
values in the penguins
tibble. Calculate how many rows were removed by subtracting the new count from the original count using nrow()
.
Then, use count()
to explore the data and see how many penguins of each species we have. This is a simple but powerful way to understand your data!
Then, use replace_na()
to replace NA
values in bill_length_mm
and bill_depth_mm
with a value of 0. You’ll need to:
- Provide the data frame as the first argument
- Provide a named list showing which columns to replace and what values to use
Question 2
Use arrange
, filter
, and select
on a data frame. Let’s build this step by step to understand how pipes work:
- Import the data set
data/data_transcript_exp_tidy.csv
usingread_csv()
andhere()
. -
Step 2a: First, just sort the tibble by expression data (
count
) from highest to lowest level usingarrange()
. Usedesc()
to get descending order. -
Step 2b: Then add
filter()
to keep only rows wherecount
> 100. Chain this with the pipe operator. -
Step 2c: Finally, add
select()
to choose all columns except fortype
. Use the-
operator to exclude columns.
Question 3
How will you:
- create a new column
log10count
that contains log10 transformedcount
values usingmutate()
andlog10()
and - rearrange the columns in the following order: ensembl_transcript_id, type, time, replicate, count, log10count using
select()
.
Before showing the solution, remember: - mutate()
adds new columns (or modifies existing ones) - it keeps all existing columns - select()
chooses columns and can reorder them - list them in the order you want
Question 4
Let’s explore grouping operations step by step. We’ll build your understanding progressively, starting with simple examples and then combining concepts.
Step 4a: First, try a simple grouping operation. Calculate the total count per transcript (ignoring time). Use:
-
group_by()
to group by transcript ID -
summarize()
to calculate the sum of counts -
.groups = "drop"
to remove grouping afterward (good practice!)
Step 4b: Now calculate a per-transcript sum, while keeping the time
information (group by both transcript and time). This creates separate groups for each combination of transcript AND time:
Question 5
Create meaningful categories from your data using case_when()
. This function lets you create new variables based on multiple conditions - it’s like a more powerful version of if_else()
.
Categorize the expression levels in the count
column into meaningful groups: - “Low” for counts less than 50 - “Medium” for counts between 50 and 200 (inclusive of 50, exclusive of 200) - “High” for counts between 200 and 1000 (inclusive of 200, exclusive of 1000) - “Very High” for counts 1000 and above
Use case_when()
inside mutate()
to create a new column called expression_level
, then use count()
to see how many transcripts fall into each category.
Question 6
Try to state and answer a new question! Stitch together several dplyr fuctions to answer a new question. I’m trying to wean you off the fill-in-the-blanks approach and get you to think independently using the tidyverse.
Here are some ideas to get you started, you don’t have to use any of them:
- Which transcript has the highest expression level at each time point? (Hint: use
dplyr::slice_max()
after grouping by time) - What is the average expression level for each transcript across all time points? (Hint: use
group_by()
andsummarize()
) - Which time point has the highest total expression level across all transcripts? (Hint: group by time and summarize total count)