Practical Biological Data Analysis: Class 8: Introduction to matricies

The Rmarkdown for this class is on github

Goals for this class

Learn what is a matrix
Describe difference between matrix and data frame
Perform mathematical functions
Convert between matrix and data frames

Load packages

library(tidyverse)
library(here)

Download files

Before we get started, let’s download all of the files you will need for the next three classes.

# conditionally download all of the files used in rmarkdown from github 
source("https://raw.githubusercontent.com/rnabioco/bmsc-7810-pbda/main/_posts/2023-12-08-class-8-matricies/download_files.R")

What is a matrix?

A Matrix is an 2 dimensional object in R. We create a matrix using the matrix function

M <- matrix(c(10:21), nrow = 4, byrow = TRUE)
M

     [,1] [,2] [,3]
[1,]   10   11   12
[2,]   13   14   15
[3,]   16   17   18
[4,]   19   20   21

We can also use as.matrix on an existing dataframe

df <- data.frame("A" = c(10:13), "B" = c(14:17), "C" = (18:21))
df

new_mat <- as.matrix(df)
new_mat

      A  B  C
[1,] 10 14 18
[2,] 11 15 19
[3,] 12 16 20
[4,] 13 17 21

Just like data frames, we can name the rows and columns of the Matrix

rownames(new_mat) <- c("first", "second", "third", "forth")
colnames(new_mat) <- c("D", "E", "F")

new_mat

        D  E  F
first  10 14 18
second 11 15 19
third  12 16 20
forth  13 17 21

We can look at the structure of the matrix using str

str(new_mat)

 int [1:4, 1:3] 10 11 12 13 14 15 16 17 18 19 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:4] "first" "second" "third" "forth"
  ..$ : chr [1:3] "D" "E" "F"

Here you can see that the type of this structure is int because it is a matrix consisting of integers. We can also see the row names and column names.

As with data frames, we can check the size of the matrix using nrow, ncol and dim

nrow(new_mat)

[1] 4

ncol(new_mat)

[1] 3

dim(new_mat)

[1] 4 3

We can also access data using brackets[

Selecting a single value:

new_mat[1,2]

[1] 14

Selecting a section of the matrix:

new_mat[1:3,2]

 first second  third 
    14     15     16

If we don’t provide an index for the row, R will return all rows:

new_mat[, 3]

 first second  third  forth 
    18     19     20     21

The same is true for the columns

new_mat[3,]

 D  E  F 
12 16 20

Because this matrix has row and column names, we can also pull out data based on those

new_mat["second", "D"]

[1] 11

Exercise

What value is in row 2 and column 3 of new_mat?

# TODO find the value in the matrix at row 2 and column 3

If we can make a matrix from a data frame, what’s the difference? Matrices can only have values of one type –> integer, boolean, character, while a dataframe can be a mix of types:

df <- data.frame("A" = c(10:12),
                 "B" = c("cat", "dog", "fish"),
                 "C" = c(TRUE, TRUE, FALSE))

df

   A    B     C
1 10  cat  TRUE
2 11  dog  TRUE
3 12 fish FALSE

M <- as.matrix(df)

M

     A    B      C      
[1,] "10" "cat"  "TRUE" 
[2,] "11" "dog"  "TRUE" 
[3,] "12" "fish" "FALSE"

typeof(df[,1])

[1] "integer"

typeof(M[,1])

[1] "character"

But Matrices can take any type of input

M <- matrix(rep(c(TRUE, FALSE), 4), nrow = 4, byrow = TRUE)
M

     [,1]  [,2]
[1,] TRUE FALSE
[2,] TRUE FALSE
[3,] TRUE FALSE
[4,] TRUE FALSE

typeof(M[,1])

[1] "logical"

Matrix opearions

If you’ve taken linear algebra, you’ve probably worked with matrices before. These same matrix operations can be done in R

Basic operations

We can do any of the mathematical operations for a matrix and one value. For example, we can add 5 to all values in a matrix, or subtract 2, or divide by 10

M <- matrix(c(10:21), nrow = 4, byrow = TRUE)
M

     [,1] [,2] [,3]
[1,]   10   11   12
[2,]   13   14   15
[3,]   16   17   18
[4,]   19   20   21

M + 1

     [,1] [,2] [,3]
[1,]   11   12   13
[2,]   14   15   16
[3,]   17   18   19
[4,]   20   21   22

M + 2

     [,1] [,2] [,3]
[1,]   12   13   14
[2,]   15   16   17
[3,]   18   19   20
[4,]   21   22   23

M - 5

     [,1] [,2] [,3]
[1,]    5    6    7
[2,]    8    9   10
[3,]   11   12   13
[4,]   14   15   16

M / 3

         [,1]     [,2] [,3]
[1,] 3.333333 3.666667    4
[2,] 4.333333 4.666667    5
[3,] 5.333333 5.666667    6
[4,] 6.333333 6.666667    7

M * 10

     [,1] [,2] [,3]
[1,]  100  110  120
[2,]  130  140  150
[3,]  160  170  180
[4,]  190  200  210

We can also provide a vector or another matrix to perform element-wise functions with.

vector <- c(2, 3, 4, 5)

M + vector

     [,1] [,2] [,3]
[1,]   12   13   14
[2,]   16   17   18
[3,]   20   21   22
[4,]   24   25   26

Here you can see that each element of the vector is added to a row ie element 1 is added to row 1, element 2 is added to row 2, etc.

The same is true for subtraction

M - vector

     [,1] [,2] [,3]
[1,]    8    9   10
[2,]   10   11   12
[3,]   12   13   14
[4,]   14   15   16

And multiplication and division

M / vector

         [,1]     [,2] [,3]
[1,] 5.000000 5.500000  6.0
[2,] 4.333333 4.666667  5.0
[3,] 4.000000 4.250000  4.5
[4,] 3.800000 4.000000  4.2

M * vector

     [,1] [,2] [,3]
[1,]   20   22   24
[2,]   39   42   45
[3,]   64   68   72
[4,]   95  100  105

What happens if there are a different number of rows as elements in the vector?

vector <- c(2, 3, 4)

M

     [,1] [,2] [,3]
[1,]   10   11   12
[2,]   13   14   15
[3,]   16   17   18
[4,]   19   20   21

M + vector

     [,1] [,2] [,3]
[1,]   12   14   16
[2,]   16   18   17
[3,]   20   19   21
[4,]   21   23   25

Note how the vector just gets reused, no error is thrown.

We can also perform these operations on two matrices

M1 <- matrix(c(10:21), nrow = 4, byrow = TRUE)
M2 <- matrix(c(110:121), nrow =4, byrow = TRUE)

M1

     [,1] [,2] [,3]
[1,]   10   11   12
[2,]   13   14   15
[3,]   16   17   18
[4,]   19   20   21

M2

     [,1] [,2] [,3]
[1,]  110  111  112
[2,]  113  114  115
[3,]  116  117  118
[4,]  119  120  121

M1 + M2

     [,1] [,2] [,3]
[1,]  120  122  124
[2,]  126  128  130
[3,]  132  134  136
[4,]  138  140  142

Note how elements in the same position of each matrix are added together

Note this also is true of vectors

v1 <- c(1,2,3)
v2 <- c(4)

v1 + v2

[1] 5 6 7

v3 <- c(5, 6)
v1 + v3

[1] 6 8 8

v4 <- c(10, 11, 12, 13, 14, 15)

v1 + v4

[1] 11 13 15 14 16 18

Exercise Multiply, subtract, and divide the two matrices M1 and M2

# TODO multiply, subtract and divide M1 and M2

Matrix multiplication

We will only briefly touch on matrix multiplication, but one reason matrices are very important in R is that you can perform multiplication with them. Exactly how this is done is explained nicely in a math is fun tutorial.

Let’s try to multiply two matrices together. Remember our first matrix has 4 rows and 3 columns:

dim(M)

[1] 4 3

So our new matrix must have 3 rows

M2 <- matrix(c(5:19), nrow = 3, byrow = TRUE)
M2

     [,1] [,2] [,3] [,4] [,5]
[1,]    5    6    7    8    9
[2,]   10   11   12   13   14
[3,]   15   16   17   18   19

Let’s perform matrix multiplication with these

M %*% M2

     [,1] [,2] [,3] [,4] [,5]
[1,]  340  373  406  439  472
[2,]  430  472  514  556  598
[3,]  520  571  622  673  724
[4,]  610  670  730  790  850

Performing functions on a matrix

So far, a matrix has looked a lot like a dataframe with some limitations. One of the places where matrices become the most useful is performing statistical functions because all items in a matrix are of the same type.

For this next section, let’s use some data that I downloaded from the social security office. This has the top 100 boy and girl names by state for 2020.

We can now read in the data and convert it to a matrix

names_mat <- read.csv(here("class_8-10_data", "boy_name_counts.csv"),
                           row.names = 1) %>%
  as.matrix()

names_mat[1:5, 1:5]

        Alabama Alaska Arizona Arkansas California
William     366     36     174      152       1021
James       304     34     215      105       1148
John        267     20      97       94        623
Elijah      254     42     284      143       1586
Noah        243     35     397      138       2625

Above you can see that we have the number of males with each name in each state. Looking at the structure, we can see that it is an integer matrix.

str(names_mat)

 int [1:40, 1:20] 366 304 267 254 243 207 187 183 173 166 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:40] "William" "James" "John" "Elijah" ...
  ..$ : chr [1:20] "Alabama" "Alaska" "Arizona" "Arkansas" ...

We can now explore this data set using many of the functions you have already learned such as rowSums and colSums

Basic functions

First, lets find the sum for all of the rows - how many total babies were named each name?

rowSums(names_mat)

  William     James      John    Elijah      Noah      Liam     Mason 
     5067      5005      3295      5921      8118      8512      4040 
   Oliver     Henry   Jackson    Samuel     Jaxon     Asher   Grayson 
     5961      4121      3599      3549      2571      3212      2938 
     Levi   Michael    Carter  Benjamin   Charles     Wyatt    Thomas 
     3717      4061      3017      5265      2427      3255      2475 
    Aiden      Luke     David      Owen    Daniel     Logan    Joseph 
     3872      3230      3554      3369      4223      3933      3356 
    Lucas    Joshua      Jack Alexander  Maverick   Gabriel     Ethan 
     4794      2645      3399      4530      2588      3091      4322 
      Eli     Isaac    Hunter      Ezra  Theodore 
     2202      3029      1940      2999      3409

And then the columns - how many babies were included from each state?

colSums(names_mat)

    Alabama      Alaska     Arizona    Arkansas  California 
       5687         986        7371        3428       41256 
   Colorado Connecticut    Delaware     Florida     Georgia 
       6396        4211        1117       21661       11791 
     Hawaii       Idaho    Illinois     Indiana        Iowa 
       1111        2306       13407        8283        3508 
     Kansas    Kentucky   Louisiana       Maine    Maryland 
       3630        5518        4955        1372        6617

What if we want to find the percent of children with a given name across all states (divide the value by the row sum * 100) - what percent of total babies for each name came from each state:

percent_mat <- names_mat / rowSums(names_mat) * 100
percent_mat[1:5, 1:5]

         Alabama    Alaska  Arizona Arkansas California
William 7.223209 0.7104796 3.433985 2.999803   20.14999
James   6.073926 0.6793207 4.295704 2.097902   22.93706
John    8.103187 0.6069803 2.943854 2.852807   18.90744
Elijah  4.289816 0.7093396 4.796487 2.415133   26.78602
Noah    2.993348 0.4311407 4.890367 1.699926   32.33555

Remember from above that division using a vector will divide every element of a row by one value, so we can only do this using rowSums. In a few minutes we will discuss how do do this on the columns.

Summary functions

We can also find the minimum, maximum, mean, and median values of the whole matrix and any column. First, lets get summary data for the whole matrix using summary

summary(names_mat)[ , 1:3]

    Alabama           Alaska         Arizona     
 Min.   : 61.00   Min.   :14.00   Min.   : 84.0  
 1st Qu.: 94.75   1st Qu.:18.00   1st Qu.:143.8  
 Median :125.00   Median :22.50   Median :174.5  
 Mean   :142.18   Mean   :24.65   Mean   :184.3  
 3rd Qu.:163.00   3rd Qu.:28.00   3rd Qu.:191.2  
 Max.   :366.00   Max.   :44.00   Max.   :451.0

You can see that this calculates the min, max, mean, median, and quartiles for the columns.

What if we just want the minimum value for the “Alabama” names? We can run min while subsetting to just the column of interest

min(names_mat[, "Alabama"])

[1] 61

We can do the same for the rows Lets try this for “William”

min(names_mat["William",])

[1] 29

What if we wanted to find the smallest value in the whole matrix?

min(names_mat)

[1] 13

max works the same as min

Exercise Find the maximum value in the for “Noah”

# TODO Find the maximum value for Noah and the whole matrix

We can also find the mean, median, and standard deviation of any part of the matrix

By row:

mean(names_mat["William", ])

[1] 253.35

median(names_mat["William", ])

[1] 178

sd(names_mat["William", ])

[1] 239.8189

By column:

mean(names_mat[ , "Alabama"])

[1] 142.175

median(names_mat[ , "Alabama"])

[1] 125

sd(names_mat[ , "Alabama"])

[1] 67.66012

Transposition

One important quality of a matrix is being able to transpose it to interchange the rows and columns - here the rows become columns and columns become rows. We transpose using t() to the matrix. Let’s first look at this using the matrix we started with

M <- matrix(c(10:21), nrow = 4, byrow = TRUE)
M

     [,1] [,2] [,3]
[1,]   10   11   12
[2,]   13   14   15
[3,]   16   17   18
[4,]   19   20   21

t(M)

     [,1] [,2] [,3] [,4]
[1,]   10   13   16   19
[2,]   11   14   17   20
[3,]   12   15   18   21

Note that the output of transposing either a matrix or a data frame will be a matrix (because the type within a column of a data frame must be the same).

df

   A    B     C
1 10  cat  TRUE
2 11  dog  TRUE
3 12 fish FALSE

t(df)

  [,1]   [,2]   [,3]   
A "10"   "11"   "12"   
B "cat"  "dog"  "fish" 
C "TRUE" "TRUE" "FALSE"

str(df)

'data.frame':   3 obs. of  3 variables:
 $ A: int  10 11 12
 $ B: chr  "cat" "dog" "fish"
 $ C: logi  TRUE TRUE FALSE

str(t(df))

 chr [1:3, 1:3] "10" "cat" "TRUE" "11" "dog" "TRUE" "12" "fish" ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:3] "A" "B" "C"
  ..$ : NULL

Note how after the transposition, all items in the original df are now characters and we no longer have a dataframe.

Now let’s try this transposition on the names matrix we’ve been working with

transposed_mat <- t(names_mat)

transposed_mat[1:3,1:3]

        William James John
Alabama     366   304  267
Alaska       36    34   20
Arizona     174   215   97

Note how the columns are now names and the rows are now states.

Remember the note above where we could only divide by the rowSums? Now we can use this transposition to figure out the percent of children in each state with a given name (divide the value by the column sum * 100)

state_percents <- transposed_mat / rowSums(transposed_mat) * 100

state_percents <- t(state_percents)

state_percents[1:3, 1:3]

         Alabama   Alaska  Arizona
William 6.435731 3.651116 2.360602
James   5.345525 3.448276 2.916836
John    4.694918 2.028398 1.315968

Above we did this in several steps, but we can also do in in one step:

state_percents_2 <- t(t(names_mat) / colSums(names_mat)) * 100

identical(state_percents, state_percents_2)

[1] TRUE

Statistical tests

We can also use matrices to perform statistical tests, like t-tests. For instance, are the names Oliver and Noah, or Oliver and Thomas used different amounts?

First, let’s normalize the data to account for the fact that each state reported different numbers of births. To do this normalization, let’s first divide each value by the total number of children reported for that state. Remember, we need to first transpose the matrix to be able to divide by the colSums

normalized_mat <- t(t(names_mat) / colSums(names_mat))

Now that we have normalized values, we can do a t-test.

normalized_mat["Oliver", 1:3]

   Alabama     Alaska    Arizona 
0.03217865 0.04361055 0.04124271

normalized_mat["Noah", 1:3]

   Alabama     Alaska    Arizona 
0.04272903 0.03549696 0.05385972

normalized_mat["Thomas", 1:3]

   Alabama     Alaska    Arizona 
0.02092492 0.02028398 0.01329535

t.test(normalized_mat["Oliver",], normalized_mat["Noah",])


    Welch Two Sample t-test

data:  normalized_mat["Oliver", ] and normalized_mat["Noah", ]
t = -1.1861, df = 36.481, p-value = 0.2433
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.010275411  0.002689617
sample estimates:
 mean of x  mean of y 
0.04133411 0.04512700

Between Oliver and Noah, there does not seem to be a difference with the data we have. What about Oliver and Thomas?

t.test(normalized_mat["Oliver",], normalized_mat["Thomas",])


    Welch Two Sample t-test

data:  normalized_mat["Oliver", ] and normalized_mat["Thomas", ]
t = 9.3268, df = 21.544, p-value = 5.121e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01858464 0.02922945
sample estimates:
 mean of x  mean of y 
0.04133411 0.01742706

Here we can see that there is a difference between the mean values for Oliver and Thomas using a t.test

Using dataframes and matricies

For many of the tidyverse functions you’ve learned so far, a data frame is required. Fortunately, it is very easy to change between a data frame and a matrix.

normalized_dat <- data.frame(normalized_mat)

str(normalized_dat)

'data.frame':   40 obs. of  20 variables:
 $ Alabama    : num  0.0644 0.0535 0.0469 0.0447 0.0427 ...
 $ Alaska     : num  0.0365 0.0345 0.0203 0.0426 0.0355 ...
 $ Arizona    : num  0.0236 0.0292 0.0132 0.0385 0.0539 ...
 $ Arkansas   : num  0.0443 0.0306 0.0274 0.0417 0.0403 ...
 $ California : num  0.0247 0.0278 0.0151 0.0384 0.0636 ...
 $ Colorado   : num  0.0369 0.0364 0.0211 0.0317 0.0408 ...
 $ Connecticut: num  0.0356 0.0359 0.0302 0.024 0.0503 ...
 $ Delaware   : num  0.0269 0.0322 0.0251 0.0403 0.0466 ...
 $ Florida    : num  0.0244 0.0278 0.018 0.0436 0.0608 ...
 $ Georgia    : num  0.0469 0.0369 0.0317 0.0446 0.0494 ...
 $ Hawaii     : num  0.0261 0.0297 0.0252 0.0342 0.0513 ...
 $ Idaho      : num  0.0412 0.0399 0.0173 0.0351 0.0308 ...
 $ Illinois   : num  0.0345 0.0334 0.023 0.0309 0.053 ...
 $ Indiana    : num  0.0333 0.0321 0.0164 0.0396 0.0391 ...
 $ Iowa       : num  0.0425 0.0296 0.0154 0.0299 0.0371 ...
 $ Kansas     : num  0.0342 0.0336 0.0226 0.0358 0.0372 ...
 $ Kentucky   : num  0.0448 0.0399 0.0268 0.0419 0.0379 ...
 $ Louisiana  : num  0.0367 0.0349 0.0367 0.0478 0.044 ...
 $ Maine      : num  0.0277 0.0248 0.016 0.0255 0.035 ...
 $ Maryland   : num  0.0328 0.0378 0.0213 0.0293 0.0533 ...

Once we can move between matrices and data frames, we can start to tidy our data for plotting purposes. Let’s plot the distribution of name usage as a violin plot. Here we want the counts to be the y axis and the names to be the y axis.

The first thing we need to do is make our matrix into a data frame

names_dat <- data.frame(names_mat)

Next, we will want the names to be a column rather than the row names. We can do this using $ or tibble::rownames_to_column

names_dat <- rownames_to_column(names_dat, "name")

names_dat[1:3,1:3]

     name Alabama Alaska
1 William     366     36
2   James     304     34
3    John     267     20

# To set using $
# names_dat$name <- rownames(names_dat)

Next, we need to pivot_longer from tidyr. We want to take everything but the names column

pivot_columns <- colnames(names_dat)[colnames(names_dat) != "name"]

names_dat <- pivot_longer(names_dat, cols = all_of(pivot_columns),
                          names_to = "state", values_to = "count")

Note, we can use the pipe %>% from dplyr to put all of this into one statement.

# Here we will specify the columns to keep first
pivot_columns <- colnames(names_mat)

names_dat <- names_mat %>% 
  data.frame %>% 
  rownames_to_column("name") %>% 
  pivot_longer(cols = all_of(pivot_columns), 
               names_to = "state", values_to = "count")

With this new data frame, we can now plot the distribution of names

# I first set the theme
theme_set(theme_classic(base_size = 10))

ggplot(names_dat, aes(x = name, y = count,
                              fill = name)) + 
  geom_violin() +
  theme(axis.text.x = element_text(angle = 90,
                                   vjust = 0.5, hjust=1)) # rotate x axis

There are a few outliers here, almost certainly California. As we discussed above, normalizing the data helps put everything onto the same scale.

Exercise Can you make the same plot as above but use our normalized values?

# TODO make plot above with normalized values
# Hint start with normalized_dat

I’m a biologist, why should I care?

Many of your datasets can be analyzed using matrices
- If you analyze RNA-seq data, all of the processing and differential testing will be done in a matrix through DESeq2
- If you analyze protein data, that is best done in a matrix
- Single cell RNA-seq data is analyzed using matrices, especially sparse matrices
- Even just many measurements can be analyzed using a matrix
Matrices are efficient object types for most types of analysis you would want to do

Acknowldgements and additional references

The content of this class borrows heavily from previous tutorials:

Matrices
- R example
- Quick tips