RNAseq QC

Matthew Taliaferro

RNA Bioscience Initiative | CU Anschutz

2024-10-21

Learning Objectives

By the end of the class, you should be able to:

  • Understand how salmon can be used to quantify gene expression given an RNAseq dataset

  • Collapse transcript-level quantifications to gene level quantifications using tximport

  • Deduce the relationships of samples to each other using hierarchical clustering and PCA

Rigor & Reproducibility

As with all computational experiments (yes, they are experiments, don’t let your pipette-toting friends tell you otherwise), keeping track of what you did is key. In the old days, I kept a written notebook of commands that I ran. Sounds silly, but there were many times that I went back to that notebook to see exactly what the parameters were for a given run using a piece of software.

Today, there are better options. You are using one of the better ones right now. Notebooks, including RMarkdown (mainly for R) and Jupyter (mainly for Python), are a great way to keep track of what you did as well as give justification or explanation for your analyses using plain ’ol English.

Trust me, one day you will be glad you used them. The Methods section of your paper is never fun to write without them.

Overview

In this class, we will examine RNAseq data collected over a time-course of differentiation from mouse embryonic stem cells to cortical glutamatergic neurons (Hubbard et al, F1000 Research (2013)). In this publication, the authors differentiated mESCs to neurons using a series of in vitro culture steps over a period of 37 days. During this timecourse, samples were extracted at selected intervals for transcriptome analysis. Importantly, for each timepoint, either 3 or 4 samples were taken for RNA extraction, library preparation and sequencing. This allows us to efficiently use the statistical frameworks provided by the DESeq2 package to identify genes whose RNA expression changes across the timecourse.

Experiment design

Cells were grown in generic differentiation-promoting media (LIF-) for 8 days until aggreates were dissociated and replated in neuronal differentiation media. This day of replating was designated as in vitro day 0 (DIV0). The timepoints taken before this replating therefore happened at “negative” times (DIV-8 and DIV-4). Because naming files with dashes or minus signs can cause problems, these samples are referred to as DIVminus8 and DIVminus4. Following the replating, samples were taken at days 1, 7, 16, 21, and 28 (DIV1, DIV7, DIV16, DIV21, and DIV28).

QC overview

Today we will focus on some quality control steps that are good ideas to do for every RNAseq dataset you encounter, whether produced by yourself or someone else.

Goal

Today we will focus on some Quality Control steps that are good ideas to do for every RNA-seq data set you encounter, whether produced by yourself or someone else.

Quantifying transcript expression with salmon

We recently learned about the RNAseq quantification tool salmon. We won’t rehash the details here about how salmon works. For our purposes, we just need to know that salmon reads in a fastq file of sequencing reads and a fasta file of transcript sequences to be quantified. Let’s take a look at this fasta file of transcripts:

head -n 20 data/block-rna/gencodecomprehensive.vM17.allcdna_subset.fa

Looks like we have ensembl transcript IDs, which is a good idea. I can tell because they start with ‘ENS’. Using ensembl IDs as transcript names will allow us to later collate transcript expression levels into gene expression levels using a database that relates transcripts and genes. More on that later.

Making a transcriptome index

The first step in quantifying these transcripts is to make an index from them. This is done as follows:

salmon index -t <transcripts.fa> -i <transcripts.idx> --type quasi -k <k>

Here, transcripts.fa is a path to our fasta file, transcripts.idx is the name of the index that will be created, and k is the length of the kmers that will be used in the hash table related kmers and transcripts. k is the length of the minimum accepted match for a kmer in a read and a kmer in a transcript. Longer kmers (higher values of k) will therefore be more stringent, and lowering k may improve mapping sensitivity at the cost of some specificity. You may also see here how read lengths can influence what value for k you should choose.

Consider an experiment where we had 25 nt reads (this was true wayyyyyy back in the old, dark days of high-throughput sequencing). What’s going to happen if I quantify these reads using an index where the kmer size was set to 29? Well, nothing will align. The index has represented the transcriptome in 29 nt chunks. However, no read will match to these 29mers because there are no 29mers in these reads! As a general rule of thumb, for reads 75 nt and longer (which is the bulk of the data produced nowadays), a good value for k that maximizes both specificity and sensitivity is 31. However, datasets that you may retrieve from the internet, particularly older ones, may have shorter read lengths, so keep this is mind when defining k.

Quantifying reads against this index

Once we have our index, we can quantify transcripts in the index using reads from our fastq files.

salmon quant --libType A -p 8 --seqBias --gcBias --validateMappings -1 <forwardreads.fastq> -2 <reversereads.fastq> -o <outputname> --index <transcripts.idx>

In this command, our forward and reverse read fastq files are supplied to -1 and -2, respectively. If the experiment produced single end reads, -2 is omitted. <transcripts.idx> is the path to the index produced in the previous step. I’m not going to go through the rest of the flags used here, but their meanings as well as other options can be found here.

Salmon outputs

Let’s take a look at what salmon spits out. The first file we will look at is a log that is found at /logs/salmon_quant.log. This file contains info about the quantification, but there’s one line of this file in particular that we are interested in. It lets us know how many of the reads in the fastq file that salmon found a home for in the transcriptome fasta.

less data/block-rna/differentiation_salmonouts/DIV0.Rep1/logs/salmon_quant.log

There are a lot of lines in this file, but really only one that we are interested in. We want the one that tells us the “Mapping rate.” How could we easily and efficiently look at the mapping rates of all our samples? Grep!

#Get the mapping rates for all samples
#In each log file, the line that we are interested in contains the string 'Mapping ' (notice the space)
grep 'Mapping ' data/block-rna/differentiation_salmonouts/*/logs/salmon_quant.log

Moving from transcript quantifications to gene quantifications

As we discussed, salmon quantifies transcripts, not genes. However, genes are made up of transcripts, so we can calculate gene expression values from transcript expression values if we knew which transcripts belonged to which genes.

Relating genes and transcripts

We can get this relationships between transcripts and genes through biomaRt.

biomaRt has many tables that relate genes, transcripts, and other useful data include gene biotypes and gene ontology categories, even across species. Let’s use it here to get a table of genes and transcripts for the mouse genome.

               biomart                version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 112
2   ENSEMBL_MART_MOUSE      Mouse strains 112
3     ENSEMBL_MART_SNP  Ensembl Variation 112
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 112

I encourage you to see what is in each mart, but for now we are only going to use ENSEMBL_MART_ENSEMBL.

Using biomaRt

Alright, we’ve chosen our mart. What data sets are available in this mart?

dataset description version
abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1) ASM259213v1
acalliptera_gene_ensembl Eastern happy genes (fAstCal1.3) fAstCal1.3
acarolinensis_gene_ensembl Green anole genes (AnoCar2.0v2) AnoCar2.0v2
acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2) bAquChr1.2
acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5) Midas_v5
amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2) ASM200744v2
amexicanus_gene_ensembl Mexican tetra genes (Astyanax_mexicanus-2.0) Astyanax_mexicanus-2.0
anancymaae_gene_ensembl Ma's night monkey genes (Anan_2.0) Anan_2.0
aocellaris_gene_ensembl Clown anemonefish genes (ASM2253959v1) ASM2253959v1
apercula_gene_ensembl Orange clownfish genes (Nemo_v1) Nemo_v1
aplatyrhynchos_gene_ensembl Mallard genes (ASM874695v1) ASM874695v1
apolyacanthus_gene_ensembl Spiny chromis genes (ASM210954v1) ASM210954v1
applatyrhynchos_gene_ensembl Duck genes (CAU_duck1.0) CAU_duck1.0
atestudineus_gene_ensembl Climbing perch genes (fAnaTes1.3) fAnaTes1.3
bbbison_gene_ensembl American bison genes (Bison_UMD1.0) Bison_UMD1.0
bgrunniens_gene_ensembl Domestic yak genes (LU_Bosgru_v3.0) LU_Bosgru_v3.0
bihybrid_gene_ensembl Hybrid - Bos Indicus genes (UOA_Brahman_1) UOA_Brahman_1
bmusculus_gene_ensembl Blue whale genes (mBalMus1.v2) mBalMus1.v2
bmutus_gene_ensembl Wild yak genes (BosGru_v2.0) BosGru_v2.0
bsplendens_gene_ensembl Siamese fighting fish genes (fBetSpl5.2) fBetSpl5.2
btaurus_gene_ensembl Cow genes (ARS-UCD1.3) ARS-UCD1.3
cabingdonii_gene_ensembl Abingdon island giant tortoise genes (ASM359739v1) ASM359739v1
catys_gene_ensembl Sooty mangabey genes (Caty_1.0) Caty_1.0
cauratus_gene_ensembl Goldfish genes (ASM336829v1) ASM336829v1
cccarpio_gene_ensembl Common carp genes (Cypcar_WagV4.0) Cypcar_WagV4.0
cdromedarius_gene_ensembl Arabian camel genes (CamDro2) CamDro2
celegans_gene_ensembl Caenorhabditis elegans (Nematode, N2) genes (WBcel235) WBcel235
cgchok1gshd_gene_ensembl Chinese hamster CHOK1GS genes (CHOK1GS_HDv1) CHOK1GS_HDv1
cgobio_gene_ensembl Channel bull blenny genes (fCotGob3.1) fCotGob3.1
charengus_gene_ensembl Atlantic herring genes (Ch_v2.0.2v2) Ch_v2.0.2v2
chircus_gene_ensembl Goat genes (ARS1) ARS1
choffmanni_gene_ensembl Sloth genes (choHof1) choHof1
chyarkandensis_gene_ensembl Yarkand deer genes (CEY_v1) CEY_v1
cimitator_gene_ensembl Panamanian white-faced capuchin genes (Cebus_imitator-1.0) Cebus_imitator-1.0
cintestinalis_gene_ensembl C.intestinalis genes (KH) KH
cjacchus_gene_ensembl White-tufted-ear marmoset genes (mCalJac1.pat.X) mCalJac1.pat.X
cjaponica_gene_ensembl Japanese quail genes (Coturnix_japonica_2.0) Coturnix_japonica_2.0
clanigera_gene_ensembl Long-tailed chinchilla genes (ChiLan1.0) ChiLan1.0
cldingo_gene_ensembl Dingo genes (ASM325472v1) ASM325472v1
clfamiliaris_gene_ensembl Dog genes (ROS_Cfam_1.0) ROS_Cfam_1.0
clumpus_gene_ensembl Lumpfish genes (fCycLum1.pri) fCycLum1.pri
cmilii_gene_ensembl Elephant shark genes (Callorhinchus_milii-6.1.3) Callorhinchus_milii-6.1.3
cpbellii_gene_ensembl Painted turtle genes (Chrysemys_picta_bellii-3.0.3) Chrysemys_picta_bellii-3.0.3
cporcellus_gene_ensembl Guinea Pig genes (Cavpor3.0) Cavpor3.0
cporosus_gene_ensembl Australian saltwater crocodile genes (CroPor_comp1) CroPor_comp1
csabaeus_gene_ensembl Vervet-AGM genes (ChlSab1.1) ChlSab1.1
csavignyi_gene_ensembl C.savignyi genes (CSAV 2.0) CSAV 2.0
csemilaevis_gene_ensembl Tongue sole genes (Cse_v1.0) Cse_v1.0
csyrichta_gene_ensembl Tarsier genes (Tarsius_syrichta-2.0.1) Tarsius_syrichta-2.0.1
cvariegatus_gene_ensembl Sheepshead minnow genes (C_variegatus-1.0) C_variegatus-1.0
cwagneri_gene_ensembl Chacoan peccary genes (CatWag_v2_BIUU_UCD) CatWag_v2_BIUU_UCD
dclupeoides_gene_ensembl Denticle herring genes (fDenClu1.2) fDenClu1.2
dlabrax_gene_ensembl European seabass genes (dlabrax2021) dlabrax2021
dleucas_gene_ensembl Beluga whale genes (ASM228892v3) ASM228892v3
dmelanogaster_gene_ensembl Drosophila melanogaster (Fruit fly) genes (BDGP6.46) BDGP6.46
dnovemcinctus_gene_ensembl Armadillo genes (Dasnov3.0) Dasnov3.0
dordii_gene_ensembl Kangaroo rat genes (Dord_2.0) Dord_2.0
drerio_gene_ensembl Zebrafish genes (GRCz11) GRCz11
easinus_gene_ensembl Donkey genes (ASM1607732v2) ASM1607732v2
eburgeri_gene_ensembl Hagfish genes (Eburgeri_3.2) Eburgeri_3.2
ecaballus_gene_ensembl Horse genes (EquCab3.0) EquCab3.0
ecalabaricus_gene_ensembl Reedfish genes (fErpCal1.1) fErpCal1.1
eelectricus_gene_ensembl Electric eel genes (fEleEle1.pri) fEleEle1.pri
eeuropaeus_gene_ensembl Hedgehog genes (eriEur1) eriEur1
elucius_gene_ensembl Northern pike genes (fEsoLuc1.pri) fEsoLuc1.pri
etelfairi_gene_ensembl Lesser hedgehog tenrec genes (TENREC) TENREC
falbicollis_gene_ensembl Collared flycatcher genes (FicAlb1.5) FicAlb1.5
fcatus_gene_ensembl Cat genes (Felis_catus_9.0) Felis_catus_9.0
fheteroclitus_gene_ensembl Mummichog genes (Fundulus_heteroclitus-3.0.2) Fundulus_heteroclitus-3.0.2
gaculeatus_gene_ensembl Stickleback genes (GAculeatus_UGA_version5) GAculeatus_UGA_version5
gevgoodei_gene_ensembl Goodes thornscrub tortoise genes (rGopEvg1_v1.p) rGopEvg1_v1.p
gfortis_gene_ensembl Medium ground-finch genes (GeoFor_1.0) GeoFor_1.0
ggallus_gene_ensembl Chicken genes (bGalGal1.mat.broiler.GRCg7b) bGalGal1.mat.broiler.GRCg7b
ggorilla_gene_ensembl Gorilla genes (gorGor4) gorGor4
gmorhua_gene_ensembl Atlantic cod genes (gadMor3.0) gadMor3.0
hburtoni_gene_ensembl Burton's mouthbrooder genes (AstBur1.0) AstBur1.0
hcomes_gene_ensembl Tiger tail seahorse genes (H_comes_QL1_v1) H_comes_QL1_v1
hgfemale_gene_ensembl Naked mole-rat female genes (Naked_mole-rat_maternal) Naked_mole-rat_maternal
hhucho_gene_ensembl Huchen genes (ASM331708v1) ASM331708v1
hsapiens_gene_ensembl Human genes (GRCh38.p14) GRCh38.p14
ipunctatus_gene_ensembl Channel catfish genes (ASM400665v3) ASM400665v3
itridecemlineatus_gene_ensembl Squirrel genes (SpeTri2.0) SpeTri2.0
jjaculus_gene_ensembl Lesser Egyptian jerboa genes (JacJac1.0) JacJac1.0
kmarmoratus_gene_ensembl Mangrove rivulus genes (ASM164957v1) ASM164957v1
lafricana_gene_ensembl Elephant genes (Loxafr3.0) Loxafr3.0
lbergylta_gene_ensembl Ballan wrasse genes (BallGen_V1) BallGen_V1
lcalcarifer_gene_ensembl Barramundi perch genes (ASB_HGAPassembly_v1) ASB_HGAPassembly_v1
lchalumnae_gene_ensembl Coelacanth genes (LatCha1) LatCha1
lcrocea_gene_ensembl Large yellow croaker genes (L_crocea_2.0) L_crocea_2.0
llaticaudata_gene_ensembl Blue-ringed sea krait genes (latLat_1.0) latLat_1.0
lleishanense_gene_ensembl Leishan spiny toad genes (ASM966780v1) ASM966780v1
loculatus_gene_ensembl Spotted gar genes (LepOcu1) LepOcu1
marmatus_gene_ensembl Zig-zag eel genes (fMasArm1.2) fMasArm1.2
mauratus_gene_ensembl Golden Hamster genes (MesAur1.0) MesAur1.0
mcaroli_gene_ensembl Ryukyu mouse genes (CAROLI_EIJ_v1.1) CAROLI_EIJ_v1.1
mdomestica_gene_ensembl Opossum genes (ASM229v1) ASM229v1
mfascicularis_gene_ensembl Crab-eating macaque genes (Macaca_fascicularis_6.0) Macaca_fascicularis_6.0
mgallopavo_gene_ensembl Turkey genes (Turkey_5.1) Turkey_5.1
mleucophaeus_gene_ensembl Drill genes (Mleu.le_1.0) Mleu.le_1.0
mlucifugus_gene_ensembl Microbat genes (Myoluc2.0) Myoluc2.0
mmmarmota_gene_ensembl Alpine marmot genes (marMar2.1) marMar2.1
mmonoceros_gene_ensembl Narwhal genes (NGI_Narwhal_1) NGI_Narwhal_1
mmoschiferus_gene_ensembl Siberian musk deer genes (MosMos_v2_BIUU_UCD) MosMos_v2_BIUU_UCD
mmulatta_gene_ensembl Macaque genes (Mmul_10) Mmul_10
mmurdjan_gene_ensembl Pinecone soldierfish genes (fMyrMur1.1) fMyrMur1.1
mmurinus_gene_ensembl Mouse Lemur genes (Mmur_3.0) Mmur_3.0
mmusculus_gene_ensembl Mouse genes (GRCm39) GRCm39
mnemestrina_gene_ensembl Pig-tailed macaque genes (Mnem_1.0) Mnem_1.0
mochrogaster_gene_ensembl Prairie vole genes (MicOch1.0) MicOch1.0
mpahari_gene_ensembl Shrew mouse genes (PAHARI_EIJ_v1.1) PAHARI_EIJ_v1.1
mpfuro_gene_ensembl Ferret genes (MusPutFur1.0) MusPutFur1.0
mspicilegus_gene_ensembl Steppe mouse genes (MUSP714) MUSP714
mspretus_gene_ensembl Algerian mouse genes (SPRET_EiJ_v1) SPRET_EiJ_v1
mzebra_gene_ensembl Zebra mbuna genes (M_zebra_UMD2a) M_zebra_UMD2a
nbrichardi_gene_ensembl Lyretail cichlid genes (NeoBri1.0) NeoBri1.0
neugenii_gene_ensembl Wallaby genes (Meug_1.0) Meug_1.0
nfurzeri_gene_ensembl Turquoise killifish genes (Nfu_20140520) Nfu_20140520
ngalili_gene_ensembl Upper Galilee mountains blind mole rat genes (S.galili_v1.0) S.galili_v1.0
nleucogenys_gene_ensembl Gibbon genes (Nleu_3.0) Nleu_3.0
nnaja_gene_ensembl Indian cobra genes (Nana_v5) Nana_v5
nscutatus_gene_ensembl Mainland tiger snake genes (TS10Xv2-PRI) TS10Xv2-PRI
nvison_gene_ensembl American mink genes (NNQGG.v01) NNQGG.v01
oanatinus_gene_ensembl Platypus genes (mOrnAna1.p.v1) mOrnAna1.p.v1
oarambouillet_gene_ensembl Sheep genes (ARS-UI_Ramb_v2.0) ARS-UI_Ramb_v2.0
oaries_gene_ensembl Sheep (texel) genes (Oar_v3.1) Oar_v3.1
ocuniculus_gene_ensembl Rabbit genes (OryCun2.0) OryCun2.0
odegus_gene_ensembl Degu genes (OctDeg1.0) OctDeg1.0
ogarnettii_gene_ensembl Bushbaby genes (OtoGar3) OtoGar3
ojavanicus_gene_ensembl Javanese ricefish genes (OJAV_1.1) OJAV_1.1
okisutch_gene_ensembl Coho salmon genes (Okis_V2) Okis_V2
olatipes_gene_ensembl Japanese medaka HdrR genes (ASM223467v1) ASM223467v1
omelastigma_gene_ensembl Indian medaka genes (Om_v0.7.RACA) Om_v0.7.RACA
omykiss_gene_ensembl Rainbow trout genes (USDA_OmykA_1.1) USDA_OmykA_1.1
oniloticus_gene_ensembl Nile tilapia genes (O_niloticus_UMD_NMBU) O_niloticus_UMD_NMBU
oprinceps_gene_ensembl Pika genes (OchPri2.0-Ens) OchPri2.0-Ens
osinensis_gene_ensembl Chinese medaka genes (ASM858656v1) ASM858656v1
otshawytscha_gene_ensembl Chinook salmon genes (Otsh_v2.0) Otsh_v2.0
pabelii_gene_ensembl Sumatran orangutan genes (Susie_PABv2) Susie_PABv2
panubis_gene_ensembl Olive baboon genes (Panubis1.0) Panubis1.0
pcapensis_gene_ensembl Hyrax genes (proCap1) proCap1
pcatodon_gene_ensembl Sperm whale genes (ASM283717v2) ASM283717v2
pcinereus_gene_ensembl Koala genes (phaCin_unsw_v4.1) phaCin_unsw_v4.1
pcoquereli_gene_ensembl Coquerel's sifaka genes (Pcoq_1.0) Pcoq_1.0
pformosa_gene_ensembl Amazon molly genes (Poecilia_formosa-5.1.2) Poecilia_formosa-5.1.2
pkingsleyae_gene_ensembl Paramormyrops kingsleyae genes (PKINGS_0.1) PKINGS_0.1
platipinna_gene_ensembl Sailfin molly genes (P_latipinna-1.0) P_latipinna-1.0
pleo_gene_ensembl Lion genes (PanLeo1.0) PanLeo1.0
pmajor_gene_ensembl Great Tit genes (Parus_major1.1) Parus_major1.1
pmarinus_gene_ensembl Lamprey genes (Pmarinus_7.0) Pmarinus_7.0
pmbairdii_gene_ensembl Northern American deer mouse genes (HU_Pman_2.1) HU_Pman_2.1
pmuralis_gene_ensembl Common wall lizard genes (PodMur_1.0) PodMur_1.0
pnattereri_gene_ensembl Red-bellied piranha genes (fPygNat1.pri) fPygNat1.pri
pnyererei_gene_ensembl Makobe Island cichlid genes (PunNye1.0) PunNye1.0
ppaniscus_gene_ensembl Bonobo genes (panpan1.1) panpan1.1
ppardus_gene_ensembl Leopard genes (PanPar1.0) PanPar1.0
preticulata_gene_ensembl Guppy genes (Guppy_female_1.0_MT) Guppy_female_1.0_MT
psimus_gene_ensembl Greater bamboo lemur genes (Prosim_1.0) Prosim_1.0
psinensis_gene_ensembl Chinese softshell turtle genes (PelSin_1.0) PelSin_1.0
psinus_gene_ensembl Vaquita genes (mPhoSin1.pri) mPhoSin1.pri
ptaltaica_gene_ensembl Tiger genes (PanTig1.0) PanTig1.0
ptextilis_gene_ensembl Eastern brown snake genes (EBS10Xv2-PRI) EBS10Xv2-PRI
ptroglodytes_gene_ensembl Chimpanzee genes (Pan_tro_3.0) Pan_tro_3.0
pvampyrus_gene_ensembl Megabat genes (pteVam1) pteVam1
rbieti_gene_ensembl Black snub-nosed monkey genes (ASM169854v1) ASM169854v1
rferrumequinum_gene_ensembl Greater horseshoe bat genes (mRhiFer1_v1.p) mRhiFer1_v1.p
rnorvegicus_gene_ensembl Rat genes (mRatBN7.2) mRatBN7.2
rroxellana_gene_ensembl Golden snub-nosed monkey genes (Rrox_v1) Rrox_v1
saraneus_gene_ensembl Shrew genes (sorAra1) sorAra1
saurata_gene_ensembl Gilthead seabream genes (fSpaAur1.1) fSpaAur1.1
sbboliviensis_gene_ensembl Bolivian squirrel monkey genes (SaiBol1.0) SaiBol1.0
scanaria_gene_ensembl Common canary genes (SCA1) SCA1
scaustralis_gene_ensembl African ostrich genes (ASM69896v1) ASM69896v1
scerevisiae_gene_ensembl Saccharomyces cerevisiae genes (R64-1-1) R64-1-1
sdumerili_gene_ensembl Greater amberjack genes (Sdu_1.0) Sdu_1.0
sformosus_gene_ensembl Asian bonytongue genes (fSclFor1.1) fSclFor1.1
shabroptila_gene_ensembl Kakapo genes (bStrHab1_v1.p) bStrHab1_v1.p
sharrisii_gene_ensembl Tasmanian devil genes (mSarHar1.11) mSarHar1.11
sldorsalis_gene_ensembl Yellowtail amberjack genes (Sedor1) Sedor1
slucioperca_gene_ensembl Pike-perch genes (SLUC_FBN_1) SLUC_FBN_1
smaximus_gene_ensembl Turbot genes (ASM1334776v1) ASM1334776v1
smerianae_gene_ensembl Argentine black and white tegu genes (HLtupMer3) HLtupMer3
spartitus_gene_ensembl Bicolor damselfish genes (Stegastes_partitus-1.0.2) Stegastes_partitus-1.0.2
spunctatus_gene_ensembl Tuatara genes (ASM311381v1) ASM311381v1
ssalar_gene_ensembl Atlantic salmon genes (Ssal_v3.1) Ssal_v3.1
ssbamei_gene_ensembl Pig - Bamei genes (Bamei_pig_v1) Bamei_pig_v1
ssberkshire_gene_ensembl Pig - Berkshire genes (Berkshire_pig_v1) Berkshire_pig_v1
sscrofa_gene_ensembl Pig genes (Sscrofa11.1) Sscrofa11.1
sshampshire_gene_ensembl Pig - Hampshire genes (Hampshire_pig_v1) Hampshire_pig_v1
ssjinhua_gene_ensembl Pig - Jinhua genes (Jinhua_pig_v1) Jinhua_pig_v1
sslandrace_gene_ensembl Pig - Landrace genes (Landrace_pig_v1) Landrace_pig_v1
sslargewhite_gene_ensembl Pig - Largewhite genes (Large_White_v1) Large_White_v1
ssmeishan_gene_ensembl Pig - Meishan genes (Meishan_pig_v1) Meishan_pig_v1
sspietrain_gene_ensembl Pig - Pietrain genes (Pietrain_pig_v1) Pietrain_pig_v1
ssrongchang_gene_ensembl Pig - Rongchang genes (Rongchang_pig_v1) Rongchang_pig_v1
sstibetan_gene_ensembl Pig - Tibetan genes (Tibetan_Pig_v2) Tibetan_Pig_v2
ssusmarc_gene_ensembl Pig USMARC genes (USMARCv1.0) USMARCv1.0
sswuzhishan_gene_ensembl Pig - Wuzhishan genes (minipig_v1.0) minipig_v1.0
strutta_gene_ensembl Brown trout genes (fSalTru1.1) fSalTru1.1
svulgaris_gene_ensembl Eurasian red squirrel genes (mSciVul1.1) mSciVul1.1
tbelangeri_gene_ensembl Tree Shrew genes (tupBel1) tupBel1
tctriunguis_gene_ensembl Three-toed box turtle genes (T_m_triunguis-2.0) T_m_triunguis-2.0
tguttata_gene_ensembl Zebra finch genes (bTaeGut1_v1.p) bTaeGut1_v1.p
tnigroviridis_gene_ensembl Tetraodon genes (TETRAODON 8.0) TETRAODON 8.0
trubripes_gene_ensembl Fugu genes (fTakRub1.2) fTakRub1.2
ttruncatus_gene_ensembl Dolphin genes (turTru1) turTru1
uamericanus_gene_ensembl American black bear genes (ASM334442v1) ASM334442v1
umaritimus_gene_ensembl Polar bear genes (UrsMar_1.0) UrsMar_1.0
uparryii_gene_ensembl Arctic ground squirrel genes (ASM342692v1) ASM342692v1
vpacos_gene_ensembl Alpaca genes (vicPac1) vicPac1
vursinus_gene_ensembl Common wombat genes (bare-nosed_wombat_genome_assembly) bare-nosed_wombat_genome_assembly
vvulpes_gene_ensembl Red fox genes (VulVul2.2) VulVul2.2
xcouchianus_gene_ensembl Monterrey platyfish genes (Xiphophorus_couchianus-4.0.1) Xiphophorus_couchianus-4.0.1
xmaculatus_gene_ensembl Platyfish genes (X_maculatus-5.0-male) X_maculatus-5.0-male
xtropicalis_gene_ensembl Tropical clawed frog genes (UCB_Xtro_10.0) UCB_Xtro_10.0

A lot of stuff for a lot of species! Perhaps we want to limit it to see which ones are relevant to mouse.

dataset description version
mmusculus_gene_ensembl Mouse genes (GRCm39) GRCm39

Using biomaRt

Ah so we probably want the dataset called ‘mmusculus_gene_ensembl’!

Using biomaRt

name description page
ensembl_gene_id Gene stable ID feature_page
ensembl_gene_id_version Gene stable ID version feature_page
ensembl_transcript_id Transcript stable ID feature_page
ensembl_transcript_id_version Transcript stable ID version feature_page
ensembl_peptide_id Protein stable ID feature_page
ensembl_peptide_id_version Protein stable ID version feature_page
ensembl_exon_id Exon stable ID feature_page
description Gene description feature_page
chromosome_name Chromosome/scaffold name feature_page
start_position Gene start (bp) feature_page
end_position Gene end (bp) feature_page
strand Strand feature_page
band Karyotype band feature_page
transcript_start Transcript start (bp) feature_page
transcript_end Transcript end (bp) feature_page
transcription_start_site Transcription start site (TSS) feature_page
transcript_length Transcript length (including UTRs and CDS) feature_page
transcript_tsl Transcript support level (TSL) feature_page
transcript_gencode_basic GENCODE basic annotation feature_page
transcript_appris APPRIS annotation feature_page

Using biomaRt

So there are 2885 rows of goodies about the mouse genome and its relationship to many other genomes. However, you can probably see that the ones that are most useful to us right now are right at the top: ‘ensembl_transcript_id’ and ‘ensembl_gene_id’. We can use those attributes in our mart to make a table relating genes and transcripts.

I’m going to through one more attribute in: external_gene_name. Those are usually more informative than ensembl IDs.

ensembl_transcript_id ensembl_gene_id external_gene_name
ENSMUST00000082387 ENSMUSG00000064336 mt-Tf
ENSMUST00000082388 ENSMUSG00000064337 mt-Rnr1
ENSMUST00000082389 ENSMUSG00000064338 mt-Tv
ENSMUST00000082390 ENSMUSG00000064339 mt-Rnr2
ENSMUST00000082391 ENSMUSG00000064340 mt-Tl1
ENSMUST00000082392 ENSMUSG00000064341 mt-Nd1
ENSMUST00000082393 ENSMUSG00000064342 mt-Ti
ENSMUST00000082394 ENSMUSG00000064343 mt-Tq
ENSMUST00000082395 ENSMUSG00000064344 mt-Tm
ENSMUST00000082396 ENSMUSG00000064345 mt-Nd2
ENSMUST00000082397 ENSMUSG00000064346 mt-Tw
ENSMUST00000082398 ENSMUSG00000064347 mt-Ta
ENSMUST00000082399 ENSMUSG00000064348 mt-Tn
ENSMUST00000082400 ENSMUSG00000064349 mt-Tc
ENSMUST00000082401 ENSMUSG00000064350 mt-Ty
ENSMUST00000082402 ENSMUSG00000064351 mt-Co1
ENSMUST00000082403 ENSMUSG00000064352 mt-Ts1
ENSMUST00000082404 ENSMUSG00000064353 mt-Td
ENSMUST00000082405 ENSMUSG00000064354 mt-Co2
ENSMUST00000082406 ENSMUSG00000064355 mt-Tk

Using biomaRt

Alright this looks good! We are going to split this into two tables. One that contains transcript ID and gene ID, and the other that contains gene ID and gene name.

Getting gene level expression data with tximport

Now that we have our table relating transcripts and genes, we can give it to tximport to have it calculate gene-level expression data from our transcript-level expression data.

First, we have to tell it where the salmon quantification files (the quant.sf.gz files) are. Here’s what our directory structure that contains these files looks like:

Gene expression data with tximport

Gene expression data with tximport

You can see that we got a list of sample names and the absolute path to each sample’s quantification file.

Now we are ready to run tximport

tximport is going to want paths to all the quantification files (salm_dirs) and a table that relates transcripts to genes (t2g). Luckily, we happen to have those exact two things.

Gene expression data with tximport

Notice how we chose lengthscaledTPM for our abundance measurement. This is going to give us TPM values (transcripts per million) for expression in the $abundance slot. Let’s check out what we have now.

ensembl_gene_id DIV0.Rep1 DIV0.Rep2 DIV0.Rep3 DIV1.Rep1 DIV1.Rep2 DIV1.Rep3 DIV1.Rep4 DIV16.Rep1 DIV16.Rep2 DIV16.Rep3 DIV16.Rep4 DIV21.Rep1 DIV21.Rep2 DIV21.Rep3 DIV21.Rep4 DIV28.Rep1 DIV28.Rep2 DIV28.Rep3 DIV28.Rep4 DIV7.Rep1 DIV7.Rep2 DIV7.Rep3 DIV7.Rep4 DIVminus4.Rep1 DIVminus4.Rep2 DIVminus4.Rep3 DIVminus8.Rep1 DIVminus8.Rep2 DIVminus8.Rep3 DIVminus8.Rep4
ENSMUSG00000000001 148.004511 145.539529 164.680183 142.713005 138.042672 137.027499 123.894516 41.689798 41.554126 41.002841 39.522314 28.162815 29.694158 29.333682 27.956845 17.960041 17.979483 18.102228 19.590551 76.088940 76.669016 79.505738 76.038479 106.634101 106.395094 109.997598 92.799612 93.563524 95.500337 101.951963
ENSMUSG00000000003 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000000028 34.180035 36.215924 39.486489 16.894844 15.391383 18.171282 16.654062 1.067147 1.270428 1.109391 1.136970 1.939558 1.003108 1.245186 1.086912 1.303824 0.935135 0.864451 0.862209 6.582942 6.752942 5.547107 5.510259 45.614077 47.851930 47.516125 59.718765 57.981259 57.935839 41.663343
ENSMUSG00000000031 7.213808 6.431925 13.253322 3.301713 2.539161 3.591509 2.798164 93.312776 56.314466 62.958855 68.565195 32.037271 37.465959 40.581328 24.780757 67.818115 64.333586 30.408290 60.198406 151.244212 137.227925 169.615059 150.599359 0.914481 1.402704 1.317939 0.177770 0.215726 0.320948 36.425779
ENSMUSG00000000037 5.877237 5.001733 7.579149 2.861842 3.176129 2.457556 3.591170 0.319281 0.708167 0.578467 0.513434 0.186198 0.260270 0.258075 0.585622 0.615895 0.133394 0.197050 1.005331 2.750903 1.827190 2.446732 2.894751 2.196733 2.309187 2.213591 3.187886 2.507155 2.624140 2.281871
ENSMUSG00000000049 0.000000 0.046208 0.000000 0.000000 0.000000 0.000000 0.411807 0.213589 0.118890 1.024154 0.628760 0.303218 0.449928 0.262019 0.105047 0.567535 0.649720 0.515246 0.360877 0.188254 0.050654 0.277837 0.189814 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000000056 31.553583 28.610456 19.741875 59.046908 51.324548 54.015893 43.897451 36.575044 47.171198 41.459018 38.403920 55.170239 37.441141 44.043899 45.852392 38.906362 40.362321 44.933634 42.356811 41.828967 46.812515 44.172289 46.834670 14.509458 18.656028 16.335093 10.671419 10.666954 12.386288 8.077939
ENSMUSG00000000058 0.420304 0.507728 0.602757 0.357594 0.504091 0.498006 0.565744 0.931596 0.632998 0.606530 0.503290 0.292719 0.224968 0.401341 0.428359 0.490118 0.200389 0.038911 0.399079 0.194399 0.102993 0.063154 0.130644 0.163821 0.162316 0.251082 0.693347 0.201366 0.527565 0.482299
ENSMUSG00000000078 52.604855 64.802703 76.892914 60.446343 60.861211 61.059568 50.159693 32.474649 28.287563 27.391212 29.511742 19.632508 19.942257 20.182975 19.363735 16.845197 16.128839 18.828771 15.732413 52.939918 52.709453 54.334181 50.804497 51.427093 49.900737 53.274561 19.669879 19.472875 20.216492 21.591729
ENSMUSG00000000085 60.617248 59.182914 61.623794 72.657499 71.506733 75.816856 63.507273 36.498680 31.450681 36.983697 34.558348 31.893252 35.692047 31.095232 34.751542 36.339896 32.474173 38.456832 39.182603 52.494911 43.715591 45.720249 46.351620 9.949776 11.703136 9.320286 13.843137 11.699411 13.486194 10.366294
ENSMUSG00000000088 372.546920 364.764405 381.963748 389.617591 388.204349 413.080094 533.720843 882.254461 918.864762 881.567068 952.656933 1064.561469 1018.564149 1023.266589 935.493316 944.628320 1102.492902 1011.428801 993.369153 703.366663 749.229000 711.055019 744.360315 602.694088 539.095644 568.529023 715.919453 723.040511 713.534783 795.564123
ENSMUSG00000000093 18.748818 26.367729 28.740489 18.171262 17.023064 19.681817 21.696499 1.748123 1.159551 1.244226 1.286737 1.859175 1.536283 1.490581 1.446475 3.138448 3.362518 3.197714 3.445495 1.960168 1.812696 2.626110 2.401155 1.641559 1.120127 1.520160 0.123672 0.064585 0.118030 0.062646
ENSMUSG00000000094 0.114764 0.197628 0.237980 0.067150 0.050183 0.025762 0.000000 0.000000 0.000000 0.020946 0.000000 0.023273 0.000000 0.020613 0.020391 0.000000 0.000000 0.000000 0.000000 0.253309 0.359092 0.191826 0.280499 0.644665 0.508711 0.638906 1.610289 2.043695 1.513517 1.061169
ENSMUSG00000000103 0.000000 0.000000 0.019350 0.019932 0.000000 0.000000 0.000000 0.024280 0.000000 0.000000 0.027595 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.022570 0.020990 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.049946
ENSMUSG00000000120 18.187330 17.621063 16.764680 65.095814 63.742121 70.050947 65.437767 7.138900 6.728019 6.471314 7.243701 3.043655 3.779647 4.010854 3.292345 1.055312 1.336741 1.401802 1.323000 7.879629 8.472306 7.826161 8.344422 1.302789 1.494076 1.100786 2.720008 2.666598 3.080750 7.171139
ENSMUSG00000000125 9.426741 9.885487 5.404755 2.614957 1.888344 2.419112 2.543413 1.836920 2.186343 2.081065 2.033197 1.815480 1.714785 1.591194 1.707502 1.388731 1.150249 1.311930 1.489738 2.613648 2.668347 2.770951 2.336912 1.482033 1.217406 1.473157 0.168666 0.267824 0.162984 0.439311
ENSMUSG00000000126 1.215801 0.746274 0.814387 0.893695 0.726967 0.795409 0.811422 1.698931 0.816782 1.232844 1.175287 0.776041 0.671299 0.527801 0.856885 0.881084 0.587611 0.480369 0.599140 1.920124 1.992736 2.643161 2.442442 1.243716 0.787857 1.154595 1.480120 1.685512 1.662030 1.757216
ENSMUSG00000000127 13.113348 14.260035 14.074681 16.162555 17.427828 17.565289 11.863558 12.271221 14.241964 13.551247 14.347367 12.813797 12.260097 11.685471 12.283357 8.849005 8.757533 8.775956 9.371477 16.306853 15.553249 15.896542 15.505971 11.131771 11.417468 12.892073 12.916885 12.422329 12.138589 11.783150
ENSMUSG00000000131 55.485697 54.448611 62.549063 71.704310 71.301569 71.997625 85.422041 53.653944 54.569370 54.041498 53.203853 56.560171 55.599344 57.503726 56.142399 56.062512 54.730314 56.114317 55.657729 64.772660 60.737488 64.559072 65.908557 73.304601 82.616101 81.113216 72.032652 72.128976 70.421851 64.545350
ENSMUSG00000000134 33.480962 36.683455 38.983454 39.609181 45.133664 47.577079 38.693647 19.425951 20.279171 18.258503 19.466904 20.759852 19.514517 20.453484 23.945999 24.145265 23.148770 24.341792 20.775309 26.254515 25.201055 25.298568 24.448624 70.194375 70.469434 72.039951 62.710030 65.426174 62.960782 52.892113
ENSMUSG00000000142 34.556890 33.051268 31.249676 26.910897 25.255541 26.237510 24.678957 4.314812 3.851946 5.201375 4.297327 3.944854 3.709407 4.698792 4.180144 4.008057 4.408832 5.413859 4.183791 12.235464 12.443578 13.543412 12.406199 5.611935 4.639913 5.140094 3.517411 3.338406 3.652103 5.866998
ENSMUSG00000000148 12.975017 12.265602 13.479207 15.929561 15.998245 15.251781 16.439377 11.187122 11.765327 12.255481 11.805331 9.727571 10.811203 10.842452 9.828969 9.057784 8.261829 8.418036 8.912263 13.212425 11.883479 13.518251 13.482611 10.102323 10.474313 9.593967 7.725926 8.271340 8.221738 8.493223
ENSMUSG00000000149 62.786926 60.575599 64.868495 81.035249 77.612600 80.007932 79.501053 19.769167 19.068923 18.725772 19.826181 17.829466 16.300954 17.513712 15.040879 14.602690 13.900191 14.489153 13.408245 37.005858 35.055680 36.987634 36.668932 36.844849 38.723659 38.975206 27.466288 29.294849 29.622090 30.355845
ENSMUSG00000000154 0.333055 0.115844 0.432376 0.325385 0.332173 0.555266 0.274624 0.315355 0.635393 0.370886 0.117576 0.114637 0.156711 0.361648 0.316280 0.277836 0.297838 0.396606 0.245718 0.372255 0.472991 0.428795 0.411684 1.678056 1.611784 1.901671 3.967049 5.087375 4.540421 5.049930
ENSMUSG00000000157 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.019660 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.038957 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000000159 0.000000 0.000000 0.000000 0.000000 0.066624 0.000000 0.000000 0.000000 0.000000 0.120014 0.192008 0.000000 0.069363 0.465487 0.133256 0.105527 0.609742 0.232339 0.000000 0.000000 0.000000 0.000000 0.034548 0.047125 0.116238 0.039351 0.258603 0.235119 0.181094 0.033496
ENSMUSG00000000167 7.167658 7.111275 4.562918 4.611452 5.165249 3.980233 5.810633 3.014607 2.165885 1.420868 2.402707 1.174127 1.731610 2.442734 1.718371 1.983554 1.487518 0.867104 1.640726 2.877345 4.086346 2.980909 2.788065 2.412113 1.093266 2.134765 3.075103 2.544267 4.048963 1.194254
ENSMUSG00000000168 46.898364 51.544129 50.803506 46.728899 46.416810 43.668901 42.770089 135.597979 142.174183 140.466776 155.267598 176.558772 177.449941 173.533809 177.441385 140.085372 135.955772 149.522280 139.289347 77.093701 65.875673 72.031057 69.113490 65.284363 59.589716 67.122029 66.846252 65.352465 66.011315 60.027737
ENSMUSG00000000171 183.510028 191.339737 185.366419 146.485489 136.055305 134.774362 144.419003 250.624457 258.445965 254.599788 264.936389 297.223823 290.955520 287.995065 284.714197 228.171116 236.310898 224.511224 221.322408 180.849179 173.186896 175.786723 175.691677 147.487631 156.118401 153.095971 219.561253 219.763695 210.634119 196.601985
ENSMUSG00000000182 0.103193 0.032365 0.029772 0.000000 0.000000 0.000000 0.000000 0.109766 0.257429 0.112760 0.211044 0.082172 0.037902 0.110168 0.111682 0.380279 0.139486 0.081007 0.246593 0.169495 0.356141 0.264438 0.269860 0.040837 0.041344 0.036939 0.000000 0.030240 0.000000 0.203803
ENSMUSG00000000183 0.000000 0.000000 0.010340 0.000000 0.010401 0.016126 0.000000 0.000000 0.000000 0.000000 0.015047 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.012499 0.000000 0.000000 0.000000 0.000000 0.012861 0.000000 0.000000 0.000000 0.020984
ENSMUSG00000000184 142.747177 147.826378 149.197724 66.761615 63.335841 65.070470 48.113890 21.348042 17.384228 15.935255 15.653769 16.292530 14.504585 17.797905 15.544638 17.701942 13.309930 15.605609 17.667904 28.101151 33.067019 38.044163 34.866641 7.046037 5.154655 5.428973 9.002990 11.536859 8.437719 2.867368
ENSMUSG00000000194 36.103417 38.818908 42.800890 36.817530 38.667199 36.778395 26.187279 33.002131 37.727627 36.606405 37.474231 36.703396 34.536615 36.214178 34.713799 33.020334 26.550992 28.899042 31.118602 34.300556 32.621947 33.363838 34.012681 30.084083 26.911440 26.198168 28.291703 30.043730 28.726287 29.337298
ENSMUSG00000000197 0.724701 0.864508 0.538785 1.073368 1.074458 0.792837 0.844630 25.556904 27.172721 27.223684 27.307364 23.469370 25.193771 24.809004 25.595933 22.134882 17.538889 21.238627 22.714445 11.528675 11.227085 11.667388 11.717720 0.717834 0.544315 0.843883 0.333168 0.479806 0.295036 0.135287
ENSMUSG00000000202 46.873410 38.860603 20.595529 21.277139 19.304139 22.041307 20.637502 1.998099 2.664860 2.147623 1.671951 0.707986 1.171352 2.805967 0.762410 1.275708 0.275591 0.391964 0.762712 3.703893 4.510447 4.937764 4.553973 1.186139 1.366082 0.695439 1.176641 2.441639 0.983961 3.130113
ENSMUSG00000000204 0.000000 0.077812 0.000000 0.036758 0.087585 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.043964 0.000000 0.000000 0.000000 0.018414 0.000000 0.000000 0.000000 0.038316 0.000000 0.000000 0.050157 0.000000 0.041803 0.000000 0.040884 0.152281
ENSMUSG00000000214 0.316821 0.376734 0.580427 0.074393 0.077525 0.232658 0.047076 0.000000 0.000000 0.000000 0.195838 0.050855 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.059494 0.028719 0.000000 0.080425 0.133600 0.051463 0.000000 0.062615 0.191165 0.206326 0.128695 0.070217
ENSMUSG00000000215 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.201293 0.000000 0.000000 0.079614 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.125601 0.000000 0.060125 0.339217 0.000000 0.526094 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000000216 0.055074 0.051534 0.015706 0.000000 0.031705 0.000000 0.000000 0.039904 0.022940 0.019738 0.000000 0.000000 0.020080 0.000000 0.000000 0.000000 0.018491 0.000000 0.038279 0.017680 0.000000 0.017147 0.017771 0.022180 0.000000 0.000000 0.000000 0.065627 0.033124 0.015572
ENSMUSG00000000223 1.490207 1.396739 1.636890 1.736106 1.971587 1.840778 1.365031 2.454484 2.567922 2.499309 2.498017 2.092696 1.983752 2.066465 1.717463 2.653865 1.954384 1.841953 2.723702 2.868183 2.988016 3.715324 3.097974 0.185306 0.237818 0.133284 0.229376 0.237803 0.222297 0.190730
ENSMUSG00000000244 0.083692 0.054689 0.130041 0.090199 0.193026 0.120789 0.083430 0.218939 0.038914 0.099444 0.015106 0.000000 0.000000 0.013459 0.000000 0.154390 0.067568 0.270073 0.127634 0.108335 0.113842 0.088430 0.052551 0.638876 0.595938 0.471887 0.533838 0.596377 0.453464 0.350619
ENSMUSG00000000247 3.581850 3.499372 3.528227 2.710486 3.101592 2.767801 2.565610 1.225779 0.659762 1.060458 0.978868 0.964519 0.809062 0.716354 0.546474 0.468901 0.592905 0.252756 0.383217 9.739157 11.367096 12.204956 9.050891 1.025635 1.058490 0.835800 0.563193 0.906406 0.423688 0.336354
ENSMUSG00000000248 0.000000 0.307460 0.275004 0.000000 0.000000 0.037225 0.000000 0.000000 0.000000 0.000000 0.000000 0.726755 0.000000 0.000000 0.000000 0.000000 0.000000 0.032926 0.094631 0.027475 0.000000 0.264993 0.027614 0.000000 0.035735 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000000253 5.002777 4.525335 4.179621 5.048770 4.947710 4.666987 4.505384 23.658215 24.517931 25.751619 26.217720 35.486540 36.812237 35.029205 34.464815 47.977399 51.307180 44.104269 43.421043 18.585852 19.591512 21.542503 19.528295 10.439729 10.042434 12.184047 20.222381 19.607781 19.986088 25.072968
ENSMUSG00000000263 2.981638 1.970632 2.113011 1.671629 1.341986 2.387899 0.558066 43.069555 53.099363 55.631124 57.681339 85.004519 87.323404 79.087829 83.067736 94.715692 60.793459 96.932514 95.108836 3.780805 3.893267 3.860449 4.027712 0.037352 0.037589 0.032486 0.056548 0.079220 0.058177 0.109525
ENSMUSG00000000266 4.207659 3.886346 3.670926 5.967348 5.313591 5.564873 3.908537 18.434130 19.178063 18.998609 19.210836 22.110790 21.232814 21.788512 20.701333 21.877453 18.154458 24.446192 24.809788 14.273999 14.445732 13.674834 13.532920 0.403795 0.353641 0.384367 0.045010 0.072257 0.094877 0.126511
ENSMUSG00000000275 15.337964 15.667032 22.966957 6.220162 6.831170 6.443813 6.385162 1.564104 1.618634 1.642415 1.574792 1.154512 1.511644 1.017289 0.828477 1.030165 0.983866 0.759897 1.061297 3.119988 3.554271 3.629310 3.486077 44.342670 44.274541 45.952267 68.204400 66.534540 67.546482 65.233133
ENSMUSG00000000276 6.707699 6.335487 5.715570 11.190985 11.445650 10.558276 8.857024 23.079722 24.157868 24.513742 24.306458 22.041553 22.124420 21.233254 22.742842 28.209391 23.221775 27.581952 27.546310 20.847164 20.360543 19.165914 19.357199 7.914284 8.048394 8.766309 11.622329 11.302601 11.791095 7.286859
ENSMUSG00000000278 90.617387 96.681487 113.010324 37.035350 35.574970 37.008561 35.981529 53.884610 53.796869 52.070052 55.294585 71.168345 71.952532 69.563003 72.057016 80.149296 90.857958 89.634870 77.269610 38.043631 38.898064 39.409246 38.148701 174.002655 172.278617 171.035085 162.763884 161.548267 159.085443 194.152958
ENSMUSG00000000282 28.352789 28.584065 28.490442 51.536750 48.043432 53.508818 41.034434 34.048141 39.276596 38.718342 42.472822 33.504992 34.355608 38.392612 31.313721 28.164786 28.592877 31.653213 29.223527 46.532177 45.670087 48.495228 50.032227 27.125277 25.598833 24.739579 18.692252 16.789268 15.102999 17.659619

TPM as an expression metric

Alright, not bad!

Let’s stop and think for a minute about what tximport did and the metric we are using (TPM). What does transcripts per million mean? Well, it means pretty much what it sounds like. For every million transcripts in the cell, X of them are this particular transcript. Importantly, this means when this TPM value was calculated from the number of counts a transcript received, this number had to be adjusted for both the total number of counts in the library and the length of a transcript.

If sample A had twice the number of total counts as sample B (i.e. was sequenced twice as deeply), then you would expect every transcript to have approximately twice the number of counts in sample A as it has in sample B. Similarly, if transcript X is twice as long as transcript Y, then you would expect that if they were equally expressed (i.e. the same number of transcript X and transcript Y molecules were present in the sample) that X would have approximately twice the counts that Y does. Working with expression units of TPM incorporates both of these normalizations.

So, if a TPM of X means that for every million transcripts in the sample that X of them were the transcript of interest, then the sum of TPM values across all species should equal one million, right?

Let’s check and see if that’s true.

TPM as an expression metric

[1] 985678.2
[1] 985291.2
[1] 985100.6

OK, not quite one million, but pretty darn close.

This notion that TPMs represent proportions of a whole also leads to another interesting insight into what tximport is doing here. If all transcripts belong to genes, then the TPM for a gene must be the sum of the TPMs of its transcripts. Can we verify that that is true?

TPM as an expression metric

ensembl_transcript_id DIV0.Rep1 DIV0.Rep2 DIV0.Rep3 DIV1.Rep1 DIV1.Rep2 DIV1.Rep3 DIV1.Rep4 DIV16.Rep1 DIV16.Rep2 DIV16.Rep3 DIV16.Rep4 DIV21.Rep1 DIV21.Rep2 DIV21.Rep3 DIV21.Rep4 DIV28.Rep1 DIV28.Rep2 DIV28.Rep3 DIV28.Rep4 DIV7.Rep1 DIV7.Rep2 DIV7.Rep3 DIV7.Rep4 DIVminus4.Rep1 DIVminus4.Rep2 DIVminus4.Rep3 DIVminus8.Rep1 DIVminus8.Rep2 DIVminus8.Rep3 DIVminus8.Rep4 ensembl_gene_id
ENSMUST00000119854 2.286511 2.475351 2.778492 3.752931 4.290970 3.606815 2.289835 4.753009 6.407206 6.093523 8.073422 4.114218 3.854100 4.504607 4.233634 0.000000 3.062028 2.657892 4.551532 7.708040 5.771621 4.913897 5.733164 0.821894 0.000000 0.791545 0.452225 0.283685 0.253438 0.000000 ENSMUSG00000037736
ENSMUST00000147408 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ENSMUSG00000052516
ENSMUST00000070080 17.294246 16.642412 20.145840 22.849371 23.111531 23.055949 18.922899 85.742059 92.095921 92.061439 90.117232 80.253664 82.962866 82.837521 83.218155 68.286276 59.179619 65.655743 67.027119 61.090866 60.764495 60.258015 58.545413 22.716225 23.042382 25.299848 18.289952 19.497363 17.597384 22.126199 ENSMUSG00000056124
ENSMUST00000033908 0.277638 0.771701 0.734236 0.835938 1.130580 1.774933 0.262763 0.231432 0.209738 0.302866 0.342360 0.394527 0.196355 0.088545 0.170725 0.088308 0.148182 0.037937 0.361527 0.124041 0.411704 0.216135 0.384765 0.627833 0.412680 0.393967 0.177899 0.097658 0.091573 0.404344 ENSMUSG00000031511
ENSMUST00000174016 8.545914 6.122679 4.978280 13.429257 13.448801 13.789999 9.139241 3.772152 6.493719 6.454788 7.620481 3.964981 3.769810 4.272302 4.711036 4.523138 4.025177 6.303732 5.551957 20.404943 18.886851 26.036222 24.703049 15.936284 17.060844 18.859078 9.623711 12.255722 10.591347 2.846295 ENSMUSG00000036036
ENSMUST00000082868 5.240588 7.435392 21.577246 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.157229 8.363363 2.106274 7.631468 4.013754 0.000000 ENSMUSG00000064802
ENSMUST00000187399 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.112263 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ENSMUSG00000090925
ENSMUST00000159512 1.197789 1.783770 1.592837 3.586005 2.904509 3.830286 1.389749 1.361614 1.953179 1.801017 2.061461 2.504494 1.787667 1.311947 1.985731 2.389696 0.801072 1.741500 1.538186 0.880757 2.094465 0.624705 1.609800 0.248865 0.954488 1.411125 1.029973 0.838168 0.218119 0.135338 ENSMUSG00000029207
ENSMUST00000194793 0.000000 0.068955 0.061078 0.064820 0.000000 0.000000 0.000000 0.078959 0.177114 0.299510 0.089493 0.181351 0.080516 0.310951 0.222825 0.324322 0.070346 0.163243 0.468088 0.067476 0.360582 0.332645 0.068630 0.000000 0.000000 0.000000 0.155780 0.076557 0.000000 0.000000 ENSMUSG00000103497
ENSMUST00000221298 0.077941 0.100916 0.090854 0.309668 0.244962 0.239357 0.299259 0.174734 0.291636 0.193253 0.231279 0.296867 0.117060 0.200170 0.109928 0.266300 0.103878 0.301950 0.395391 0.569643 0.215417 0.462677 0.353049 0.000000 0.000000 0.028680 0.000000 0.000000 0.000000 0.050635 ENSMUSG00000113688
ENSMUST00000138551 0.000000 0.000000 0.119676 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ENSMUSG00000034164
ENSMUST00000210944 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ENSMUSG00000110312
ENSMUST00000194146 0.066535 0.185273 0.055779 0.000000 0.000000 0.000000 0.070382 0.000000 0.000000 0.000000 0.078488 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.129887 0.000000 0.000000 0.000000 0.162839 0.143064 0.000000 0.000000 0.058096 0.000000 ENSMUSG00000102375
ENSMUST00000112725 1.270589 1.049486 1.271776 1.294794 1.619005 1.511612 0.893824 0.688353 0.884466 0.635204 0.709094 0.928998 0.846144 0.829175 0.696928 0.989624 1.018670 0.818393 1.100337 1.270363 0.878781 1.156467 1.281586 1.667594 1.633202 1.868757 1.477511 1.274160 1.640671 1.306233 ENSMUSG00000025269
ENSMUST00000181391 2.222061 1.416381 2.497798 3.006018 3.142753 1.571165 3.167750 3.530355 2.965603 4.649879 3.558337 3.465148 3.203126 3.106205 2.862499 0.000000 1.333758 2.039999 3.282519 5.710520 3.937353 6.847600 3.991743 1.733049 0.496494 1.601051 1.270107 1.507055 1.477476 0.000000 ENSMUSG00000030446
ENSMUST00000223290 0.000000 0.000000 0.000000 0.026984 0.000000 0.000000 0.032844 0.033378 0.000000 0.000000 0.000000 0.000000 0.000000 0.033128 0.000000 0.000000 0.031158 0.072496 0.032133 0.000000 0.031972 0.029910 0.000000 0.000000 0.036096 0.000000 0.029574 0.000000 0.000000 0.000000 ENSMUSG00000035954
ENSMUST00000116191 0.000000 0.044746 0.000000 0.000000 0.104528 0.168647 0.046020 0.048814 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.102273 0.042109 0.000000 0.069164 0.000000 0.111104 0.170086 0.000000 0.000000 ENSMUSG00000080078
ENSMUST00000060782 0.267962 0.170311 0.391316 0.446440 0.000000 0.000000 0.389636 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.087233 0.046855 0.000000 0.000000 0.110542 0.000000 0.048208 0.044187 0.083067 0.328807 0.000000 ENSMUSG00000051716
ENSMUST00000141479 0.960027 0.541858 0.000000 0.000000 0.473587 1.080745 0.000000 0.274066 0.000000 0.000000 0.000000 0.000000 0.359059 1.129914 0.300957 0.388096 0.000000 0.000000 0.801308 1.141492 1.285565 0.314952 1.575395 0.000000 0.348874 0.398863 0.000000 0.970236 1.421541 0.351342 ENSMUSG00000049119
ENSMUST00000211096 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ENSMUSG00000110353

OK so lets look at the expression of ENSMUSG00000020634 in the first sample (DIVminus8.Rep1).

TPM as an expression metric

[1] 46.71414
[1] 46.71414

Basic RNAseq QC

OK now that we’ve got expression values for all genes, we now might want to use these expression values to learn a little bit about our samples. One simple question is > Are replicates similar to each other, or at least more similar to each other than to other samples?

If our data is worth anything at all, we would hope that differences between replicates, which are supposed to be drawn from the same condition, are smaller than differences between samples drawn from different conditions. If that’s not true, it could indicate that one replicate is very different from other replciates (in which case we might want to remove it), or that the data in general is of poor quality.

Another question is:

How similar is each sample to every other sample?

In our timecourse, we might expect that samples drawn from adjacent timepoints might be more similar to each other than samples from more distant timepoints.

Hierarchical clustering

A simple way to think about this is to simply correlate TPM values for genes between samples. For plotting purposes here, let’s plot the log(TPM) of two samples against each other. However, for the actual correlation coefficient we are going to be using the Spearman correlation method, which uses ranks, not absolute values. This means that whether or not you take the log will have no effect on the Spearman correlation coefficient.

# DIVminus8.Rep1 vs DIVminus8.Rep2

# Since we are plotting log TPM values, we need to add a pseudocount to all samples.
# log(0) is a problem.

# Add pseudocounts and take log within ggplot function call
r.spearman <- cor.test(
  tpms$DIVminus8.Rep1, tpms$DIVminus8.Rep2,
  method = "spearman"
)$estimate[[1]]

r.spearman <- signif(r.spearman, 2)

ggplot(tpms, aes(x = log10(DIVminus8.Rep1 + 1e-3), y = log10(DIVminus8.Rep2 + 1e-3))) +
  geom_point() +
  theme_classic() +
  annotate("text", x = 2, y = 0, label = paste0("R = ", r.spearman))

Hierarchical clustering

Hierarchical clustering

With RNAseq data, the variance of a gene’s expression increases as the expression increases. However, using a pseudocount and taking the log of the expression value actually reverses this trend. Now, genes with the lowest expression have the most variance. Why is this a problem? Well, the genes with the most variance are going to be the ones that contribute the most to intersample differences. Ideally, we would like to therefore remove the relationship between expression and variance.

There are transformations, notably rlog and vst, that are made to deal with this, but they are best used when dealing with normalized count data, while here we are dealing with TPMs. We will talk about counts later, but not here.

So, for now, we will take another approach of simply using an expression threshold. Any gene that does not meet our threshold will be excluded from the analysis. Obviously where to set this threshold is a bit subjective. For now, we will set this cutoff at 1 TPM.

Hierarchical clustering

# DIVminus8.Rep1 vs DIVminus8.Rep2

# Since we are plotting log TPM values we will only keep for genes that have expression of at least 1 TPM in both samples
tpms.2samplecor <-
  dplyr::select(tpms, c(ensembl_gene_id, DIVminus8.Rep1, DIVminus8.Rep2)) |>
  filter(DIVminus8.Rep1 >= 1 & DIVminus8.Rep2 >= 1)

# pull the correlation coefficient
r.spearman <- cor.test(
  tpms.2samplecor$DIVminus8.Rep1,
  tpms.2samplecor$DIVminus8.Rep2,
  method = "spearman"
)$estimate[[1]]

# round/set sig digits
r.spearman <- signif(r.spearman, 2)

# plot
ggplot(tpms.2samplecor, aes(x = log10(DIVminus8.Rep1 + 1e-3), y = log10(DIVminus8.Rep2 + 1e-3))) +
  geom_point() +
  theme_classic() +
  annotate("text", x = 2, y = 1, label = paste0("R = ", r.spearman))

Hierarchical clustering

Filtering lowly expressed genes

OK that’s two samples compared to each other, but now we want to see how all samples compare to all other samples. Before we do this we need to decide how to apply our expression cutoff across many samples. Should a gene have to meet the cutoff in only one sample? In all samples? Let’s start by saying it has to meet the cutoff in at least half of the 30 samples.

[1] 50123
[1] 14843

Correlating gene expression values

Now we can use the cor function to calculate pairwise correlations in a .red[matrix] of TPM values.

          DIV0.Rep1 DIV0.Rep2 DIV0.Rep3 DIV1.Rep1 DIV1.Rep2 DIV1.Rep3 DIV1.Rep4
DIV0.Rep1 1.0000000 0.9921992 0.9857075 0.9344949 0.9320084 0.9318094 0.9341577
DIV0.Rep2 0.9921992 1.0000000 0.9895527 0.9290296 0.9269420 0.9270552 0.9301675
DIV0.Rep3 0.9857075 0.9895527 1.0000000 0.9172688 0.9149980 0.9161315 0.9175274
DIV1.Rep1 0.9344949 0.9290296 0.9172688 1.0000000 0.9932976 0.9913331 0.9857878
DIV1.Rep2 0.9320084 0.9269420 0.9149980 0.9932976 1.0000000 0.9911381 0.9851731
DIV1.Rep3 0.9318094 0.9270552 0.9161315 0.9913331 0.9911381 1.0000000 0.9863747
          DIV16.Rep1 DIV16.Rep2 DIV16.Rep3 DIV16.Rep4 DIV21.Rep1 DIV21.Rep2
DIV0.Rep1  0.6674215  0.6543366  0.6549751  0.6557675  0.6430027  0.6437791
DIV0.Rep2  0.6609602  0.6460813  0.6477044  0.6479861  0.6365390  0.6368957
DIV0.Rep3  0.6374190  0.6211743  0.6223612  0.6227213  0.6121500  0.6133398
DIV1.Rep1  0.7547993  0.7419703  0.7432819  0.7436624  0.7234613  0.7235614
DIV1.Rep2  0.7567175  0.7438094  0.7457015  0.7456283  0.7251852  0.7253125
DIV1.Rep3  0.7560271  0.7419976  0.7444082  0.7441764  0.7240899  0.7238994
          DIV21.Rep3 DIV21.Rep4 DIV28.Rep1 DIV28.Rep2 DIV28.Rep3 DIV28.Rep4
DIV0.Rep1  0.6455802  0.6422156  0.6385205  0.6416556  0.6336616  0.6352217
DIV0.Rep2  0.6387583  0.6339425  0.6312565  0.6348317  0.6259258  0.6283085
DIV0.Rep3  0.6138374  0.6091561  0.6080339  0.6112330  0.6026283  0.6047245
DIV1.Rep1  0.7259016  0.7220668  0.7162478  0.7139679  0.7103676  0.7129652
DIV1.Rep2  0.7277639  0.7246262  0.7182024  0.7164955  0.7130165  0.7151487
DIV1.Rep3  0.7259727  0.7221704  0.7165046  0.7140036  0.7110218  0.7129018
          DIV7.Rep1 DIV7.Rep2 DIV7.Rep3 DIV7.Rep4 DIVminus4.Rep1 DIVminus4.Rep2
DIV0.Rep1 0.7738807 0.7788370 0.7768059 0.7753821      0.8313811      0.8285264
DIV0.Rep2 0.7667818 0.7719339 0.7696461 0.7689326      0.8356858      0.8331484
DIV0.Rep3 0.7431343 0.7490053 0.7462518 0.7452457      0.8515066      0.8500630
DIV1.Rep1 0.8501343 0.8518467 0.8519696 0.8513401      0.7365926      0.7348312
DIV1.Rep2 0.8519090 0.8530144 0.8535201 0.8521009      0.7330103      0.7303372
DIV1.Rep3 0.8503243 0.8519535 0.8523296 0.8519729      0.7341659      0.7314245
          DIVminus4.Rep3 DIVminus8.Rep1 DIVminus8.Rep2 DIVminus8.Rep3
DIV0.Rep1      0.8298582      0.7952033      0.7960610      0.7958396
DIV0.Rep2      0.8342924      0.8016864      0.8025006      0.8009213
DIV0.Rep3      0.8516980      0.8181704      0.8185364      0.8187208
DIV1.Rep1      0.7354999      0.6986520      0.6993633      0.6996563
DIV1.Rep2      0.7320226      0.6954396      0.6958901      0.6953545
DIV1.Rep3      0.7338895      0.6967133      0.6970533      0.6969969
          DIVminus8.Rep4
DIV0.Rep1      0.7849066
DIV0.Rep2      0.7897993
DIV0.Rep3      0.8067435
DIV1.Rep1      0.6906591
DIV1.Rep2      0.6866114
DIV1.Rep3      0.6879394

Hierarchical clustering

Now we need to plot these and have similar samples (i.e. those that are highly correlated with each other) be clustered near to each other. We will use pheatmap to do this.

# let's pull information that we want to add as categories
pheatmap(
  tpms.cor,
  annotation_col = metadata[, 2:3],
  fontsize = 7,
  show_colnames = FALSE
)

This looks pretty good! There are two main points to takeaway here. First, all replicates for a given timepoint are clustering with each other. Second, you can kind of derive the order of the timepoints from the clustering. The biggest separation is between early (DIVminus8 to DIV1) and late (DIV7 to DIV28). After that you can then see finer-grained structure.

Hierarchical clustering

PCA analysis

Another way to visualize relationships is using a dimensionality reduction technique called Principal Component Analysis (PCA). Let’s watch this short video. It focuses more on how to interpret them rather than the math behind their creation.

PCA analysis

PCA works best when values are approximately normally distributed, so we will first take the log of our expression values.

With our cutoff as it is now (genes have to have expression of at least 1 TPM in half the samples), it is possible that we will have some 0 values. Taking the log of 0 might cause a problem, so we will add a pseudocount.

PCA analysis

Very similar interpretation as before (heatmap of correlation).