biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 112
2 ENSEMBL_MART_MOUSE Mouse strains 112
3 ENSEMBL_MART_SNP Ensembl Variation 112
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 112
RNA Bioscience Initiative | CU Anschutz
2024-10-21
By the end of the class, you should be able to:
Understand how salmon
can be used to quantify gene expression given an RNAseq dataset
Collapse transcript-level quantifications to gene level quantifications using tximport
Deduce the relationships of samples to each other using hierarchical clustering and PCA
As with all computational experiments (yes, they are experiments, don’t let your pipette-toting friends tell you otherwise), keeping track of what you did is key. In the old days, I kept a written notebook of commands that I ran. Sounds silly, but there were many times that I went back to that notebook to see exactly what the parameters were for a given run using a piece of software.
Today, there are better options. You are using one of the better ones right now. Notebooks, including RMarkdown (mainly for R) and Jupyter (mainly for Python), are a great way to keep track of what you did as well as give justification or explanation for your analyses using plain ’ol English.
Trust me, one day you will be glad you used them. The Methods section of your paper is never fun to write without them.
In this class, we will examine RNAseq data collected over a time-course of differentiation from mouse embryonic stem cells to cortical glutamatergic neurons (Hubbard et al, F1000 Research (2013)). In this publication, the authors differentiated mESCs to neurons using a series of in vitro culture steps over a period of 37 days. During this timecourse, samples were extracted at selected intervals for transcriptome analysis. Importantly, for each timepoint, either 3 or 4 samples were taken for RNA extraction, library preparation and sequencing. This allows us to efficiently use the statistical frameworks provided by the DESeq2 package to identify genes whose RNA expression changes across the timecourse.
Cells were grown in generic differentiation-promoting media (LIF-) for 8 days until aggreates were dissociated and replated in neuronal differentiation media. This day of replating was designated as in vitro day 0 (DIV0). The timepoints taken before this replating therefore happened at “negative” times (DIV-8 and DIV-4). Because naming files with dashes or minus signs can cause problems, these samples are referred to as DIVminus8 and DIVminus4. Following the replating, samples were taken at days 1, 7, 16, 21, and 28 (DIV1, DIV7, DIV16, DIV21, and DIV28).
Today we will focus on some quality control steps that are good ideas to do for every RNAseq dataset you encounter, whether produced by yourself or someone else.
Today we will focus on some Quality Control steps that are good ideas to do for every RNA-seq data set you encounter, whether produced by yourself or someone else.
salmon
We recently learned about the RNAseq quantification tool salmon. We won’t rehash the details here about how salmon works. For our purposes, we just need to know that salmon reads in a fastq file of sequencing reads and a fasta file of transcript sequences to be quantified. Let’s take a look at this fasta file of transcripts:
Looks like we have ensembl transcript IDs, which is a good idea. I can tell because they start with ‘ENS’. Using ensembl IDs as transcript names will allow us to later collate transcript expression levels into gene expression levels using a database that relates transcripts and genes. More on that later.
The first step in quantifying these transcripts is to make an index from them. This is done as follows:
Here, transcripts.fa
is a path to our fasta file, transcripts.idx is the name of the index that will be created, and k is the length of the kmers that will be used in the hash table related kmers and transcripts. k is the length of the minimum accepted match for a kmer in a read and a kmer in a transcript. Longer kmers (higher values of k) will therefore be more stringent, and lowering k may improve mapping sensitivity at the cost of some specificity. You may also see here how read lengths can influence what value for k you should choose.
Consider an experiment where we had 25 nt reads (this was true wayyyyyy back in the old, dark days of high-throughput sequencing). What’s going to happen if I quantify these reads using an index where the kmer size was set to 29? Well, nothing will align. The index has represented the transcriptome in 29 nt chunks. However, no read will match to these 29mers because there are no 29mers in these reads! As a general rule of thumb, for reads 75 nt and longer (which is the bulk of the data produced nowadays), a good value for k that maximizes both specificity and sensitivity is 31. However, datasets that you may retrieve from the internet, particularly older ones, may have shorter read lengths, so keep this is mind when defining k.
Once we have our index, we can quantify transcripts in the index using reads from our fastq files.
salmon quant --libType A -p 8 --seqBias --gcBias --validateMappings -1 <forwardreads.fastq> -2 <reversereads.fastq> -o <outputname> --index <transcripts.idx>
In this command, our forward and reverse read fastq files are supplied to -1 and -2, respectively. If the experiment produced single end reads, -2 is omitted. <transcripts.idx> is the path to the index produced in the previous step. I’m not going to go through the rest of the flags used here, but their meanings as well as other options can be found here.
Let’s take a look at what salmon spits out. The first file we will look at is a log that is found at /logs/salmon_quant.log. This file contains info about the quantification, but there’s one line of this file in particular that we are interested in. It lets us know how many of the reads in the fastq file that salmon found a home for in the transcriptome fasta.
There are a lot of lines in this file, but really only one that we are interested in. We want the one that tells us the “Mapping rate.” How could we easily and efficiently look at the mapping rates of all our samples? Grep!
As we discussed, salmon quantifies transcripts, not genes. However, genes are made up of transcripts, so we can calculate gene expression values from transcript expression values if we knew which transcripts belonged to which genes.
We can get this relationships between transcripts and genes through biomaRt
.
biomaRt
has many tables that relate genes, transcripts, and other useful data include gene biotypes and gene ontology categories, even across species. Let’s use it here to get a table of genes and transcripts for the mouse genome.
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 112
2 ENSEMBL_MART_MOUSE Mouse strains 112
3 ENSEMBL_MART_SNP Ensembl Variation 112
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 112
I encourage you to see what is in each mart, but for now we are only going to use ENSEMBL_MART_ENSEMBL.
Alright, we’ve chosen our mart. What data sets are available in this mart?
dataset | description | version |
---|---|---|
abrachyrhynchus_gene_ensembl | Pink-footed goose genes (ASM259213v1) | ASM259213v1 |
acalliptera_gene_ensembl | Eastern happy genes (fAstCal1.3) | fAstCal1.3 |
acarolinensis_gene_ensembl | Green anole genes (AnoCar2.0v2) | AnoCar2.0v2 |
acchrysaetos_gene_ensembl | Golden eagle genes (bAquChr1.2) | bAquChr1.2 |
acitrinellus_gene_ensembl | Midas cichlid genes (Midas_v5) | Midas_v5 |
amelanoleuca_gene_ensembl | Giant panda genes (ASM200744v2) | ASM200744v2 |
amexicanus_gene_ensembl | Mexican tetra genes (Astyanax_mexicanus-2.0) | Astyanax_mexicanus-2.0 |
anancymaae_gene_ensembl | Ma's night monkey genes (Anan_2.0) | Anan_2.0 |
aocellaris_gene_ensembl | Clown anemonefish genes (ASM2253959v1) | ASM2253959v1 |
apercula_gene_ensembl | Orange clownfish genes (Nemo_v1) | Nemo_v1 |
aplatyrhynchos_gene_ensembl | Mallard genes (ASM874695v1) | ASM874695v1 |
apolyacanthus_gene_ensembl | Spiny chromis genes (ASM210954v1) | ASM210954v1 |
applatyrhynchos_gene_ensembl | Duck genes (CAU_duck1.0) | CAU_duck1.0 |
atestudineus_gene_ensembl | Climbing perch genes (fAnaTes1.3) | fAnaTes1.3 |
bbbison_gene_ensembl | American bison genes (Bison_UMD1.0) | Bison_UMD1.0 |
bgrunniens_gene_ensembl | Domestic yak genes (LU_Bosgru_v3.0) | LU_Bosgru_v3.0 |
bihybrid_gene_ensembl | Hybrid - Bos Indicus genes (UOA_Brahman_1) | UOA_Brahman_1 |
bmusculus_gene_ensembl | Blue whale genes (mBalMus1.v2) | mBalMus1.v2 |
bmutus_gene_ensembl | Wild yak genes (BosGru_v2.0) | BosGru_v2.0 |
bsplendens_gene_ensembl | Siamese fighting fish genes (fBetSpl5.2) | fBetSpl5.2 |
btaurus_gene_ensembl | Cow genes (ARS-UCD1.3) | ARS-UCD1.3 |
cabingdonii_gene_ensembl | Abingdon island giant tortoise genes (ASM359739v1) | ASM359739v1 |
catys_gene_ensembl | Sooty mangabey genes (Caty_1.0) | Caty_1.0 |
cauratus_gene_ensembl | Goldfish genes (ASM336829v1) | ASM336829v1 |
cccarpio_gene_ensembl | Common carp genes (Cypcar_WagV4.0) | Cypcar_WagV4.0 |
cdromedarius_gene_ensembl | Arabian camel genes (CamDro2) | CamDro2 |
celegans_gene_ensembl | Caenorhabditis elegans (Nematode, N2) genes (WBcel235) | WBcel235 |
cgchok1gshd_gene_ensembl | Chinese hamster CHOK1GS genes (CHOK1GS_HDv1) | CHOK1GS_HDv1 |
cgobio_gene_ensembl | Channel bull blenny genes (fCotGob3.1) | fCotGob3.1 |
charengus_gene_ensembl | Atlantic herring genes (Ch_v2.0.2v2) | Ch_v2.0.2v2 |
chircus_gene_ensembl | Goat genes (ARS1) | ARS1 |
choffmanni_gene_ensembl | Sloth genes (choHof1) | choHof1 |
chyarkandensis_gene_ensembl | Yarkand deer genes (CEY_v1) | CEY_v1 |
cimitator_gene_ensembl | Panamanian white-faced capuchin genes (Cebus_imitator-1.0) | Cebus_imitator-1.0 |
cintestinalis_gene_ensembl | C.intestinalis genes (KH) | KH |
cjacchus_gene_ensembl | White-tufted-ear marmoset genes (mCalJac1.pat.X) | mCalJac1.pat.X |
cjaponica_gene_ensembl | Japanese quail genes (Coturnix_japonica_2.0) | Coturnix_japonica_2.0 |
clanigera_gene_ensembl | Long-tailed chinchilla genes (ChiLan1.0) | ChiLan1.0 |
cldingo_gene_ensembl | Dingo genes (ASM325472v1) | ASM325472v1 |
clfamiliaris_gene_ensembl | Dog genes (ROS_Cfam_1.0) | ROS_Cfam_1.0 |
clumpus_gene_ensembl | Lumpfish genes (fCycLum1.pri) | fCycLum1.pri |
cmilii_gene_ensembl | Elephant shark genes (Callorhinchus_milii-6.1.3) | Callorhinchus_milii-6.1.3 |
cpbellii_gene_ensembl | Painted turtle genes (Chrysemys_picta_bellii-3.0.3) | Chrysemys_picta_bellii-3.0.3 |
cporcellus_gene_ensembl | Guinea Pig genes (Cavpor3.0) | Cavpor3.0 |
cporosus_gene_ensembl | Australian saltwater crocodile genes (CroPor_comp1) | CroPor_comp1 |
csabaeus_gene_ensembl | Vervet-AGM genes (ChlSab1.1) | ChlSab1.1 |
csavignyi_gene_ensembl | C.savignyi genes (CSAV 2.0) | CSAV 2.0 |
csemilaevis_gene_ensembl | Tongue sole genes (Cse_v1.0) | Cse_v1.0 |
csyrichta_gene_ensembl | Tarsier genes (Tarsius_syrichta-2.0.1) | Tarsius_syrichta-2.0.1 |
cvariegatus_gene_ensembl | Sheepshead minnow genes (C_variegatus-1.0) | C_variegatus-1.0 |
cwagneri_gene_ensembl | Chacoan peccary genes (CatWag_v2_BIUU_UCD) | CatWag_v2_BIUU_UCD |
dclupeoides_gene_ensembl | Denticle herring genes (fDenClu1.2) | fDenClu1.2 |
dlabrax_gene_ensembl | European seabass genes (dlabrax2021) | dlabrax2021 |
dleucas_gene_ensembl | Beluga whale genes (ASM228892v3) | ASM228892v3 |
dmelanogaster_gene_ensembl | Drosophila melanogaster (Fruit fly) genes (BDGP6.46) | BDGP6.46 |
dnovemcinctus_gene_ensembl | Armadillo genes (Dasnov3.0) | Dasnov3.0 |
dordii_gene_ensembl | Kangaroo rat genes (Dord_2.0) | Dord_2.0 |
drerio_gene_ensembl | Zebrafish genes (GRCz11) | GRCz11 |
easinus_gene_ensembl | Donkey genes (ASM1607732v2) | ASM1607732v2 |
eburgeri_gene_ensembl | Hagfish genes (Eburgeri_3.2) | Eburgeri_3.2 |
ecaballus_gene_ensembl | Horse genes (EquCab3.0) | EquCab3.0 |
ecalabaricus_gene_ensembl | Reedfish genes (fErpCal1.1) | fErpCal1.1 |
eelectricus_gene_ensembl | Electric eel genes (fEleEle1.pri) | fEleEle1.pri |
eeuropaeus_gene_ensembl | Hedgehog genes (eriEur1) | eriEur1 |
elucius_gene_ensembl | Northern pike genes (fEsoLuc1.pri) | fEsoLuc1.pri |
etelfairi_gene_ensembl | Lesser hedgehog tenrec genes (TENREC) | TENREC |
falbicollis_gene_ensembl | Collared flycatcher genes (FicAlb1.5) | FicAlb1.5 |
fcatus_gene_ensembl | Cat genes (Felis_catus_9.0) | Felis_catus_9.0 |
fheteroclitus_gene_ensembl | Mummichog genes (Fundulus_heteroclitus-3.0.2) | Fundulus_heteroclitus-3.0.2 |
gaculeatus_gene_ensembl | Stickleback genes (GAculeatus_UGA_version5) | GAculeatus_UGA_version5 |
gevgoodei_gene_ensembl | Goodes thornscrub tortoise genes (rGopEvg1_v1.p) | rGopEvg1_v1.p |
gfortis_gene_ensembl | Medium ground-finch genes (GeoFor_1.0) | GeoFor_1.0 |
ggallus_gene_ensembl | Chicken genes (bGalGal1.mat.broiler.GRCg7b) | bGalGal1.mat.broiler.GRCg7b |
ggorilla_gene_ensembl | Gorilla genes (gorGor4) | gorGor4 |
gmorhua_gene_ensembl | Atlantic cod genes (gadMor3.0) | gadMor3.0 |
hburtoni_gene_ensembl | Burton's mouthbrooder genes (AstBur1.0) | AstBur1.0 |
hcomes_gene_ensembl | Tiger tail seahorse genes (H_comes_QL1_v1) | H_comes_QL1_v1 |
hgfemale_gene_ensembl | Naked mole-rat female genes (Naked_mole-rat_maternal) | Naked_mole-rat_maternal |
hhucho_gene_ensembl | Huchen genes (ASM331708v1) | ASM331708v1 |
hsapiens_gene_ensembl | Human genes (GRCh38.p14) | GRCh38.p14 |
ipunctatus_gene_ensembl | Channel catfish genes (ASM400665v3) | ASM400665v3 |
itridecemlineatus_gene_ensembl | Squirrel genes (SpeTri2.0) | SpeTri2.0 |
jjaculus_gene_ensembl | Lesser Egyptian jerboa genes (JacJac1.0) | JacJac1.0 |
kmarmoratus_gene_ensembl | Mangrove rivulus genes (ASM164957v1) | ASM164957v1 |
lafricana_gene_ensembl | Elephant genes (Loxafr3.0) | Loxafr3.0 |
lbergylta_gene_ensembl | Ballan wrasse genes (BallGen_V1) | BallGen_V1 |
lcalcarifer_gene_ensembl | Barramundi perch genes (ASB_HGAPassembly_v1) | ASB_HGAPassembly_v1 |
lchalumnae_gene_ensembl | Coelacanth genes (LatCha1) | LatCha1 |
lcrocea_gene_ensembl | Large yellow croaker genes (L_crocea_2.0) | L_crocea_2.0 |
llaticaudata_gene_ensembl | Blue-ringed sea krait genes (latLat_1.0) | latLat_1.0 |
lleishanense_gene_ensembl | Leishan spiny toad genes (ASM966780v1) | ASM966780v1 |
loculatus_gene_ensembl | Spotted gar genes (LepOcu1) | LepOcu1 |
marmatus_gene_ensembl | Zig-zag eel genes (fMasArm1.2) | fMasArm1.2 |
mauratus_gene_ensembl | Golden Hamster genes (MesAur1.0) | MesAur1.0 |
mcaroli_gene_ensembl | Ryukyu mouse genes (CAROLI_EIJ_v1.1) | CAROLI_EIJ_v1.1 |
mdomestica_gene_ensembl | Opossum genes (ASM229v1) | ASM229v1 |
mfascicularis_gene_ensembl | Crab-eating macaque genes (Macaca_fascicularis_6.0) | Macaca_fascicularis_6.0 |
mgallopavo_gene_ensembl | Turkey genes (Turkey_5.1) | Turkey_5.1 |
mleucophaeus_gene_ensembl | Drill genes (Mleu.le_1.0) | Mleu.le_1.0 |
mlucifugus_gene_ensembl | Microbat genes (Myoluc2.0) | Myoluc2.0 |
mmmarmota_gene_ensembl | Alpine marmot genes (marMar2.1) | marMar2.1 |
mmonoceros_gene_ensembl | Narwhal genes (NGI_Narwhal_1) | NGI_Narwhal_1 |
mmoschiferus_gene_ensembl | Siberian musk deer genes (MosMos_v2_BIUU_UCD) | MosMos_v2_BIUU_UCD |
mmulatta_gene_ensembl | Macaque genes (Mmul_10) | Mmul_10 |
mmurdjan_gene_ensembl | Pinecone soldierfish genes (fMyrMur1.1) | fMyrMur1.1 |
mmurinus_gene_ensembl | Mouse Lemur genes (Mmur_3.0) | Mmur_3.0 |
mmusculus_gene_ensembl | Mouse genes (GRCm39) | GRCm39 |
mnemestrina_gene_ensembl | Pig-tailed macaque genes (Mnem_1.0) | Mnem_1.0 |
mochrogaster_gene_ensembl | Prairie vole genes (MicOch1.0) | MicOch1.0 |
mpahari_gene_ensembl | Shrew mouse genes (PAHARI_EIJ_v1.1) | PAHARI_EIJ_v1.1 |
mpfuro_gene_ensembl | Ferret genes (MusPutFur1.0) | MusPutFur1.0 |
mspicilegus_gene_ensembl | Steppe mouse genes (MUSP714) | MUSP714 |
mspretus_gene_ensembl | Algerian mouse genes (SPRET_EiJ_v1) | SPRET_EiJ_v1 |
mzebra_gene_ensembl | Zebra mbuna genes (M_zebra_UMD2a) | M_zebra_UMD2a |
nbrichardi_gene_ensembl | Lyretail cichlid genes (NeoBri1.0) | NeoBri1.0 |
neugenii_gene_ensembl | Wallaby genes (Meug_1.0) | Meug_1.0 |
nfurzeri_gene_ensembl | Turquoise killifish genes (Nfu_20140520) | Nfu_20140520 |
ngalili_gene_ensembl | Upper Galilee mountains blind mole rat genes (S.galili_v1.0) | S.galili_v1.0 |
nleucogenys_gene_ensembl | Gibbon genes (Nleu_3.0) | Nleu_3.0 |
nnaja_gene_ensembl | Indian cobra genes (Nana_v5) | Nana_v5 |
nscutatus_gene_ensembl | Mainland tiger snake genes (TS10Xv2-PRI) | TS10Xv2-PRI |
nvison_gene_ensembl | American mink genes (NNQGG.v01) | NNQGG.v01 |
oanatinus_gene_ensembl | Platypus genes (mOrnAna1.p.v1) | mOrnAna1.p.v1 |
oarambouillet_gene_ensembl | Sheep genes (ARS-UI_Ramb_v2.0) | ARS-UI_Ramb_v2.0 |
oaries_gene_ensembl | Sheep (texel) genes (Oar_v3.1) | Oar_v3.1 |
ocuniculus_gene_ensembl | Rabbit genes (OryCun2.0) | OryCun2.0 |
odegus_gene_ensembl | Degu genes (OctDeg1.0) | OctDeg1.0 |
ogarnettii_gene_ensembl | Bushbaby genes (OtoGar3) | OtoGar3 |
ojavanicus_gene_ensembl | Javanese ricefish genes (OJAV_1.1) | OJAV_1.1 |
okisutch_gene_ensembl | Coho salmon genes (Okis_V2) | Okis_V2 |
olatipes_gene_ensembl | Japanese medaka HdrR genes (ASM223467v1) | ASM223467v1 |
omelastigma_gene_ensembl | Indian medaka genes (Om_v0.7.RACA) | Om_v0.7.RACA |
omykiss_gene_ensembl | Rainbow trout genes (USDA_OmykA_1.1) | USDA_OmykA_1.1 |
oniloticus_gene_ensembl | Nile tilapia genes (O_niloticus_UMD_NMBU) | O_niloticus_UMD_NMBU |
oprinceps_gene_ensembl | Pika genes (OchPri2.0-Ens) | OchPri2.0-Ens |
osinensis_gene_ensembl | Chinese medaka genes (ASM858656v1) | ASM858656v1 |
otshawytscha_gene_ensembl | Chinook salmon genes (Otsh_v2.0) | Otsh_v2.0 |
pabelii_gene_ensembl | Sumatran orangutan genes (Susie_PABv2) | Susie_PABv2 |
panubis_gene_ensembl | Olive baboon genes (Panubis1.0) | Panubis1.0 |
pcapensis_gene_ensembl | Hyrax genes (proCap1) | proCap1 |
pcatodon_gene_ensembl | Sperm whale genes (ASM283717v2) | ASM283717v2 |
pcinereus_gene_ensembl | Koala genes (phaCin_unsw_v4.1) | phaCin_unsw_v4.1 |
pcoquereli_gene_ensembl | Coquerel's sifaka genes (Pcoq_1.0) | Pcoq_1.0 |
pformosa_gene_ensembl | Amazon molly genes (Poecilia_formosa-5.1.2) | Poecilia_formosa-5.1.2 |
pkingsleyae_gene_ensembl | Paramormyrops kingsleyae genes (PKINGS_0.1) | PKINGS_0.1 |
platipinna_gene_ensembl | Sailfin molly genes (P_latipinna-1.0) | P_latipinna-1.0 |
pleo_gene_ensembl | Lion genes (PanLeo1.0) | PanLeo1.0 |
pmajor_gene_ensembl | Great Tit genes (Parus_major1.1) | Parus_major1.1 |
pmarinus_gene_ensembl | Lamprey genes (Pmarinus_7.0) | Pmarinus_7.0 |
pmbairdii_gene_ensembl | Northern American deer mouse genes (HU_Pman_2.1) | HU_Pman_2.1 |
pmuralis_gene_ensembl | Common wall lizard genes (PodMur_1.0) | PodMur_1.0 |
pnattereri_gene_ensembl | Red-bellied piranha genes (fPygNat1.pri) | fPygNat1.pri |
pnyererei_gene_ensembl | Makobe Island cichlid genes (PunNye1.0) | PunNye1.0 |
ppaniscus_gene_ensembl | Bonobo genes (panpan1.1) | panpan1.1 |
ppardus_gene_ensembl | Leopard genes (PanPar1.0) | PanPar1.0 |
preticulata_gene_ensembl | Guppy genes (Guppy_female_1.0_MT) | Guppy_female_1.0_MT |
psimus_gene_ensembl | Greater bamboo lemur genes (Prosim_1.0) | Prosim_1.0 |
psinensis_gene_ensembl | Chinese softshell turtle genes (PelSin_1.0) | PelSin_1.0 |
psinus_gene_ensembl | Vaquita genes (mPhoSin1.pri) | mPhoSin1.pri |
ptaltaica_gene_ensembl | Tiger genes (PanTig1.0) | PanTig1.0 |
ptextilis_gene_ensembl | Eastern brown snake genes (EBS10Xv2-PRI) | EBS10Xv2-PRI |
ptroglodytes_gene_ensembl | Chimpanzee genes (Pan_tro_3.0) | Pan_tro_3.0 |
pvampyrus_gene_ensembl | Megabat genes (pteVam1) | pteVam1 |
rbieti_gene_ensembl | Black snub-nosed monkey genes (ASM169854v1) | ASM169854v1 |
rferrumequinum_gene_ensembl | Greater horseshoe bat genes (mRhiFer1_v1.p) | mRhiFer1_v1.p |
rnorvegicus_gene_ensembl | Rat genes (mRatBN7.2) | mRatBN7.2 |
rroxellana_gene_ensembl | Golden snub-nosed monkey genes (Rrox_v1) | Rrox_v1 |
saraneus_gene_ensembl | Shrew genes (sorAra1) | sorAra1 |
saurata_gene_ensembl | Gilthead seabream genes (fSpaAur1.1) | fSpaAur1.1 |
sbboliviensis_gene_ensembl | Bolivian squirrel monkey genes (SaiBol1.0) | SaiBol1.0 |
scanaria_gene_ensembl | Common canary genes (SCA1) | SCA1 |
scaustralis_gene_ensembl | African ostrich genes (ASM69896v1) | ASM69896v1 |
scerevisiae_gene_ensembl | Saccharomyces cerevisiae genes (R64-1-1) | R64-1-1 |
sdumerili_gene_ensembl | Greater amberjack genes (Sdu_1.0) | Sdu_1.0 |
sformosus_gene_ensembl | Asian bonytongue genes (fSclFor1.1) | fSclFor1.1 |
shabroptila_gene_ensembl | Kakapo genes (bStrHab1_v1.p) | bStrHab1_v1.p |
sharrisii_gene_ensembl | Tasmanian devil genes (mSarHar1.11) | mSarHar1.11 |
sldorsalis_gene_ensembl | Yellowtail amberjack genes (Sedor1) | Sedor1 |
slucioperca_gene_ensembl | Pike-perch genes (SLUC_FBN_1) | SLUC_FBN_1 |
smaximus_gene_ensembl | Turbot genes (ASM1334776v1) | ASM1334776v1 |
smerianae_gene_ensembl | Argentine black and white tegu genes (HLtupMer3) | HLtupMer3 |
spartitus_gene_ensembl | Bicolor damselfish genes (Stegastes_partitus-1.0.2) | Stegastes_partitus-1.0.2 |
spunctatus_gene_ensembl | Tuatara genes (ASM311381v1) | ASM311381v1 |
ssalar_gene_ensembl | Atlantic salmon genes (Ssal_v3.1) | Ssal_v3.1 |
ssbamei_gene_ensembl | Pig - Bamei genes (Bamei_pig_v1) | Bamei_pig_v1 |
ssberkshire_gene_ensembl | Pig - Berkshire genes (Berkshire_pig_v1) | Berkshire_pig_v1 |
sscrofa_gene_ensembl | Pig genes (Sscrofa11.1) | Sscrofa11.1 |
sshampshire_gene_ensembl | Pig - Hampshire genes (Hampshire_pig_v1) | Hampshire_pig_v1 |
ssjinhua_gene_ensembl | Pig - Jinhua genes (Jinhua_pig_v1) | Jinhua_pig_v1 |
sslandrace_gene_ensembl | Pig - Landrace genes (Landrace_pig_v1) | Landrace_pig_v1 |
sslargewhite_gene_ensembl | Pig - Largewhite genes (Large_White_v1) | Large_White_v1 |
ssmeishan_gene_ensembl | Pig - Meishan genes (Meishan_pig_v1) | Meishan_pig_v1 |
sspietrain_gene_ensembl | Pig - Pietrain genes (Pietrain_pig_v1) | Pietrain_pig_v1 |
ssrongchang_gene_ensembl | Pig - Rongchang genes (Rongchang_pig_v1) | Rongchang_pig_v1 |
sstibetan_gene_ensembl | Pig - Tibetan genes (Tibetan_Pig_v2) | Tibetan_Pig_v2 |
ssusmarc_gene_ensembl | Pig USMARC genes (USMARCv1.0) | USMARCv1.0 |
sswuzhishan_gene_ensembl | Pig - Wuzhishan genes (minipig_v1.0) | minipig_v1.0 |
strutta_gene_ensembl | Brown trout genes (fSalTru1.1) | fSalTru1.1 |
svulgaris_gene_ensembl | Eurasian red squirrel genes (mSciVul1.1) | mSciVul1.1 |
tbelangeri_gene_ensembl | Tree Shrew genes (tupBel1) | tupBel1 |
tctriunguis_gene_ensembl | Three-toed box turtle genes (T_m_triunguis-2.0) | T_m_triunguis-2.0 |
tguttata_gene_ensembl | Zebra finch genes (bTaeGut1_v1.p) | bTaeGut1_v1.p |
tnigroviridis_gene_ensembl | Tetraodon genes (TETRAODON 8.0) | TETRAODON 8.0 |
trubripes_gene_ensembl | Fugu genes (fTakRub1.2) | fTakRub1.2 |
ttruncatus_gene_ensembl | Dolphin genes (turTru1) | turTru1 |
uamericanus_gene_ensembl | American black bear genes (ASM334442v1) | ASM334442v1 |
umaritimus_gene_ensembl | Polar bear genes (UrsMar_1.0) | UrsMar_1.0 |
uparryii_gene_ensembl | Arctic ground squirrel genes (ASM342692v1) | ASM342692v1 |
vpacos_gene_ensembl | Alpaca genes (vicPac1) | vicPac1 |
vursinus_gene_ensembl | Common wombat genes (bare-nosed_wombat_genome_assembly) | bare-nosed_wombat_genome_assembly |
vvulpes_gene_ensembl | Red fox genes (VulVul2.2) | VulVul2.2 |
xcouchianus_gene_ensembl | Monterrey platyfish genes (Xiphophorus_couchianus-4.0.1) | Xiphophorus_couchianus-4.0.1 |
xmaculatus_gene_ensembl | Platyfish genes (X_maculatus-5.0-male) | X_maculatus-5.0-male |
xtropicalis_gene_ensembl | Tropical clawed frog genes (UCB_Xtro_10.0) | UCB_Xtro_10.0 |
A lot of stuff for a lot of species! Perhaps we want to limit it to see which ones are relevant to mouse.
dataset | description | version |
---|---|---|
mmusculus_gene_ensembl | Mouse genes (GRCm39) | GRCm39 |
Ah so we probably want the dataset called ‘mmusculus_gene_ensembl’!
name | description | page |
---|---|---|
ensembl_gene_id | Gene stable ID | feature_page |
ensembl_gene_id_version | Gene stable ID version | feature_page |
ensembl_transcript_id | Transcript stable ID | feature_page |
ensembl_transcript_id_version | Transcript stable ID version | feature_page |
ensembl_peptide_id | Protein stable ID | feature_page |
ensembl_peptide_id_version | Protein stable ID version | feature_page |
ensembl_exon_id | Exon stable ID | feature_page |
description | Gene description | feature_page |
chromosome_name | Chromosome/scaffold name | feature_page |
start_position | Gene start (bp) | feature_page |
end_position | Gene end (bp) | feature_page |
strand | Strand | feature_page |
band | Karyotype band | feature_page |
transcript_start | Transcript start (bp) | feature_page |
transcript_end | Transcript end (bp) | feature_page |
transcription_start_site | Transcription start site (TSS) | feature_page |
transcript_length | Transcript length (including UTRs and CDS) | feature_page |
transcript_tsl | Transcript support level (TSL) | feature_page |
transcript_gencode_basic | GENCODE basic annotation | feature_page |
transcript_appris | APPRIS annotation | feature_page |
So there are 2885 rows of goodies about the mouse genome and its relationship to many other genomes. However, you can probably see that the ones that are most useful to us right now are right at the top: ‘ensembl_transcript_id’ and ‘ensembl_gene_id’. We can use those attributes in our mart to make a table relating genes and transcripts.
I’m going to through one more attribute in: external_gene_name. Those are usually more informative than ensembl IDs.
ensembl_transcript_id | ensembl_gene_id | external_gene_name |
---|---|---|
ENSMUST00000082387 | ENSMUSG00000064336 | mt-Tf |
ENSMUST00000082388 | ENSMUSG00000064337 | mt-Rnr1 |
ENSMUST00000082389 | ENSMUSG00000064338 | mt-Tv |
ENSMUST00000082390 | ENSMUSG00000064339 | mt-Rnr2 |
ENSMUST00000082391 | ENSMUSG00000064340 | mt-Tl1 |
ENSMUST00000082392 | ENSMUSG00000064341 | mt-Nd1 |
ENSMUST00000082393 | ENSMUSG00000064342 | mt-Ti |
ENSMUST00000082394 | ENSMUSG00000064343 | mt-Tq |
ENSMUST00000082395 | ENSMUSG00000064344 | mt-Tm |
ENSMUST00000082396 | ENSMUSG00000064345 | mt-Nd2 |
ENSMUST00000082397 | ENSMUSG00000064346 | mt-Tw |
ENSMUST00000082398 | ENSMUSG00000064347 | mt-Ta |
ENSMUST00000082399 | ENSMUSG00000064348 | mt-Tn |
ENSMUST00000082400 | ENSMUSG00000064349 | mt-Tc |
ENSMUST00000082401 | ENSMUSG00000064350 | mt-Ty |
ENSMUST00000082402 | ENSMUSG00000064351 | mt-Co1 |
ENSMUST00000082403 | ENSMUSG00000064352 | mt-Ts1 |
ENSMUST00000082404 | ENSMUSG00000064353 | mt-Td |
ENSMUST00000082405 | ENSMUSG00000064354 | mt-Co2 |
ENSMUST00000082406 | ENSMUSG00000064355 | mt-Tk |
Alright this looks good! We are going to split this into two tables. One that contains transcript ID and gene ID, and the other that contains gene ID and gene name.
tximport
Now that we have our table relating transcripts and genes, we can give it to tximport to have it calculate gene-level expression data from our transcript-level expression data.
First, we have to tell it where the salmon quantification files (the quant.sf.gz
files) are. Here’s what our directory structure that contains these files looks like:
tximport
tximport
You can see that we got a list of sample names and the absolute path to each sample’s quantification file.
Now we are ready to run tximport
tximport
is going to want paths to all the quantification files (salm_dirs) and a table that relates transcripts to genes (t2g). Luckily, we happen to have those exact two things.
tximport
Notice how we chose lengthscaledTPM for our abundance measurement. This is going to give us TPM values (transcripts per million) for expression in the $abundance slot. Let’s check out what we have now.
ensembl_gene_id | DIV0.Rep1 | DIV0.Rep2 | DIV0.Rep3 | DIV1.Rep1 | DIV1.Rep2 | DIV1.Rep3 | DIV1.Rep4 | DIV16.Rep1 | DIV16.Rep2 | DIV16.Rep3 | DIV16.Rep4 | DIV21.Rep1 | DIV21.Rep2 | DIV21.Rep3 | DIV21.Rep4 | DIV28.Rep1 | DIV28.Rep2 | DIV28.Rep3 | DIV28.Rep4 | DIV7.Rep1 | DIV7.Rep2 | DIV7.Rep3 | DIV7.Rep4 | DIVminus4.Rep1 | DIVminus4.Rep2 | DIVminus4.Rep3 | DIVminus8.Rep1 | DIVminus8.Rep2 | DIVminus8.Rep3 | DIVminus8.Rep4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000000001 | 148.004511 | 145.539529 | 164.680183 | 142.713005 | 138.042672 | 137.027499 | 123.894516 | 41.689798 | 41.554126 | 41.002841 | 39.522314 | 28.162815 | 29.694158 | 29.333682 | 27.956845 | 17.960041 | 17.979483 | 18.102228 | 19.590551 | 76.088940 | 76.669016 | 79.505738 | 76.038479 | 106.634101 | 106.395094 | 109.997598 | 92.799612 | 93.563524 | 95.500337 | 101.951963 |
ENSMUSG00000000003 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
ENSMUSG00000000028 | 34.180035 | 36.215924 | 39.486489 | 16.894844 | 15.391383 | 18.171282 | 16.654062 | 1.067147 | 1.270428 | 1.109391 | 1.136970 | 1.939558 | 1.003108 | 1.245186 | 1.086912 | 1.303824 | 0.935135 | 0.864451 | 0.862209 | 6.582942 | 6.752942 | 5.547107 | 5.510259 | 45.614077 | 47.851930 | 47.516125 | 59.718765 | 57.981259 | 57.935839 | 41.663343 |
ENSMUSG00000000031 | 7.213808 | 6.431925 | 13.253322 | 3.301713 | 2.539161 | 3.591509 | 2.798164 | 93.312776 | 56.314466 | 62.958855 | 68.565195 | 32.037271 | 37.465959 | 40.581328 | 24.780757 | 67.818115 | 64.333586 | 30.408290 | 60.198406 | 151.244212 | 137.227925 | 169.615059 | 150.599359 | 0.914481 | 1.402704 | 1.317939 | 0.177770 | 0.215726 | 0.320948 | 36.425779 |
ENSMUSG00000000037 | 5.877237 | 5.001733 | 7.579149 | 2.861842 | 3.176129 | 2.457556 | 3.591170 | 0.319281 | 0.708167 | 0.578467 | 0.513434 | 0.186198 | 0.260270 | 0.258075 | 0.585622 | 0.615895 | 0.133394 | 0.197050 | 1.005331 | 2.750903 | 1.827190 | 2.446732 | 2.894751 | 2.196733 | 2.309187 | 2.213591 | 3.187886 | 2.507155 | 2.624140 | 2.281871 |
ENSMUSG00000000049 | 0.000000 | 0.046208 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.411807 | 0.213589 | 0.118890 | 1.024154 | 0.628760 | 0.303218 | 0.449928 | 0.262019 | 0.105047 | 0.567535 | 0.649720 | 0.515246 | 0.360877 | 0.188254 | 0.050654 | 0.277837 | 0.189814 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
ENSMUSG00000000056 | 31.553583 | 28.610456 | 19.741875 | 59.046908 | 51.324548 | 54.015893 | 43.897451 | 36.575044 | 47.171198 | 41.459018 | 38.403920 | 55.170239 | 37.441141 | 44.043899 | 45.852392 | 38.906362 | 40.362321 | 44.933634 | 42.356811 | 41.828967 | 46.812515 | 44.172289 | 46.834670 | 14.509458 | 18.656028 | 16.335093 | 10.671419 | 10.666954 | 12.386288 | 8.077939 |
ENSMUSG00000000058 | 0.420304 | 0.507728 | 0.602757 | 0.357594 | 0.504091 | 0.498006 | 0.565744 | 0.931596 | 0.632998 | 0.606530 | 0.503290 | 0.292719 | 0.224968 | 0.401341 | 0.428359 | 0.490118 | 0.200389 | 0.038911 | 0.399079 | 0.194399 | 0.102993 | 0.063154 | 0.130644 | 0.163821 | 0.162316 | 0.251082 | 0.693347 | 0.201366 | 0.527565 | 0.482299 |
ENSMUSG00000000078 | 52.604855 | 64.802703 | 76.892914 | 60.446343 | 60.861211 | 61.059568 | 50.159693 | 32.474649 | 28.287563 | 27.391212 | 29.511742 | 19.632508 | 19.942257 | 20.182975 | 19.363735 | 16.845197 | 16.128839 | 18.828771 | 15.732413 | 52.939918 | 52.709453 | 54.334181 | 50.804497 | 51.427093 | 49.900737 | 53.274561 | 19.669879 | 19.472875 | 20.216492 | 21.591729 |
ENSMUSG00000000085 | 60.617248 | 59.182914 | 61.623794 | 72.657499 | 71.506733 | 75.816856 | 63.507273 | 36.498680 | 31.450681 | 36.983697 | 34.558348 | 31.893252 | 35.692047 | 31.095232 | 34.751542 | 36.339896 | 32.474173 | 38.456832 | 39.182603 | 52.494911 | 43.715591 | 45.720249 | 46.351620 | 9.949776 | 11.703136 | 9.320286 | 13.843137 | 11.699411 | 13.486194 | 10.366294 |
ENSMUSG00000000088 | 372.546920 | 364.764405 | 381.963748 | 389.617591 | 388.204349 | 413.080094 | 533.720843 | 882.254461 | 918.864762 | 881.567068 | 952.656933 | 1064.561469 | 1018.564149 | 1023.266589 | 935.493316 | 944.628320 | 1102.492902 | 1011.428801 | 993.369153 | 703.366663 | 749.229000 | 711.055019 | 744.360315 | 602.694088 | 539.095644 | 568.529023 | 715.919453 | 723.040511 | 713.534783 | 795.564123 |
ENSMUSG00000000093 | 18.748818 | 26.367729 | 28.740489 | 18.171262 | 17.023064 | 19.681817 | 21.696499 | 1.748123 | 1.159551 | 1.244226 | 1.286737 | 1.859175 | 1.536283 | 1.490581 | 1.446475 | 3.138448 | 3.362518 | 3.197714 | 3.445495 | 1.960168 | 1.812696 | 2.626110 | 2.401155 | 1.641559 | 1.120127 | 1.520160 | 0.123672 | 0.064585 | 0.118030 | 0.062646 |
ENSMUSG00000000094 | 0.114764 | 0.197628 | 0.237980 | 0.067150 | 0.050183 | 0.025762 | 0.000000 | 0.000000 | 0.000000 | 0.020946 | 0.000000 | 0.023273 | 0.000000 | 0.020613 | 0.020391 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.253309 | 0.359092 | 0.191826 | 0.280499 | 0.644665 | 0.508711 | 0.638906 | 1.610289 | 2.043695 | 1.513517 | 1.061169 |
ENSMUSG00000000103 | 0.000000 | 0.000000 | 0.019350 | 0.019932 | 0.000000 | 0.000000 | 0.000000 | 0.024280 | 0.000000 | 0.000000 | 0.027595 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.022570 | 0.020990 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.049946 |
ENSMUSG00000000120 | 18.187330 | 17.621063 | 16.764680 | 65.095814 | 63.742121 | 70.050947 | 65.437767 | 7.138900 | 6.728019 | 6.471314 | 7.243701 | 3.043655 | 3.779647 | 4.010854 | 3.292345 | 1.055312 | 1.336741 | 1.401802 | 1.323000 | 7.879629 | 8.472306 | 7.826161 | 8.344422 | 1.302789 | 1.494076 | 1.100786 | 2.720008 | 2.666598 | 3.080750 | 7.171139 |
ENSMUSG00000000125 | 9.426741 | 9.885487 | 5.404755 | 2.614957 | 1.888344 | 2.419112 | 2.543413 | 1.836920 | 2.186343 | 2.081065 | 2.033197 | 1.815480 | 1.714785 | 1.591194 | 1.707502 | 1.388731 | 1.150249 | 1.311930 | 1.489738 | 2.613648 | 2.668347 | 2.770951 | 2.336912 | 1.482033 | 1.217406 | 1.473157 | 0.168666 | 0.267824 | 0.162984 | 0.439311 |
ENSMUSG00000000126 | 1.215801 | 0.746274 | 0.814387 | 0.893695 | 0.726967 | 0.795409 | 0.811422 | 1.698931 | 0.816782 | 1.232844 | 1.175287 | 0.776041 | 0.671299 | 0.527801 | 0.856885 | 0.881084 | 0.587611 | 0.480369 | 0.599140 | 1.920124 | 1.992736 | 2.643161 | 2.442442 | 1.243716 | 0.787857 | 1.154595 | 1.480120 | 1.685512 | 1.662030 | 1.757216 |
ENSMUSG00000000127 | 13.113348 | 14.260035 | 14.074681 | 16.162555 | 17.427828 | 17.565289 | 11.863558 | 12.271221 | 14.241964 | 13.551247 | 14.347367 | 12.813797 | 12.260097 | 11.685471 | 12.283357 | 8.849005 | 8.757533 | 8.775956 | 9.371477 | 16.306853 | 15.553249 | 15.896542 | 15.505971 | 11.131771 | 11.417468 | 12.892073 | 12.916885 | 12.422329 | 12.138589 | 11.783150 |
ENSMUSG00000000131 | 55.485697 | 54.448611 | 62.549063 | 71.704310 | 71.301569 | 71.997625 | 85.422041 | 53.653944 | 54.569370 | 54.041498 | 53.203853 | 56.560171 | 55.599344 | 57.503726 | 56.142399 | 56.062512 | 54.730314 | 56.114317 | 55.657729 | 64.772660 | 60.737488 | 64.559072 | 65.908557 | 73.304601 | 82.616101 | 81.113216 | 72.032652 | 72.128976 | 70.421851 | 64.545350 |
ENSMUSG00000000134 | 33.480962 | 36.683455 | 38.983454 | 39.609181 | 45.133664 | 47.577079 | 38.693647 | 19.425951 | 20.279171 | 18.258503 | 19.466904 | 20.759852 | 19.514517 | 20.453484 | 23.945999 | 24.145265 | 23.148770 | 24.341792 | 20.775309 | 26.254515 | 25.201055 | 25.298568 | 24.448624 | 70.194375 | 70.469434 | 72.039951 | 62.710030 | 65.426174 | 62.960782 | 52.892113 |
ENSMUSG00000000142 | 34.556890 | 33.051268 | 31.249676 | 26.910897 | 25.255541 | 26.237510 | 24.678957 | 4.314812 | 3.851946 | 5.201375 | 4.297327 | 3.944854 | 3.709407 | 4.698792 | 4.180144 | 4.008057 | 4.408832 | 5.413859 | 4.183791 | 12.235464 | 12.443578 | 13.543412 | 12.406199 | 5.611935 | 4.639913 | 5.140094 | 3.517411 | 3.338406 | 3.652103 | 5.866998 |
ENSMUSG00000000148 | 12.975017 | 12.265602 | 13.479207 | 15.929561 | 15.998245 | 15.251781 | 16.439377 | 11.187122 | 11.765327 | 12.255481 | 11.805331 | 9.727571 | 10.811203 | 10.842452 | 9.828969 | 9.057784 | 8.261829 | 8.418036 | 8.912263 | 13.212425 | 11.883479 | 13.518251 | 13.482611 | 10.102323 | 10.474313 | 9.593967 | 7.725926 | 8.271340 | 8.221738 | 8.493223 |
ENSMUSG00000000149 | 62.786926 | 60.575599 | 64.868495 | 81.035249 | 77.612600 | 80.007932 | 79.501053 | 19.769167 | 19.068923 | 18.725772 | 19.826181 | 17.829466 | 16.300954 | 17.513712 | 15.040879 | 14.602690 | 13.900191 | 14.489153 | 13.408245 | 37.005858 | 35.055680 | 36.987634 | 36.668932 | 36.844849 | 38.723659 | 38.975206 | 27.466288 | 29.294849 | 29.622090 | 30.355845 |
ENSMUSG00000000154 | 0.333055 | 0.115844 | 0.432376 | 0.325385 | 0.332173 | 0.555266 | 0.274624 | 0.315355 | 0.635393 | 0.370886 | 0.117576 | 0.114637 | 0.156711 | 0.361648 | 0.316280 | 0.277836 | 0.297838 | 0.396606 | 0.245718 | 0.372255 | 0.472991 | 0.428795 | 0.411684 | 1.678056 | 1.611784 | 1.901671 | 3.967049 | 5.087375 | 4.540421 | 5.049930 |
ENSMUSG00000000157 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.019660 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.038957 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
ENSMUSG00000000159 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.066624 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.120014 | 0.192008 | 0.000000 | 0.069363 | 0.465487 | 0.133256 | 0.105527 | 0.609742 | 0.232339 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.034548 | 0.047125 | 0.116238 | 0.039351 | 0.258603 | 0.235119 | 0.181094 | 0.033496 |
ENSMUSG00000000167 | 7.167658 | 7.111275 | 4.562918 | 4.611452 | 5.165249 | 3.980233 | 5.810633 | 3.014607 | 2.165885 | 1.420868 | 2.402707 | 1.174127 | 1.731610 | 2.442734 | 1.718371 | 1.983554 | 1.487518 | 0.867104 | 1.640726 | 2.877345 | 4.086346 | 2.980909 | 2.788065 | 2.412113 | 1.093266 | 2.134765 | 3.075103 | 2.544267 | 4.048963 | 1.194254 |
ENSMUSG00000000168 | 46.898364 | 51.544129 | 50.803506 | 46.728899 | 46.416810 | 43.668901 | 42.770089 | 135.597979 | 142.174183 | 140.466776 | 155.267598 | 176.558772 | 177.449941 | 173.533809 | 177.441385 | 140.085372 | 135.955772 | 149.522280 | 139.289347 | 77.093701 | 65.875673 | 72.031057 | 69.113490 | 65.284363 | 59.589716 | 67.122029 | 66.846252 | 65.352465 | 66.011315 | 60.027737 |
ENSMUSG00000000171 | 183.510028 | 191.339737 | 185.366419 | 146.485489 | 136.055305 | 134.774362 | 144.419003 | 250.624457 | 258.445965 | 254.599788 | 264.936389 | 297.223823 | 290.955520 | 287.995065 | 284.714197 | 228.171116 | 236.310898 | 224.511224 | 221.322408 | 180.849179 | 173.186896 | 175.786723 | 175.691677 | 147.487631 | 156.118401 | 153.095971 | 219.561253 | 219.763695 | 210.634119 | 196.601985 |
ENSMUSG00000000182 | 0.103193 | 0.032365 | 0.029772 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.109766 | 0.257429 | 0.112760 | 0.211044 | 0.082172 | 0.037902 | 0.110168 | 0.111682 | 0.380279 | 0.139486 | 0.081007 | 0.246593 | 0.169495 | 0.356141 | 0.264438 | 0.269860 | 0.040837 | 0.041344 | 0.036939 | 0.000000 | 0.030240 | 0.000000 | 0.203803 |
ENSMUSG00000000183 | 0.000000 | 0.000000 | 0.010340 | 0.000000 | 0.010401 | 0.016126 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.015047 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.012499 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.012861 | 0.000000 | 0.000000 | 0.000000 | 0.020984 |
ENSMUSG00000000184 | 142.747177 | 147.826378 | 149.197724 | 66.761615 | 63.335841 | 65.070470 | 48.113890 | 21.348042 | 17.384228 | 15.935255 | 15.653769 | 16.292530 | 14.504585 | 17.797905 | 15.544638 | 17.701942 | 13.309930 | 15.605609 | 17.667904 | 28.101151 | 33.067019 | 38.044163 | 34.866641 | 7.046037 | 5.154655 | 5.428973 | 9.002990 | 11.536859 | 8.437719 | 2.867368 |
ENSMUSG00000000194 | 36.103417 | 38.818908 | 42.800890 | 36.817530 | 38.667199 | 36.778395 | 26.187279 | 33.002131 | 37.727627 | 36.606405 | 37.474231 | 36.703396 | 34.536615 | 36.214178 | 34.713799 | 33.020334 | 26.550992 | 28.899042 | 31.118602 | 34.300556 | 32.621947 | 33.363838 | 34.012681 | 30.084083 | 26.911440 | 26.198168 | 28.291703 | 30.043730 | 28.726287 | 29.337298 |
ENSMUSG00000000197 | 0.724701 | 0.864508 | 0.538785 | 1.073368 | 1.074458 | 0.792837 | 0.844630 | 25.556904 | 27.172721 | 27.223684 | 27.307364 | 23.469370 | 25.193771 | 24.809004 | 25.595933 | 22.134882 | 17.538889 | 21.238627 | 22.714445 | 11.528675 | 11.227085 | 11.667388 | 11.717720 | 0.717834 | 0.544315 | 0.843883 | 0.333168 | 0.479806 | 0.295036 | 0.135287 |
ENSMUSG00000000202 | 46.873410 | 38.860603 | 20.595529 | 21.277139 | 19.304139 | 22.041307 | 20.637502 | 1.998099 | 2.664860 | 2.147623 | 1.671951 | 0.707986 | 1.171352 | 2.805967 | 0.762410 | 1.275708 | 0.275591 | 0.391964 | 0.762712 | 3.703893 | 4.510447 | 4.937764 | 4.553973 | 1.186139 | 1.366082 | 0.695439 | 1.176641 | 2.441639 | 0.983961 | 3.130113 |
ENSMUSG00000000204 | 0.000000 | 0.077812 | 0.000000 | 0.036758 | 0.087585 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.043964 | 0.000000 | 0.000000 | 0.000000 | 0.018414 | 0.000000 | 0.000000 | 0.000000 | 0.038316 | 0.000000 | 0.000000 | 0.050157 | 0.000000 | 0.041803 | 0.000000 | 0.040884 | 0.152281 |
ENSMUSG00000000214 | 0.316821 | 0.376734 | 0.580427 | 0.074393 | 0.077525 | 0.232658 | 0.047076 | 0.000000 | 0.000000 | 0.000000 | 0.195838 | 0.050855 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.059494 | 0.028719 | 0.000000 | 0.080425 | 0.133600 | 0.051463 | 0.000000 | 0.062615 | 0.191165 | 0.206326 | 0.128695 | 0.070217 |
ENSMUSG00000000215 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.201293 | 0.000000 | 0.000000 | 0.079614 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.125601 | 0.000000 | 0.060125 | 0.339217 | 0.000000 | 0.526094 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
ENSMUSG00000000216 | 0.055074 | 0.051534 | 0.015706 | 0.000000 | 0.031705 | 0.000000 | 0.000000 | 0.039904 | 0.022940 | 0.019738 | 0.000000 | 0.000000 | 0.020080 | 0.000000 | 0.000000 | 0.000000 | 0.018491 | 0.000000 | 0.038279 | 0.017680 | 0.000000 | 0.017147 | 0.017771 | 0.022180 | 0.000000 | 0.000000 | 0.000000 | 0.065627 | 0.033124 | 0.015572 |
ENSMUSG00000000223 | 1.490207 | 1.396739 | 1.636890 | 1.736106 | 1.971587 | 1.840778 | 1.365031 | 2.454484 | 2.567922 | 2.499309 | 2.498017 | 2.092696 | 1.983752 | 2.066465 | 1.717463 | 2.653865 | 1.954384 | 1.841953 | 2.723702 | 2.868183 | 2.988016 | 3.715324 | 3.097974 | 0.185306 | 0.237818 | 0.133284 | 0.229376 | 0.237803 | 0.222297 | 0.190730 |
ENSMUSG00000000244 | 0.083692 | 0.054689 | 0.130041 | 0.090199 | 0.193026 | 0.120789 | 0.083430 | 0.218939 | 0.038914 | 0.099444 | 0.015106 | 0.000000 | 0.000000 | 0.013459 | 0.000000 | 0.154390 | 0.067568 | 0.270073 | 0.127634 | 0.108335 | 0.113842 | 0.088430 | 0.052551 | 0.638876 | 0.595938 | 0.471887 | 0.533838 | 0.596377 | 0.453464 | 0.350619 |
ENSMUSG00000000247 | 3.581850 | 3.499372 | 3.528227 | 2.710486 | 3.101592 | 2.767801 | 2.565610 | 1.225779 | 0.659762 | 1.060458 | 0.978868 | 0.964519 | 0.809062 | 0.716354 | 0.546474 | 0.468901 | 0.592905 | 0.252756 | 0.383217 | 9.739157 | 11.367096 | 12.204956 | 9.050891 | 1.025635 | 1.058490 | 0.835800 | 0.563193 | 0.906406 | 0.423688 | 0.336354 |
ENSMUSG00000000248 | 0.000000 | 0.307460 | 0.275004 | 0.000000 | 0.000000 | 0.037225 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.726755 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.032926 | 0.094631 | 0.027475 | 0.000000 | 0.264993 | 0.027614 | 0.000000 | 0.035735 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
ENSMUSG00000000253 | 5.002777 | 4.525335 | 4.179621 | 5.048770 | 4.947710 | 4.666987 | 4.505384 | 23.658215 | 24.517931 | 25.751619 | 26.217720 | 35.486540 | 36.812237 | 35.029205 | 34.464815 | 47.977399 | 51.307180 | 44.104269 | 43.421043 | 18.585852 | 19.591512 | 21.542503 | 19.528295 | 10.439729 | 10.042434 | 12.184047 | 20.222381 | 19.607781 | 19.986088 | 25.072968 |
ENSMUSG00000000263 | 2.981638 | 1.970632 | 2.113011 | 1.671629 | 1.341986 | 2.387899 | 0.558066 | 43.069555 | 53.099363 | 55.631124 | 57.681339 | 85.004519 | 87.323404 | 79.087829 | 83.067736 | 94.715692 | 60.793459 | 96.932514 | 95.108836 | 3.780805 | 3.893267 | 3.860449 | 4.027712 | 0.037352 | 0.037589 | 0.032486 | 0.056548 | 0.079220 | 0.058177 | 0.109525 |
ENSMUSG00000000266 | 4.207659 | 3.886346 | 3.670926 | 5.967348 | 5.313591 | 5.564873 | 3.908537 | 18.434130 | 19.178063 | 18.998609 | 19.210836 | 22.110790 | 21.232814 | 21.788512 | 20.701333 | 21.877453 | 18.154458 | 24.446192 | 24.809788 | 14.273999 | 14.445732 | 13.674834 | 13.532920 | 0.403795 | 0.353641 | 0.384367 | 0.045010 | 0.072257 | 0.094877 | 0.126511 |
ENSMUSG00000000275 | 15.337964 | 15.667032 | 22.966957 | 6.220162 | 6.831170 | 6.443813 | 6.385162 | 1.564104 | 1.618634 | 1.642415 | 1.574792 | 1.154512 | 1.511644 | 1.017289 | 0.828477 | 1.030165 | 0.983866 | 0.759897 | 1.061297 | 3.119988 | 3.554271 | 3.629310 | 3.486077 | 44.342670 | 44.274541 | 45.952267 | 68.204400 | 66.534540 | 67.546482 | 65.233133 |
ENSMUSG00000000276 | 6.707699 | 6.335487 | 5.715570 | 11.190985 | 11.445650 | 10.558276 | 8.857024 | 23.079722 | 24.157868 | 24.513742 | 24.306458 | 22.041553 | 22.124420 | 21.233254 | 22.742842 | 28.209391 | 23.221775 | 27.581952 | 27.546310 | 20.847164 | 20.360543 | 19.165914 | 19.357199 | 7.914284 | 8.048394 | 8.766309 | 11.622329 | 11.302601 | 11.791095 | 7.286859 |
ENSMUSG00000000278 | 90.617387 | 96.681487 | 113.010324 | 37.035350 | 35.574970 | 37.008561 | 35.981529 | 53.884610 | 53.796869 | 52.070052 | 55.294585 | 71.168345 | 71.952532 | 69.563003 | 72.057016 | 80.149296 | 90.857958 | 89.634870 | 77.269610 | 38.043631 | 38.898064 | 39.409246 | 38.148701 | 174.002655 | 172.278617 | 171.035085 | 162.763884 | 161.548267 | 159.085443 | 194.152958 |
ENSMUSG00000000282 | 28.352789 | 28.584065 | 28.490442 | 51.536750 | 48.043432 | 53.508818 | 41.034434 | 34.048141 | 39.276596 | 38.718342 | 42.472822 | 33.504992 | 34.355608 | 38.392612 | 31.313721 | 28.164786 | 28.592877 | 31.653213 | 29.223527 | 46.532177 | 45.670087 | 48.495228 | 50.032227 | 27.125277 | 25.598833 | 24.739579 | 18.692252 | 16.789268 | 15.102999 | 17.659619 |
Alright, not bad!
Let’s stop and think for a minute about what tximport
did and the metric we are using (TPM). What does transcripts per million mean? Well, it means pretty much what it sounds like. For every million transcripts in the cell, X of them are this particular transcript. Importantly, this means when this TPM value was calculated from the number of counts a transcript received, this number had to be adjusted for both the total number of counts in the library and the length of a transcript.
If sample A had twice the number of total counts as sample B (i.e. was sequenced twice as deeply), then you would expect every transcript to have approximately twice the number of counts in sample A as it has in sample B. Similarly, if transcript X is twice as long as transcript Y, then you would expect that if they were equally expressed (i.e. the same number of transcript X and transcript Y molecules were present in the sample) that X would have approximately twice the counts that Y does. Working with expression units of TPM incorporates both of these normalizations.
So, if a TPM of X means that for every million transcripts in the sample that X of them were the transcript of interest, then the sum of TPM values across all species should equal one million, right?
Let’s check and see if that’s true.
[1] 985678.2
[1] 985291.2
[1] 985100.6
OK, not quite one million, but pretty darn close.
This notion that TPMs represent proportions of a whole also leads to another interesting insight into what tximport
is doing here. If all transcripts belong to genes, then the TPM for a gene must be the sum of the TPMs of its transcripts. Can we verify that that is true?
ensembl_transcript_id | DIV0.Rep1 | DIV0.Rep2 | DIV0.Rep3 | DIV1.Rep1 | DIV1.Rep2 | DIV1.Rep3 | DIV1.Rep4 | DIV16.Rep1 | DIV16.Rep2 | DIV16.Rep3 | DIV16.Rep4 | DIV21.Rep1 | DIV21.Rep2 | DIV21.Rep3 | DIV21.Rep4 | DIV28.Rep1 | DIV28.Rep2 | DIV28.Rep3 | DIV28.Rep4 | DIV7.Rep1 | DIV7.Rep2 | DIV7.Rep3 | DIV7.Rep4 | DIVminus4.Rep1 | DIVminus4.Rep2 | DIVminus4.Rep3 | DIVminus8.Rep1 | DIVminus8.Rep2 | DIVminus8.Rep3 | DIVminus8.Rep4 | ensembl_gene_id |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSMUST00000119854 | 2.286511 | 2.475351 | 2.778492 | 3.752931 | 4.290970 | 3.606815 | 2.289835 | 4.753009 | 6.407206 | 6.093523 | 8.073422 | 4.114218 | 3.854100 | 4.504607 | 4.233634 | 0.000000 | 3.062028 | 2.657892 | 4.551532 | 7.708040 | 5.771621 | 4.913897 | 5.733164 | 0.821894 | 0.000000 | 0.791545 | 0.452225 | 0.283685 | 0.253438 | 0.000000 | ENSMUSG00000037736 |
ENSMUST00000147408 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ENSMUSG00000052516 |
ENSMUST00000070080 | 17.294246 | 16.642412 | 20.145840 | 22.849371 | 23.111531 | 23.055949 | 18.922899 | 85.742059 | 92.095921 | 92.061439 | 90.117232 | 80.253664 | 82.962866 | 82.837521 | 83.218155 | 68.286276 | 59.179619 | 65.655743 | 67.027119 | 61.090866 | 60.764495 | 60.258015 | 58.545413 | 22.716225 | 23.042382 | 25.299848 | 18.289952 | 19.497363 | 17.597384 | 22.126199 | ENSMUSG00000056124 |
ENSMUST00000033908 | 0.277638 | 0.771701 | 0.734236 | 0.835938 | 1.130580 | 1.774933 | 0.262763 | 0.231432 | 0.209738 | 0.302866 | 0.342360 | 0.394527 | 0.196355 | 0.088545 | 0.170725 | 0.088308 | 0.148182 | 0.037937 | 0.361527 | 0.124041 | 0.411704 | 0.216135 | 0.384765 | 0.627833 | 0.412680 | 0.393967 | 0.177899 | 0.097658 | 0.091573 | 0.404344 | ENSMUSG00000031511 |
ENSMUST00000174016 | 8.545914 | 6.122679 | 4.978280 | 13.429257 | 13.448801 | 13.789999 | 9.139241 | 3.772152 | 6.493719 | 6.454788 | 7.620481 | 3.964981 | 3.769810 | 4.272302 | 4.711036 | 4.523138 | 4.025177 | 6.303732 | 5.551957 | 20.404943 | 18.886851 | 26.036222 | 24.703049 | 15.936284 | 17.060844 | 18.859078 | 9.623711 | 12.255722 | 10.591347 | 2.846295 | ENSMUSG00000036036 |
ENSMUST00000082868 | 5.240588 | 7.435392 | 21.577246 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.157229 | 8.363363 | 2.106274 | 7.631468 | 4.013754 | 0.000000 | ENSMUSG00000064802 |
ENSMUST00000187399 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.112263 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ENSMUSG00000090925 |
ENSMUST00000159512 | 1.197789 | 1.783770 | 1.592837 | 3.586005 | 2.904509 | 3.830286 | 1.389749 | 1.361614 | 1.953179 | 1.801017 | 2.061461 | 2.504494 | 1.787667 | 1.311947 | 1.985731 | 2.389696 | 0.801072 | 1.741500 | 1.538186 | 0.880757 | 2.094465 | 0.624705 | 1.609800 | 0.248865 | 0.954488 | 1.411125 | 1.029973 | 0.838168 | 0.218119 | 0.135338 | ENSMUSG00000029207 |
ENSMUST00000194793 | 0.000000 | 0.068955 | 0.061078 | 0.064820 | 0.000000 | 0.000000 | 0.000000 | 0.078959 | 0.177114 | 0.299510 | 0.089493 | 0.181351 | 0.080516 | 0.310951 | 0.222825 | 0.324322 | 0.070346 | 0.163243 | 0.468088 | 0.067476 | 0.360582 | 0.332645 | 0.068630 | 0.000000 | 0.000000 | 0.000000 | 0.155780 | 0.076557 | 0.000000 | 0.000000 | ENSMUSG00000103497 |
ENSMUST00000221298 | 0.077941 | 0.100916 | 0.090854 | 0.309668 | 0.244962 | 0.239357 | 0.299259 | 0.174734 | 0.291636 | 0.193253 | 0.231279 | 0.296867 | 0.117060 | 0.200170 | 0.109928 | 0.266300 | 0.103878 | 0.301950 | 0.395391 | 0.569643 | 0.215417 | 0.462677 | 0.353049 | 0.000000 | 0.000000 | 0.028680 | 0.000000 | 0.000000 | 0.000000 | 0.050635 | ENSMUSG00000113688 |
ENSMUST00000138551 | 0.000000 | 0.000000 | 0.119676 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ENSMUSG00000034164 |
ENSMUST00000210944 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ENSMUSG00000110312 |
ENSMUST00000194146 | 0.066535 | 0.185273 | 0.055779 | 0.000000 | 0.000000 | 0.000000 | 0.070382 | 0.000000 | 0.000000 | 0.000000 | 0.078488 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.129887 | 0.000000 | 0.000000 | 0.000000 | 0.162839 | 0.143064 | 0.000000 | 0.000000 | 0.058096 | 0.000000 | ENSMUSG00000102375 |
ENSMUST00000112725 | 1.270589 | 1.049486 | 1.271776 | 1.294794 | 1.619005 | 1.511612 | 0.893824 | 0.688353 | 0.884466 | 0.635204 | 0.709094 | 0.928998 | 0.846144 | 0.829175 | 0.696928 | 0.989624 | 1.018670 | 0.818393 | 1.100337 | 1.270363 | 0.878781 | 1.156467 | 1.281586 | 1.667594 | 1.633202 | 1.868757 | 1.477511 | 1.274160 | 1.640671 | 1.306233 | ENSMUSG00000025269 |
ENSMUST00000181391 | 2.222061 | 1.416381 | 2.497798 | 3.006018 | 3.142753 | 1.571165 | 3.167750 | 3.530355 | 2.965603 | 4.649879 | 3.558337 | 3.465148 | 3.203126 | 3.106205 | 2.862499 | 0.000000 | 1.333758 | 2.039999 | 3.282519 | 5.710520 | 3.937353 | 6.847600 | 3.991743 | 1.733049 | 0.496494 | 1.601051 | 1.270107 | 1.507055 | 1.477476 | 0.000000 | ENSMUSG00000030446 |
ENSMUST00000223290 | 0.000000 | 0.000000 | 0.000000 | 0.026984 | 0.000000 | 0.000000 | 0.032844 | 0.033378 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.033128 | 0.000000 | 0.000000 | 0.031158 | 0.072496 | 0.032133 | 0.000000 | 0.031972 | 0.029910 | 0.000000 | 0.000000 | 0.036096 | 0.000000 | 0.029574 | 0.000000 | 0.000000 | 0.000000 | ENSMUSG00000035954 |
ENSMUST00000116191 | 0.000000 | 0.044746 | 0.000000 | 0.000000 | 0.104528 | 0.168647 | 0.046020 | 0.048814 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.102273 | 0.042109 | 0.000000 | 0.069164 | 0.000000 | 0.111104 | 0.170086 | 0.000000 | 0.000000 | ENSMUSG00000080078 |
ENSMUST00000060782 | 0.267962 | 0.170311 | 0.391316 | 0.446440 | 0.000000 | 0.000000 | 0.389636 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.087233 | 0.046855 | 0.000000 | 0.000000 | 0.110542 | 0.000000 | 0.048208 | 0.044187 | 0.083067 | 0.328807 | 0.000000 | ENSMUSG00000051716 |
ENSMUST00000141479 | 0.960027 | 0.541858 | 0.000000 | 0.000000 | 0.473587 | 1.080745 | 0.000000 | 0.274066 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.359059 | 1.129914 | 0.300957 | 0.388096 | 0.000000 | 0.000000 | 0.801308 | 1.141492 | 1.285565 | 0.314952 | 1.575395 | 0.000000 | 0.348874 | 0.398863 | 0.000000 | 0.970236 | 1.421541 | 0.351342 | ENSMUSG00000049119 |
ENSMUST00000211096 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ENSMUSG00000110353 |
OK so lets look at the expression of ENSMUSG00000020634 in the first sample (DIVminus8.Rep1).
[1] 46.71414
[1] 46.71414
OK now that we’ve got expression values for all genes, we now might want to use these expression values to learn a little bit about our samples. One simple question is > Are replicates similar to each other, or at least more similar to each other than to other samples?
If our data is worth anything at all, we would hope that differences between replicates, which are supposed to be drawn from the same condition, are smaller than differences between samples drawn from different conditions. If that’s not true, it could indicate that one replicate is very different from other replciates (in which case we might want to remove it), or that the data in general is of poor quality.
Another question is:
How similar is each sample to every other sample?
In our timecourse, we might expect that samples drawn from adjacent timepoints might be more similar to each other than samples from more distant timepoints.
A simple way to think about this is to simply correlate TPM values for genes between samples. For plotting purposes here, let’s plot the log(TPM) of two samples against each other. However, for the actual correlation coefficient we are going to be using the Spearman correlation method, which uses ranks, not absolute values. This means that whether or not you take the log will have no effect on the Spearman correlation coefficient.
# DIVminus8.Rep1 vs DIVminus8.Rep2
# Since we are plotting log TPM values, we need to add a pseudocount to all samples.
# log(0) is a problem.
# Add pseudocounts and take log within ggplot function call
r.spearman <- cor.test(
tpms$DIVminus8.Rep1, tpms$DIVminus8.Rep2,
method = "spearman"
)$estimate[[1]]
r.spearman <- signif(r.spearman, 2)
ggplot(tpms, aes(x = log10(DIVminus8.Rep1 + 1e-3), y = log10(DIVminus8.Rep2 + 1e-3))) +
geom_point() +
theme_classic() +
annotate("text", x = 2, y = 0, label = paste0("R = ", r.spearman))
With RNAseq data, the variance of a gene’s expression increases as the expression increases. However, using a pseudocount and taking the log of the expression value actually reverses this trend. Now, genes with the lowest expression have the most variance. Why is this a problem? Well, the genes with the most variance are going to be the ones that contribute the most to intersample differences. Ideally, we would like to therefore remove the relationship between expression and variance.
There are transformations, notably rlog
and vst
, that are made to deal with this, but they are best used when dealing with normalized count data, while here we are dealing with TPMs. We will talk about counts later, but not here.
So, for now, we will take another approach of simply using an expression threshold. Any gene that does not meet our threshold will be excluded from the analysis. Obviously where to set this threshold is a bit subjective. For now, we will set this cutoff at 1 TPM.
# DIVminus8.Rep1 vs DIVminus8.Rep2
# Since we are plotting log TPM values we will only keep for genes that have expression of at least 1 TPM in both samples
tpms.2samplecor <-
dplyr::select(tpms, c(ensembl_gene_id, DIVminus8.Rep1, DIVminus8.Rep2)) |>
filter(DIVminus8.Rep1 >= 1 & DIVminus8.Rep2 >= 1)
# pull the correlation coefficient
r.spearman <- cor.test(
tpms.2samplecor$DIVminus8.Rep1,
tpms.2samplecor$DIVminus8.Rep2,
method = "spearman"
)$estimate[[1]]
# round/set sig digits
r.spearman <- signif(r.spearman, 2)
# plot
ggplot(tpms.2samplecor, aes(x = log10(DIVminus8.Rep1 + 1e-3), y = log10(DIVminus8.Rep2 + 1e-3))) +
geom_point() +
theme_classic() +
annotate("text", x = 2, y = 1, label = paste0("R = ", r.spearman))
OK that’s two samples compared to each other, but now we want to see how all samples compare to all other samples. Before we do this we need to decide how to apply our expression cutoff across many samples. Should a gene have to meet the cutoff in only one sample? In all samples? Let’s start by saying it has to meet the cutoff in at least half of the 30 samples.
[1] 50123
[1] 14843
Now we can use the cor
function to calculate pairwise correlations in a .red[matrix] of TPM values.
DIV0.Rep1 DIV0.Rep2 DIV0.Rep3 DIV1.Rep1 DIV1.Rep2 DIV1.Rep3 DIV1.Rep4
DIV0.Rep1 1.0000000 0.9921992 0.9857075 0.9344949 0.9320084 0.9318094 0.9341577
DIV0.Rep2 0.9921992 1.0000000 0.9895527 0.9290296 0.9269420 0.9270552 0.9301675
DIV0.Rep3 0.9857075 0.9895527 1.0000000 0.9172688 0.9149980 0.9161315 0.9175274
DIV1.Rep1 0.9344949 0.9290296 0.9172688 1.0000000 0.9932976 0.9913331 0.9857878
DIV1.Rep2 0.9320084 0.9269420 0.9149980 0.9932976 1.0000000 0.9911381 0.9851731
DIV1.Rep3 0.9318094 0.9270552 0.9161315 0.9913331 0.9911381 1.0000000 0.9863747
DIV16.Rep1 DIV16.Rep2 DIV16.Rep3 DIV16.Rep4 DIV21.Rep1 DIV21.Rep2
DIV0.Rep1 0.6674215 0.6543366 0.6549751 0.6557675 0.6430027 0.6437791
DIV0.Rep2 0.6609602 0.6460813 0.6477044 0.6479861 0.6365390 0.6368957
DIV0.Rep3 0.6374190 0.6211743 0.6223612 0.6227213 0.6121500 0.6133398
DIV1.Rep1 0.7547993 0.7419703 0.7432819 0.7436624 0.7234613 0.7235614
DIV1.Rep2 0.7567175 0.7438094 0.7457015 0.7456283 0.7251852 0.7253125
DIV1.Rep3 0.7560271 0.7419976 0.7444082 0.7441764 0.7240899 0.7238994
DIV21.Rep3 DIV21.Rep4 DIV28.Rep1 DIV28.Rep2 DIV28.Rep3 DIV28.Rep4
DIV0.Rep1 0.6455802 0.6422156 0.6385205 0.6416556 0.6336616 0.6352217
DIV0.Rep2 0.6387583 0.6339425 0.6312565 0.6348317 0.6259258 0.6283085
DIV0.Rep3 0.6138374 0.6091561 0.6080339 0.6112330 0.6026283 0.6047245
DIV1.Rep1 0.7259016 0.7220668 0.7162478 0.7139679 0.7103676 0.7129652
DIV1.Rep2 0.7277639 0.7246262 0.7182024 0.7164955 0.7130165 0.7151487
DIV1.Rep3 0.7259727 0.7221704 0.7165046 0.7140036 0.7110218 0.7129018
DIV7.Rep1 DIV7.Rep2 DIV7.Rep3 DIV7.Rep4 DIVminus4.Rep1 DIVminus4.Rep2
DIV0.Rep1 0.7738807 0.7788370 0.7768059 0.7753821 0.8313811 0.8285264
DIV0.Rep2 0.7667818 0.7719339 0.7696461 0.7689326 0.8356858 0.8331484
DIV0.Rep3 0.7431343 0.7490053 0.7462518 0.7452457 0.8515066 0.8500630
DIV1.Rep1 0.8501343 0.8518467 0.8519696 0.8513401 0.7365926 0.7348312
DIV1.Rep2 0.8519090 0.8530144 0.8535201 0.8521009 0.7330103 0.7303372
DIV1.Rep3 0.8503243 0.8519535 0.8523296 0.8519729 0.7341659 0.7314245
DIVminus4.Rep3 DIVminus8.Rep1 DIVminus8.Rep2 DIVminus8.Rep3
DIV0.Rep1 0.8298582 0.7952033 0.7960610 0.7958396
DIV0.Rep2 0.8342924 0.8016864 0.8025006 0.8009213
DIV0.Rep3 0.8516980 0.8181704 0.8185364 0.8187208
DIV1.Rep1 0.7354999 0.6986520 0.6993633 0.6996563
DIV1.Rep2 0.7320226 0.6954396 0.6958901 0.6953545
DIV1.Rep3 0.7338895 0.6967133 0.6970533 0.6969969
DIVminus8.Rep4
DIV0.Rep1 0.7849066
DIV0.Rep2 0.7897993
DIV0.Rep3 0.8067435
DIV1.Rep1 0.6906591
DIV1.Rep2 0.6866114
DIV1.Rep3 0.6879394
Now we need to plot these and have similar samples (i.e. those that are highly correlated with each other) be clustered near to each other. We will use pheatmap
to do this.
This looks pretty good! There are two main points to takeaway here. First, all replicates for a given timepoint are clustering with each other. Second, you can kind of derive the order of the timepoints from the clustering. The biggest separation is between early (DIVminus8 to DIV1) and late (DIV7 to DIV28). After that you can then see finer-grained structure.
Another way to visualize relationships is using a dimensionality reduction technique called Principal Component Analysis (PCA). Let’s watch this short video. It focuses more on how to interpret them rather than the math behind their creation.
PCA works best when values are approximately normally distributed, so we will first take the log of our expression values.
With our cutoff as it is now (genes have to have expression of at least 1 TPM in half the samples), it is possible that we will have some 0 values. Taking the log of 0 might cause a problem, so we will add a pseudocount.
Very similar interpretation as before (heatmap of correlation).
Course website: https://rnabioco.github.io/molb-7950