library(tidyverse)
library(ggbeeswarm)
Genes across the tree of life
In the previous episode we explored the different types of genes within a species, now it’s time to see how they compere between species across the tree of life. We already have the data for S_cerevisiae, so now we need to download the same kind of data for A_thaliana, D_melanogaster, E_coli and P_furiosus. Save all the tables in data/genes/.
This table has all the tables above combined: genes.tsv
Setup R environment
Import the libraries that we will need.
Import tables of genes
<- read.delim("data/genes.tsv", sep = "\t") genes
Types of genes per species
From the previous episode we learned that S cerevisiae has 10 types of RNA genes, protein-coding genes, pseudogenes and other. Let’s see which types are in all the species.
<- genes %>%
gene_type group_by(Species,Gene.Type)%>%
summarize(Number_genes = n())%>%
ungroup()%>%
group_by(Species)%>%
mutate(All_genes = sum(Number_genes), Percent_genes = Number_genes * 100 / All_genes)
head(gene_type)
Species | Gene.Type | Number_genes | All_genes | Percent_genes |
---|---|---|---|---|
A_thaliana | antisense_RNA | 79 | 38315 | 0.2061856 |
A_thaliana | lncRNA | 3480 | 38315 | 9.0826047 |
A_thaliana | miRNA | 326 | 38315 | 0.8508417 |
A_thaliana | misc_RNA | 508 | 38315 | 1.3258515 |
A_thaliana | ncRNA | 439 | 38315 | 1.1457654 |
A_thaliana | ncRNA_pseudogene | 2 | 38315 | 0.0052199 |
We can now plot this to do the comparison.
ggplot(gene_type, aes(x = Species, y = Percent_genes, fill = Species))+
geom_col()+
facet_wrap(~Gene.Type, scales = "free_y")+
theme(axis.text.x = element_text(angle=45, hjust = 1),
legend.position = "none")
We have many types of genes that are absent from some species. Let’s keep only the ones who are not present in all species.
<- gene_type %>%
gene_types_in_all group_by(Gene.Type)%>%
summarize(Number_species = n())%>%
filter(Number_species == 5)
gene_types_in_all
Gene.Type | Number_species |
---|---|
protein-coding | 5 |
pseudogene | 5 |
rRNA | 5 |
tRNA | 5 |
Let’s filter our gene_type
object to remain only with the gene types in all species and plot it again.
<- gene_type %>%
gene_type filter(Gene.Type %in% gene_types_in_all$Gene.Type)
ggplot(gene_type, aes(x = Species, y = Percent_genes, fill = Species))+
geom_col()+
facet_wrap(~Gene.Type, scales = "free_y")+
theme(axis.text.x = element_text(angle=45, hjust = 1),
legend.position = "none")
Size of genes
Previously we examined the sizes of the different types of genes in S cerevisiae, with our new data set we can now compare across the 5 species. Let’s go back to use the genes
dataframe, filter it to use only the common gene types, and calculate the size from the begin and end positions.
<- genes %>%
genes_size filter(Gene.Type %in% gene_types_in_all$Gene.Type)%>%
mutate(Size = End - Begin)
head(genes_size)
Accession | Begin | End | Chromosome | Orientation | Name | Symbol | Gene.ID | Gene.Type | Transcripts.accession | Protein.accession | Protein.length | Locus.tag | Species | Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NC_000932.1 | 4 | 76 | Pltd | minus | tRNA-His | trnH | 1466273 | tRNA | NA | ArthCt088 | A_thaliana | 72 | ||
NC_000932.1 | 383 | 1444 | Pltd | minus | photosystem II protein D1 | psbA | 844802 | protein-coding | NP_051039.1 | 353 | ArthCp002 | A_thaliana | 1061 | |
NC_000932.1 | 1717 | 4347 | Pltd | minus | tRNA-Lys | trnK | 1466272 | tRNA | NA | ArthCt089 | A_thaliana | 2630 | ||
NC_000932.1 | 2056 | 3570 | Pltd | minus | maturase K | matK | 844797 | protein-coding | NP_051040.2 | 504 | ArthCp003 | A_thaliana | 1514 | |
NC_000932.1 | 5084 | 6188 | Pltd | minus | ribosomal protein S16 | rps16 | 844798 | protein-coding | NP_051041.1 | 79 | ArthCp004 | A_thaliana | 1104 | |
NC_000932.1 | 6616 | 6687 | Pltd | minus | tRNA-Gln | trnQ | 1466271 | tRNA | NA | ArthCt090 | A_thaliana | 71 |
Now let’s make a summary dataframe with the average length of each gene type per species.
<- genes_size %>%
size_avg group_by(Species, Gene.Type)%>%
summarise(Avg_Length = mean(Size, na.rm = TRUE))
`summarise()` has grouped output by 'Species'. You can override using the
`.groups` argument.
size_avg
Species | Gene.Type | Avg_Length |
---|---|---|
A_thaliana | protein-coding | 2403.22110 |
A_thaliana | pseudogene | 2399.95718 |
A_thaliana | rRNA | 1169.27273 |
A_thaliana | tRNA | 85.94196 |
D_melanogaster | protein-coding | 6961.84623 |
D_melanogaster | pseudogene | 985.41801 |
D_melanogaster | rRNA | 254.40541 |
D_melanogaster | tRNA | 74.26603 |
E_coli | protein-coding | 902.42192 |
E_coli | pseudogene | 1107.13235 |
E_coli | rRNA | 1451.81818 |
E_coli | tRNA | 77.02913 |
P_furiosus | protein-coding | 863.96383 |
P_furiosus | pseudogene | 539.93548 |
P_furiosus | rRNA | 2273.00000 |
P_furiosus | tRNA | 78.04762 |
S_cerevisiae | protein-coding | 1483.59844 |
S_cerevisiae | pseudogene | 959.00000 |
S_cerevisiae | rRNA | 1250.00000 |
S_cerevisiae | tRNA | 78.52508 |
If we plot the sizes using log10 transformation we will see the patterns more easily.
ggplot(data = genes_size, aes(x = Species, y = Size, color = Species))+
geom_quasirandom()+
geom_boxplot(alpha = 0.5, color = "darkgray")+
labs(title = "Gene Length", y = "Log 10 of gene length (bp)")+
facet_wrap(~Gene.Type, scales = "free_x", nrow =1)+
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_log10(label = scales::comma)
In this plot we can see that across the tree of life, both rRNA and tRNA have similar sizes, but protein-coding and pseudogenes have a wider range of sizes, being a plant and animal the ones with bigger genes.
How can you explain the strange pattern in the sizes of rRNA genes?