Genes across the tree of life

In the previous episode we explored the different types of genes within a species, now it’s time to see how they compere between species across the tree of life. We already have the data for S_cerevisiae, so now we need to download the same kind of data for A_thaliana, D_melanogaster, E_coli and P_furiosus. Save all the tables in data/genes/.

This table has all the tables above combined: genes.tsv

Setup R environment

Import the libraries that we will need.

library(tidyverse)
library(ggbeeswarm)

Import tables of genes

genes <- read.delim("data/genes.tsv", sep = "\t")

Types of genes per species

From the previous episode we learned that S cerevisiae has 10 types of RNA genes, protein-coding genes, pseudogenes and other. Let’s see which types are in all the species.

gene_type <- genes %>%
  group_by(Species,Gene.Type)%>%
  summarize(Number_genes = n())%>%
  ungroup()%>%
  group_by(Species)%>%
  mutate(All_genes = sum(Number_genes), Percent_genes = Number_genes * 100 / All_genes)

head(gene_type)
Species Gene.Type Number_genes All_genes Percent_genes
A_thaliana antisense_RNA 79 38315 0.2061856
A_thaliana lncRNA 3480 38315 9.0826047
A_thaliana miRNA 326 38315 0.8508417
A_thaliana misc_RNA 508 38315 1.3258515
A_thaliana ncRNA 439 38315 1.1457654
A_thaliana ncRNA_pseudogene 2 38315 0.0052199

We can now plot this to do the comparison.

ggplot(gene_type, aes(x = Species, y = Percent_genes, fill = Species))+
  geom_col()+
  facet_wrap(~Gene.Type, scales = "free_y")+
  theme(axis.text.x = element_text(angle=45, hjust = 1),
        legend.position = "none")

We have many types of genes that are absent from some species. Let’s keep only the ones who are not present in all species.

gene_types_in_all <- gene_type %>%
  group_by(Gene.Type)%>%
  summarize(Number_species = n())%>%
  filter(Number_species == 5)
gene_types_in_all
Gene.Type Number_species
protein-coding 5
pseudogene 5
rRNA 5
tRNA 5

Let’s filter our gene_type object to remain only with the gene types in all species and plot it again.

gene_type <- gene_type %>%
  filter(Gene.Type %in% gene_types_in_all$Gene.Type)
ggplot(gene_type, aes(x = Species, y = Percent_genes, fill = Species))+
  geom_col()+
  facet_wrap(~Gene.Type, scales = "free_y")+
  theme(axis.text.x = element_text(angle=45, hjust = 1),
        legend.position = "none")

Size of genes

Previously we examined the sizes of the different types of genes in S cerevisiae, with our new data set we can now compare across the 5 species. Let’s go back to use the genes dataframe, filter it to use only the common gene types, and calculate the size from the begin and end positions.

genes_size <- genes %>%
  filter(Gene.Type %in% gene_types_in_all$Gene.Type)%>%
  mutate(Size = End - Begin)

head(genes_size)
Accession Begin End Chromosome Orientation Name Symbol Gene.ID Gene.Type Transcripts.accession Protein.accession Protein.length Locus.tag Species Size
NC_000932.1 4 76 Pltd minus tRNA-His trnH 1466273 tRNA NA ArthCt088 A_thaliana 72
NC_000932.1 383 1444 Pltd minus photosystem II protein D1 psbA 844802 protein-coding NP_051039.1 353 ArthCp002 A_thaliana 1061
NC_000932.1 1717 4347 Pltd minus tRNA-Lys trnK 1466272 tRNA NA ArthCt089 A_thaliana 2630
NC_000932.1 2056 3570 Pltd minus maturase K matK 844797 protein-coding NP_051040.2 504 ArthCp003 A_thaliana 1514
NC_000932.1 5084 6188 Pltd minus ribosomal protein S16 rps16 844798 protein-coding NP_051041.1 79 ArthCp004 A_thaliana 1104
NC_000932.1 6616 6687 Pltd minus tRNA-Gln trnQ 1466271 tRNA NA ArthCt090 A_thaliana 71

Now let’s make a summary dataframe with the average length of each gene type per species.

size_avg <- genes_size %>%
  group_by(Species, Gene.Type)%>%
  summarise(Avg_Length = mean(Size, na.rm = TRUE))
`summarise()` has grouped output by 'Species'. You can override using the
`.groups` argument.
size_avg
Species Gene.Type Avg_Length
A_thaliana protein-coding 2403.22110
A_thaliana pseudogene 2399.95718
A_thaliana rRNA 1169.27273
A_thaliana tRNA 85.94196
D_melanogaster protein-coding 6961.84623
D_melanogaster pseudogene 985.41801
D_melanogaster rRNA 254.40541
D_melanogaster tRNA 74.26603
E_coli protein-coding 902.42192
E_coli pseudogene 1107.13235
E_coli rRNA 1451.81818
E_coli tRNA 77.02913
P_furiosus protein-coding 863.96383
P_furiosus pseudogene 539.93548
P_furiosus rRNA 2273.00000
P_furiosus tRNA 78.04762
S_cerevisiae protein-coding 1483.59844
S_cerevisiae pseudogene 959.00000
S_cerevisiae rRNA 1250.00000
S_cerevisiae tRNA 78.52508

If we plot the sizes using log10 transformation we will see the patterns more easily.

ggplot(data = genes_size, aes(x = Species, y = Size, color = Species))+
  geom_quasirandom()+
  geom_boxplot(alpha = 0.5, color = "darkgray")+
  labs(title = "Gene Length", y = "Log 10 of gene length (bp)")+
  facet_wrap(~Gene.Type, scales = "free_x", nrow =1)+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_log10(label = scales::comma)

In this plot we can see that across the tree of life, both rRNA and tRNA have similar sizes, but protein-coding and pseudogenes have a wider range of sizes, being a plant and animal the ones with bigger genes.

How can you explain the strange pattern in the sizes of rRNA genes?