Genes across the tree of life

In the previous episode we explored the different types of genes within a species, now it’s time to see how they compere between species across the tree of life. We already have the data for S_cerevisiae, so now we need to download the same kind of data for A_thaliana, D_melanogaster, E_coli and P_furiosus. Save all the tables in data/genes/.

This table has all the tables above combined: genes.tsv

Setup R environment

Import the libraries that we will need.

library(tidyverse)
library(ggbeeswarm)

Import tables of genes

genes <- read.delim("data/genes.tsv", sep = "\t")

Types of genes per species

From the previous episode we learned that S cerevisiae has 10 types of RNA genes, protein-coding genes, pseudogenes and other. Let’s see which types are in all the species.

gene_type <- genes %>%
  group_by(Species,Gene.Type)%>%
  summarize(Number_genes = n())%>%
  ungroup()%>%
  group_by(Species)%>%
  mutate(All_genes = sum(Number_genes), Percent_genes = Number_genes * 100 / All_genes)

head(gene_type)

Species	Gene.Type	Number_genes	All_genes	Percent_genes
A_thaliana	antisense_RNA	79	38315	0.2061856
A_thaliana	lncRNA	3480	38315	9.0826047
A_thaliana	miRNA	326	38315	0.8508417
A_thaliana	misc_RNA	508	38315	1.3258515
A_thaliana	ncRNA	439	38315	1.1457654
A_thaliana	ncRNA_pseudogene	2	38315	0.0052199

We can now plot this to do the comparison.

ggplot(gene_type, aes(x = Species, y = Percent_genes, fill = Species))+
  geom_col()+
  facet_wrap(~Gene.Type, scales = "free_y")+
  theme(axis.text.x = element_text(angle=45, hjust = 1),
        legend.position = "none")

We have many types of genes that are absent from some species. Let’s keep only the ones who are not present in all species.

gene_types_in_all <- gene_type %>%
  group_by(Gene.Type)%>%
  summarize(Number_species = n())%>%
  filter(Number_species == 5)
gene_types_in_all

Gene.Type	Number_species
protein-coding	5
pseudogene	5
rRNA	5
tRNA	5

Let’s filter our gene_type object to remain only with the gene types in all species and plot it again.

gene_type <- gene_type %>%
  filter(Gene.Type %in% gene_types_in_all$Gene.Type)

ggplot(gene_type, aes(x = Species, y = Percent_genes, fill = Species))+
  geom_col()+
  facet_wrap(~Gene.Type, scales = "free_y")+
  theme(axis.text.x = element_text(angle=45, hjust = 1),
        legend.position = "none")

Size of genes

Previously we examined the sizes of the different types of genes in S cerevisiae, with our new data set we can now compare across the 5 species. Let’s go back to use the genes dataframe, filter it to use only the common gene types, and calculate the size from the begin and end positions.

genes_size <- genes %>%
  filter(Gene.Type %in% gene_types_in_all$Gene.Type)%>%
  mutate(Size = End - Begin)

head(genes_size)

Accession	Begin	End	Chromosome	Orientation	Name	Symbol	Gene.ID	Gene.Type	Protein.accession	Protein.length	Locus.tag	Species	Size
NC_000932.1	4	76	Pltd	minus	tRNA-His	trnH	1466273	tRNA		NA	ArthCt088	A_thaliana	72
NC_000932.1	383	1444	Pltd	minus	photosystem II protein D1	psbA	844802	protein-coding	NP_051039.1	353	ArthCp002	A_thaliana	1061
NC_000932.1	1717	4347	Pltd	minus	tRNA-Lys	trnK	1466272	tRNA		NA	ArthCt089	A_thaliana	2630
NC_000932.1	2056	3570	Pltd	minus	maturase K	matK	844797	protein-coding	NP_051040.2	504	ArthCp003	A_thaliana	1514
NC_000932.1	5084	6188	Pltd	minus	ribosomal protein S16	rps16	844798	protein-coding	NP_051041.1	79	ArthCp004	A_thaliana	1104
NC_000932.1	6616	6687	Pltd	minus	tRNA-Gln	trnQ	1466271	tRNA		NA	ArthCt090	A_thaliana	71

Now let’s make a summary dataframe with the average length of each gene type per species.

size_avg <- genes_size %>%
  group_by(Species, Gene.Type)%>%
  summarise(Avg_Length = mean(Size, na.rm = TRUE))

`summarise()` has grouped output by 'Species'. You can override using the
`.groups` argument.

size_avg

Species	Gene.Type	Avg_Length
A_thaliana	protein-coding	2403.22110
A_thaliana	pseudogene	2399.95718
A_thaliana	rRNA	1169.27273
A_thaliana	tRNA	85.94196
D_melanogaster	protein-coding	6961.84623
D_melanogaster	pseudogene	985.41801
D_melanogaster	rRNA	254.40541
D_melanogaster	tRNA	74.26603
E_coli	protein-coding	902.42192
E_coli	pseudogene	1107.13235
E_coli	rRNA	1451.81818
E_coli	tRNA	77.02913
P_furiosus	protein-coding	863.96383
P_furiosus	pseudogene	539.93548
P_furiosus	rRNA	2273.00000
P_furiosus	tRNA	78.04762
S_cerevisiae	protein-coding	1483.59844
S_cerevisiae	pseudogene	959.00000
S_cerevisiae	rRNA	1250.00000
S_cerevisiae	tRNA	78.52508

If we plot the sizes using log10 transformation we will see the patterns more easily.

ggplot(data = genes_size, aes(x = Species, y = Size, color = Species))+
  geom_quasirandom()+
  geom_boxplot(alpha = 0.5, color = "darkgray")+
  labs(title = "Gene Length", y = "Log 10 of gene length (bp)")+
  facet_wrap(~Gene.Type, scales = "free_x", nrow =1)+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_log10(label = scales::comma)

In this plot we can see that across the tree of life, both rRNA and tRNA have similar sizes, but protein-coding and pseudogenes have a wider range of sizes, being a plant and animal the ones with bigger genes.

How can you explain the strange pattern in the sizes of rRNA genes?