library(tidyverse)
library(ggbeeswarm)
Exploring genes
In the previous sessions we explored the number of genes in a genome, and we saw that there are genes classified as protein-coding and some are not. In this session we will explore those other types of genes.
Download table of genes of Saccharomyces cerevisiae
Go to NCBI Datasets and search for Saccharomyces cerevisiae and click on “View Annotated Genes”. In the “Filters” section select all gene types. This will give us a table with one row per gene with information about their position in the chromosomes, their name, type, length, etc. Select all rows and download the table with the name “S_cerevisiae.tsv” in your workshop directory in data/genes
.
You can also download that table from: S_cerevisiae.tsv
Setup R environment
Gene types of S. cerevisiae
Let’s start by reading the file we downloaded into a dataframe.
<- read.delim("data/genes/S_cerevisiae.tsv", sep = "\t") scerevisiae
With the function str()
we can show the structure of a dataframe, that is, the column names and types and description of the values.
str(scerevisiae)
'data.frame': 6470 obs. of 13 variables:
$ Accession : chr "NC_001133.9" "NC_001133.9" "NC_001133.9" "NC_001133.9" ...
$ Begin : int 1807 2480 7235 11565 12046 13363 21566 22395 24000 31567 ...
$ End : int 2169 2707 9016 11951 12426 13743 21850 22685 27968 32940 ...
$ Chromosome : chr "I" "I" "I" "I" ...
$ Orientation : chr "minus" "plus" "minus" "minus" ...
$ Name : chr "seripauperin PAU8" "uncharacterized protein" "putative permease SEO1" "uncharacterized protein" ...
$ Symbol : chr "PAU8" "" "SEO1" "" ...
$ Gene.ID : int 851229 1466426 851230 851232 851233 851234 851235 1466427 851236 851237 ...
$ Gene.Type : chr "protein-coding" "protein-coding" "protein-coding" "protein-coding" ...
$ Transcripts.accession: chr "NM_001180043.1" "NM_001184582.1" "NM_001178208.1" "NM_001179897.1" ...
$ Protein.accession : chr "NP_009332.1" "NP_878038.1" "NP_009333.1" "NP_009335.1" ...
$ Protein.length : int 120 75 593 128 126 126 94 96 1322 457 ...
$ Locus.tag : chr "YAL068C" "YAL067W-A" "YAL067C" "YAL065C" ...
We can see that by default all the columns with number where imported as integers and the columns with letters where imported as character. There is another data structure for characters that is called factor. Factors are used for categorical data. They consist of levels, which are the categories (set of possible values), and a vector with numbers that says which level is present in each position.
We can convert a character column to a factor with the function as.factor()
(there are equivalent as.type()
functions for different data types). To specify a column of a dataframe we use the symbol $
like in the following example, where we are reassigning to the column Gene.Type
of the dataframe scerevisiae
its original values transformed to factor.
$Gene.Type <- as.factor(scerevisiae$Gene.Type) scerevisiae
If we see the structure of the datraframe again we see that that column change its type and description.
str(scerevisiae)
'data.frame': 6470 obs. of 13 variables:
$ Accession : chr "NC_001133.9" "NC_001133.9" "NC_001133.9" "NC_001133.9" ...
$ Begin : int 1807 2480 7235 11565 12046 13363 21566 22395 24000 31567 ...
$ End : int 2169 2707 9016 11951 12426 13743 21850 22685 27968 32940 ...
$ Chromosome : chr "I" "I" "I" "I" ...
$ Orientation : chr "minus" "plus" "minus" "minus" ...
$ Name : chr "seripauperin PAU8" "uncharacterized protein" "putative permease SEO1" "uncharacterized protein" ...
$ Symbol : chr "PAU8" "" "SEO1" "" ...
$ Gene.ID : int 851229 1466426 851230 851232 851233 851234 851235 1466427 851236 851237 ...
$ Gene.Type : Factor w/ 13 levels "antisense_RNA",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Transcripts.accession: chr "NM_001180043.1" "NM_001184582.1" "NM_001178208.1" "NM_001179897.1" ...
$ Protein.accession : chr "NP_009332.1" "NP_878038.1" "NP_009333.1" "NP_009335.1" ...
$ Protein.length : int 120 75 593 128 126 126 94 96 1322 457 ...
$ Locus.tag : chr "YAL068C" "YAL067W-A" "YAL067C" "YAL065C" ...
We can use the function levels()
to print the levels of a factor.
levels(scerevisiae$Gene.Type)
[1] "antisense_RNA" "misc_RNA" "ncRNA" "other"
[5] "protein-coding" "pseudogene" "RNase_MRP_RNA" "RNase_P_RNA"
[9] "rRNA" "snoRNA" "snRNA" "telomerase_RNA"
[13] "tRNA"
We now all the types of genes described in this S. cerevisiae genome!
Size and number of each type of gene
Let’s see how many of each type do we have, and their sizes. We can first obtain the size in bp by comparing the End position with the Begin position. And then get the number and average size of each type using the group_by()
and summarize()
combination of functions.
<- scerevisiae %>%
scerevisiae mutate(Size = End - Begin)
<- scerevisiae %>%
gene_type group_by(Gene.Type)%>%
summarize(Number = n(), Average_size = mean(Size) )
gene_type
Gene.Type | Number | Average_size |
---|---|---|
antisense_RNA | 7 | 2438.85714 |
misc_RNA | 10 | 1671.40000 |
ncRNA | 22 | 760.50000 |
other | 1 | 521.00000 |
protein-coding | 6014 | 1483.59844 |
pseudogene | 18 | 959.00000 |
RNase_MRP_RNA | 1 | 339.00000 |
RNase_P_RNA | 1 | 368.00000 |
rRNA | 14 | 1250.00000 |
snoRNA | 76 | 179.11842 |
snRNA | 6 | 400.33333 |
telomerase_RNA | 1 | 1300.00000 |
tRNA | 299 | 78.52508 |
To have a better representation of this let’s do a plot with the quasirandom and boxplot geoms.
ggplot(scerevisiae, aes(x=Gene.Type, y = Size, color = Gene.Type))+
geom_quasirandom()+
geom_boxplot(alpha = 0.5, color = "lightgray")+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
Again we have a case where it will be useful to transform the sizes to their logarithms to have a better view. Let’s use the automated way of ggplot to do the transformation scale_y_log10()
.
ggplot(scerevisiae, aes(x=Gene.Type, y = Size, color = Gene.Type))+
geom_quasirandom()+
geom_boxplot(alpha = 0.5, color = "lightgray")+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")+
scale_y_log10(label = scales::comma)
Gene types by chromosome
Since we have the information of the chromosome where each genes is present. Let’s see how the different gene types are distributed in the different chromosomes. For this we can do a similar summary than before, but grouping by gene type and chromosome at the same time.
<- scerevisiae %>%
gene_type_by_chrom group_by(Gene.Type, Chromosome)%>%
summarize(Number = n(), Avg_size = mean(Size))
`summarise()` has grouped output by 'Gene.Type'. You can override using the
`.groups` argument.
head(gene_type_by_chrom)
Gene.Type | Chromosome | Number | Avg_size |
---|---|---|---|
antisense_RNA | II | 2 | 2931.5 |
antisense_RNA | VII | 2 | 2063.0 |
antisense_RNA | VIII | 1 | 983.0 |
antisense_RNA | XIII | 1 | 3014.0 |
antisense_RNA | XVI | 1 | 3086.0 |
misc_RNA | XII | 10 | 1671.4 |
To explore if different chromosomes have different proportions of gene types we can plot the number of genes in each type per chromosome in a bar chart.
ggplot(gene_type_by_chrom, aes(x = Chromosome, y = Number, fill = Gene.Type))+
geom_col()
In this plot we see that the chromosomes are not well ordered in the X axis, that is because they are being ordered alphabetically because they are in the character type.
Before we changed a character column to factor type with the as.factor()
function. Now we want to convert it to factor and specify the order of levels. To do that in one step we can use the factor()
function which will take the values, and the argument levels
which will be a vector of the levels in the order that we want them.
$Chromosome <- factor(gene_type_by_chrom$Chromosome, levels = c("I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX", "X", "XI", "XII", "XIII", "XIV", "XV", "XVI", "MT"))
gene_type_by_chromstr(gene_type_by_chrom$Gene.Type)
Factor w/ 13 levels "antisense_RNA",..: 1 1 1 1 1 2 3 3 3 3 ...
Now we can repeat the plot and the chromosomes will be in the specified order!
ggplot(gene_type_by_chrom, aes(x = Chromosome, y = Number, fill = Gene.Type))+
geom_col()
Since we have a lot more protein-coding genes that any other type, let’s remove them to have a better view.
<- gene_type_by_chrom %>%
non_protein_by_chrom filter(!Gene.Type == "protein-coding")
ggplot(non_protein_by_chrom, aes(x = Chromosome, y = Number, fill = Gene.Type))+
geom_col()
It would be better if we had a color palette that was easier to read. In ggplot there are functions (layers) to tweak each variable that is being represented in the plot. Such as the scale_y_log10()
that we used for the variable y, there are several scale
functions for the different types of variables. We can use scale_fill_brewer()
to “brew” a color palette for the variable fill
.
ggplot(non_protein_by_chrom, aes(x = Chromosome, y = Number, fill = Gene.Type))+
geom_col()+
scale_fill_brewer(palette = "Paired" )
Now it is easier to see that rRNA genes are only present in the mitochondrial chromosome and chromosome XII and that almost every chromosome has snoRNA genes, for example.