Exploring genes

In the previous sessions we explored the number of genes in a genome, and we saw that there are genes classified as protein-coding and some are not. In this session we will explore those other types of genes.

Download table of genes of Saccharomyces cerevisiae

Go to NCBI Datasets and search for Saccharomyces cerevisiae and click on “View Annotated Genes”. In the “Filters” section select all gene types. This will give us a table with one row per gene with information about their position in the chromosomes, their name, type, length, etc. Select all rows and download the table with the name “S_cerevisiae.tsv” in your workshop directory in data/genes.

You can also download that table from: S_cerevisiae.tsv

Setup R environment

library(tidyverse)
library(ggbeeswarm)

Gene types of S. cerevisiae

Let’s start by reading the file we downloaded into a dataframe.

scerevisiae <- read.delim("data/genes/S_cerevisiae.tsv", sep = "\t")

With the function str() we can show the structure of a dataframe, that is, the column names and types and description of the values.

str(scerevisiae)

'data.frame':   6470 obs. of  13 variables:
 $ Accession            : chr  "NC_001133.9" "NC_001133.9" "NC_001133.9" "NC_001133.9" ...
 $ Begin                : int  1807 2480 7235 11565 12046 13363 21566 22395 24000 31567 ...
 $ End                  : int  2169 2707 9016 11951 12426 13743 21850 22685 27968 32940 ...
 $ Chromosome           : chr  "I" "I" "I" "I" ...
 $ Orientation          : chr  "minus" "plus" "minus" "minus" ...
 $ Name                 : chr  "seripauperin PAU8" "uncharacterized protein" "putative permease SEO1" "uncharacterized protein" ...
 $ Symbol               : chr  "PAU8" "" "SEO1" "" ...
 $ Gene.ID              : int  851229 1466426 851230 851232 851233 851234 851235 1466427 851236 851237 ...
 $ Gene.Type            : chr  "protein-coding" "protein-coding" "protein-coding" "protein-coding" ...
 $ Transcripts.accession: chr  "NM_001180043.1" "NM_001184582.1" "NM_001178208.1" "NM_001179897.1" ...
 $ Protein.accession    : chr  "NP_009332.1" "NP_878038.1" "NP_009333.1" "NP_009335.1" ...
 $ Protein.length       : int  120 75 593 128 126 126 94 96 1322 457 ...
 $ Locus.tag            : chr  "YAL068C" "YAL067W-A" "YAL067C" "YAL065C" ...

We can see that by default all the columns with number where imported as integers and the columns with letters where imported as character. There is another data structure for characters that is called factor. Factors are used for categorical data. They consist of levels, which are the categories (set of possible values), and a vector with numbers that says which level is present in each position.

We can convert a character column to a factor with the function as.factor() (there are equivalent as.type() functions for different data types). To specify a column of a dataframe we use the symbol $ like in the following example, where we are reassigning to the column Gene.Type of the dataframe scerevisiae its original values transformed to factor.

scerevisiae$Gene.Type <- as.factor(scerevisiae$Gene.Type)

If we see the structure of the datraframe again we see that that column change its type and description.

str(scerevisiae)

'data.frame':   6470 obs. of  13 variables:
 $ Accession            : chr  "NC_001133.9" "NC_001133.9" "NC_001133.9" "NC_001133.9" ...
 $ Begin                : int  1807 2480 7235 11565 12046 13363 21566 22395 24000 31567 ...
 $ End                  : int  2169 2707 9016 11951 12426 13743 21850 22685 27968 32940 ...
 $ Chromosome           : chr  "I" "I" "I" "I" ...
 $ Orientation          : chr  "minus" "plus" "minus" "minus" ...
 $ Name                 : chr  "seripauperin PAU8" "uncharacterized protein" "putative permease SEO1" "uncharacterized protein" ...
 $ Symbol               : chr  "PAU8" "" "SEO1" "" ...
 $ Gene.ID              : int  851229 1466426 851230 851232 851233 851234 851235 1466427 851236 851237 ...
 $ Gene.Type            : Factor w/ 13 levels "antisense_RNA",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Transcripts.accession: chr  "NM_001180043.1" "NM_001184582.1" "NM_001178208.1" "NM_001179897.1" ...
 $ Protein.accession    : chr  "NP_009332.1" "NP_878038.1" "NP_009333.1" "NP_009335.1" ...
 $ Protein.length       : int  120 75 593 128 126 126 94 96 1322 457 ...
 $ Locus.tag            : chr  "YAL068C" "YAL067W-A" "YAL067C" "YAL065C" ...

We can use the function levels() to print the levels of a factor.

levels(scerevisiae$Gene.Type)

 [1] "antisense_RNA"  "misc_RNA"       "ncRNA"          "other"         
 [5] "protein-coding" "pseudogene"     "RNase_MRP_RNA"  "RNase_P_RNA"   
 [9] "rRNA"           "snoRNA"         "snRNA"          "telomerase_RNA"
[13] "tRNA"

We now all the types of genes described in this S. cerevisiae genome!

Size and number of each type of gene

Let’s see how many of each type do we have, and their sizes. We can first obtain the size in bp by comparing the End position with the Begin position. And then get the number and average size of each type using the group_by() and summarize() combination of functions.

scerevisiae <- scerevisiae %>%
  mutate(Size = End - Begin)

gene_type <- scerevisiae %>%
  group_by(Gene.Type)%>%
  summarize(Number = n(), Average_size = mean(Size) )
gene_type

Gene.Type	Number	Average_size
antisense_RNA	7	2438.85714
misc_RNA	10	1671.40000
ncRNA	22	760.50000
other	1	521.00000
protein-coding	6014	1483.59844
pseudogene	18	959.00000
RNase_MRP_RNA	1	339.00000
RNase_P_RNA	1	368.00000
rRNA	14	1250.00000
snoRNA	76	179.11842
snRNA	6	400.33333
telomerase_RNA	1	1300.00000
tRNA	299	78.52508

To have a better representation of this let’s do a plot with the quasirandom and boxplot geoms.

ggplot(scerevisiae, aes(x=Gene.Type, y = Size, color = Gene.Type))+
  geom_quasirandom()+
  geom_boxplot(alpha = 0.5, color = "lightgray")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Again we have a case where it will be useful to transform the sizes to their logarithms to have a better view. Let’s use the automated way of ggplot to do the transformation scale_y_log10().

ggplot(scerevisiae, aes(x=Gene.Type, y = Size, color = Gene.Type))+
  geom_quasirandom()+
  geom_boxplot(alpha = 0.5, color = "lightgray")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")+
  scale_y_log10(label = scales::comma)

Gene types by chromosome

Since we have the information of the chromosome where each genes is present. Let’s see how the different gene types are distributed in the different chromosomes. For this we can do a similar summary than before, but grouping by gene type and chromosome at the same time.

gene_type_by_chrom <- scerevisiae %>%
  group_by(Gene.Type, Chromosome)%>%
  summarize(Number = n(), Avg_size = mean(Size))

`summarise()` has grouped output by 'Gene.Type'. You can override using the
`.groups` argument.

head(gene_type_by_chrom)

Gene.Type	Chromosome	Number	Avg_size
antisense_RNA	II	2	2931.5
antisense_RNA	VII	2	2063.0
antisense_RNA	VIII	1	983.0
antisense_RNA	XIII	1	3014.0
antisense_RNA	XVI	1	3086.0
misc_RNA	XII	10	1671.4

To explore if different chromosomes have different proportions of gene types we can plot the number of genes in each type per chromosome in a bar chart.

ggplot(gene_type_by_chrom, aes(x = Chromosome, y = Number, fill = Gene.Type))+
    geom_col()

In this plot we see that the chromosomes are not well ordered in the X axis, that is because they are being ordered alphabetically because they are in the character type.

Before we changed a character column to factor type with the as.factor() function. Now we want to convert it to factor and specify the order of levels. To do that in one step we can use the factor() function which will take the values, and the argument levels which will be a vector of the levels in the order that we want them.

gene_type_by_chrom$Chromosome <- factor(gene_type_by_chrom$Chromosome, levels = c("I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX", "X", "XI", "XII", "XIII", "XIV", "XV", "XVI", "MT"))
str(gene_type_by_chrom$Gene.Type)

 Factor w/ 13 levels "antisense_RNA",..: 1 1 1 1 1 2 3 3 3 3 ...

Now we can repeat the plot and the chromosomes will be in the specified order!

ggplot(gene_type_by_chrom, aes(x = Chromosome, y = Number, fill = Gene.Type))+
    geom_col()

Since we have a lot more protein-coding genes that any other type, let’s remove them to have a better view.

non_protein_by_chrom <- gene_type_by_chrom %>%
  filter(!Gene.Type == "protein-coding")

ggplot(non_protein_by_chrom, aes(x = Chromosome, y = Number, fill = Gene.Type))+
    geom_col()

It would be better if we had a color palette that was easier to read. In ggplot there are functions (layers) to tweak each variable that is being represented in the plot. Such as the scale_y_log10() that we used for the variable y, there are several scale functions for the different types of variables. We can use scale_fill_brewer() to “brew” a color palette for the variable fill .

ggplot(non_protein_by_chrom, aes(x = Chromosome, y = Number, fill = Gene.Type))+
    geom_col()+
  scale_fill_brewer(palette = "Paired" )

Now it is easier to see that rRNA genes are only present in the mitochondrial chromosome and chromosome XII and that almost every chromosome has snoRNA genes, for example.