Genomic annotation

We have been exploring what is in a genome by looking at the genes, but there is more than only genes in genomes. Here we will see some other elements that we can find in genomes and also the parts of a gene.

Setup R environment

library(tidyverse)

Import annotation

Download the table with the annotated features of the Saccharomyces cerevisiae genome from this link. This table comes from the GFF file of the annotated reference genome from the Saccharomyces Genome Database Current Release. The GFF file was converted to a TSV file using AGAT.

Import it into a dataframe called annotation and add the argument stringsAsFactors = TRUE, this will make all the columns with strings (characters) to be imported as factors.

annotation <- read.delim("data/scerevisiae.tsv", sep = "\t", stringsAsFactors = TRUE)

str(annotation)

'data.frame':   63097 obs. of  10 variables:
 $ seq_id     : Factor w/ 17 levels "chrI","chrII",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ primary_tag: Factor w/ 36 levels "blocked_reading_frame",..: 31 30 30 7 11 7 15 4 2 6 ...
 $ start      : int  1 1 1 838 838 2778 2781 2781 2790 2781 ...
 $ end        : int  34 781 781 1079 1079 4096 4096 4096 3932 2789 ...
 $ strand     : int  -1 -1 -1 1 1 1 1 1 1 1 ...
 $ ID         : Factor w/ 58347 levels "cds-1","cds-10",..: 39268 39227 39228 29020 45747 45731 45732 12918 1337 22003 ...
 $ locus      : Factor w/ 6878 levels "CEN1","CEN1_centromere",..: 2158 2158 2158 2158 2158 2431 2431 2431 2431 2431 ...
 $ Name       : Factor w/ 26413 levels "CEN1","CEN1_centromere",..: 188 186 187 9179 9179 9162 9164 9163 9163 9163 ...
 $ Note       : Factor w/ 6386 levels "[PIN(+)] prion, named for [PSI+] INducibility; an infectious protein conformation that is generally an ordered "| __truncated__,..: 3059 5938 3059 6167 6167 1737 3059 3059 3059 3059 ...
 $ Parent     : Factor w/ 18542 levels "CEN1","CEN10",..: 195 20 195 20 28 20 6515 6516 6516 6516 ...

Let’s see the levels of the column primary_tag to see what features are described in this annotation.

levels(annotation$primary_tag)

 [1] "blocked_reading_frame"              "CDS"                               
 [3] "centromere"                         "exon"                              
 [5] "external_transcribed_spacer_region" "five_prime_UTR"                    
 [7] "gene"                               "intein_encoding_region"            
 [9] "internal_transcribed_spacer_region" "intron"                            
[11] "long_terminal_repeat"               "LTR_retrotransposon"               
[13] "mating_type_region"                 "matrix_attachment_site"            
[15] "mRNA"                               "ncRNA"                             
[17] "ncRNA_gene"                         "non_transcribed_region"            
[19] "origin_of_replication"              "plus_1_translational_frameshift"   
[21] "pseudogene"                         "recombination_enhancer"            
[23] "rRNA"                               "rRNA_gene"                         
[25] "silent_mating_type_cassette_array"  "snoRNA"                            
[27] "snoRNA_gene"                        "snRNA"                             
[29] "snRNA_gene"                         "telomere"                          
[31] "telomeric_repeat"                   "three_prime_UTR"                   
[33] "transposable_element_gene"          "tRNA"                              
[35] "tRNA_gene"                          "uORF"

Since the column for the strand is coded as -1 and 1 it was imported as numeric, but it would be better represented as a factor.

annotation$strand <- as.factor(annotation$strand)

All genes in the genome

To have a graphic representation of the genes in the genome we can plot them using segments between the start and end positions of each gene along each chromosome. To do this let’s make another dataframe keeping only the genes and plot using geom_segment() and faceting by chromosome.

genes <- annotation %>%
  filter(primary_tag == "gene")

ggplot(genes)+
  geom_segment(aes(x = start, xend = end, y = strand, yend = strand),
               linewidth = 8, color = "darkgreen")+
  facet_wrap(~seq_id, ncol = 1, strip.position = "right")+
  theme(panel.grid = element_blank())

The genes of one chromosome

We can have a better view of this if we see only one chromosome.

genes_chrI <- genes %>%
  filter(seq_id == "chrI")

ggplot(genes_chrI)+
  geom_segment(aes(x = start, xend = end, y = strand, yend = strand),
               linewidth = 8, color = "darkgreen")+
  theme(panel.grid = element_blank())

Main features

To see what else is in the genome let’s select some of the main features (without nested features and no genes) and do a similar plot.

main_features <- annotation %>%
  filter(primary_tag %in% c( "ncRNA_gene", "snoRNA_gene", "snRNA_gene",
                             "tRNA_gene", "rRNA_gene", "origin_of_replication",
                             "transposable_element_gene", "mating_type_region",
                             "telomere","centromere"))

ggplot(main_features)+
  geom_segment(aes(x = start, xend = end,
                   y = strand, yend = strand,
                   color = primary_tag),
               linewidth = 8)+
  facet_wrap(~seq_id, ncol = 1, strip.position = "right")+
  theme(panel.grid = element_blank())

Main features of a region

Some features are too small to see them at this scale, so let’s zoom in to a region of a chromosome. For this we can filter by chromosome (seq_id) and position (start and end).

main_features_chrI <- main_features %>%
  filter(seq_id == "chrI",
         start > 100000, end < 200000)

ggplot(main_features_chrI)+
  geom_segment(aes(x = start, xend = end,
                   y = strand, yend = strand,
                   color = primary_tag),
               linewidth = 8)+
  theme(panel.grid = element_blank())

Exercise: Nested features of the genes in a region

Filter the annotation dataframe to remain only with the primary tags gene, exon, CDS, five_prime_UTR and three_prime_UTR in chromosome chrVII between the positions 61900 and 69500.

Plot this features as geom_segment() using the position as the x axis, the strand for the y axis, the locus as the coloring variable, and create panels by feature (primary tag).