Analysing a small dataset

Obtaining information from a public database

Go to NCBI Datasets and search for the species in the following list. Create a spreadsheet with the name of the organism, the number of total genes and protein-coding genes and the kingdom they belong to (Bacteria, Archaea, Fungi, Animal, Plant).

  • Homo sapiens
  • Mus musculus
  • Danio rerio
  • Pleurodeles waltl
  • Arabidopsis thaliana
  • Gossypium hirsutum
  • Oryza sativa
  • Zea mays
  • Drosophila melanogaster
  • Caenorhabditis elegans
  • Saccharomyces cerevisiae
  • Cryptocccus neoformans
  • Methanococcus jannaschii
  • Halobacterium salinarum
  • Escherichia coli
  • Helicobacter pylori
  • Haemophilus influenzae
  • Yersinia pestis
  • SARS CoV-2 virus
  • Influenza A virus

Create a directory for this workshop (e.g. Documents/SummmerScholars2024/) and a sub-directory called data/. Move your spreadsheet to data/ .

If you have issues with creating your spreadsheet, you can download one from here.

R environment setup

When we run code we have a “working directory”, the directory in our file system associated to the processes we are running. By default the working directory will be your home directory or the one where the script or Quarto document is located in. It is It is a good practice to explicitly specify your working at the beginning of your code. For this, in a regular script you would use the function setwd() (set working directory) but in a Quarto document if you use that in a code chunk it will only have an effect for that code chunk. To set the working directory for all the chunks, in the setup box at the top we can add execute-dir : file to explicitly set the working directory to be the one where the Quarto document file is located in. So let’s also save this Quarto document in the directory we crated for this workshop Documents/SummmerScholars2024/.

In this box let’s also add df-print: kable to use the “kable” format to print the dataframes in the Quarto reports as a nice table, instead of regular R output.

title: "Class 1"
format: html
df-print: kable
editor: visual
project:
  execute-dir: file

Another important setup step is loading the needed libraries. We will use the tidyverse library for plotting and data manipulation.

library(tidyverse)

Read the spreadsheet into a dataframe

To use in R the data that we obtained, we have to read it into a dataframe (a data structure with rows representing observations and columns representing variables). The read.csv() function will automatically make a dataframe from a comma-separated values file, we only need to tell it the path to the file.

genomes <- read.csv("data/size_genes.csv")
genomes
Scientific_name Common_name Kingdom Size Genes Protein_coding
Homo sapiens human Animal 3.0990e+03 59652 20080
Mus musculus house mouse Animal 2.7280e+03 50763 22198
Danio rerio zebrafish Animal 1.3730e+03 40031 26448
Pleurodeles waltl iberian ribbed newt Animal 2.0301e+04 13359 13359
Arabidopsis thaliana thale cress Plant 1.1910e+02 38312 27562
Gossypium hirsutum cotton Plant 2.3050e+03 99456 67585
Oryza sativa japanese rice Plant 3.7380e+02 35223 28738
Zea mays maize Plant 2.1820e+03 49897 34337
Drosophila melanogaster fruit fly Animal 1.4370e+02 17872 13962
Caenorhabditis elegans Animal 1.0030e+02 46927 19984
Saccharomyces cerevisiae brewer’s yeast Fungi 1.2070e+01 6470 6014
Cryptocccus neoformans Fungi 1.9050e+01 7004 6632
Methanocaldococcus jannaschii Archea 1.7400e+00 1903 1816
Halobacterium salinarum Archea 2.4300e+00 2584 2479
Escherichia coli Bacteria 4.6420e+00 4639 4288
Helicobacter pylori Bacteria 1.6240e+00 1558 1447
Haemophilus influenzae Bacteria 1.8460e+00 1834 1700
Yersinia pestis Bacteria 4.6580e+00 4240 3951
SARS CoV-2 virus Virus 2.9900e-02 11 11
Influenza A virus Virus 1.3630e-02 12 12
summary(genomes)
 Scientific_name    Common_name          Kingdom               Size          
 Length:20          Length:20          Length:20          Min.   :    0.014  
 Class :character   Class :character   Class :character   1st Qu.:    2.284  
 Mode  :character   Mode  :character   Mode  :character   Median :   59.675  
                                                          Mean   : 1638.650  
                                                          3rd Qu.: 1575.250  
                                                          Max.   :20301.000  
     Genes       Protein_coding 
 Min.   :   11   Min.   :   11  
 1st Qu.: 2414   1st Qu.: 2313  
 Median :10182   Median : 9996  
 Mean   :24087   Mean   :15130  
 3rd Qu.:41755   3rd Qu.:23260  
 Max.   :99456   Max.   :67585  

Plot the genome size vs. the number of protein-coding genes

To start understanding genomes we need to ask how the genome size is related to the number of genes it contains. Do small genomes have less genes than large genomes?

Let’s plot this relationship to visually check if larger genome means more genes, across the tree of life. We can start with a basic plot, only specifying which information goes in which axis and how are we going to graphically represent this.

ggplot(genomes, aes(x = Size, y = Protein_coding))+
  geom_point()

With this we can see that there is not a clear tendency in all our data set. Let’s add more information to understand it better.

ggplot(genomes, aes(x = Size, y = Protein_coding, shape = Kingdom))+
  geom_point()

Now we can see that Plants tend to have more protein-coding genes and that Animals don’t follow a linear tendency of increase of genome size and number of protein-coding genes.

We can add more specification about the graphical elements to our plot. For example we can make the points bigger and a certain color.

ggplot(genomes, aes(x = Size, y = Protein_coding, shape = Kingdom))+
  geom_point(color = "blue", size = 3)

If we want to use color to convey information as well, we can use it inside of the aes() function to specify the name of each species.

ggplot(genomes, aes(x = Size, y = Protein_coding, color = Scientific_name, shape = Kingdom))+
  geom_point(size = 3)

With ggplot we can add more layers to tweak the properties of out plots. The theme() function controls the elements not related to the data. Let’s use it to improve the size and position of the legends. And let’s use the labs() function to add a title and change the labels of the x and y axis.

ggplot(genomes, aes(x = Size, y = Protein_coding, color = Scientific_name, shape = Kingdom))+
  geom_point(size = 3)+
  theme(legend.box = "horizontal",
        legend.key.size = unit(0.5, "cm"))+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (Mb)",
       y = "Number of protein-coding genes")

When we have data points that span very large numeric ranges it is useful to transform their values to have a better visualization of the tendencies in the data.

Tidyverse offers efficient ways to modify data. It uses pipes %>% to pass the result of one step as the input to the next step, and it has functions that are very easily combined with pipes to transform the data.

The function mutate() is used to create or modify variables in a dataframe. It takes a name for the column and an operation that will give the values to that column.

Let’s use these tools to create new columns with the logarithm of the columns Size and Protein_coding.

genomes <- genomes %>%
  mutate(Size_log = log10(Size),
         Protein_coding_log = log10(Protein_coding))

genomes
Scientific_name Common_name Kingdom Size Genes Protein_coding Size_log Protein_coding_log
Homo sapiens human Animal 3.0990e+03 59652 20080 3.4912216 4.302764
Mus musculus house mouse Animal 2.7280e+03 50763 22198 3.4358444 4.346314
Danio rerio zebrafish Animal 1.3730e+03 40031 26448 3.1376705 4.422393
Pleurodeles waltl iberian ribbed newt Animal 2.0301e+04 13359 13359 4.3075174 4.125774
Arabidopsis thaliana thale cress Plant 1.1910e+02 38312 27562 2.0759118 4.440311
Gossypium hirsutum cotton Plant 2.3050e+03 99456 67585 3.3626709 4.829850
Oryza sativa japanese rice Plant 3.7380e+02 35223 28738 2.5726393 4.458456
Zea mays maize Plant 2.1820e+03 49897 34337 3.3388547 4.535762
Drosophila melanogaster fruit fly Animal 1.4370e+02 17872 13962 2.1574568 4.144948
Caenorhabditis elegans Animal 1.0030e+02 46927 19984 2.0013009 4.300682
Saccharomyces cerevisiae brewer’s yeast Fungi 1.2070e+01 6470 6014 1.0817073 3.779163
Cryptocccus neoformans Fungi 1.9050e+01 7004 6632 1.2798950 3.821645
Methanocaldococcus jannaschii Archea 1.7400e+00 1903 1816 0.2405492 3.259116
Halobacterium salinarum Archea 2.4300e+00 2584 2479 0.3856063 3.394277
Escherichia coli Bacteria 4.6420e+00 4639 4288 0.6667051 3.632255
Helicobacter pylori Bacteria 1.6240e+00 1558 1447 0.2105860 3.160468
Haemophilus influenzae Bacteria 1.8460e+00 1834 1700 0.2662317 3.230449
Yersinia pestis Bacteria 4.6580e+00 4240 3951 0.6681995 3.596707
SARS CoV-2 virus Virus 2.9900e-02 11 11 -1.5243288 1.041393
Influenza A virus Virus 1.3630e-02 12 12 -1.8655041 1.079181
ggplot(genomes, aes(x = Size_log, y = Protein_coding_log, color = Scientific_name, shape = Kingdom))+
  geom_point(size = 3)+
  theme(legend.box = "horizontal",
        legend.key.size = unit(0.5, "cm"))+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Log10 of genome size (Mb)",
       y = "Log 10 of number of protein-coding genes")

ggplot has functions that you can use as layers to automatically transform the data points and show a scale with meaningful units in the axis.

ggplot(genomes, aes(x = Size, y = Protein_coding, color = Scientific_name, shape = Kingdom))+
  geom_point(size = 3)+
  theme(legend.box = "horizontal",
        legend.key.size = unit(0.5, "cm"))+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (Mb)",
       y = "Number of protein-coding genes")+
  scale_y_log10(label = scales::comma)+
  scale_x_log10(label = scales::comma)

Fraction of protein-coding genes

Another important question about genomes is how many of the total genes are protein-coding genes.

genomes <- genomes %>%
  mutate(Protein_coding_fraction = Protein_coding / Genes)

Let’s plot this using bars as the graphical representation of the protein-coding fraction.

ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
  geom_col()

Let’s improve the labels.

ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))+
  labs(title = "Protein-coding fraction of genes",
       y = "Fraction",
       x = "Species")

In this plot we have all the information but it is difficult to see a pattern because the species are arranged in alphabetical order. It would be more useful to group the columns by Kingdom to see if there is a difference between kingdoms.

For this, ggplot has the function facet_wrap() to separate the plot into panels by group.

ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))+
  labs(title = "Protein-coding fraction of genes",
       y = "Fraction",
       x = "Species")+
  facet_wrap(~Kingdom)

With the default parameters of facet_wrap() we get too much empty space because each panel is considering all species. Let’s each panel have only the corresponding species and arrange all panels in one row to be able to compare the y axis.

ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 45, hjust =1),
        legend.position = "none")+
  labs(title = "Protein-coding fraction of genes",
       y = "Fraction",
       x = "Species")+
  facet_wrap(~Kingdom, nrow = 1, scales = "free_x")

Now we can clearly see that Virus only have protein-coding genes, that Archea, Bacteria and Fungi have a high fraction of protein-coding genes that is not very variable between species, that Plants and Animals have lower fractions, with Animals having more variable values than Plants. One of the animals, Pleurodeles waltl has only protein-coding genes, this is odd, maybe its annotation is incomplete and the rest of the genes were not reported.

This observations are a good place to start our quest to know what’s in a genome, but we only used a few species, in the next session we will use the power of R to analyze a bigger data set and get more meaningful conclusions.