Analysing a small dataset

Obtaining information from a public database

Go to NCBI Datasets and search for the species in the following list. Create a spreadsheet with the name of the organism, the number of total genes and protein-coding genes and the kingdom they belong to (Bacteria, Archaea, Fungi, Animal, Plant).

Homo sapiens
Mus musculus
Danio rerio
Pleurodeles waltl
Arabidopsis thaliana
Gossypium hirsutum
Oryza sativa
Zea mays
Drosophila melanogaster
Caenorhabditis elegans
Saccharomyces cerevisiae
Cryptocccus neoformans
Methanococcus jannaschii
Halobacterium salinarum
Escherichia coli
Helicobacter pylori
Haemophilus influenzae
Yersinia pestis
SARS CoV-2 virus
Influenza A virus

Create a directory for this workshop (e.g. Documents/SummmerScholars2024/) and a sub-directory called data/. Move your spreadsheet to data/ .

If you have issues with creating your spreadsheet, you can download one from here.

R environment setup

When we run code we have a “working directory”, the directory in our file system associated to the processes we are running. By default the working directory will be your home directory or the one where the script or Quarto document is located in. It is It is a good practice to explicitly specify your working at the beginning of your code. For this, in a regular script you would use the function setwd() (set working directory) but in a Quarto document if you use that in a code chunk it will only have an effect for that code chunk. To set the working directory for all the chunks, in the setup box at the top we can add execute-dir : file to explicitly set the working directory to be the one where the Quarto document file is located in. So let’s also save this Quarto document in the directory we crated for this workshop Documents/SummmerScholars2024/.

In this box let’s also add df-print: kable to use the “kable” format to print the dataframes in the Quarto reports as a nice table, instead of regular R output.

title: "Class 1"
format: html
df-print: kable
editor: visual
project:
  execute-dir: file

Another important setup step is loading the needed libraries. We will use the tidyverse library for plotting and data manipulation.

library(tidyverse)

Read the spreadsheet into a dataframe

To use in R the data that we obtained, we have to read it into a dataframe (a data structure with rows representing observations and columns representing variables). The read.csv() function will automatically make a dataframe from a comma-separated values file, we only need to tell it the path to the file.

genomes <- read.csv("data/size_genes.csv")
genomes

Scientific_name	Common_name	Kingdom	Size	Genes	Protein_coding
Homo sapiens	human	Animal	3.0990e+03	59652	20080
Mus musculus	house mouse	Animal	2.7280e+03	50763	22198
Danio rerio	zebrafish	Animal	1.3730e+03	40031	26448
Pleurodeles waltl	iberian ribbed newt	Animal	2.0301e+04	13359	13359
Arabidopsis thaliana	thale cress	Plant	1.1910e+02	38312	27562
Gossypium hirsutum	cotton	Plant	2.3050e+03	99456	67585
Oryza sativa	japanese rice	Plant	3.7380e+02	35223	28738
Zea mays	maize	Plant	2.1820e+03	49897	34337
Drosophila melanogaster	fruit fly	Animal	1.4370e+02	17872	13962
Caenorhabditis elegans		Animal	1.0030e+02	46927	19984
Saccharomyces cerevisiae	brewer’s yeast	Fungi	1.2070e+01	6470	6014
Cryptocccus neoformans		Fungi	1.9050e+01	7004	6632
Methanocaldococcus jannaschii		Archea	1.7400e+00	1903	1816
Halobacterium salinarum		Archea	2.4300e+00	2584	2479
Escherichia coli		Bacteria	4.6420e+00	4639	4288
Helicobacter pylori		Bacteria	1.6240e+00	1558	1447
Haemophilus influenzae		Bacteria	1.8460e+00	1834	1700
Yersinia pestis		Bacteria	4.6580e+00	4240	3951
SARS CoV-2 virus		Virus	2.9900e-02	11	11
Influenza A virus		Virus	1.3630e-02	12	12

summary(genomes)

 Scientific_name    Common_name          Kingdom               Size          
 Length:20          Length:20          Length:20          Min.   :    0.014  
 Class :character   Class :character   Class :character   1st Qu.:    2.284  
 Mode  :character   Mode  :character   Mode  :character   Median :   59.675  
                                                          Mean   : 1638.650  
                                                          3rd Qu.: 1575.250  
                                                          Max.   :20301.000  
     Genes       Protein_coding 
 Min.   :   11   Min.   :   11  
 1st Qu.: 2414   1st Qu.: 2313  
 Median :10182   Median : 9996  
 Mean   :24087   Mean   :15130  
 3rd Qu.:41755   3rd Qu.:23260  
 Max.   :99456   Max.   :67585

Plot the genome size vs. the number of protein-coding genes

To start understanding genomes we need to ask how the genome size is related to the number of genes it contains. Do small genomes have less genes than large genomes?

Let’s plot this relationship to visually check if larger genome means more genes, across the tree of life. We can start with a basic plot, only specifying which information goes in which axis and how are we going to graphically represent this.

ggplot(genomes, aes(x = Size, y = Protein_coding))+
  geom_point()

With this we can see that there is not a clear tendency in all our data set. Let’s add more information to understand it better.

ggplot(genomes, aes(x = Size, y = Protein_coding, shape = Kingdom))+
  geom_point()

Now we can see that Plants tend to have more protein-coding genes and that Animals don’t follow a linear tendency of increase of genome size and number of protein-coding genes.

We can add more specification about the graphical elements to our plot. For example we can make the points bigger and a certain color.

ggplot(genomes, aes(x = Size, y = Protein_coding, shape = Kingdom))+
  geom_point(color = "blue", size = 3)

If we want to use color to convey information as well, we can use it inside of the aes() function to specify the name of each species.

ggplot(genomes, aes(x = Size, y = Protein_coding, color = Scientific_name, shape = Kingdom))+
  geom_point(size = 3)

With ggplot we can add more layers to tweak the properties of out plots. The theme() function controls the elements not related to the data. Let’s use it to improve the size and position of the legends. And let’s use the labs() function to add a title and change the labels of the x and y axis.

ggplot(genomes, aes(x = Size, y = Protein_coding, color = Scientific_name, shape = Kingdom))+
  geom_point(size = 3)+
  theme(legend.box = "horizontal",
        legend.key.size = unit(0.5, "cm"))+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (Mb)",
       y = "Number of protein-coding genes")

When we have data points that span very large numeric ranges it is useful to transform their values to have a better visualization of the tendencies in the data.

Tidyverse offers efficient ways to modify data. It uses pipes %>% to pass the result of one step as the input to the next step, and it has functions that are very easily combined with pipes to transform the data.

The function mutate() is used to create or modify variables in a dataframe. It takes a name for the column and an operation that will give the values to that column.

Let’s use these tools to create new columns with the logarithm of the columns Size and Protein_coding.

genomes <- genomes %>%
  mutate(Size_log = log10(Size),
         Protein_coding_log = log10(Protein_coding))

genomes

Scientific_name	Common_name	Kingdom	Size	Genes	Protein_coding	Size_log	Protein_coding_log
Homo sapiens	human	Animal	3.0990e+03	59652	20080	3.4912216	4.302764
Mus musculus	house mouse	Animal	2.7280e+03	50763	22198	3.4358444	4.346314
Danio rerio	zebrafish	Animal	1.3730e+03	40031	26448	3.1376705	4.422393
Pleurodeles waltl	iberian ribbed newt	Animal	2.0301e+04	13359	13359	4.3075174	4.125774
Arabidopsis thaliana	thale cress	Plant	1.1910e+02	38312	27562	2.0759118	4.440311
Gossypium hirsutum	cotton	Plant	2.3050e+03	99456	67585	3.3626709	4.829850
Oryza sativa	japanese rice	Plant	3.7380e+02	35223	28738	2.5726393	4.458456
Zea mays	maize	Plant	2.1820e+03	49897	34337	3.3388547	4.535762
Drosophila melanogaster	fruit fly	Animal	1.4370e+02	17872	13962	2.1574568	4.144948
Caenorhabditis elegans		Animal	1.0030e+02	46927	19984	2.0013009	4.300682
Saccharomyces cerevisiae	brewer’s yeast	Fungi	1.2070e+01	6470	6014	1.0817073	3.779163
Cryptocccus neoformans		Fungi	1.9050e+01	7004	6632	1.2798950	3.821645
Methanocaldococcus jannaschii		Archea	1.7400e+00	1903	1816	0.2405492	3.259116
Halobacterium salinarum		Archea	2.4300e+00	2584	2479	0.3856063	3.394277
Escherichia coli		Bacteria	4.6420e+00	4639	4288	0.6667051	3.632255
Helicobacter pylori		Bacteria	1.6240e+00	1558	1447	0.2105860	3.160468
Haemophilus influenzae		Bacteria	1.8460e+00	1834	1700	0.2662317	3.230449
Yersinia pestis		Bacteria	4.6580e+00	4240	3951	0.6681995	3.596707
SARS CoV-2 virus		Virus	2.9900e-02	11	11	-1.5243288	1.041393
Influenza A virus		Virus	1.3630e-02	12	12	-1.8655041	1.079181

ggplot(genomes, aes(x = Size_log, y = Protein_coding_log, color = Scientific_name, shape = Kingdom))+
  geom_point(size = 3)+
  theme(legend.box = "horizontal",
        legend.key.size = unit(0.5, "cm"))+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Log10 of genome size (Mb)",
       y = "Log 10 of number of protein-coding genes")

ggplot has functions that you can use as layers to automatically transform the data points and show a scale with meaningful units in the axis.

ggplot(genomes, aes(x = Size, y = Protein_coding, color = Scientific_name, shape = Kingdom))+
  geom_point(size = 3)+
  theme(legend.box = "horizontal",
        legend.key.size = unit(0.5, "cm"))+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (Mb)",
       y = "Number of protein-coding genes")+
  scale_y_log10(label = scales::comma)+
  scale_x_log10(label = scales::comma)

Fraction of protein-coding genes

Another important question about genomes is how many of the total genes are protein-coding genes.

genomes <- genomes %>%
  mutate(Protein_coding_fraction = Protein_coding / Genes)

Let’s plot this using bars as the graphical representation of the protein-coding fraction.

ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
  geom_col()

Let’s improve the labels.

ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))+
  labs(title = "Protein-coding fraction of genes",
       y = "Fraction",
       x = "Species")

In this plot we have all the information but it is difficult to see a pattern because the species are arranged in alphabetical order. It would be more useful to group the columns by Kingdom to see if there is a difference between kingdoms.

For this, ggplot has the function facet_wrap() to separate the plot into panels by group.

ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))+
  labs(title = "Protein-coding fraction of genes",
       y = "Fraction",
       x = "Species")+
  facet_wrap(~Kingdom)

With the default parameters of facet_wrap() we get too much empty space because each panel is considering all species. Let’s each panel have only the corresponding species and arrange all panels in one row to be able to compare the y axis.

ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 45, hjust =1),
        legend.position = "none")+
  labs(title = "Protein-coding fraction of genes",
       y = "Fraction",
       x = "Species")+
  facet_wrap(~Kingdom, nrow = 1, scales = "free_x")

Now we can clearly see that Virus only have protein-coding genes, that Archea, Bacteria and Fungi have a high fraction of protein-coding genes that is not very variable between species, that Plants and Animals have lower fractions, with Animals having more variable values than Plants. One of the animals, Pleurodeles waltl has only protein-coding genes, this is odd, maybe its annotation is incomplete and the rest of the genes were not reported.

This observations are a good place to start our quest to know what’s in a genome, but we only used a few species, in the next session we will use the power of R to analyze a bigger data set and get more meaningful conclusions.