library(tidyverse)
Analysing a small dataset
Obtaining information from a public database
Go to NCBI Datasets and search for the species in the following list. Create a spreadsheet with the name of the organism, the number of total genes and protein-coding genes and the kingdom they belong to (Bacteria, Archaea, Fungi, Animal, Plant).
- Homo sapiens
- Mus musculus
- Danio rerio
- Pleurodeles waltl
- Arabidopsis thaliana
- Gossypium hirsutum
- Oryza sativa
- Zea mays
- Drosophila melanogaster
- Caenorhabditis elegans
- Saccharomyces cerevisiae
- Cryptocccus neoformans
- Methanococcus jannaschii
- Halobacterium salinarum
- Escherichia coli
- Helicobacter pylori
- Haemophilus influenzae
- Yersinia pestis
- SARS CoV-2 virus
- Influenza A virus
Create a directory for this workshop (e.g. Documents/SummmerScholars2024/
) and a sub-directory called data/
. Move your spreadsheet to data/
.
If you have issues with creating your spreadsheet, you can download one from here.
R environment setup
When we run code we have a “working directory”, the directory in our file system associated to the processes we are running. By default the working directory will be your home directory or the one where the script or Quarto document is located in. It is It is a good practice to explicitly specify your working at the beginning of your code. For this, in a regular script you would use the function setwd()
(set working directory) but in a Quarto document if you use that in a code chunk it will only have an effect for that code chunk. To set the working directory for all the chunks, in the setup box at the top we can add execute-dir : file
to explicitly set the working directory to be the one where the Quarto document file
is located in. So let’s also save this Quarto document in the directory we crated for this workshop Documents/SummmerScholars2024/.
In this box let’s also add df-print: kable
to use the “kable” format to print the dataframes in the Quarto reports as a nice table, instead of regular R output.
title: "Class 1"
format: html
df-print: kable
editor: visual
project:
execute-dir: file
Another important setup step is loading the needed libraries. We will use the tidyverse
library for plotting and data manipulation.
Read the spreadsheet into a dataframe
To use in R the data that we obtained, we have to read it into a dataframe (a data structure with rows representing observations and columns representing variables). The read.csv()
function will automatically make a dataframe from a comma-separated values file, we only need to tell it the path to the file.
<- read.csv("data/size_genes.csv")
genomes genomes
Scientific_name | Common_name | Kingdom | Size | Genes | Protein_coding |
---|---|---|---|---|---|
Homo sapiens | human | Animal | 3.0990e+03 | 59652 | 20080 |
Mus musculus | house mouse | Animal | 2.7280e+03 | 50763 | 22198 |
Danio rerio | zebrafish | Animal | 1.3730e+03 | 40031 | 26448 |
Pleurodeles waltl | iberian ribbed newt | Animal | 2.0301e+04 | 13359 | 13359 |
Arabidopsis thaliana | thale cress | Plant | 1.1910e+02 | 38312 | 27562 |
Gossypium hirsutum | cotton | Plant | 2.3050e+03 | 99456 | 67585 |
Oryza sativa | japanese rice | Plant | 3.7380e+02 | 35223 | 28738 |
Zea mays | maize | Plant | 2.1820e+03 | 49897 | 34337 |
Drosophila melanogaster | fruit fly | Animal | 1.4370e+02 | 17872 | 13962 |
Caenorhabditis elegans | Animal | 1.0030e+02 | 46927 | 19984 | |
Saccharomyces cerevisiae | brewer’s yeast | Fungi | 1.2070e+01 | 6470 | 6014 |
Cryptocccus neoformans | Fungi | 1.9050e+01 | 7004 | 6632 | |
Methanocaldococcus jannaschii | Archea | 1.7400e+00 | 1903 | 1816 | |
Halobacterium salinarum | Archea | 2.4300e+00 | 2584 | 2479 | |
Escherichia coli | Bacteria | 4.6420e+00 | 4639 | 4288 | |
Helicobacter pylori | Bacteria | 1.6240e+00 | 1558 | 1447 | |
Haemophilus influenzae | Bacteria | 1.8460e+00 | 1834 | 1700 | |
Yersinia pestis | Bacteria | 4.6580e+00 | 4240 | 3951 | |
SARS CoV-2 virus | Virus | 2.9900e-02 | 11 | 11 | |
Influenza A virus | Virus | 1.3630e-02 | 12 | 12 |
summary(genomes)
Scientific_name Common_name Kingdom Size
Length:20 Length:20 Length:20 Min. : 0.014
Class :character Class :character Class :character 1st Qu.: 2.284
Mode :character Mode :character Mode :character Median : 59.675
Mean : 1638.650
3rd Qu.: 1575.250
Max. :20301.000
Genes Protein_coding
Min. : 11 Min. : 11
1st Qu.: 2414 1st Qu.: 2313
Median :10182 Median : 9996
Mean :24087 Mean :15130
3rd Qu.:41755 3rd Qu.:23260
Max. :99456 Max. :67585
Plot the genome size vs. the number of protein-coding genes
To start understanding genomes we need to ask how the genome size is related to the number of genes it contains. Do small genomes have less genes than large genomes?
Let’s plot this relationship to visually check if larger genome means more genes, across the tree of life. We can start with a basic plot, only specifying which information goes in which axis and how are we going to graphically represent this.
ggplot(genomes, aes(x = Size, y = Protein_coding))+
geom_point()
With this we can see that there is not a clear tendency in all our data set. Let’s add more information to understand it better.
ggplot(genomes, aes(x = Size, y = Protein_coding, shape = Kingdom))+
geom_point()
Now we can see that Plants tend to have more protein-coding genes and that Animals don’t follow a linear tendency of increase of genome size and number of protein-coding genes.
We can add more specification about the graphical elements to our plot. For example we can make the points bigger and a certain color.
ggplot(genomes, aes(x = Size, y = Protein_coding, shape = Kingdom))+
geom_point(color = "blue", size = 3)
If we want to use color to convey information as well, we can use it inside of the aes()
function to specify the name of each species.
ggplot(genomes, aes(x = Size, y = Protein_coding, color = Scientific_name, shape = Kingdom))+
geom_point(size = 3)
With ggplot we can add more layers to tweak the properties of out plots. The theme()
function controls the elements not related to the data. Let’s use it to improve the size and position of the legends. And let’s use the labs()
function to add a title and change the labels of the x and y axis.
ggplot(genomes, aes(x = Size, y = Protein_coding, color = Scientific_name, shape = Kingdom))+
geom_point(size = 3)+
theme(legend.box = "horizontal",
legend.key.size = unit(0.5, "cm"))+
labs(title = "Relationship between genome size and number of protein-coding genes",
x = "Genome size (Mb)",
y = "Number of protein-coding genes")
When we have data points that span very large numeric ranges it is useful to transform their values to have a better visualization of the tendencies in the data.
Tidyverse offers efficient ways to modify data. It uses pipes %>%
to pass the result of one step as the input to the next step, and it has functions that are very easily combined with pipes to transform the data.
The function mutate()
is used to create or modify variables in a dataframe. It takes a name for the column and an operation that will give the values to that column.
Let’s use these tools to create new columns with the logarithm of the columns Size
and Protein_coding
.
<- genomes %>%
genomes mutate(Size_log = log10(Size),
Protein_coding_log = log10(Protein_coding))
genomes
Scientific_name | Common_name | Kingdom | Size | Genes | Protein_coding | Size_log | Protein_coding_log |
---|---|---|---|---|---|---|---|
Homo sapiens | human | Animal | 3.0990e+03 | 59652 | 20080 | 3.4912216 | 4.302764 |
Mus musculus | house mouse | Animal | 2.7280e+03 | 50763 | 22198 | 3.4358444 | 4.346314 |
Danio rerio | zebrafish | Animal | 1.3730e+03 | 40031 | 26448 | 3.1376705 | 4.422393 |
Pleurodeles waltl | iberian ribbed newt | Animal | 2.0301e+04 | 13359 | 13359 | 4.3075174 | 4.125774 |
Arabidopsis thaliana | thale cress | Plant | 1.1910e+02 | 38312 | 27562 | 2.0759118 | 4.440311 |
Gossypium hirsutum | cotton | Plant | 2.3050e+03 | 99456 | 67585 | 3.3626709 | 4.829850 |
Oryza sativa | japanese rice | Plant | 3.7380e+02 | 35223 | 28738 | 2.5726393 | 4.458456 |
Zea mays | maize | Plant | 2.1820e+03 | 49897 | 34337 | 3.3388547 | 4.535762 |
Drosophila melanogaster | fruit fly | Animal | 1.4370e+02 | 17872 | 13962 | 2.1574568 | 4.144948 |
Caenorhabditis elegans | Animal | 1.0030e+02 | 46927 | 19984 | 2.0013009 | 4.300682 | |
Saccharomyces cerevisiae | brewer’s yeast | Fungi | 1.2070e+01 | 6470 | 6014 | 1.0817073 | 3.779163 |
Cryptocccus neoformans | Fungi | 1.9050e+01 | 7004 | 6632 | 1.2798950 | 3.821645 | |
Methanocaldococcus jannaschii | Archea | 1.7400e+00 | 1903 | 1816 | 0.2405492 | 3.259116 | |
Halobacterium salinarum | Archea | 2.4300e+00 | 2584 | 2479 | 0.3856063 | 3.394277 | |
Escherichia coli | Bacteria | 4.6420e+00 | 4639 | 4288 | 0.6667051 | 3.632255 | |
Helicobacter pylori | Bacteria | 1.6240e+00 | 1558 | 1447 | 0.2105860 | 3.160468 | |
Haemophilus influenzae | Bacteria | 1.8460e+00 | 1834 | 1700 | 0.2662317 | 3.230449 | |
Yersinia pestis | Bacteria | 4.6580e+00 | 4240 | 3951 | 0.6681995 | 3.596707 | |
SARS CoV-2 virus | Virus | 2.9900e-02 | 11 | 11 | -1.5243288 | 1.041393 | |
Influenza A virus | Virus | 1.3630e-02 | 12 | 12 | -1.8655041 | 1.079181 |
ggplot(genomes, aes(x = Size_log, y = Protein_coding_log, color = Scientific_name, shape = Kingdom))+
geom_point(size = 3)+
theme(legend.box = "horizontal",
legend.key.size = unit(0.5, "cm"))+
labs(title = "Relationship between genome size and number of protein-coding genes",
x = "Log10 of genome size (Mb)",
y = "Log 10 of number of protein-coding genes")
ggplot has functions that you can use as layers to automatically transform the data points and show a scale with meaningful units in the axis.
ggplot(genomes, aes(x = Size, y = Protein_coding, color = Scientific_name, shape = Kingdom))+
geom_point(size = 3)+
theme(legend.box = "horizontal",
legend.key.size = unit(0.5, "cm"))+
labs(title = "Relationship between genome size and number of protein-coding genes",
x = "Genome size (Mb)",
y = "Number of protein-coding genes")+
scale_y_log10(label = scales::comma)+
scale_x_log10(label = scales::comma)
Fraction of protein-coding genes
Another important question about genomes is how many of the total genes are protein-coding genes.
<- genomes %>%
genomes mutate(Protein_coding_fraction = Protein_coding / Genes)
Let’s plot this using bars as the graphical representation of the protein-coding fraction.
ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
geom_col()
Let’s improve the labels.
ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
geom_col()+
theme(axis.text.x = element_text(angle = 45, hjust =1))+
labs(title = "Protein-coding fraction of genes",
y = "Fraction",
x = "Species")
In this plot we have all the information but it is difficult to see a pattern because the species are arranged in alphabetical order. It would be more useful to group the columns by Kingdom to see if there is a difference between kingdoms.
For this, ggplot has the function facet_wrap()
to separate the plot into panels by group.
ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
geom_col()+
theme(axis.text.x = element_text(angle = 45, hjust =1))+
labs(title = "Protein-coding fraction of genes",
y = "Fraction",
x = "Species")+
facet_wrap(~Kingdom)
With the default parameters of facet_wrap()
we get too much empty space because each panel is considering all species. Let’s each panel have only the corresponding species and arrange all panels in one row to be able to compare the y axis.
ggplot(genomes, aes(x = Scientific_name, y = Protein_coding_fraction, fill = Kingdom))+
geom_col()+
theme(axis.text.x = element_text(angle = 45, hjust =1),
legend.position = "none")+
labs(title = "Protein-coding fraction of genes",
y = "Fraction",
x = "Species")+
facet_wrap(~Kingdom, nrow = 1, scales = "free_x")
Now we can clearly see that Virus only have protein-coding genes, that Archea, Bacteria and Fungi have a high fraction of protein-coding genes that is not very variable between species, that Plants and Animals have lower fractions, with Animals having more variable values than Plants. One of the animals, Pleurodeles waltl has only protein-coding genes, this is odd, maybe its annotation is incomplete and the rest of the genes were not reported.
This observations are a good place to start our quest to know what’s in a genome, but we only used a few species, in the next session we will use the power of R to analyze a bigger data set and get more meaningful conclusions.