library(tidyverse)
Genome size vs. number of genes
Download assembly metadata of taxonomic groups
- Go to the home page of NCBI Datasets https://www.ncbi.nlm.nih.gov/datasets/
- Search “Metazoa”
- Click on “Browse all 15,582 genomes”
- Apply the filters:
- Select columns:
- Assembly
- Scientific name
- Size (Mb)
- Genes
- Protein-coding
- Pseudogenes
- Check the box to the left of “Assembly” to select all entries.
- Click on “Download/Download Table” and type “metazoa.tsv” for the name of the file and click download.
- Move the downloaded table to your workshop data directory.
Repeat the steps for the following groups of organisms: “Bacteria, Fungi, Archaea, Viridiplantae, Viruses”.
For Bacteria use the filters Reference genomes, Annotated by NCBI RefSeq, Assembly level: complete, Year released: 2018-2024,
For Fungi and Viridiplantae use only the filter: Annotated by NCBI RefSeq.
For Archaea use: Annotated by NCBI RefSeq and Assembly level: Chromosome to complete.
For Viruses use: Annotated by NCBI RefSeq, Assembly level: Complete, Year released: 2022-2024
You can download the tables from the following links: archaea.tsv, bacteria.tsv, fungi.tsv, metazoa.tsv, viridiplantae.tsv, viruses.tsv
Setting up R environment
Let’s start by making sure the working directory is our folder for this workshop and loading the tidyverse
library.
Import the assembly metadata
No we are going to import each table individually. It’s important that we put the kingdom name as the dataframe name.
<- read.delim("data/bacteria.tsv", sep ="\t", )
Bacteria <- read.delim("data/archaea.tsv", sep = "\t")
Archaea <- read.delim("data/metazoa.tsv", sep = "\t")
Metazoa <- read.delim("data/viridiplantae.tsv", sep = "\t")
Viridiplantae <- read.delim("data/fungi.tsv", sep = "\t")
Fungi <- read.delim("data/viruses.tsv", sep = "\t") Viruses
To join the dataframes into one we need first to create a list with. Which is an R data structure that hold elements that can be of different types and sizes. We are going to use the function lst()
which creates a list and uses the name of the dataframe as the name of the dataframe element of the list.
<- lst(Bacteria, Archaea, Metazoa, Viridiplantae, Fungi, Viruses) dfs_list
If you click in the name of the list in your Environment panel you can see the structure of the list and its elements, that in this case are all dataframes.
Now we can give this list to the bind_rows()
function to create a new dataframe with the rows of all the previous dataframes and a new column with the name of the dataframe of origin in every row.
<- bind_rows(dfs_list, .id = "Kingdom")
assemblies head(assemblies)
Kingdom | Assembly.Accession | Assembly.Name | Organism.Name | Assembly.Stats.Total.Sequence.Length | Annotation.Count.Gene.Total | Annotation.Count.Gene.Protein.coding | Annotation.Count.Gene.Pseudogene |
---|---|---|---|---|---|---|---|
Bacteria | GCF_000008865.2 | ASM886v2 | Escherichia coli O157:H7 str. Sakai | 5594605 | 5417 | 5155 | 136 |
Bacteria | GCF_006094375.1 | ASM609437v1 | Staphylococcus epidermidis | 2491058 | 2368 | 2231 | 53 |
Bacteria | GCF_019048385.1 | ASM1904838v1 | Pantoea agglomerans | 4692018 | 4361 | 4204 | 43 |
Bacteria | GCF_014131755.1 | ASM1413175v1 | Bacteroides thetaiotaomicron | 6304193 | 4875 | 4713 | 75 |
Bacteria | GCF_016026695.1 | ASM1602669v1 | Lactococcus garvieae | 2084337 | 2110 | 1981 | 50 |
Bacteria | GCF_900560965.1 | PPRCHA0 | Pseudomonas protegens CHA0 | 6868303 | 6270 | 6146 | 35 |
Now we can use the rename()
function to give better names to the columns.
<- assemblies %>%
assemblies rename("GenBank_accession" = "Assembly.Accession",
"Genome_accession" = "Assembly.Name",
"Organism" = "Organism.Name",
"Size" = "Assembly.Stats.Total.Sequence.Length",
"Genes" = "Annotation.Count.Gene.Total",
"Protein_coding" = "Annotation.Count.Gene.Protein.coding",
"Pseudogenes" = "Annotation.Count.Gene.Pseudogene")
str(assemblies)
'data.frame': 8770 obs. of 8 variables:
$ Kingdom : chr "Bacteria" "Bacteria" "Bacteria" "Bacteria" ...
$ GenBank_accession: chr "GCF_000008865.2" "GCF_006094375.1" "GCF_019048385.1" "GCF_014131755.1" ...
$ Genome_accession : chr "ASM886v2" "ASM609437v1" "ASM1904838v1" "ASM1413175v1" ...
$ Organism : chr "Escherichia coli O157:H7 str. Sakai" "Staphylococcus epidermidis" "Pantoea agglomerans" "Bacteroides thetaiotaomicron" ...
$ Size : num 5594605 2491058 4692018 6304193 2084337 ...
$ Genes : int 5417 2368 4361 4875 2110 6270 5334 3359 2067 1989 ...
$ Protein_coding : int 5155 2231 4204 4713 1981 6146 5109 3218 1957 1842 ...
$ Pseudogenes : int 136 53 43 75 50 35 77 48 45 60 ...
Since we downloaded the assembly metadata form database that might have many assemblies for each organism, lets remain only with one assembly per organism. Let’s use the function group_by()
to create a group of rows of the same Organism, and the function slice()
to keep only this first row in each group.
<- assemblies %>%
assemblies group_by(Organism) %>%
slice(1)
Now we can check the basic summary statistics for out entire data set.
summary(assemblies)
Kingdom GenBank_accession Genome_accession Organism
Length:8510 Length:8510 Length:8510 Length:8510
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Size Genes Protein_coding Pseudogenes
Min. :9.910e+02 Min. : 1 Min. : 1 Min. : 1.0
1st Qu.:1.503e+04 1st Qu.: 8 1st Qu.: 8 1st Qu.: 18.0
Median :2.608e+06 Median : 2512 Median : 2416 Median : 48.0
Mean :1.392e+08 Mean : 5421 Mean : 4540 Mean : 371.3
3rd Qu.:5.601e+06 3rd Qu.: 5059 3rd Qu.: 4878 3rd Qu.: 128.5
Max. :4.005e+10 Max. :186827 Max. :103787 Max. :33511.0
NA's :3463
Genome length vs. protein-coding genes
Now that we have a bigger dataset than in the previous exercise, let’s see if we find the same tendencies in the relationship between genome size and number of protein-coding genes.
ggplot(assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) +
geom_point()+
labs(title = "Relationship between genome size and number of protein-coding genes",
x = "Genome size (bp)",
y = "Number of protein-coding genes")
We can see that the tendency of plants having more genes and animals having larger genomes looks true with the bigger data set. But we don’t see it very well because we have extremely large genomes that make the scale of the plot too big. Let’s see how we can identify extreme genomes
Identifying extreme genomes
We can again use the group_by()
function to group by Kingdom and mix it with the arrange()
function, which reorders the rows in each group in ascending order of the specified variable. Finally we can slice()
to get the first row of each arranged group. Note that if many rows had the same value of protein coding genes, since we are only picking one, we might be loosing genomes that also have the smallest number of genes.
<- assemblies %>%
fewest_genes group_by(Kingdom) %>%
arrange(Protein_coding)%>%
slice(1)
fewest_genes
Kingdom | GenBank_accession | Genome_accession | Organism | Size | Genes | Protein_coding | Pseudogenes |
---|---|---|---|---|---|---|---|
Archaea | GCF_023169545.1 | ASM2316954v1 | Nanobdella aerobiophila | 668961 | 765 | 719 | NA |
Bacteria | GCF_964019745.1 | ihAphAlni3.Zinderia_sp_1.1 | Candidatus Zinderia endosymbiont of Aphrophora alni | 336619 | 333 | 284 | 13 |
Fungi | GCF_000803265.1 | ASM80326v1 | Ordospora colligata OC4 | 2290528 | 1879 | 1820 | NA |
Metazoa | GCF_954871325.1 | tfGorSpeb1-WG-v2 | Gordionus sp. m RMFG-2023 | 287640266 | 10079 | 8321 | 33 |
Viridiplantae | GCF_000733215.1 | ASM73321v1 | Auxenochlorella protothecoides | 22924637 | 7016 | 7010 | NA |
Viruses | GCF_023148805.1 | ASM2314880v1 | Achyranthes virus A | 9491 | 1 | 1 | NA |
Do a little research about the species with the fewest genomes of each kingdom. Do they have something in common?
Now, how would you get the genomes with the largest number of protein-coding genes ? And the smallest genomes ?
Code
<- assemblies %>%
most_genes group_by(Kingdom) %>%
arrange(desc(Protein_coding))%>%
slice(1)
Code
<- assemblies %>%
smallest_genomes group_by(Kingdom) %>%
arrange(Size)%>%
slice(1)
To find the largest genomes, let’s not only get the single largest one but the 10 largest of each group. We can use the slice_head()
function instead of slice()
for this.
<- assemblies %>%
largest_genomes group_by(Kingdom) %>%
arrange(desc(Size))%>%
slice_head(n = 10)
largest_genomes
Kingdom | GenBank_accession | Genome_accession | Organism | Size | Genes | Protein_coding | Pseudogenes |
---|---|---|---|---|---|---|---|
Archaea | GCF_020618475.1 | ASM2061847v1 | Haladaptatus halobius | 7279389 | 7673 | 7335 | 275 |
Archaea | GCF_020700235.1 | ASM2070023v1 | Haladaptatus pallidirubidus | 6270521 | 6555 | 6279 | 209 |
Archaea | GCF_029338335.1 | ASM2933833v1 | Halomontanus rarus | 5998452 | 5559 | 5371 | 132 |
Archaea | GCF_000007345.1 | ASM734v1 | Methanosarcina acetivorans C2A | 5751492 | 4915 | 4690 | 157 |
Archaea | GCF_019879105.1 | ASM1987910v1 | Natrinema sp. SYSU A 869 | 5674674 | 5608 | 4985 | 550 |
Archaea | GCF_024266705.1 | ASM2426670v1 | Natrinema gelatinilyticum | 5443578 | 5342 | 4952 | 334 |
Archaea | GCF_000025325.1 | ASM2532v1 | Haloterrigena turkmenica DSM 5511 | 5440782 | 5281 | 5059 | 160 |
Archaea | GCF_000970145.1 | ASM97014v1 | Methanosarcina siciliae C2J | 5427890 | 4744 | 4444 | 233 |
Archaea | GCF_013456555.2 | ASM1345655v2 | Natrinema zhouii | 5336312 | 5348 | 5017 | 268 |
Archaea | GCF_037335815.1 | ASM3733581v1 | Salinirubellus sp. SYNS196 | 5327573 | 5257 | 5069 | 133 |
Bacteria | GCF_026965555.1 | ASM2696555v1 | Nannocystis poenicansa | 13228663 | 10193 | 10074 | 19 |
Bacteria | GCF_005144635.2 | ASM514463v2 | Polyangium aurulentum | 12320079 | 9670 | 9573 | 18 |
Bacteria | GCF_033097365.1 | ASM3309736v1 | Actinoplanes sichuanensis | 12144028 | 11135 | 10892 | 145 |
Bacteria | GCF_020215645.1 | ASM2021564v1 | Nonomuraea gerenzanensis | 12066052 | 11223 | 11059 | 73 |
Bacteria | GCF_025908395.1 | ASM2590839v1 | Streptomyces milbemycinicus | 12063392 | 10157 | 9716 | 351 |
Bacteria | GCF_035917475.1 | ASM3591747v1 | Nonomuraea fuscirosea | 11975494 | 11059 | 10881 | 96 |
Bacteria | GCF_019904175.1 | ASM1990417v1 | Dactylosporangium vinaceum | 11959907 | 10781 | 10558 | 134 |
Bacteria | GCF_017498545.1 | ASM1749854v1 | Sulfidibacter corallicola | 11786365 | 6977 | 6895 | 12 |
Bacteria | GCF_015689475.1 | ASM1568947v1 | Streptomyces solisilvae | 11639536 | 9472 | 9008 | 373 |
Bacteria | GCF_030294945.1 | SGFS_1.0 | Streptomyces graminofaciens | 11602651 | 10235 | 9681 | 460 |
Fungi | GCF_000439145.1 | ASM43914v3 | Rhizophagus irregularis DAOM 181602=DAOM 197198 | 136726313 | 26245 | 26143 | 3 |
Fungi | GCF_000151645.1 | ASM15164v1 | Tuber melanosporum | 124945702 | 7420 | 7420 | NA |
Fungi | GCF_026914185.1 | ASM2691418v1 | Puccinia triticina | 122823596 | 12736 | 12736 | NA |
Fungi | GCF_015039405.1 | Cananz1 | Cantharellus anzutake | 120173979 | 17292 | 16954 | 18 |
Fungi | GCF_000204055.1 | v1.0 | Melampsora larici-populina 98AG31 | 101129028 | 16263 | 16255 | 8 |
Fungi | GCF_021901695.1 | Pst134E36_v1_pri | Puccinia striiformis f. sp. tritici | 89491239 | 17813 | 17118 | NA |
Fungi | GCF_000149925.1 | ASM14992v1 | Puccinia graminis f. sp. tritici CRL 75-36-700-3 | 88644628 | 16308 | 15799 | 73 |
Fungi | GCF_002865645.1 | Melbi2 | Hyaloscypha bicolor E | 82384847 | 18640 | 18598 | 14 |
Fungi | GCF_020360975.1 | ASM2036097v1 | Hirsutella rhossiliensis | 81749301 | 12181 | 12053 | NA |
Fungi | GCF_016647785.1 | Suifus1 | Suillus fuscotomentosus | 79655913 | 18660 | 18510 | 3 |
Metazoa | GCF_019279795.1 | PAN1.0 | Protopterus annectens | 40054324612 | 30570 | 22039 | 1966 |
Metazoa | GCF_027579735.1 | aBomBom1.pri | Bombina bombina | 10019663791 | 31613 | 22468 | 1998 |
Metazoa | GCF_023864345.2 | iqSchSeri2.2 | Schistocerca serialis cubense | 9082806520 | 75810 | 17237 | 7394 |
Metazoa | GCF_021461395.2 | iqSchAmer2.1 | Schistocerca americana | 8990354198 | 81274 | 17662 | 7227 |
Metazoa | GCF_028390025.1 | aPseCor3.hap2 | Pseudophryne corroboree | 8872758793 | 186827 | 24591 | 33511 |
Metazoa | GCF_023898315.1 | iqSchNite1.1 | Schistocerca nitens | 8822555691 | 72560 | 17500 | 6288 |
Metazoa | GCF_021461385.2 | iqSchPice1.1 | Schistocerca piceifrons | 8742443784 | 96806 | 17490 | 10058 |
Metazoa | GCF_023897955.1 | iqSchGreg1.2 | Schistocerca gregaria | 8742418402 | 99467 | 19799 | 3862 |
Metazoa | GCF_023864275.1 | iqSchCanc2.1 | Schistocerca cancellata | 8536877781 | 103533 | 16907 | 6571 |
Metazoa | GCF_901001135.1 | aRhiBiv1.1 | Rhinatrema bivittatum | 5319222779 | 28114 | 21554 | 620 |
Viridiplantae | GCF_018294505.1 | IWGSC CS RefSeq v2.1 | Triticum aestivum | 14566502436 | 155463 | 103787 | 19065 |
Viridiplantae | GCF_002162155.2 | WEW_v2.1 | Triticum dicoccoides | 10677097747 | 102886 | 66764 | 14884 |
Viridiplantae | GCF_030272615.1 | Sugi_1.0 | Cryptomeria japonica | 9046532310 | 61123 | 41227 | 2981 |
Viridiplantae | GCF_003073215.2 | Tu2.1 | Triticum urartu | 4849077157 | 50810 | 35768 | 6509 |
Viridiplantae | GCF_904849725.1 | MorexV3_pseudomolecules_assembly | Hordeum vulgare subsp. vulgare | 4225577519 | 58438 | 31448 | 5778 |
Viridiplantae | GCF_002575655.2 | Aet v5.0 | Aegilops tauschii subsp. strangulata | 4218064899 | 62260 | 45657 | 5195 |
Viridiplantae | GCF_024323335.1 | CAAS_Psat_ZW6_1.0 | Pisum sativum | 3796066963 | 65672 | 40025 | 5386 |
Viridiplantae | GCF_000715135.1 | Ntab-TN90 | Nicotiana tabacum | 3642884816 | 74273 | 61780 | 3403 |
Viridiplantae | GCF_002878395.1 | UCD10Xv1.1 | Capsicum annuum | 3211819707 | 48341 | 32202 | 5748 |
Viridiplantae | GCF_004153795.1 | AHAU_CSS_1 | Camellia sinensis | 3105212962 | 68531 | 50838 | 6195 |
Viruses | GCF_029891415.1 | ASM2989141v1 | Cotonvirus japonicus | 1476527 | 1309 | 1306 | NA |
Viruses | GCF_002966405.1 | ASM296640v1 | Bodo saltans virus | 1385869 | 1207 | 1207 | NA |
Viruses | GCF_002966375.1 | ASM296637v1 | Acanthamoeba polyphaga mimivirus | 1221932 | 935 | 930 | NA |
Viruses | GCF_002924545.1 | IIT1_a5assembly | Powai lake megavirus | 1208707 | 996 | 996 | NA |
Viruses | GCF_004156295.1 | ASM415629v1 | Moumouvirus australiensis | 1098002 | 908 | 899 | NA |
Viruses | GCF_002966435.1 | ASM296643v1 | Moumouvirus goulette | 1016844 | 975 | 970 | NA |
Viruses | GCF_003057795.1 | ASM305779v1 | Tetraselmis virus 1 | 668031 | 663 | 653 | NA |
Viruses | GCF_006547245.1 | ASM654724v1 | Chrysochromulina parva virus | 437255 | 45 | 45 | NA |
Viruses | GCF_029882485.1 | ASM2988248v1 | Acanthamoeba castellanii medusavirus | 381277 | 463 | 461 | NA |
Viruses | GCF_002624965.1 | ASM262496v1 | Pseudomonas phage Phabio | 309157 | 471 | 468 | NA |
In Metazoans we can see that there is a very big jump in sizes from 5.3Gb to 8.5Gb and in Viridiplantae there is a bigger jump from 4.8Gb to 9Gb. To have a better view of our graph let’s keep the genomes with a size smaller or equal (<=
) than 8.5Gb.
<- assemblies %>%
filtered_assemblies filter(Size <= 8500000000)
ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) +
geom_point()+
labs(title = "Relationship between genome size and number of protein-coding genes",
x = "Genome size (bp)",
y = "Number of protein-coding genes")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(labels = scales::comma)+
scale_x_continuous(labels = scales::comma)
Now we have a better view of the tendencies! At least in Metazoans and Viridiplantae. We can also use a logarithmic scale to improve the view.
ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) +
geom_point()+
labs(title = "Relationship between genome size and number of protein-coding genes in Log10 scale",
x = "Genome size (bp)",
y = "Number of protein-coding genes")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_log10(labels = scales::comma)+
scale_x_log10(labels = scales::comma)
Or we can separate each kingdom into its own panel to see the tendencies in a proper scale for each group.
ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) +
geom_point()+
labs(title = "Relationship between genome size and number of protein-coding genes",
x = "Genome size (bp)",
y = "Number of protein-coding genes")+
facet_wrap(~Kingdom, scales = "free")+
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(labels = scales::comma)+
scale_x_continuous(labels = scales::comma)
Here we have a good view of the distribution of our data in each group, but the Viruses group seems to have a few very large genomes. When we filtered before we only considered the size, because we were using a value that was a proper threshold to remove both the animals and plants with very large genomes. In our last plot we can see that the jump of genome size in the Viruses is around the 500,000 bp. But we can not use that as a threshold in our entire data set. Instead, we are going to get a new dataframe big_viruses
with the viruses that we want to remove. Then we will filter our main data set to keep only the organisms that are not (!
) in (%in%
) the column Organisms of the big_viruses
dataframe (big_viruses$Organism)
.
<- filtered_assemblies %>%
big_viruses filter(Kingdom == "Viruses", Size > 500000)
<- filtered_assemblies %>%
filtered_assemblies filter(!Organism %in% big_viruses$Organism)
ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) +
geom_point()+
labs(title = "Relationship between genome size and number of protein-coding genes",
x = "Genome size (bp)",
y = "Number of protein-coding genes")+
facet_wrap(~Kingdom, scales = "free")+
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(labels = scales::comma)+
scale_x_continuous(labels = scales::comma)
Fraction of protein-coding genes vs. total genes
In the previous episode we saw that Viruses had only protein-coding genes, Archea, Bacteria and Fungi had a high proportion and plants and animals a lower proportion. Let’s see if that holds true with our large data set.
First let’s calculate the fraction of protein-coding genes.
<- filtered_assemblies %>%
filtered_assemblies mutate(Protein_coding_fraction = Protein_coding / Genes)
Since now we have a lot of data points it will be useful to plot the summary statistics as a boxplot, instead of a bar for each genome.
ggplot(filtered_assemblies, aes(Kingdom, Protein_coding_fraction, color = Kingdom))+
theme(legend.position = "none")+
geom_boxplot()
This is a very useful way to show the broad patterns of the data, and we can complement it with the raw data simultaneously. In ggplot we can add multiple representations of the data with different “geoms”. Let’s install and load the library ggbeeswarm
to be able to use the geom_quasirandom()
, which is plotted along the boxplot to show all the data points in an ordered way to be able to see their distribution.
library(ggbeeswarm)
ggplot(filtered_assemblies, aes(Kingdom, Protein_coding_fraction, color = Kingdom))+
theme(legend.position = "none")+
geom_quasirandom()+
geom_boxplot()
When using more than one geom, if they both use the same aes()
parameters, those can be placed in the pain ggplot()
function. If not, the independent aes()
and other parameters go in the specific geom layer.
ggplot(filtered_assemblies, aes(Kingdom, Protein_coding_fraction))+
theme(legend.position = "none")+
geom_quasirandom(aes(color = Kingdom))+
geom_boxplot(alpha = 0.5, color = "darkgray")+
labs(title = "Fraction of protein-coding genes in each kingdom",
y = "Fracion of protein-coding genes")
Import taxonomic data of metazoans
With the previous plots we see that metazoans have a very large range of variation both in the genome size vs. protein-coding gene relationship and the fraction of protein-coding genes. To see how is this explained by the different groups of animals, let’s get the taxonomic information. This table was downloaded from NCBI and it is available at taxonomy_metazoa.tsv. Download it and save it in your data/
directory.
<- read.delim("data/taxonomy_metazoa.tsv", na = c("NA", ""))
taxonomy_metazoa head(taxonomy_metazoa)
Query | Taxid | Tax.name | Authority | Rank | Basionym | Basionym.authority | Curator.common.name | Has.type.material | Group.name | Superkingdom.name | Superkingdom.taxid | Kingdom.name | Kingdom.taxid | Phylum.name | Phylum.taxid | Class.name | Class.taxid | Order.name | Order.taxid | Family.name | Family.taxid | Genus.name | Genus.taxid | Species.name | Species.taxid |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Homo sapiens | 9606 | Homo sapiens | Linnaeus, 1758 | SPECIES | NA | NA | human | no | primates | Eukaryota | 2759 | Metazoa | 33208 | Chordata | 7711 | Mammalia | 40674 | Primates | 9443 | Hominidae | 9604 | Homo | 9605 | Homo sapiens | 9606 |
Mus musculus | 10090 | Mus musculus | Linnaeus, 1758 | SPECIES | NA | NA | house mouse | no | rodents | Eukaryota | 2759 | Metazoa | 33208 | Chordata | 7711 | Mammalia | 40674 | Rodentia | 9989 | Muridae | 10066 | Mus | 10088 | Mus musculus | 10090 |
Drosophila melanogaster | 7227 | Drosophila melanogaster | Meigen, 1830 | SPECIES | NA | NA | fruit fly | no | flies | Eukaryota | 2759 | Metazoa | 33208 | Arthropoda | 6656 | Insecta | 50557 | Diptera | 7147 | Drosophilidae | 7214 | Drosophila | 7215 | Drosophila melanogaster | 7227 |
Danio rerio | 7955 | Danio rerio | (Hamilton, 1822) | SPECIES | Cyprinus rerio | Hamilton, 1822 | zebrafish | no | bony fishes | Eukaryota | 2759 | Metazoa | 33208 | Chordata | 7711 | Actinopteri | 186623 | Cypriniformes | 7952 | Danionidae | 2743709 | Danio | 7954 | Danio rerio | 7955 |
Bubalus bubalis | 89462 | Bubalus bubalis | NA | SPECIES | Bos bubalis | Linnaeus, 1758 | water buffalo | no | even-toed ungulates & whales | Eukaryota | 2759 | Metazoa | 33208 | Chordata | 7711 | Mammalia | 40674 | Artiodactyla | 91561 | Bovidae | 9895 | Bubalus | 9918 | Bubalus bubalis | 89462 |
Helicoverpa armigera | 29058 | Helicoverpa armigera | (Hubner, 1808) | SPECIES | Noctua armigera | Hübner, 1805 | cotton bollworm | no | moths | Eukaryota | 2759 | Metazoa | 33208 | Arthropoda | 6656 | Insecta | 50557 | Lepidoptera | 7088 | Noctuidae | 7100 | Helicoverpa | 7112 | Helicoverpa armigera | 29058 |
Let’s select only the columns with the taxonomy information. The function select()
lets you do this while renaming them. We will join this table with our dataframe, so it will be easier if we use the same name for the column they have in common, which is “Organism”.
<- taxonomy_metazoa %>%
taxonomy_metazoa_selected select(Organism = Query,
Common_name = Curator.common.name,
Phylum = Phylum.name,
Class = Class.name,
Order = Order.name,
Family = Family.name,
Genus = Genus.name,
Species = Species.name)
head(taxonomy_metazoa_selected)
Organism | Common_name | Phylum | Class | Order | Family | Genus | Species |
---|---|---|---|---|---|---|---|
Homo sapiens | human | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens |
Mus musculus | house mouse | Chordata | Mammalia | Rodentia | Muridae | Mus | Mus musculus |
Drosophila melanogaster | fruit fly | Arthropoda | Insecta | Diptera | Drosophilidae | Drosophila | Drosophila melanogaster |
Danio rerio | zebrafish | Chordata | Actinopteri | Cypriniformes | Danionidae | Danio | Danio rerio |
Bubalus bubalis | water buffalo | Chordata | Mammalia | Artiodactyla | Bovidae | Bubalus | Bubalus bubalis |
Helicoverpa armigera | cotton bollworm | Arthropoda | Insecta | Lepidoptera | Noctuidae | Helicoverpa | Helicoverpa armigera |
We will join our filtered_assemblies
dataframe with the new taxonomy_metazoa_selected
dataframe. We want only the rows that have information on both tables. By using the function inner_join()
we will put the columns of both dataframes in a single one and the rows will be matched by the column “Organism”. There are multiple joining functions in tidyverse, inner_join()
makes sure that only the organisms (rows) that are present in both tables go to the resulting one.
<-inner_join(filtered_assemblies,taxonomy_metazoa_selected, by = "Organism")
metazoa_assembly_tax head(metazoa_assembly_tax)
Kingdom | GenBank_accession | Genome_accession | Organism | Size | Genes | Protein_coding | Pseudogenes | Protein_coding_fraction | Common_name | Phylum | Class | Order | Family | Genus | Species |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metazoa | GCF_021347895.1 | KAUST_Apoly_ChrSc | Acanthochromis polyacanthus | 956771783 | 30251 | 23921 | 328 | 0.7907507 | spiny chromis | Chordata | Actinopteri | NA | Pomacentridae | Acanthochromis | Acanthochromis polyacanthus |
Metazoa | GCF_904848185.1 | fAcaLat1.1 | Acanthopagrus latus | 685127588 | 30246 | 23786 | 189 | 0.7864180 | yellowfin seabream | Chordata | Actinopteri | Spariformes | Sparidae | Acanthopagrus | Acanthopagrus latus |
Metazoa | GCF_929443795.1 | bAccGen1.1 | Accipiter gentilis | 1398011397 | 19564 | 16976 | 165 | 0.8677162 | Northern goshawk | Chordata | Aves | Accipitriformes | Accipitridae | Accipiter | Accipiter gentilis |
Metazoa | GCF_027475565.1 | VMU_Ajub_asm_v1.0 | Acinonyx jubatus | 2377417351 | 33469 | 19999 | 3672 | 0.5975380 | cheetah | Chordata | Mammalia | Carnivora | Felidae | Acinonyx | Acinonyx jubatus |
Metazoa | GCF_902713425.1 | fAciRut3.2 maternal haplotype | Acipenser ruthenus | 1899794300 | 80041 | 38165 | 6639 | 0.4768181 | sterlet | Chordata | Actinopteri | Acipenseriformes | Acipenseridae | Acipenser | Acipenser ruthenus |
Metazoa | GCF_903995435.1 | mAcoRus1.1 | Acomys russatus | 2301522284 | 29331 | 20811 | 5096 | 0.7095223 | golden spiny mouse | Chordata | Mammalia | Rodentia | Muridae | Acomys | Acomys russatus |
Let’s repeat our plot of genome length vs protein-coding genes for our Metazoa data set.
ggplot(metazoa_assembly_tax) +
geom_point(aes(x= Size, y = Protein_coding, color = Phylum))+
scale_x_continuous(label=scales::comma)+
scale_y_continuous(label=scales::comma)+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
labs(title = "Genome length vs protein-coding genes of Metazoa",
x = "Genome size (bp)",
y = "Number of protein-coding genes")
We can see that Chordata is one of the phyla with more data points and it has a lot of variation. Let’s see the protein coding fraction as well.
ggplot(metazoa_assembly_tax, aes(Phylum, Protein_coding_fraction))+
geom_quasirandom(aes(color = Phylum))+
geom_boxplot(alpha = 0.5, color = "darkgray")+
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))+
labs(title = "Fraction of protein-coding genes per phylum of Metazoans",
y = "Fraction of protein-coding genes")
In this measure we see that Chordata and Cnidaria have a lot of variation. Showing the data points along with the boxplot, allows us to see that we have a lot more information from Chordates and Arthropoda than the rest of the phyla, so we have to keep in mind that our data set is not representing equaly all the animal diversity, so we would need to increase the number of assemblies of the other groups to have more reliable conclusions.