Genome size vs. number of genes

Download assembly metadata of taxonomic groups

  1. Go to the home page of NCBI Datasets https://www.ncbi.nlm.nih.gov/datasets/
  2. Search “Metazoa”
  3. Click on “Browse all 15,582 genomes”
  4. Apply the filters:
  5. Select columns:
    • Assembly
    • Scientific name
    • Size (Mb)
    • Genes
    • Protein-coding
    • Pseudogenes
  6. Check the box to the left of “Assembly” to select all entries.
  7. Click on “Download/Download Table” and type “metazoa.tsv” for the name of the file and click download.
  8. Move the downloaded table to your workshop data directory.

Repeat the steps for the following groups of organisms: “Bacteria, Fungi, Archaea, Viridiplantae, Viruses”.

For Bacteria use the filters Reference genomes, Annotated by NCBI RefSeq, Assembly level: complete, Year released: 2018-2024,

For Fungi and Viridiplantae use only the filter: Annotated by NCBI RefSeq.

For Archaea use: Annotated by NCBI RefSeq and Assembly level: Chromosome to complete.

For Viruses use: Annotated by NCBI RefSeq, Assembly level: Complete, Year released: 2022-2024

You can download the tables from the following links: archaea.tsv, bacteria.tsv, fungi.tsv, metazoa.tsv, viridiplantae.tsv, viruses.tsv

Setting up R environment

Let’s start by making sure the working directory is our folder for this workshop and loading the tidyverse library.

library(tidyverse)

Import the assembly metadata

No we are going to import each table individually. It’s important that we put the kingdom name as the dataframe name.

Bacteria <- read.delim("data/bacteria.tsv", sep ="\t", )
Archaea <- read.delim("data/archaea.tsv", sep = "\t")
Metazoa <- read.delim("data/metazoa.tsv", sep = "\t")
Viridiplantae <- read.delim("data/viridiplantae.tsv", sep = "\t")
Fungi <- read.delim("data/fungi.tsv", sep = "\t")
Viruses <- read.delim("data/viruses.tsv", sep = "\t")

To join the dataframes into one we need first to create a list with. Which is an R data structure that hold elements that can be of different types and sizes. We are going to use the function lst() which creates a list and uses the name of the dataframe as the name of the dataframe element of the list.

dfs_list <- lst(Bacteria, Archaea, Metazoa, Viridiplantae, Fungi, Viruses)

If you click in the name of the list in your Environment panel you can see the structure of the list and its elements, that in this case are all dataframes.

Now we can give this list to the bind_rows() function to create a new dataframe with the rows of all the previous dataframes and a new column with the name of the dataframe of origin in every row.

assemblies <- bind_rows(dfs_list, .id = "Kingdom")
head(assemblies)
Kingdom Assembly.Accession Assembly.Name Organism.Name Assembly.Stats.Total.Sequence.Length Annotation.Count.Gene.Total Annotation.Count.Gene.Protein.coding Annotation.Count.Gene.Pseudogene
Bacteria GCF_000008865.2 ASM886v2 Escherichia coli O157:H7 str. Sakai 5594605 5417 5155 136
Bacteria GCF_006094375.1 ASM609437v1 Staphylococcus epidermidis 2491058 2368 2231 53
Bacteria GCF_019048385.1 ASM1904838v1 Pantoea agglomerans 4692018 4361 4204 43
Bacteria GCF_014131755.1 ASM1413175v1 Bacteroides thetaiotaomicron 6304193 4875 4713 75
Bacteria GCF_016026695.1 ASM1602669v1 Lactococcus garvieae 2084337 2110 1981 50
Bacteria GCF_900560965.1 PPRCHA0 Pseudomonas protegens CHA0 6868303 6270 6146 35

Now we can use the rename() function to give better names to the columns.

assemblies <- assemblies %>%
  rename("GenBank_accession" = "Assembly.Accession",
         "Genome_accession" = "Assembly.Name",
         "Organism" = "Organism.Name",
         "Size" = "Assembly.Stats.Total.Sequence.Length",
         "Genes" = "Annotation.Count.Gene.Total",
         "Protein_coding" = "Annotation.Count.Gene.Protein.coding",
         "Pseudogenes" = "Annotation.Count.Gene.Pseudogene")
str(assemblies)
'data.frame':   8770 obs. of  8 variables:
 $ Kingdom          : chr  "Bacteria" "Bacteria" "Bacteria" "Bacteria" ...
 $ GenBank_accession: chr  "GCF_000008865.2" "GCF_006094375.1" "GCF_019048385.1" "GCF_014131755.1" ...
 $ Genome_accession : chr  "ASM886v2" "ASM609437v1" "ASM1904838v1" "ASM1413175v1" ...
 $ Organism         : chr  "Escherichia coli O157:H7 str. Sakai" "Staphylococcus epidermidis" "Pantoea agglomerans" "Bacteroides thetaiotaomicron" ...
 $ Size             : num  5594605 2491058 4692018 6304193 2084337 ...
 $ Genes            : int  5417 2368 4361 4875 2110 6270 5334 3359 2067 1989 ...
 $ Protein_coding   : int  5155 2231 4204 4713 1981 6146 5109 3218 1957 1842 ...
 $ Pseudogenes      : int  136 53 43 75 50 35 77 48 45 60 ...

Since we downloaded the assembly metadata form database that might have many assemblies for each organism, lets remain only with one assembly per organism. Let’s use the function group_by() to create a group of rows of the same Organism, and the function slice() to keep only this first row in each group.

assemblies <- assemblies %>%
  group_by(Organism) %>%
  slice(1)

Now we can check the basic summary statistics for out entire data set.

summary(assemblies)
   Kingdom          GenBank_accession  Genome_accession     Organism        
 Length:8510        Length:8510        Length:8510        Length:8510       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
      Size               Genes        Protein_coding    Pseudogenes     
 Min.   :9.910e+02   Min.   :     1   Min.   :     1   Min.   :    1.0  
 1st Qu.:1.503e+04   1st Qu.:     8   1st Qu.:     8   1st Qu.:   18.0  
 Median :2.608e+06   Median :  2512   Median :  2416   Median :   48.0  
 Mean   :1.392e+08   Mean   :  5421   Mean   :  4540   Mean   :  371.3  
 3rd Qu.:5.601e+06   3rd Qu.:  5059   3rd Qu.:  4878   3rd Qu.:  128.5  
 Max.   :4.005e+10   Max.   :186827   Max.   :103787   Max.   :33511.0  
                                                       NA's   :3463     

Genome length vs. protein-coding genes

Now that we have a bigger dataset than in the previous exercise, let’s see if we find the same tendencies in the relationship between genome size and number of protein-coding genes.

ggplot(assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) + 
  geom_point()+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")

We can see that the tendency of plants having more genes and animals having larger genomes looks true with the bigger data set. But we don’t see it very well because we have extremely large genomes that make the scale of the plot too big. Let’s see how we can identify extreme genomes

Identifying extreme genomes

We can again use the group_by() function to group by Kingdom and mix it with the arrange() function, which reorders the rows in each group in ascending order of the specified variable. Finally we can slice() to get the first row of each arranged group. Note that if many rows had the same value of protein coding genes, since we are only picking one, we might be loosing genomes that also have the smallest number of genes.

fewest_genes <- assemblies %>%
  group_by(Kingdom) %>%
  arrange(Protein_coding)%>%
  slice(1)
fewest_genes
Kingdom GenBank_accession Genome_accession Organism Size Genes Protein_coding Pseudogenes
Archaea GCF_023169545.1 ASM2316954v1 Nanobdella aerobiophila 668961 765 719 NA
Bacteria GCF_964019745.1 ihAphAlni3.Zinderia_sp_1.1 Candidatus Zinderia endosymbiont of Aphrophora alni 336619 333 284 13
Fungi GCF_000803265.1 ASM80326v1 Ordospora colligata OC4 2290528 1879 1820 NA
Metazoa GCF_954871325.1 tfGorSpeb1-WG-v2 Gordionus sp. m RMFG-2023 287640266 10079 8321 33
Viridiplantae GCF_000733215.1 ASM73321v1 Auxenochlorella protothecoides 22924637 7016 7010 NA
Viruses GCF_023148805.1 ASM2314880v1 Achyranthes virus A 9491 1 1 NA

Do a little research about the species with the fewest genomes of each kingdom. Do they have something in common?

Now, how would you get the genomes with the largest number of protein-coding genes ? And the smallest genomes ?

Code
most_genes <- assemblies %>%
  group_by(Kingdom) %>%
  arrange(desc(Protein_coding))%>%
  slice(1)
Code
smallest_genomes <- assemblies %>%
  group_by(Kingdom) %>%
  arrange(Size)%>%
  slice(1)

To find the largest genomes, let’s not only get the single largest one but the 10 largest of each group. We can use the slice_head() function instead of slice() for this.

largest_genomes <- assemblies %>%
  group_by(Kingdom) %>%
  arrange(desc(Size))%>%
  slice_head(n = 10)
largest_genomes
Kingdom GenBank_accession Genome_accession Organism Size Genes Protein_coding Pseudogenes
Archaea GCF_020618475.1 ASM2061847v1 Haladaptatus halobius 7279389 7673 7335 275
Archaea GCF_020700235.1 ASM2070023v1 Haladaptatus pallidirubidus 6270521 6555 6279 209
Archaea GCF_029338335.1 ASM2933833v1 Halomontanus rarus 5998452 5559 5371 132
Archaea GCF_000007345.1 ASM734v1 Methanosarcina acetivorans C2A 5751492 4915 4690 157
Archaea GCF_019879105.1 ASM1987910v1 Natrinema sp. SYSU A 869 5674674 5608 4985 550
Archaea GCF_024266705.1 ASM2426670v1 Natrinema gelatinilyticum 5443578 5342 4952 334
Archaea GCF_000025325.1 ASM2532v1 Haloterrigena turkmenica DSM 5511 5440782 5281 5059 160
Archaea GCF_000970145.1 ASM97014v1 Methanosarcina siciliae C2J 5427890 4744 4444 233
Archaea GCF_013456555.2 ASM1345655v2 Natrinema zhouii 5336312 5348 5017 268
Archaea GCF_037335815.1 ASM3733581v1 Salinirubellus sp. SYNS196 5327573 5257 5069 133
Bacteria GCF_026965555.1 ASM2696555v1 Nannocystis poenicansa 13228663 10193 10074 19
Bacteria GCF_005144635.2 ASM514463v2 Polyangium aurulentum 12320079 9670 9573 18
Bacteria GCF_033097365.1 ASM3309736v1 Actinoplanes sichuanensis 12144028 11135 10892 145
Bacteria GCF_020215645.1 ASM2021564v1 Nonomuraea gerenzanensis 12066052 11223 11059 73
Bacteria GCF_025908395.1 ASM2590839v1 Streptomyces milbemycinicus 12063392 10157 9716 351
Bacteria GCF_035917475.1 ASM3591747v1 Nonomuraea fuscirosea 11975494 11059 10881 96
Bacteria GCF_019904175.1 ASM1990417v1 Dactylosporangium vinaceum 11959907 10781 10558 134
Bacteria GCF_017498545.1 ASM1749854v1 Sulfidibacter corallicola 11786365 6977 6895 12
Bacteria GCF_015689475.1 ASM1568947v1 Streptomyces solisilvae 11639536 9472 9008 373
Bacteria GCF_030294945.1 SGFS_1.0 Streptomyces graminofaciens 11602651 10235 9681 460
Fungi GCF_000439145.1 ASM43914v3 Rhizophagus irregularis DAOM 181602=DAOM 197198 136726313 26245 26143 3
Fungi GCF_000151645.1 ASM15164v1 Tuber melanosporum 124945702 7420 7420 NA
Fungi GCF_026914185.1 ASM2691418v1 Puccinia triticina 122823596 12736 12736 NA
Fungi GCF_015039405.1 Cananz1 Cantharellus anzutake 120173979 17292 16954 18
Fungi GCF_000204055.1 v1.0 Melampsora larici-populina 98AG31 101129028 16263 16255 8
Fungi GCF_021901695.1 Pst134E36_v1_pri Puccinia striiformis f. sp. tritici 89491239 17813 17118 NA
Fungi GCF_000149925.1 ASM14992v1 Puccinia graminis f. sp. tritici CRL 75-36-700-3 88644628 16308 15799 73
Fungi GCF_002865645.1 Melbi2 Hyaloscypha bicolor E 82384847 18640 18598 14
Fungi GCF_020360975.1 ASM2036097v1 Hirsutella rhossiliensis 81749301 12181 12053 NA
Fungi GCF_016647785.1 Suifus1 Suillus fuscotomentosus 79655913 18660 18510 3
Metazoa GCF_019279795.1 PAN1.0 Protopterus annectens 40054324612 30570 22039 1966
Metazoa GCF_027579735.1 aBomBom1.pri Bombina bombina 10019663791 31613 22468 1998
Metazoa GCF_023864345.2 iqSchSeri2.2 Schistocerca serialis cubense 9082806520 75810 17237 7394
Metazoa GCF_021461395.2 iqSchAmer2.1 Schistocerca americana 8990354198 81274 17662 7227
Metazoa GCF_028390025.1 aPseCor3.hap2 Pseudophryne corroboree 8872758793 186827 24591 33511
Metazoa GCF_023898315.1 iqSchNite1.1 Schistocerca nitens 8822555691 72560 17500 6288
Metazoa GCF_021461385.2 iqSchPice1.1 Schistocerca piceifrons 8742443784 96806 17490 10058
Metazoa GCF_023897955.1 iqSchGreg1.2 Schistocerca gregaria 8742418402 99467 19799 3862
Metazoa GCF_023864275.1 iqSchCanc2.1 Schistocerca cancellata 8536877781 103533 16907 6571
Metazoa GCF_901001135.1 aRhiBiv1.1 Rhinatrema bivittatum 5319222779 28114 21554 620
Viridiplantae GCF_018294505.1 IWGSC CS RefSeq v2.1 Triticum aestivum 14566502436 155463 103787 19065
Viridiplantae GCF_002162155.2 WEW_v2.1 Triticum dicoccoides 10677097747 102886 66764 14884
Viridiplantae GCF_030272615.1 Sugi_1.0 Cryptomeria japonica 9046532310 61123 41227 2981
Viridiplantae GCF_003073215.2 Tu2.1 Triticum urartu 4849077157 50810 35768 6509
Viridiplantae GCF_904849725.1 MorexV3_pseudomolecules_assembly Hordeum vulgare subsp. vulgare 4225577519 58438 31448 5778
Viridiplantae GCF_002575655.2 Aet v5.0 Aegilops tauschii subsp. strangulata 4218064899 62260 45657 5195
Viridiplantae GCF_024323335.1 CAAS_Psat_ZW6_1.0 Pisum sativum 3796066963 65672 40025 5386
Viridiplantae GCF_000715135.1 Ntab-TN90 Nicotiana tabacum 3642884816 74273 61780 3403
Viridiplantae GCF_002878395.1 UCD10Xv1.1 Capsicum annuum 3211819707 48341 32202 5748
Viridiplantae GCF_004153795.1 AHAU_CSS_1 Camellia sinensis 3105212962 68531 50838 6195
Viruses GCF_029891415.1 ASM2989141v1 Cotonvirus japonicus 1476527 1309 1306 NA
Viruses GCF_002966405.1 ASM296640v1 Bodo saltans virus 1385869 1207 1207 NA
Viruses GCF_002966375.1 ASM296637v1 Acanthamoeba polyphaga mimivirus 1221932 935 930 NA
Viruses GCF_002924545.1 IIT1_a5assembly Powai lake megavirus 1208707 996 996 NA
Viruses GCF_004156295.1 ASM415629v1 Moumouvirus australiensis 1098002 908 899 NA
Viruses GCF_002966435.1 ASM296643v1 Moumouvirus goulette 1016844 975 970 NA
Viruses GCF_003057795.1 ASM305779v1 Tetraselmis virus 1 668031 663 653 NA
Viruses GCF_006547245.1 ASM654724v1 Chrysochromulina parva virus 437255 45 45 NA
Viruses GCF_029882485.1 ASM2988248v1 Acanthamoeba castellanii medusavirus 381277 463 461 NA
Viruses GCF_002624965.1 ASM262496v1 Pseudomonas phage Phabio 309157 471 468 NA

In Metazoans we can see that there is a very big jump in sizes from 5.3Gb to 8.5Gb and in Viridiplantae there is a bigger jump from 4.8Gb to 9Gb. To have a better view of our graph let’s keep the genomes with a size smaller or equal (<=) than 8.5Gb.

filtered_assemblies <- assemblies %>%
  filter(Size <= 8500000000)
ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) + 
  geom_point()+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(labels = scales::comma)+
  scale_x_continuous(labels = scales::comma)

Now we have a better view of the tendencies! At least in Metazoans and Viridiplantae. We can also use a logarithmic scale to improve the view.

ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) + 
  geom_point()+
  labs(title = "Relationship between genome size and number of protein-coding genes in Log10 scale",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_log10(labels = scales::comma)+
  scale_x_log10(labels = scales::comma)

Or we can separate each kingdom into its own panel to see the tendencies in a proper scale for each group.

ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) + 
  geom_point()+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")+
  facet_wrap(~Kingdom, scales = "free")+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(labels = scales::comma)+
  scale_x_continuous(labels = scales::comma)

Here we have a good view of the distribution of our data in each group, but the Viruses group seems to have a few very large genomes. When we filtered before we only considered the size, because we were using a value that was a proper threshold to remove both the animals and plants with very large genomes. In our last plot we can see that the jump of genome size in the Viruses is around the 500,000 bp. But we can not use that as a threshold in our entire data set. Instead, we are going to get a new dataframe big_viruses with the viruses that we want to remove. Then we will filter our main data set to keep only the organisms that are not (!) in (%in%) the column Organisms of the big_viruses dataframe (big_viruses$Organism).

big_viruses <- filtered_assemblies %>%
  filter(Kingdom == "Viruses", Size > 500000)
filtered_assemblies <- filtered_assemblies %>%
  filter(!Organism %in% big_viruses$Organism)
ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) + 
  geom_point()+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")+
  facet_wrap(~Kingdom, scales = "free")+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(labels = scales::comma)+
  scale_x_continuous(labels = scales::comma)

Fraction of protein-coding genes vs. total genes

In the previous episode we saw that Viruses had only protein-coding genes, Archea, Bacteria and Fungi had a high proportion and plants and animals a lower proportion. Let’s see if that holds true with our large data set.

First let’s calculate the fraction of protein-coding genes.

filtered_assemblies <- filtered_assemblies %>%
  mutate(Protein_coding_fraction = Protein_coding / Genes)

Since now we have a lot of data points it will be useful to plot the summary statistics as a boxplot, instead of a bar for each genome.

ggplot(filtered_assemblies, aes(Kingdom, Protein_coding_fraction, color = Kingdom))+
  theme(legend.position = "none")+
  geom_boxplot()

This is a very useful way to show the broad patterns of the data, and we can complement it with the raw data simultaneously. In ggplot we can add multiple representations of the data with different “geoms”. Let’s install and load the library ggbeeswarm to be able to use the geom_quasirandom(), which is plotted along the boxplot to show all the data points in an ordered way to be able to see their distribution.

library(ggbeeswarm)
ggplot(filtered_assemblies, aes(Kingdom, Protein_coding_fraction, color = Kingdom))+
  theme(legend.position = "none")+
  geom_quasirandom()+
  geom_boxplot()

When using more than one geom, if they both use the same aes() parameters, those can be placed in the pain ggplot() function. If not, the independent aes() and other parameters go in the specific geom layer.

ggplot(filtered_assemblies, aes(Kingdom, Protein_coding_fraction))+
  theme(legend.position = "none")+
  geom_quasirandom(aes(color = Kingdom))+
  geom_boxplot(alpha = 0.5, color = "darkgray")+
  labs(title = "Fraction of protein-coding genes in each kingdom",
       y = "Fracion of protein-coding genes")

Import taxonomic data of metazoans

With the previous plots we see that metazoans have a very large range of variation both in the genome size vs. protein-coding gene relationship and the fraction of protein-coding genes. To see how is this explained by the different groups of animals, let’s get the taxonomic information. This table was downloaded from NCBI and it is available at taxonomy_metazoa.tsv. Download it and save it in your data/ directory.

taxonomy_metazoa <- read.delim("data/taxonomy_metazoa.tsv", na = c("NA", ""))
head(taxonomy_metazoa)
Query Taxid Tax.name Authority Rank Basionym Basionym.authority Curator.common.name Has.type.material Group.name Superkingdom.name Superkingdom.taxid Kingdom.name Kingdom.taxid Phylum.name Phylum.taxid Class.name Class.taxid Order.name Order.taxid Family.name Family.taxid Genus.name Genus.taxid Species.name Species.taxid
Homo sapiens 9606 Homo sapiens Linnaeus, 1758 SPECIES NA NA human no primates Eukaryota 2759 Metazoa 33208 Chordata 7711 Mammalia 40674 Primates 9443 Hominidae 9604 Homo 9605 Homo sapiens 9606
Mus musculus 10090 Mus musculus Linnaeus, 1758 SPECIES NA NA house mouse no rodents Eukaryota 2759 Metazoa 33208 Chordata 7711 Mammalia 40674 Rodentia 9989 Muridae 10066 Mus 10088 Mus musculus 10090
Drosophila melanogaster 7227 Drosophila melanogaster Meigen, 1830 SPECIES NA NA fruit fly no flies Eukaryota 2759 Metazoa 33208 Arthropoda 6656 Insecta 50557 Diptera 7147 Drosophilidae 7214 Drosophila 7215 Drosophila melanogaster 7227
Danio rerio 7955 Danio rerio (Hamilton, 1822) SPECIES Cyprinus rerio Hamilton, 1822 zebrafish no bony fishes Eukaryota 2759 Metazoa 33208 Chordata 7711 Actinopteri 186623 Cypriniformes 7952 Danionidae 2743709 Danio 7954 Danio rerio 7955
Bubalus bubalis 89462 Bubalus bubalis NA SPECIES Bos bubalis Linnaeus, 1758 water buffalo no even-toed ungulates & whales Eukaryota 2759 Metazoa 33208 Chordata 7711 Mammalia 40674 Artiodactyla 91561 Bovidae 9895 Bubalus 9918 Bubalus bubalis 89462
Helicoverpa armigera 29058 Helicoverpa armigera (Hubner, 1808) SPECIES Noctua armigera Hübner, 1805 cotton bollworm no moths Eukaryota 2759 Metazoa 33208 Arthropoda 6656 Insecta 50557 Lepidoptera 7088 Noctuidae 7100 Helicoverpa 7112 Helicoverpa armigera 29058

Let’s select only the columns with the taxonomy information. The function select() lets you do this while renaming them. We will join this table with our dataframe, so it will be easier if we use the same name for the column they have in common, which is “Organism”.

taxonomy_metazoa_selected <- taxonomy_metazoa %>%
  select(Organism = Query,
         Common_name = Curator.common.name,
         Phylum = Phylum.name,
         Class = Class.name,
         Order = Order.name,
         Family = Family.name,
         Genus = Genus.name,
         Species = Species.name)
head(taxonomy_metazoa_selected)
Organism Common_name Phylum Class Order Family Genus Species
Homo sapiens human Chordata Mammalia Primates Hominidae Homo Homo sapiens
Mus musculus house mouse Chordata Mammalia Rodentia Muridae Mus Mus musculus
Drosophila melanogaster fruit fly Arthropoda Insecta Diptera Drosophilidae Drosophila Drosophila melanogaster
Danio rerio zebrafish Chordata Actinopteri Cypriniformes Danionidae Danio Danio rerio
Bubalus bubalis water buffalo Chordata Mammalia Artiodactyla Bovidae Bubalus Bubalus bubalis
Helicoverpa armigera cotton bollworm Arthropoda Insecta Lepidoptera Noctuidae Helicoverpa Helicoverpa armigera

We will join our filtered_assemblies dataframe with the new taxonomy_metazoa_selected dataframe. We want only the rows that have information on both tables. By using the function inner_join() we will put the columns of both dataframes in a single one and the rows will be matched by the column “Organism”. There are multiple joining functions in tidyverse, inner_join() makes sure that only the organisms (rows) that are present in both tables go to the resulting one.

metazoa_assembly_tax <-inner_join(filtered_assemblies,taxonomy_metazoa_selected, by = "Organism")
head(metazoa_assembly_tax)
Kingdom GenBank_accession Genome_accession Organism Size Genes Protein_coding Pseudogenes Protein_coding_fraction Common_name Phylum Class Order Family Genus Species
Metazoa GCF_021347895.1 KAUST_Apoly_ChrSc Acanthochromis polyacanthus 956771783 30251 23921 328 0.7907507 spiny chromis Chordata Actinopteri NA Pomacentridae Acanthochromis Acanthochromis polyacanthus
Metazoa GCF_904848185.1 fAcaLat1.1 Acanthopagrus latus 685127588 30246 23786 189 0.7864180 yellowfin seabream Chordata Actinopteri Spariformes Sparidae Acanthopagrus Acanthopagrus latus
Metazoa GCF_929443795.1 bAccGen1.1 Accipiter gentilis 1398011397 19564 16976 165 0.8677162 Northern goshawk Chordata Aves Accipitriformes Accipitridae Accipiter Accipiter gentilis
Metazoa GCF_027475565.1 VMU_Ajub_asm_v1.0 Acinonyx jubatus 2377417351 33469 19999 3672 0.5975380 cheetah Chordata Mammalia Carnivora Felidae Acinonyx Acinonyx jubatus
Metazoa GCF_902713425.1 fAciRut3.2 maternal haplotype Acipenser ruthenus 1899794300 80041 38165 6639 0.4768181 sterlet Chordata Actinopteri Acipenseriformes Acipenseridae Acipenser Acipenser ruthenus
Metazoa GCF_903995435.1 mAcoRus1.1 Acomys russatus 2301522284 29331 20811 5096 0.7095223 golden spiny mouse Chordata Mammalia Rodentia Muridae Acomys Acomys russatus

Let’s repeat our plot of genome length vs protein-coding genes for our Metazoa data set.

ggplot(metazoa_assembly_tax) + 
  geom_point(aes(x= Size, y = Protein_coding, color = Phylum))+
  scale_x_continuous(label=scales::comma)+
  scale_y_continuous(label=scales::comma)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(title = "Genome length vs protein-coding genes of Metazoa",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")

We can see that Chordata is one of the phyla with more data points and it has a lot of variation. Let’s see the protein coding fraction as well.

ggplot(metazoa_assembly_tax, aes(Phylum, Protein_coding_fraction))+
  geom_quasirandom(aes(color = Phylum))+
  geom_boxplot(alpha = 0.5, color = "darkgray")+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(title = "Fraction of protein-coding genes per phylum of Metazoans",
       y = "Fraction of protein-coding genes")

In this measure we see that Chordata and Cnidaria have a lot of variation. Showing the data points along with the boxplot, allows us to see that we have a lot more information from Chordates and Arthropoda than the rest of the phyla, so we have to keep in mind that our data set is not representing equaly all the animal diversity, so we would need to increase the number of assemblies of the other groups to have more reliable conclusions.