Genome size vs. number of genes

Download assembly metadata of taxonomic groups

Go to the home page of NCBI Datasets https://www.ncbi.nlm.nih.gov/datasets/
Search “Metazoa”
Click on “Browse all 15,582 genomes”
Apply the filters:
Select columns:
- Assembly
- Scientific name
- Size (Mb)
- Genes
- Protein-coding
- Pseudogenes
Check the box to the left of “Assembly” to select all entries.
Click on “Download/Download Table” and type “metazoa.tsv” for the name of the file and click download.
Move the downloaded table to your workshop data directory.

Repeat the steps for the following groups of organisms: “Bacteria, Fungi, Archaea, Viridiplantae, Viruses”.

For Bacteria use the filters Reference genomes, Annotated by NCBI RefSeq, Assembly level: complete, Year released: 2018-2024,

For Fungi and Viridiplantae use only the filter: Annotated by NCBI RefSeq.

For Archaea use: Annotated by NCBI RefSeq and Assembly level: Chromosome to complete.

For Viruses use: Annotated by NCBI RefSeq, Assembly level: Complete, Year released: 2022-2024

You can download the tables from the following links: archaea.tsv, bacteria.tsv, fungi.tsv, metazoa.tsv, viridiplantae.tsv, viruses.tsv

Setting up R environment

Let’s start by making sure the working directory is our folder for this workshop and loading the tidyverse library.

library(tidyverse)

Import the assembly metadata

No we are going to import each table individually. It’s important that we put the kingdom name as the dataframe name.

Bacteria <- read.delim("data/bacteria.tsv", sep ="\t", )
Archaea <- read.delim("data/archaea.tsv", sep = "\t")
Metazoa <- read.delim("data/metazoa.tsv", sep = "\t")
Viridiplantae <- read.delim("data/viridiplantae.tsv", sep = "\t")
Fungi <- read.delim("data/fungi.tsv", sep = "\t")
Viruses <- read.delim("data/viruses.tsv", sep = "\t")

To join the dataframes into one we need first to create a list with. Which is an R data structure that hold elements that can be of different types and sizes. We are going to use the function lst() which creates a list and uses the name of the dataframe as the name of the dataframe element of the list.

dfs_list <- lst(Bacteria, Archaea, Metazoa, Viridiplantae, Fungi, Viruses)

If you click in the name of the list in your Environment panel you can see the structure of the list and its elements, that in this case are all dataframes.

Now we can give this list to the bind_rows() function to create a new dataframe with the rows of all the previous dataframes and a new column with the name of the dataframe of origin in every row.

assemblies <- bind_rows(dfs_list, .id = "Kingdom")
head(assemblies)

Kingdom	Assembly.Accession	Assembly.Name	Organism.Name	Assembly.Stats.Total.Sequence.Length	Annotation.Count.Gene.Total	Annotation.Count.Gene.Protein.coding	Annotation.Count.Gene.Pseudogene
Bacteria	GCF_000008865.2	ASM886v2	Escherichia coli O157:H7 str. Sakai	5594605	5417	5155	136
Bacteria	GCF_006094375.1	ASM609437v1	Staphylococcus epidermidis	2491058	2368	2231	53
Bacteria	GCF_019048385.1	ASM1904838v1	Pantoea agglomerans	4692018	4361	4204	43
Bacteria	GCF_014131755.1	ASM1413175v1	Bacteroides thetaiotaomicron	6304193	4875	4713	75
Bacteria	GCF_016026695.1	ASM1602669v1	Lactococcus garvieae	2084337	2110	1981	50
Bacteria	GCF_900560965.1	PPRCHA0	Pseudomonas protegens CHA0	6868303	6270	6146	35

Now we can use the rename() function to give better names to the columns.

assemblies <- assemblies %>%
  rename("GenBank_accession" = "Assembly.Accession",
         "Genome_accession" = "Assembly.Name",
         "Organism" = "Organism.Name",
         "Size" = "Assembly.Stats.Total.Sequence.Length",
         "Genes" = "Annotation.Count.Gene.Total",
         "Protein_coding" = "Annotation.Count.Gene.Protein.coding",
         "Pseudogenes" = "Annotation.Count.Gene.Pseudogene")
str(assemblies)

'data.frame':   8770 obs. of  8 variables:
 $ Kingdom          : chr  "Bacteria" "Bacteria" "Bacteria" "Bacteria" ...
 $ GenBank_accession: chr  "GCF_000008865.2" "GCF_006094375.1" "GCF_019048385.1" "GCF_014131755.1" ...
 $ Genome_accession : chr  "ASM886v2" "ASM609437v1" "ASM1904838v1" "ASM1413175v1" ...
 $ Organism         : chr  "Escherichia coli O157:H7 str. Sakai" "Staphylococcus epidermidis" "Pantoea agglomerans" "Bacteroides thetaiotaomicron" ...
 $ Size             : num  5594605 2491058 4692018 6304193 2084337 ...
 $ Genes            : int  5417 2368 4361 4875 2110 6270 5334 3359 2067 1989 ...
 $ Protein_coding   : int  5155 2231 4204 4713 1981 6146 5109 3218 1957 1842 ...
 $ Pseudogenes      : int  136 53 43 75 50 35 77 48 45 60 ...

Since we downloaded the assembly metadata form database that might have many assemblies for each organism, lets remain only with one assembly per organism. Let’s use the function group_by() to create a group of rows of the same Organism, and the function slice() to keep only this first row in each group.

assemblies <- assemblies %>%
  group_by(Organism) %>%
  slice(1)

Now we can check the basic summary statistics for out entire data set.

summary(assemblies)

   Kingdom          GenBank_accession  Genome_accession     Organism        
 Length:8510        Length:8510        Length:8510        Length:8510       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
      Size               Genes        Protein_coding    Pseudogenes     
 Min.   :9.910e+02   Min.   :     1   Min.   :     1   Min.   :    1.0  
 1st Qu.:1.503e+04   1st Qu.:     8   1st Qu.:     8   1st Qu.:   18.0  
 Median :2.608e+06   Median :  2512   Median :  2416   Median :   48.0  
 Mean   :1.392e+08   Mean   :  5421   Mean   :  4540   Mean   :  371.3  
 3rd Qu.:5.601e+06   3rd Qu.:  5059   3rd Qu.:  4878   3rd Qu.:  128.5  
 Max.   :4.005e+10   Max.   :186827   Max.   :103787   Max.   :33511.0  
                                                       NA's   :3463

Genome length vs. protein-coding genes

Now that we have a bigger dataset than in the previous exercise, let’s see if we find the same tendencies in the relationship between genome size and number of protein-coding genes.

ggplot(assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) + 
  geom_point()+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")

We can see that the tendency of plants having more genes and animals having larger genomes looks true with the bigger data set. But we don’t see it very well because we have extremely large genomes that make the scale of the plot too big. Let’s see how we can identify extreme genomes

Identifying extreme genomes

We can again use the group_by() function to group by Kingdom and mix it with the arrange() function, which reorders the rows in each group in ascending order of the specified variable. Finally we can slice() to get the first row of each arranged group. Note that if many rows had the same value of protein coding genes, since we are only picking one, we might be loosing genomes that also have the smallest number of genes.

fewest_genes <- assemblies %>%
  group_by(Kingdom) %>%
  arrange(Protein_coding)%>%
  slice(1)
fewest_genes

Kingdom	GenBank_accession	Genome_accession	Organism	Size	Genes	Protein_coding	Pseudogenes
Archaea	GCF_023169545.1	ASM2316954v1	Nanobdella aerobiophila	668961	765	719	NA
Bacteria	GCF_964019745.1	ihAphAlni3.Zinderia_sp_1.1	Candidatus Zinderia endosymbiont of Aphrophora alni	336619	333	284	13
Fungi	GCF_000803265.1	ASM80326v1	Ordospora colligata OC4	2290528	1879	1820	NA
Metazoa	GCF_954871325.1	tfGorSpeb1-WG-v2	Gordionus sp. m RMFG-2023	287640266	10079	8321	33
Viridiplantae	GCF_000733215.1	ASM73321v1	Auxenochlorella protothecoides	22924637	7016	7010	NA
Viruses	GCF_023148805.1	ASM2314880v1	Achyranthes virus A	9491	1	1	NA

Do a little research about the species with the fewest genomes of each kingdom. Do they have something in common?

Now, how would you get the genomes with the largest number of protein-coding genes ? And the smallest genomes ?

Code

most_genes <- assemblies %>%
  group_by(Kingdom) %>%
  arrange(desc(Protein_coding))%>%
  slice(1)

Code

smallest_genomes <- assemblies %>%
  group_by(Kingdom) %>%
  arrange(Size)%>%
  slice(1)

To find the largest genomes, let’s not only get the single largest one but the 10 largest of each group. We can use the slice_head() function instead of slice() for this.

largest_genomes <- assemblies %>%
  group_by(Kingdom) %>%
  arrange(desc(Size))%>%
  slice_head(n = 10)
largest_genomes

Kingdom	GenBank_accession	Genome_accession	Organism	Size	Genes	Protein_coding	Pseudogenes
Archaea	GCF_020618475.1	ASM2061847v1	Haladaptatus halobius	7279389	7673	7335	275
Archaea	GCF_020700235.1	ASM2070023v1	Haladaptatus pallidirubidus	6270521	6555	6279	209
Archaea	GCF_029338335.1	ASM2933833v1	Halomontanus rarus	5998452	5559	5371	132
Archaea	GCF_000007345.1	ASM734v1	Methanosarcina acetivorans C2A	5751492	4915	4690	157
Archaea	GCF_019879105.1	ASM1987910v1	Natrinema sp. SYSU A 869	5674674	5608	4985	550
Archaea	GCF_024266705.1	ASM2426670v1	Natrinema gelatinilyticum	5443578	5342	4952	334
Archaea	GCF_000025325.1	ASM2532v1	Haloterrigena turkmenica DSM 5511	5440782	5281	5059	160
Archaea	GCF_000970145.1	ASM97014v1	Methanosarcina siciliae C2J	5427890	4744	4444	233
Archaea	GCF_013456555.2	ASM1345655v2	Natrinema zhouii	5336312	5348	5017	268
Archaea	GCF_037335815.1	ASM3733581v1	Salinirubellus sp. SYNS196	5327573	5257	5069	133
Bacteria	GCF_026965555.1	ASM2696555v1	Nannocystis poenicansa	13228663	10193	10074	19
Bacteria	GCF_005144635.2	ASM514463v2	Polyangium aurulentum	12320079	9670	9573	18
Bacteria	GCF_033097365.1	ASM3309736v1	Actinoplanes sichuanensis	12144028	11135	10892	145
Bacteria	GCF_020215645.1	ASM2021564v1	Nonomuraea gerenzanensis	12066052	11223	11059	73
Bacteria	GCF_025908395.1	ASM2590839v1	Streptomyces milbemycinicus	12063392	10157	9716	351
Bacteria	GCF_035917475.1	ASM3591747v1	Nonomuraea fuscirosea	11975494	11059	10881	96
Bacteria	GCF_019904175.1	ASM1990417v1	Dactylosporangium vinaceum	11959907	10781	10558	134
Bacteria	GCF_017498545.1	ASM1749854v1	Sulfidibacter corallicola	11786365	6977	6895	12
Bacteria	GCF_015689475.1	ASM1568947v1	Streptomyces solisilvae	11639536	9472	9008	373
Bacteria	GCF_030294945.1	SGFS_1.0	Streptomyces graminofaciens	11602651	10235	9681	460
Fungi	GCF_000439145.1	ASM43914v3	Rhizophagus irregularis DAOM 181602=DAOM 197198	136726313	26245	26143	3
Fungi	GCF_000151645.1	ASM15164v1	Tuber melanosporum	124945702	7420	7420	NA
Fungi	GCF_026914185.1	ASM2691418v1	Puccinia triticina	122823596	12736	12736	NA
Fungi	GCF_015039405.1	Cananz1	Cantharellus anzutake	120173979	17292	16954	18
Fungi	GCF_000204055.1	v1.0	Melampsora larici-populina 98AG31	101129028	16263	16255	8
Fungi	GCF_021901695.1	Pst134E36_v1_pri	Puccinia striiformis f. sp. tritici	89491239	17813	17118	NA
Fungi	GCF_000149925.1	ASM14992v1	Puccinia graminis f. sp. tritici CRL 75-36-700-3	88644628	16308	15799	73
Fungi	GCF_002865645.1	Melbi2	Hyaloscypha bicolor E	82384847	18640	18598	14
Fungi	GCF_020360975.1	ASM2036097v1	Hirsutella rhossiliensis	81749301	12181	12053	NA
Fungi	GCF_016647785.1	Suifus1	Suillus fuscotomentosus	79655913	18660	18510	3
Metazoa	GCF_019279795.1	PAN1.0	Protopterus annectens	40054324612	30570	22039	1966
Metazoa	GCF_027579735.1	aBomBom1.pri	Bombina bombina	10019663791	31613	22468	1998
Metazoa	GCF_023864345.2	iqSchSeri2.2	Schistocerca serialis cubense	9082806520	75810	17237	7394
Metazoa	GCF_021461395.2	iqSchAmer2.1	Schistocerca americana	8990354198	81274	17662	7227
Metazoa	GCF_028390025.1	aPseCor3.hap2	Pseudophryne corroboree	8872758793	186827	24591	33511
Metazoa	GCF_023898315.1	iqSchNite1.1	Schistocerca nitens	8822555691	72560	17500	6288
Metazoa	GCF_021461385.2	iqSchPice1.1	Schistocerca piceifrons	8742443784	96806	17490	10058
Metazoa	GCF_023897955.1	iqSchGreg1.2	Schistocerca gregaria	8742418402	99467	19799	3862
Metazoa	GCF_023864275.1	iqSchCanc2.1	Schistocerca cancellata	8536877781	103533	16907	6571
Metazoa	GCF_901001135.1	aRhiBiv1.1	Rhinatrema bivittatum	5319222779	28114	21554	620
Viridiplantae	GCF_018294505.1	IWGSC CS RefSeq v2.1	Triticum aestivum	14566502436	155463	103787	19065
Viridiplantae	GCF_002162155.2	WEW_v2.1	Triticum dicoccoides	10677097747	102886	66764	14884
Viridiplantae	GCF_030272615.1	Sugi_1.0	Cryptomeria japonica	9046532310	61123	41227	2981
Viridiplantae	GCF_003073215.2	Tu2.1	Triticum urartu	4849077157	50810	35768	6509
Viridiplantae	GCF_904849725.1	MorexV3_pseudomolecules_assembly	Hordeum vulgare subsp. vulgare	4225577519	58438	31448	5778
Viridiplantae	GCF_002575655.2	Aet v5.0	Aegilops tauschii subsp. strangulata	4218064899	62260	45657	5195
Viridiplantae	GCF_024323335.1	CAAS_Psat_ZW6_1.0	Pisum sativum	3796066963	65672	40025	5386
Viridiplantae	GCF_000715135.1	Ntab-TN90	Nicotiana tabacum	3642884816	74273	61780	3403
Viridiplantae	GCF_002878395.1	UCD10Xv1.1	Capsicum annuum	3211819707	48341	32202	5748
Viridiplantae	GCF_004153795.1	AHAU_CSS_1	Camellia sinensis	3105212962	68531	50838	6195
Viruses	GCF_029891415.1	ASM2989141v1	Cotonvirus japonicus	1476527	1309	1306	NA
Viruses	GCF_002966405.1	ASM296640v1	Bodo saltans virus	1385869	1207	1207	NA
Viruses	GCF_002966375.1	ASM296637v1	Acanthamoeba polyphaga mimivirus	1221932	935	930	NA
Viruses	GCF_002924545.1	IIT1_a5assembly	Powai lake megavirus	1208707	996	996	NA
Viruses	GCF_004156295.1	ASM415629v1	Moumouvirus australiensis	1098002	908	899	NA
Viruses	GCF_002966435.1	ASM296643v1	Moumouvirus goulette	1016844	975	970	NA
Viruses	GCF_003057795.1	ASM305779v1	Tetraselmis virus 1	668031	663	653	NA
Viruses	GCF_006547245.1	ASM654724v1	Chrysochromulina parva virus	437255	45	45	NA
Viruses	GCF_029882485.1	ASM2988248v1	Acanthamoeba castellanii medusavirus	381277	463	461	NA
Viruses	GCF_002624965.1	ASM262496v1	Pseudomonas phage Phabio	309157	471	468	NA

In Metazoans we can see that there is a very big jump in sizes from 5.3Gb to 8.5Gb and in Viridiplantae there is a bigger jump from 4.8Gb to 9Gb. To have a better view of our graph let’s keep the genomes with a size smaller or equal (<=) than 8.5Gb.

filtered_assemblies <- assemblies %>%
  filter(Size <= 8500000000)

ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) + 
  geom_point()+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(labels = scales::comma)+
  scale_x_continuous(labels = scales::comma)

Now we have a better view of the tendencies! At least in Metazoans and Viridiplantae. We can also use a logarithmic scale to improve the view.

ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) + 
  geom_point()+
  labs(title = "Relationship between genome size and number of protein-coding genes in Log10 scale",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_log10(labels = scales::comma)+
  scale_x_log10(labels = scales::comma)

Or we can separate each kingdom into its own panel to see the tendencies in a proper scale for each group.

ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) + 
  geom_point()+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")+
  facet_wrap(~Kingdom, scales = "free")+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(labels = scales::comma)+
  scale_x_continuous(labels = scales::comma)

Here we have a good view of the distribution of our data in each group, but the Viruses group seems to have a few very large genomes. When we filtered before we only considered the size, because we were using a value that was a proper threshold to remove both the animals and plants with very large genomes. In our last plot we can see that the jump of genome size in the Viruses is around the 500,000 bp. But we can not use that as a threshold in our entire data set. Instead, we are going to get a new dataframe big_viruses with the viruses that we want to remove. Then we will filter our main data set to keep only the organisms that are not (!) in (%in%) the column Organisms of the big_viruses dataframe (big_viruses$Organism).

big_viruses <- filtered_assemblies %>%
  filter(Kingdom == "Viruses", Size > 500000)
filtered_assemblies <- filtered_assemblies %>%
  filter(!Organism %in% big_viruses$Organism)

ggplot(filtered_assemblies, aes(x= Size, y = Protein_coding, color = Kingdom)) + 
  geom_point()+
  labs(title = "Relationship between genome size and number of protein-coding genes",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")+
  facet_wrap(~Kingdom, scales = "free")+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(labels = scales::comma)+
  scale_x_continuous(labels = scales::comma)

Fraction of protein-coding genes vs. total genes

In the previous episode we saw that Viruses had only protein-coding genes, Archea, Bacteria and Fungi had a high proportion and plants and animals a lower proportion. Let’s see if that holds true with our large data set.

First let’s calculate the fraction of protein-coding genes.

filtered_assemblies <- filtered_assemblies %>%
  mutate(Protein_coding_fraction = Protein_coding / Genes)

Since now we have a lot of data points it will be useful to plot the summary statistics as a boxplot, instead of a bar for each genome.

ggplot(filtered_assemblies, aes(Kingdom, Protein_coding_fraction, color = Kingdom))+
  theme(legend.position = "none")+
  geom_boxplot()

This is a very useful way to show the broad patterns of the data, and we can complement it with the raw data simultaneously. In ggplot we can add multiple representations of the data with different “geoms”. Let’s install and load the library ggbeeswarm to be able to use the geom_quasirandom(), which is plotted along the boxplot to show all the data points in an ordered way to be able to see their distribution.

library(ggbeeswarm)
ggplot(filtered_assemblies, aes(Kingdom, Protein_coding_fraction, color = Kingdom))+
  theme(legend.position = "none")+
  geom_quasirandom()+
  geom_boxplot()

When using more than one geom, if they both use the same aes() parameters, those can be placed in the pain ggplot() function. If not, the independent aes() and other parameters go in the specific geom layer.

ggplot(filtered_assemblies, aes(Kingdom, Protein_coding_fraction))+
  theme(legend.position = "none")+
  geom_quasirandom(aes(color = Kingdom))+
  geom_boxplot(alpha = 0.5, color = "darkgray")+
  labs(title = "Fraction of protein-coding genes in each kingdom",
       y = "Fracion of protein-coding genes")

Import taxonomic data of metazoans

With the previous plots we see that metazoans have a very large range of variation both in the genome size vs. protein-coding gene relationship and the fraction of protein-coding genes. To see how is this explained by the different groups of animals, let’s get the taxonomic information. This table was downloaded from NCBI and it is available at taxonomy_metazoa.tsv. Download it and save it in your data/ directory.

taxonomy_metazoa <- read.delim("data/taxonomy_metazoa.tsv", na = c("NA", ""))
head(taxonomy_metazoa)

Query	Taxid	Tax.name	Authority	Rank	Basionym	Basionym.authority	Curator.common.name	Has.type.material	Group.name	Superkingdom.name	Superkingdom.taxid	Kingdom.name	Kingdom.taxid	Phylum.name	Phylum.taxid	Class.name	Class.taxid	Order.name	Order.taxid	Family.name	Family.taxid	Genus.name	Genus.taxid	Species.name	Species.taxid
Homo sapiens	9606	Homo sapiens	Linnaeus, 1758	SPECIES	NA	NA	human	no	primates	Eukaryota	2759	Metazoa	33208	Chordata	7711	Mammalia	40674	Primates	9443	Hominidae	9604	Homo	9605	Homo sapiens	9606
Mus musculus	10090	Mus musculus	Linnaeus, 1758	SPECIES	NA	NA	house mouse	no	rodents	Eukaryota	2759	Metazoa	33208	Chordata	7711	Mammalia	40674	Rodentia	9989	Muridae	10066	Mus	10088	Mus musculus	10090
Drosophila melanogaster	7227	Drosophila melanogaster	Meigen, 1830	SPECIES	NA	NA	fruit fly	no	flies	Eukaryota	2759	Metazoa	33208	Arthropoda	6656	Insecta	50557	Diptera	7147	Drosophilidae	7214	Drosophila	7215	Drosophila melanogaster	7227
Danio rerio	7955	Danio rerio	(Hamilton, 1822)	SPECIES	Cyprinus rerio	Hamilton, 1822	zebrafish	no	bony fishes	Eukaryota	2759	Metazoa	33208	Chordata	7711	Actinopteri	186623	Cypriniformes	7952	Danionidae	2743709	Danio	7954	Danio rerio	7955
Bubalus bubalis	89462	Bubalus bubalis	NA	SPECIES	Bos bubalis	Linnaeus, 1758	water buffalo	no	even-toed ungulates & whales	Eukaryota	2759	Metazoa	33208	Chordata	7711	Mammalia	40674	Artiodactyla	91561	Bovidae	9895	Bubalus	9918	Bubalus bubalis	89462
Helicoverpa armigera	29058	Helicoverpa armigera	(Hubner, 1808)	SPECIES	Noctua armigera	Hübner, 1805	cotton bollworm	no	moths	Eukaryota	2759	Metazoa	33208	Arthropoda	6656	Insecta	50557	Lepidoptera	7088	Noctuidae	7100	Helicoverpa	7112	Helicoverpa armigera	29058

Let’s select only the columns with the taxonomy information. The function select() lets you do this while renaming them. We will join this table with our dataframe, so it will be easier if we use the same name for the column they have in common, which is “Organism”.

taxonomy_metazoa_selected <- taxonomy_metazoa %>%
  select(Organism = Query,
         Common_name = Curator.common.name,
         Phylum = Phylum.name,
         Class = Class.name,
         Order = Order.name,
         Family = Family.name,
         Genus = Genus.name,
         Species = Species.name)
head(taxonomy_metazoa_selected)

Organism	Common_name	Phylum	Class	Order	Family	Genus	Species
Homo sapiens	human	Chordata	Mammalia	Primates	Hominidae	Homo	Homo sapiens
Mus musculus	house mouse	Chordata	Mammalia	Rodentia	Muridae	Mus	Mus musculus
Drosophila melanogaster	fruit fly	Arthropoda	Insecta	Diptera	Drosophilidae	Drosophila	Drosophila melanogaster
Danio rerio	zebrafish	Chordata	Actinopteri	Cypriniformes	Danionidae	Danio	Danio rerio
Bubalus bubalis	water buffalo	Chordata	Mammalia	Artiodactyla	Bovidae	Bubalus	Bubalus bubalis
Helicoverpa armigera	cotton bollworm	Arthropoda	Insecta	Lepidoptera	Noctuidae	Helicoverpa	Helicoverpa armigera

We will join our filtered_assemblies dataframe with the new taxonomy_metazoa_selected dataframe. We want only the rows that have information on both tables. By using the function inner_join() we will put the columns of both dataframes in a single one and the rows will be matched by the column “Organism”. There are multiple joining functions in tidyverse, inner_join() makes sure that only the organisms (rows) that are present in both tables go to the resulting one.

metazoa_assembly_tax <-inner_join(filtered_assemblies,taxonomy_metazoa_selected, by = "Organism")
head(metazoa_assembly_tax)

Kingdom	GenBank_accession	Genome_accession	Organism	Size	Genes	Protein_coding	Pseudogenes	Protein_coding_fraction	Common_name	Phylum	Class	Order	Family	Genus	Species
Metazoa	GCF_021347895.1	KAUST_Apoly_ChrSc	Acanthochromis polyacanthus	956771783	30251	23921	328	0.7907507	spiny chromis	Chordata	Actinopteri	NA	Pomacentridae	Acanthochromis	Acanthochromis polyacanthus
Metazoa	GCF_904848185.1	fAcaLat1.1	Acanthopagrus latus	685127588	30246	23786	189	0.7864180	yellowfin seabream	Chordata	Actinopteri	Spariformes	Sparidae	Acanthopagrus	Acanthopagrus latus
Metazoa	GCF_929443795.1	bAccGen1.1	Accipiter gentilis	1398011397	19564	16976	165	0.8677162	Northern goshawk	Chordata	Aves	Accipitriformes	Accipitridae	Accipiter	Accipiter gentilis
Metazoa	GCF_027475565.1	VMU_Ajub_asm_v1.0	Acinonyx jubatus	2377417351	33469	19999	3672	0.5975380	cheetah	Chordata	Mammalia	Carnivora	Felidae	Acinonyx	Acinonyx jubatus
Metazoa	GCF_902713425.1	fAciRut3.2 maternal haplotype	Acipenser ruthenus	1899794300	80041	38165	6639	0.4768181	sterlet	Chordata	Actinopteri	Acipenseriformes	Acipenseridae	Acipenser	Acipenser ruthenus
Metazoa	GCF_903995435.1	mAcoRus1.1	Acomys russatus	2301522284	29331	20811	5096	0.7095223	golden spiny mouse	Chordata	Mammalia	Rodentia	Muridae	Acomys	Acomys russatus

Let’s repeat our plot of genome length vs protein-coding genes for our Metazoa data set.

ggplot(metazoa_assembly_tax) + 
  geom_point(aes(x= Size, y = Protein_coding, color = Phylum))+
  scale_x_continuous(label=scales::comma)+
  scale_y_continuous(label=scales::comma)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(title = "Genome length vs protein-coding genes of Metazoa",
       x = "Genome size (bp)",
       y = "Number of protein-coding genes")

We can see that Chordata is one of the phyla with more data points and it has a lot of variation. Let’s see the protein coding fraction as well.

ggplot(metazoa_assembly_tax, aes(Phylum, Protein_coding_fraction))+
  geom_quasirandom(aes(color = Phylum))+
  geom_boxplot(alpha = 0.5, color = "darkgray")+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(title = "Fraction of protein-coding genes per phylum of Metazoans",
       y = "Fraction of protein-coding genes")

In this measure we see that Chordata and Cnidaria have a lot of variation. Showing the data points along with the boxplot, allows us to see that we have a lot more information from Chordates and Arthropoda than the rest of the phyla, so we have to keep in mind that our data set is not representing equaly all the animal diversity, so we would need to increase the number of assemblies of the other groups to have more reliable conclusions.