The 2019-nCoV coronavirus dominated the news cycle this week. As I watched the reporting unfold, one of the questions that kept coming up was, “What is this virus most like?” Reporters, investors, medical practitioners, and the general public were all trying to find some epidemiological point of comparison. In other words, what are the phylogenetic relationships of 2019-nCoV?

Being a coronavirus, the first one that comes to mind is SARS. But we can do better than just making a guess!

In the early days of the outbreak, Chinese health officials released the entire genome of the virus. That opens up the door to approximating the phylogenetic relationships between 2019-nCoV and some of the world’s other well-known virulent viral bugs.

Background on the Wuhan Coronavirus 2019-nCoV

Officially designated 2019-nCoV by the WHO (GenBank: MN908947, Chinese: 2019新型冠狀病毒), it appears to have originated from a seafood market in the city of Wuhan, China.


The situation is still developing. As of the time of this writing, the virus has confirmed evidence of human-to-human transmission and has spread to the following locales :

  • Bangkok, Thailand
  • Tokyo, Japan
  • Seoul, Korea
  • Beijing, China
  • Shanghai, China
  • Guangdong, China
  • Dayuan
  • Hong Kong, China
  • Macau, China
  • Washington (state), United States of America
  • Vietnam,
  • Singapore.

It has killed at least 25 patients, and at least 800 cases remain active. Reported symptoms include fever in 90% of cases,[1] fatigue and dry cough in 80%,[2][3] and shortness of breath in 20%, with distress in 15%.[4][5][6]. No specific treatment is presently available.

Yesterday, the Chinese government announced that it took the extraordinary step instituting a near-total quarantine of Wuhan. That puts more than 11 million people on lockdown in one of China’s major port cities. It also canceled major Lunar New Year celebrations in Beijing and Shanghai Disneyland is closing until further notice.

It’s still too early to tell how serious this outbreak will become, but the WHO elected not to declare a global public health emergency when they met this week to discuss the situation.

Getting Some Data to Help Establish the Phylogenetic Relationships of the 2019-nCoV Wuhan Coronavirus

Even though I don’t work in bioinformatics, phylogenetic trees have a warm place in my heart. I earned a Bachelor of Science degree in Ecology, Behavior, and Evolution at UCLA and spent many an hour with them. These days, I still occasionally play with them during times like this when my curiosity gets the best of me.

There are loads of open-source digital tools out there to help parse through the treasure troves of genomic data out there. One set of tools I like to use are the rich bioinformatics libraries in R. Once you know what libraries to use and understand some peer-reviewed techniques, all you need to do if fire up some scripts and you’re in business!

Picking Other Viruses for a Phylogenetic Comparison

The first step was to assemble a library of genomic data to use as points of comparison. I wanted to grab the new 2019-nCoV coronavirus along with some of the most recognizable bad bugs out there. My list included:

GenBank No.Common NameFamily
JF915184
Influenza A H1N1Orthomyxoviridae
JX947861Influenza COrthomyxoviridae
KR063674MarburgFiloviridae
KT029139MERSCoronaviridae
KX169266Influenza BOrthomyxoviridae
MN9089472019-nCoVCoronaviridae
NC001563West NileFlaviviridae
NC001617Rhinovirus APicornaviridae
NC002031Yellow FeverFlaviviridae
NC002549EbolaFiloviridae
NC004718SARSCoronaviridae
NC012532ZikaFlaviviridae
NC038312Rhinovirus BPicornaviridae

Fortunately, the National Center for Biotechnology Information (NCBI) keeps a library of FASTA files storing the complete genomes of such viruses.

Heading to the Library (GenBank and NCBI) to Grab Some Genomic Data

A user-friendly way of grabbing these genomes is the impressive Viral Sequence Selection Interface run by the NCBI’s U.S. National Library of Medicine.

Alternatively, if you know the GenBank number of the virus you want, you can head the NCBI’s main web hub. Search for the GenBank number with the filters set to “All Databases.” This will return a record with a direct link to download the FASTA files.

Regardless of which method you choose, you want to grab complete genomes for this exercise. Sample integrity is usually clearly indicated in the accession record title or in the detailed record sheet provided by NCBI.

Wait, What is a FASTA File?

FASTA (FastAll) files are to bioinformaticians what KML files are to cartographers. The format is text-based and represents either nucleotide sequences or amino acid sequences. These nucleotides or amino acids are coded using single-letter characters. The FASTA format also allows for sequence names and comments to precede the sequences in the file.

There are packages written to process FASTA-formatted data in (at least) Python, Ruby, Perl, and R. Of those, the Python and R solutions are probably the most widely-used and robust. There are also your typical Java programs like FigTree that provide a friendly UI for phylogenetic exploration.

Figtree is a Java-based application that allows you to explore phylogenetic trees. It’s overkill for the simple visualization I’m building.

For visualizing the phylogenetic relationships of 2019-nCoV, I used some R bioinformatics packages alongside the world-famous ggplot2. I didn’t need to explore a full phylogenetic tree as much as simply approximate and visualize relatedness.

Preparing for a Rough Sketch of Phylogenetic Relationships of 2019-nCoV

I got the ball rolling by rendering a rough sketch of the relationships between the genomes I collected in the last step. I heavily-referenced the work of biostatistician TF Khang and researchers Sims et. al and Zielezinski et. al. to accomplish this.

The whole process began with firing up the old’ RStudio and analyzing k-mers.


K-in-the-what-now?

K-mers are simply strings of nucleotides (or other biological building blocks) that are “k” number characters long. K is just a stand-in variable for any integer. Here are some examples:

ATTA (a 4-mer)
ATTAAA (a 6-mer)

Once you’ve picked a k value, you can scan the entire genome and identify all the possible k-mers in the genome. So, for the following sequence of 12 bases,

ATTAAAGGTTTA

we get the following 4mer sequences:

ATTA (A 4-mer)
 TTAA (Another 4-mer appears when you shift one nucleotide over)
  TAAA (Oh look! Another 4-mer)
   AAAG (...)
    AAGG (...)
     AGGT (...)
      GGTT ( •_•)
       GTTT ( •_•)>⌐■-■
        TTTA (⌐■_■)

Now, the longer the k-mer sequences, the more specific your resolution ultimately gets. The odds of you matching a 2-mer chain between two organisms is much greater than that of matching a 30-mer chain.

But your chains don’t always have to be large to be useful for comparisons. For my purposes, I was just making broad brushstroke comparisons between some taxonomic Family clusters of viruses. Per the work of TF Khang, k=5 is sufficient for this purpose.

If there are any bioinformaticians out there reading this, please check my work and let me know if you feel this is valid!


The Initial Sketch of the Phylogenetic Relationships of 2019-nCoV

Once I tabulated my k-mers for each virus sample, I computationally compared them. Per the recommendations of the bioinformatics material I referenced, I used a Jensen-Shannon distance calculation. I then followed that up by running the data through the bioNJ neighbor-joining algorithm to piece together a phylogenetic tree data visualization using the plot.phylo() mechanics in R. Lastly, a subsequent PCA seemed to confirm that my clustering was on the right track.

Simple unrooted phylogenetic tree visualization of the phylogenetic relationships of 2019-nCoV
A quick rough sketch of the phylogenetic relationships between our sampled viruses.

License: 2020 Xyzology | Reproducible with Attribution CC BY-SA
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Improving the Visual Legibility

As a quick sketch, the graphic above works just fine, but I wanted to make it more presentable and legible. In the format above, labels are stacked on top of one another and the nodes aren’t always clear. It was time to fire up more another R libraries.

Trusty ggplot2 and ggtree to the Rescue

ggplot2 is always a go-to instrument for my data visualizations in R. In this case, I also leveraged the ggtree extension of ggplot2 to quickly render a few different data visualizations.

The first iteration is analogous to the sketch I made above. It used the familiar flat tree layout to produce the plot below.

Flat tree visualization of the phylogenetic relationships of 2019-nCoV
Standard flat ggtree plot of the phylogenetic relationships (k=6) between our sampled viruses. Not bad!

License: 2020 Xyzology | CC-BY-SA | This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This is a perfectly valid way of representing the phylogenetic relationships of 2019-nCoV coronavirus. In fact, data visualization purists would probably argue this is the best format to use. It’s the format you’d choose if you’re most concerned with ensuring that the nodes and relationships were always clear and highly interpretable.

Since I’m not publishing this in a journal, I was able to exercise more artistic freedom. I wanted to make a more formally-appealing plot while preserving legibility. To do that, I reoriented the flat plot into a circular tree by combining some ggtree and ggplot2 functions.

The first plot forgoes a legend and the second one includes a legend. The viruses are color-coded based on their Families so viewers can quickly get a sense of taxonomic ordination.

Circular tree visualization of the phylogenetic relationships of 2019-nCoV without legend.
Another circular ggtree + ggplot2 rendering of the phylogenetic relationships (k=6) between our sampled viruses. This time, I included a legend of the taxonomic color codes.

License: 2020 Xyzology | CC-BY-SA | This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Circular tree visualization of the phylogenetic relationships of 2019-nCoV with legend.
Another circular ggtree + ggplot2 rendering of the phylogenetic relationships (k=6) between our sampled viruses. This time, I included a legend of the taxonomic color codes.

License: 2020 Xyzology | CC-BY-SA | This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

So What Does This Tell Us?

As demonstrated, bioinformatic data techniques help us quickly visualize the phylogenetic relationships between 2019-nCoV coronavirus and some familiar viral pathogens.

Based on the trees, it seems that 2019-nCoV is closely aligned with SARS and MERS. That’s no surprise because they’re all in the same family! It does, however, suggest that it is more related to SARS than MERS based on where it falls on the branching. It would appear that SARS and 2019-nCoV share a more recent common ancestor virus than MERS and 2019-nCoV.

But Don’t Take My Word for It…

Again, please take all this in with the caveat that I am not a viral scientist or a bioinformatics professional. These visualizations and approximations should only be taken as simply some potential evidence in a much, much, much larger bucket being looked at by much, much, much more informed and knowledgeable people.

Still, it’s always nice to leverage data like this to elevate my impressions above conjecture and into more statistically robust domains.

Thanks for reading! 🤓📊🖥️.

Advertisements Disclosure

I will always make it clear if I am writing to endorse or recommend a specific product(s) or service(s). I hate it when I visit a site only to find out that the article is just one big ad.

Various ads may be displayed on this post to help defray the operating cost of this blog. I may make a small commission on any purchases you make by clicking on those advertisements. Thank you for supporting my work bringing you accurate and actionable information on data literacy, analytics, and engineering.