Using Coalescent Analysis to Map the 2016 Zika Virus Outbreak


In this write-up, I’ll retrieve and process Zika genome data, use Bayesian software to generate a Zika gene tree, and propose a route that Zika took to get from Africa to the Americas.


Zika virus was first detected in humans in Uganda ca. 1952, but went largely unnoticed with very few infections[1]. Decades later, Zika virus emerged in Brazil during the spring of 2015, and made its way to Central and North America by 2016[2], resulting in an estimated 1.5 million infections in Brazil[3]. The outbreak was deemed to no longer be a public health emergency of international concern by the World Health Organization in November 2016[4]. Despite yielding few fatal cases, the Zika Virus outbreak garnered widespread attention due to its prevalence and coincidence with the 2016 Olympics.

In addition to its high infection rate, the spontaneity and location of the outbreak raises questions about the methods and routes by which Zika and other viruses move across continents. In the span of less than a century, Zika went from an uncommon African virus to 1.5 million infections half way across the globe. A better understanding of how this happens this may lead to earlier detection and prevention of future outbreaks. Accordingly, I am writing this with the intent to use Zika virus as a demonstration for mapping the route of viral epidemics.

To do this, I am going to capitalize on recent advancements in molecular sequencing and computational genomics. Bayesian Evolutionary Analysis Sampling Trees [5] (BEAST) is an example of such advancements. BEAST is a cross-platform software that implements a Markov Chain Monte Carlo (MCMC) algorithm for inferring time-measured phylogenies using molecular clocks. This idea for estimating phylogenies  goes back to the 1980’s when John Kingman developed the theory of Coalescence. In which he writes “The basic strategy is to select n individuals from a particular generation, and to trace back their descent, noting when there are common ancestors[6].

Accordingly, this study is conducted with the intent to use Zika virus as a demonstration for using sequence data to map the route of a viral epidemic. I will be using BEAST to construct a gene tree from Zika virus sequences to generate a time frame for divergences that can be transposed onto a map as a hypothetical course for the outbreak.


Sequence Retrieval and Alignment

The data was retrieved from the National Center for Biotechnology Information’s (NCBI) Virus Variation database. Only sequences isolated from human hosts, (because human infections are the ones I’m interested in) and containing only the E protein region (used in the interest of sample size) were used. I used MUSCLE [7] to align the sequences.

BEAST Parameters and Priors

BEAST uses priors to estimate ages of tree nodes. Here I’ll describe the ones I set:

  • Partitions were made according to site (1 + 2 + 3) to account for codon position. Clock and Tree models were linked (due to the partitions deriving from the same sequence).
  • The HKY substitution model [8] was used as research suggests this is best for ssRNA sequences [9]. A strict clock model with a substitution rate of 1.15E-3 (Zika virus mutation rate) [10]  and the prior was set to a Coalescent Bayesian Skyline model based on random tree selection.
  • Finally, the MCMC chain had a length of 10,000,000 with a pre-burnin length of 10,000 which should be sufficient for the sample size and sequence length.


Bayesian Skyline

The skyline plot below shows the effective population size of the Zika virus on a logarithmic scale. Note the distinct increase after 2010.

Figure 1 – Bayesian Skyline plot of Zika Virus effective population (Y) generated in BEAST from E protein sequences.

Gene Tree

The gene tree shows the divergences of each sample based on genetic variation. The tree begins ca 1.2kya in Africa. There is an explosion of diversity ca 2010 that coincides with the population growth shown in the skyline plot above.

Figure 2 – Zika Virus E protein Gene tree generated in BEAST and displayed with DensiTree.


Using the gene tree above, I plotted a hypothetical route that the Zika virus took based on time and location of the divergences:

Figure 3 – Mapping a hypothetical route for the transmission of Zika Virus leading up to the outbreak in 2015-2016. Based on gene tree shown in figure 2.

To summarize what’s going on in the picture:

  • The Zika virus originates in Africa and diverges in the late 80’s, early 90’s with the Thailand sequence.
  • A couple other lineages branch out to nearby South East Asia countries. Approximately, one decade later, the virus diverges into several lineages around the Polynesian area.
  • The virus then appears to arrive on the Eastern Hemisphere around 2010 in South America until it spreads into Central America during the 2016 outbreak.

This route is consistent with other findings[11]  (which is always good to hear). So, the next course of action is to speculate what factors played a role in transmitting Zika Virus at different points in its course to the Americas. For instance, what happened during the early 90’s between SEA and Africa, or ca 2010 between Brazil and Oceania/Polynesia that facilitated the spread across large bodies of water?

To do this, we need to know a little more about how Zika spreads. First, it use mosquitoes as hosts. Second, it can be transmitted through blood, sexual transmissions, and other bodily fluids (urine, saliva, etc.). Thus, it is reasonable to assume that people get Zika through mosquito bites[12].

One proposal (and this applies to many mosquito-borne diseases) is that mosquitoes lay eggs in tires which then get sent across the world in shipping containers[13]. Some other ideas are that the virus got to Brazil via person-to-person transfer during large, global events like the 2014 FIFA World Cup. Of course, it could also be the result of someone returning from a vacation to Polynesia.

Whatever the case may be, it’s important and also quite interesting, to be able to model genetic lineages and make some sense of viral outbreaks. Hopefully you gained some insight into how epidemiologists can use bioinformatic tools to trace the origins of a virus.


Dick GW. Zika Virus II: Pathogenicity and Physical Properties. Transactions of the Royal Society of Tropical Medicine and Hygiene. September 1952.

Center for Disease Control and Prevention. Cumulative Zika Virus Disease Case Counts in the United States, 2015-2018. Reporting and Surveillance. Updated: February 2018.

World Health Organization. Zika and Potential Complications. Zika Situation Report. February 2016.

World Health Organization. Zika Virus, Microcephaly, and Guillain-Barre Syndrome. Zika Situation Report. November 2016.

Drummond, AJ. BEAST 2: A Software Platform for Bayesian Evolutionary Analysis. PLoS Computational Biology. 2014.

Kingman, JFC. On the Genealogy of Large Populations. Journal of Applied Probability. 1982.

Edgar, Robert C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004.

Hasegawa M. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution. 1985

Shapiro Beth, Choosing Appropriate Substitution Models for the Phylogenetic Analysis of Protein-Coding Sequences. Molecular Biology and Evolution. January 2006.

Metsky, Hayden C. Zika Virus Evolution and Spread in the Americas. Nature. February 2018.

Faye O. Molecular Evolution of Zika Virus during Its Emergence in the 20th Century. Public Library of Science Neglected Tropical Diseases. 2014

Center for Disease Control and Prevention. Zika Virus Transmission Methods. Prevention and Transmission.

Reiter P. The Used Tire Trade: A Mechanism for the Worldwide Dispersal of Container Breeding Mosquitoes. Journal of the American Mosquito Control Association. September 1987

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s