How Genetic Similarity is Calculated and a Little on What the NASA Twin Study Really Means

You’ve probably heard statistics claiming that DNA between humans and chimpanzees are over 99% similar, or that humans share 50% of there DNA with a banana, or whatever-the-case. More recently, you might have heard news articles from sources such as CNN[1] or Time[2] claim that an astronaut’s (Scott Kelly), DNA changed 7% compared to his twin after spending a year in space. Obviously, this isn’t true. What the articles meant to, or should have said was that 7% of the astronauts genes were differentially expressed compared to his twin.



Essentially, environmental factors, such as light or chemicals, can cause the body – or its cells – to produce more of a particular gene in response. For instance, turtle embryos upregulate sox9 when exposed to heat, which is involved in sex determination[3]. The NASA twin study write up (the full study is not published yet, I believe) in particular noted IGF-1, a protein related to bone and muscle density, steadily increased over the 1 year stay in space[4].

This study, from what I can gather without having access to the full methods, looked at the astronauts’ expression levels for every gene, compared them to the expression levels of his twin, and noted that 7% of the genes had differential expression levels after landing.

The buzz that these headlines garnered made me think that an explanation as to how genetic similarity is calculated may be interesting if not useful. So here’s a write-up on how researchers come up with statistics like I mentioned in the opening paragraph.

Gene Orthology and Alignment

When comparing Sequences between two species, you cannot it approach like you were reading a book. Over the course of speciation, genes can move (shoutout to Barbara)[5], chromosomes can fuse and break, etc. One cannot just grab a chromosome from a chimp and start comparing base pairs to another arbitrary chromosome from a human.

It’d be like a school question that asks you to list an order of events and you mess one up, causing the subsequent answers to be off by one, and then the teacher marking all of them wrong. Or, a more biological example would be to say that humans are only 23/24ths similar to chimps because we have one less chromosome than they do. Even though our chromosome 2 became fused and is present, almost identically, in chimps as 2 separate chromosomes[6].

There has to be some sort of alignment such that you are reading off base pairs from orthologous genes- or, genes in different species that evolved from a common ancestral gene by speciation. This is accomplished through the use of an alignment algorithm, like BLAST[7], which finds regions of similarity between DNA sequences.

Simplified diagram of homology subtypes

Know Which Parts of the Genome is Considered

Another thing to consider is what genes are present in the analysis. Is it just transcribed regions? What about just protein-coding-regions? Over 99% of the genome is not translated[8]– converted to proteins- and only ca. 10% even has biological function(at least, we haven’t found out what it does yet)[9]. Are the researchers taking into consideration the entire genome, or just parts of it. There is no correct way of doing this, it’s just something that needs to be considered. For instance, the mouse genome is 85% similar to humans when considering protein-coding regions only, but only 50% similar when considering non-coding regions[10].

Consensus and Polymorphism

As we know, there is genetic variation within species, not just between them. Not every human is alike, nor is every chimp, or mouse, or whatever. So do geneticists just pick a random individual from each species of interest and use their genomes for comparison? Typically, no. They generate a consensus sequence- a sort of sampling whereby the modal nucleotide for each site is considered the “correct”, or consensus nucleotide for the sample entire sample.

Once a consensus sequence is derived and the genes are aligned, then you can start reading the DNA, nucleotide by nucleotide, and identifying single nucleotide polymorphisms (SNPs). A SNP is a nucleotide that varies from another reference position. Imagine you’re reading across a sequence and you see a Guanine. You look at the same spot on another sequence and see a Cytosine- BOOM! you’ve got a SNP and a tally for genetic dissimilarity.

Looking at Some Frequently Cited Statistics

Here I’m going to take a look at the methods behind the studies of some popular statistics. Let’s begin:

  1. “Humans and Chimpanzees share 98% of Their DNA.” – This study from The Chimpanzee Sequence and Analysis Consortium[11] used BLASTZ aligned sequences from a couple chimps. They removed small (<15kb) insertions and deletions, and only compared 13,454 orthologous genes out of the 19,277 annotated human genes.
  2. “Humans are 99.9% Genetically Identical to Each Other.”[12] – This statistic comes from The Thousand Genomes Project Consortium and compares 2,504 individuals from 26 populations. The entire genome is sampled and they found, that an average individual has about 4.1 million SNPS This works out to around 99.87% similarity considering the human genome is about 3,000,000,000 base pairs in length.
  3. “Humans are 50% of their DNA with bananas.” I cannot find a primary source on this :/ .

In Summary

Hopefully, this write-up gave some insight into what it means for species to be genetically similar to each other. Concerning the NASA twin study, No, Kelly’s DNA did not change after being in space for 1 year. Outside of an insignificant number of mutagenic events, which happen at the cellular level and not the genomic level, his DNA remained the same. One interesting finding from the study, however is that his telomerase became significantly longer while in space, but returned to normal after landing. Overall, a 7% difference in gene expression is what we saw, but I guess that not as interesting as suggesting that he is no longer twins with his twin[13], o_O .


  3. Gilbert SF. “Environmental Sex Determination”. Developmental Biology. 6th edition. 2000. Available from:
  4. Edwards M. “NASA Twin Study Confirms Preliminary Findings”. NASA Human Research Strategic Communications. 2018 Available From:
  5. Ravindran S. Barbara McClinktock and the Discovery of Jumping Genes”. Proceeedings of the National Academy of Science. 2012.
  6. Yunis JJ. “The Origin of Man: A Chromosome Pictoral Legacy”. American Association for the Advancement of Science. 1982.
  7. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. “Basic local alignment search tool.” J. Mol. Biol. 215:403-410. 1990.
  8. International Human Genome Sequencing Consortium (Feb 2001). “Initial sequencing and analysis of the human genome”. Nature. 409 (6822): 860–921. doi:10.1038/35057062.
  9. Ponting CP. “What Fraction of the Human Genome is Functional?”. Genome Research. 2011.
  10. National Human Genome Research Institute. “Why Mouse Matters”. Mouse Sequencing Consortium. 2000. Available From:
  11. The Chimpanzee Sequence and Analysis Consortium. “Initial Sequence of the Chimpanzee Genome and Comparison with the Human Genome”. Nature 437:69-87. 2005. doi:10.1038/nature04072
  12. The 1000 Genomes Project Consortium. “A global reference for human genetic variation.” Nature. 2015;526(7571):68-74. doi:10.1038/nature15393.

Photo Credit

R.A. Jensen, Orthologs and paralogs – we need to get it right, Genome Biology, 2001.Figure 1. Available From:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s