Population STRUCTURE of Arctic Dogs & Wolves – How Similar Are They?

Overview

Arctic dogs are now lovable pets, but were once bred for their ability to pull a sled around. What’s more interesting, at least to me, is that they share a high degree of genetic similarity with wolves. Here, I’m going to be looking at the Arctic Dog clade to understand the relationships between the breeds that comprise it.

Loading Packages

Pophelper isn’t on the cran, so you’ll have to clone it from github or use “devtools”

library(xlsx)
library(reshape2)
library(hierfstat)
library(pophelper)
library(pegas)

The Data

I got this data from Dryad (DOI: http://dx.doi.org/10.5061/dryad.46170) from “Using multiple markers to elucidate the ancient, historical, and modern relationships among North American Arctic dog breeds” by Brown SK et. al. The data comes as a multisheeted excel file. The data we want is the genotype data for the dog breeds (sheet 3) and wolves (page 5). The wolf genotypes aren’t delimited so we’ll have to fix that. First, though, we’ll get the dog data into something we can work with.

df = read.xlsx("/Users/macuser/Downloads/Brown_et_al_Appendix_1.xlsx",3)
#Breeds are denoted by a row, let's change that to a column:
df$Breed = c("NA","CanID","CanID","CanID","CanID","CanID","CanID","CanID",
             "CanID","CanID","CanID","CanID","CanID","CanID","CanID","CanID","CanID",
             "CanID","CanID","CanID","CanID","CanID","CanID","CanID","CanID","CanID",
             "NA","GrnID","GrnID","GrnID","GrnID","GrnID","GrnID","GrnID","GrnID","NA",
             "AKHusky","AKHusky","AKHusky","AKHusky","AKHusky","AKHusky","AKHusky","AKHusky",
             "AKHusky","AKHusky","AKHusky","AKHusky","AKHusky","AKHusky","AKHusky","AKHusky",
             "NA","Malamute","Malamute","Malamute","Malamute","Malamute","Malamute",
             "Malamute","Malamute","Malamute","Malamute","Malamute","Malamute","Malamute",
             "Malamute","Malamute","NA","SBHusky","SBHusky","SBHusky","SBHusky","SBHusky",
             "SBHusky","SBHusky","SBHusky","SBHusky","SBHusky")
#Reorder so that Breed is the 2nd column after ID:
df = df[c(1,50,2:49)]
colnames(df) = c("ID","Breed",1:48)
#Now drop the Row that denoted breed:
df = df[-c(1,27,36,53,69),]
#Replace missing values with 0 (pretty much any sort of genotype analysis program reads missing values as 0)
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
df[df=='?'] = '0'
#That messed my headers...gonna have to rename them:
colnames(df) = c("ID","Breed",1:48)
head(df[,c("ID","Breed",1:15)],5)
##         ID Breed   1   2   3   4   5   6   7   8   9  10 11 12  13  14  15
## 1 S12-0793 CanID 100 104 131 131 127 127 219 221 246 248 89 95 286 286 126
## 2 S12-0786 CanID 104 104 151 153 127 133 221 221 242 246 89 93 286 286 126
## 3  DR25915 CanID 104 104 131 149 117 127 221 221 242 242 89 89 286 286 122
## 4  DR25926 CanID  90 102 151 153 113 125 221 221   0   0 91 93 286 286 126
## 5  DR25938 CanID 100 104 149 151 127 127 221 221 242 248 93 95 286 286 118

Now for the Wolf data (this is gonna be a bit more challenging):

wolfDF = read.xlsx("/Users/macuser/Downloads/Brown_et_al_Appendix_1.xlsx",5)
#Drop the 'Wolf' header and add it as column like we did before:
wolfDF = wolfDF[-c(1),]
wolfDF$Breed = c("Wolf","Wolf","Wolf","Wolf","Wolf","Wolf","Wolf","Wolf","Wolf","Wolf",
               "Wolf","Wolf")
wolfDF = wolfDF[c(1,50,2:49)]
#Now we start Delimiting- First, get rid of the unalligned headers:
wolfDF= wolfDF[1:26]
#Rename remaining headers
colnames(wolfDF) = c("ID","Breed",1:24)
newDF = wolfDF[c('ID','Breed')]
#This loop will iterate through the dataframe columns with the genotype data and split the fields every 3 characters (the length of all genotype lables in our case). Then it trims flanking whitespace (leaving the space we use to delimit by) and seperates the fields into two columns. The columns are then appended to the newDF (above) which already contains ID and Breed.
for (i in wolfDF[c(3:26)]){
  wolfDF.i = gsub("(.{4})", "\\1 ", i)
  wolfDF.i = trimws(wolfDF.i, which = 'left')
  fodder = transform(wolfDF, wolfDF.i = colsplit(wolfDF.i,pattern = " ", names = c('1','2')))
  newDF = cbind(newDF,fodder[c('wolfDF.i.1','wolfDF.i.2')])
  #print(newDF)
}
colnames(newDF) = c("ID","Breed",1:48)
newDF$ID = gsub(",","",newDF$ID)
head(newDF[,c("ID","Breed",1:15)],5)
##         ID Breed   1   2   3   4   5   6   7   8   9  10 11 12  13  14  15
## 2 DR24542   Wolf  98 100 131 131 115 115 217 223 235 244 89 93 286 292 124
## 3 DR24544   Wolf  94 100 133 133 123 123 219 233 244 248 91 93 284 296 124
## 4 DR24545   Wolf 100 106 137 147 123 125 219 219 242 246 89 91 284 284 126
## 5 DR24546   Wolf  92  94 131 147 123 125 219 223 242 246 91 95 284 284 124
## 6 DR24547   Wolf  94 110 137 147 123 125 219 223 246 246 89 95 284 286 124

Combine the two dataframes

fullDF = rbind(df,newDF)
fullDF[, 3:50] <- sapply(fullDF[, 3:50], as.numeric)
rownames(fullDF) = (1:86)

Preliminary Analysis: PCA

Principal component analysis on genotypes is a way to cluster individuals based on genetic variation. In this case, indPCA is plotting all loci against each other in n dimensions, finding the axes corresponding to the highest degree of variation (principal component 1 and 2), and plotting individuals based on their deviation from these principal components. (This is the best video I’ve seen explaining PCA:
https://www.youtube.com/watch?v=_UVHneBUBW0).

Breed = fullDF[,c(2:50)]
BreedPCA = indpca(Breed)
plot(BreedPCA,cex=0.5,col = c(rep(1,25),rep(2,8),rep(3,16),rep(4,15),rep(5,10),rep(6,12)))

PCA1

Let’s drop that Outlier on the top left

Breed = Breed[-c(26),]
BreedPCA = indpca(Breed)
plot(BreedPCA,cex=0.5,col = c(rep(1,25),rep(2,7),rep(3,16),rep(4,15),rep(5,10),rep(6,12)))

PCA2

Cool. We can now visualize the genetic similarity between the breeds. The two breeds of inuit dogs are closely clustered and are the husky breeds. Malamute appears to be somewhat inbetween the two and wolves, as expected since there are technically a different species, make up there own group. There’s one wolf that looks like it might actually be an Alaskan Husky. We’re going to turn this classification up a notch using a program called STRUCTURE.

STRUCTURE Analysis

STRUCTURE uses a Bayesian approach via MCMC to form clusters given genotype data. It will compute the proportion of an individuals genome originating from each inferred population up to K populations.
First, we have to write out data to a text file that can be read by STRUCTURE.

#Name each loci (the original excel file has the real names for each locus, but these will do):
header = data.frame("Locus1","Locus2","Locus3","Locus4","Locus5","Locus6","Locus7","Locus8","Locus9","Locus10","Locus11","Locus12","Locus13","Locus14","Locus15","Locus16","Locus17","Locus18","Locus19","Locus20","Locus21","Locus22","Locus23","Locus24")
#Convert Breed to a Number:
StructureDF = fullDF
StructureDF$Breed[StructureDF$Breed=='CanID']=1
StructureDF$Breed[StructureDF$Breed=='GrnID']=2
StructureDF$Breed[StructureDF$Breed=='AKHusky']=3
StructureDF$Breed[StructureDF$Breed=='Malamute']=4
StructureDF$Breed[StructureDF$Breed=='SBHusky']=5
StructureDF$Breed[StructureDF$Breed=='Wolf']=6
#StructureDF$Sep = c(rep(1,86))
#StructureDF = StructureDF[c('ID','Breed','Sep',1:48)]
#Write in the headers:
write.table(header, "StructureData.txt", sep = " ", row.names = FALSE,col.names = FALSE, quote = FALSE)
#Write in the rest of the data:
write.table(StructureDF, "StructureData.txt", sep = " ", row.names = FALSE, col.names = FALSE, quote = FALSE, append = TRUE)

STRUCTURE is going to read this data to assign K populations to individuals. Here is a good tutorial for how to use STRUCTURE:

http://pbgworks.org/sites/pbgworks.org/files/Tutorial%20of%20STRUCTURE%20software.pdf.

These are some of the more significant parameters I used:

  • Burnin: 10000
  • Reps after burnin: 100000
  • Number of Iterations for K: 3
  • Range of K: 1-12 (I Chose 12 because that would allow for each breed to have a dicotomus subdivision. K should be atleast 6 since that is the number of breeds and the “expected” value for K).

Each run will be saved to a folder called “Results” found in your directory.

Now we need to access these output files.

#STRUCTURE directory = dogStructure,
#STUCTURE parameter set name = pset
#Default STRUCTURE result output (created automatically) = Results
f <-list.files("./dogStructure/pset/Results",full.names=TRUE)
f <- readQ(files=f,indlabfromfile = TRUE)
fDF <-tabulateQ(f)
fDF <-summariseQ(fDF)
fDF
##    loci ind  k runs  elpdmean       elpdsd elpdmin elpdmax
## 1    24  85  1    3 -6999.933   0.05773503 -7000.0 -6999.9
## 2    24  85  2    3 -6573.333   3.60878558 -6577.5 -6571.2
## 3    24  85  3    3 -6158.533   1.07857931 -6159.3 -6157.3
## 4    24  85  4    3 -6018.933   1.45716620 -6020.3 -6017.4
## 5    24  85  5    3 -6037.100  16.60451746 -6056.2 -6026.1
## 6    24  85  6    3 -5910.800   5.07641606 -5915.1 -5905.2
## 7    24  85  7    3 -5965.833  64.45807423 -6037.5 -5912.6
## 8    24  85  8    3 -5968.100  41.04826428 -6013.7 -5934.1
## 9    24  85  9    3 -6057.900  72.99321886 -6129.4 -5983.5
## 10   24  85 10    3 -6151.167  88.97922979 -6248.1 -6073.2
## 11   24  85 11    3 -6155.533 162.23440860 -6312.5 -5988.5
## 12   24  85 12    3 -6082.367 131.75876947 -6202.9 -5941.7

With the data loaded we can now find K.

Finding K

To find K, we’ll use the Evanno Method. The Evanno mehtod uses an ad hoc statistic, ∆K, based on the rate of change in the log probability of data between successive K values, to estimate the best value for K.

evanno <-evannoMethodStructure(fDF)
plot(evanno$k, evanno$deltaK, type="l", xlab="K", ylab="∆ K")

EvannoK

According to the Evanno Method, the best value for K is 3 followd by 4 and then 6. This means that there is not enough genetic variation between the 6 breeds to assign them distinct populations. Can you guess what the population assignments are based on our PCA? I’m going to bet that wolves are in their own population and that the Inuit dog breeds share a population as well as the Husky breeds. Malamutes could belong to either the Inuit population or the Husky clade… Those are my guesses, anyway.
We can use CLUMPP to actually see these population assignments.

CLUMPP

According to documentation, “CLUMPP permutes the clusters output by independent runs of clustering programs such as structure, so that they match up as closely as possible.” In other words, what we’ll be doing is alligning and merging the runs from STRUCTURE to come up with one consensus (it’s really more of an average) population assignment.

This will create folders in our directory for each value of K

clumppExport(dfq,prefix="dogsClumpp",parammode=3)

Now, we will look at all 3 runs for K = 3, the optimal value for K, before any CLUMPPing

#Access the file "paramfile" file in the folder we just made for K = 3:
noAlign = readQ(files = "./dogsClumpp_K3/dogsCLumpp_K3-combined.txt")
#Add color and labels:
col = c("tomato","steelblue1","lightgoldenrod","plum","seagreen1", "orange")
dogPops = c("CanID","CanID","CanID","CanID","CanID","CanID","CanID","CanID","CanID",
            "CanID","CanID","CanID","CanID","CanID","CanID","CanID","CanID","CanID",
            "CanID","CanID","CanID","CanID","CanID","CanID","CanID","GreenID",
            "GreenID","GreenID","GreenID","GreenID","GreenID","GreenID","AKHusky",
            "AKHusky","AKHusky","AKHusky","AKHusky","AKHusky",
            "AKHusky","AKHusky","AKHusky","AKHusky","AKHusky","AKHusky","AKHusky",
            "AKHusky","AKHusky","AKHusky","Malamute","Malamute","Malamute","Malamute",
            "Malamute","Malamute","Malamute","Malamute","Malamute","Malamute","Malamute",
            "Malamute","Malamute","Malamute","Malamute","SBHusky","SBHusky","SBHusky",
            "SBHusky","SBHusky","SBHusky","SBHusky","SBHusky","SBHusky","SBHusky",
            "Wolf","Wolf","Wolf","Wolf","Wolf","Wolf","Wolf","Wolf","Wolf","Wolf",
            "Wolf","Wolf")
dogPopLabels = as.data.frame(dogPops, stringsAsFactors = FALSE)
#Figure outputted in directory
plotQ(qlist=noAlign,imgoutput="join",grplab = dogPopLabels,grplabangle = 45,
      grplabpos = 0.5, grplabheight = 1.2, clustercol = col, showsp = FALSE)

Structure1

Now Align them with CLUMPP. The CLUMPP exe needs to be in the same folder as our file, first.

targetFolder = paste("./dogsClumpp_K3","/CLUMPP",sep="")
#CLUMPP exe location:
file.copy("/Users/macuser/Downloads/CLUMPP_MacOSX.1.1.2/CLUMPP",targetFolder)
# Move our R processing to that folder
setwd("./dogsClumpp_K3")
#Run CLUMPP
system("./CLUMPP")
#Revert dir change
setwd("..")

Plot the Aligned runs using the “…combined-aligned.txt” file:

withAlign = readQ("./dogsClumpp_K3/dogsClumpp_K3-combined-aligned.txt")
plotQ(qlist=withAlign,imgoutput="join",grplab = dogPopLabels,grplabangle = 45,
      grplabpos = 0.5, grplabheight = 1.2, clustercol = col, showsp = FALSE)

Structure2

Now we can see each run for K = 3 aligned, next, we’ll merge them into 1 average run:

merged = readQ("./dogsClumpp_K3/dogsClumpp_K3-combined-merged.txt")
plotQ(qlist=merged,imgoutput="sep",grplab = dogPopLabels,grplabangle = 45,
      grplabpos = 0.5, grplabheight = 1.2, clustercol = col, showsp = FALSE)

Structure3

We finally have our population assignments for K = 3, but I want to compart it with K = 6, the number of breeds in our dataset. So I’ll repeat some of the previous steps.

targetFolder = paste("./dogsClumpp_K6","/CLUMPP",sep="")
#CLUMPP exe location:
file.copy("/Users/macuser/Downloads/CLUMPP_MacOSX.1.1.2/CLUMPP",targetFolder)
# Move our R processing to that folder
setwd("./dogsClumpp_K6")
#Run CLUMPP
system("./CLUMPP")
#Revert dir change
setwd("..")
merged6 = readQ("./dogsClumpp_K6/dogsClumpp_K6-combined-merged.txt")
plotQ(qlist=withAlign,imgoutput="join",grplab = dogPopLabels,grplabangle = 45,
      grplabpos = 0.5, grplabheight = 1.2, clustercol = col, showsp = FALSE)

Structure4

Thoughts

Now that we have our results, we are at the endgame. Here’s some of my thoughts on what we found:

  • For K = 3, Malamutes were assigned primarily to the Husky group. Looking at the K = 6 plot, however, I’m kind of surprised that they aren’t in their own group, making K = 4 the best fit, though K = 4 was the second best candidate for K.
  • Apart from the lone wolf that I suspected was an outlier based on PCA, there is very little admixture from other groups in their own. The converse is slightly less true- that is, Huskies have a slight amount of wolf in their DNA.
  • There is a high degree of admixture between Alaskan and Siberian Huskies for K = 6. They might as well not be different breeds. In fact, after some Googling, Alaskan Huskies aren’t recognized as a breed by the AKC.
  • The PCA analysis is highly consistent with our STRUCTURE findings.

References

  • STRUCTURE: Pritchard K., et al. “Inference of Population Structure Using Multilocus Genotype Data.” GSoA. February 2000.
  • CLUMPP: https://web.stanford.edu/group/rosenberglab/clumpp.html
  • Data Source: Brown S.K, et al. “Using multiple markers to elucidate the ancient, historical and modern relationships among North American Arctic dog breeds.” Nature Heredity, June 2015.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s