A genome wide association study (GWAS) is a statistical analysis that identifies links between genetic variances and a given phenotype. This could be a disease, morphology, intelligence, etc. Essentially, we are trying to find regions on chromosomes across the entire genome that may be causal or indicative of a trait of interest.
There are many statistical methods for modelling GWA, but in this case, I will stick a simple linear model- a Wald test- using the software Gemma.
The data necessary to conduct a Wald test is fairly simple – you need genetic data preferably in plink, bimbam, or VCF format, and a phenotypic measurement. Databases like NCBI’s dbGap and the European Genome Phenome archive have what we need, but they restrict access to researchers only. In this example, I’ll use some data I found from a study published in Nature that is available here. It contains a some other data, but the one I will use is the Chinese pharmacogenic data in the ./Public/Genomics folder (arbitrarily selected tbh). This data is in plink binary format, thus containing a .bim (binary genomic data), .bed (SNP site data), and .fam (phenotype and familial data).
The data contains 4,032 SNPS of 106 Chinese individuals. The phenotype is a binary classification for “adverse drug reaction.” One thing to note is that Gemma reads the phenotype from the 6th column of the .fam file, but the data comes with it in the 5th column, so I wrote a python script to flip it. Another important detail is that the data should be one-hot encoded, i.e. 0 and 1 instead of 1 and 2 as it comes downloaded.
import pandas as pd df =pd.read_table('/Users/macuser/Desktop/GWASStuff/China_Pharm.fam',sep=' ',header=None) print(df.head(10)) df = df[[0,1,2,3,5,4]] print(df.head(10)) df.to_csv('/Users/macuser/Desktop/China_PharmFam.csv',sep=' ',index=False,header=False) # Change the csv extension to a text extension manually
0 1 2 3 4 5 0 M11072707 M11072707 0 0 -9 2 1 M11072306 M11072306 0 0 -9 2 2 M11081312 M11081312 0 0 -9 2 3 M11061605 M11061605 0 0 -9 1 4 M11071301 M11071301 0 0 -9 2 5 M11081715 M11081715 0 0 -9 2 6 M11080306 M11080306 0 0 -9 2 7 M11072311 M11072311 0 0 -9 2 8 M11081304 M11081304 0 0 -9 2 9 M11060903 M11060903 0 0 -9 1 0 1 2 3 5 4 0 M11072707 M11072707 0 0 2 -9 1 M11072306 M11072306 0 0 2 -9 2 M11081312 M11081312 0 0 2 -9 3 M11061605 M11061605 0 0 1 -9 4 M11071301 M11071301 0 0 2 -9 5 M11081715 M11081715 0 0 2 -9 6 M11080306 M11080306 0 0 2 -9 7 M11072311 M11072311 0 0 2 -9 8 M11081304 M11081304 0 0 2 -9 9 M11060903 M11060903 0 0 1 -9
After cleaning the data, move to the folder that contains the data of interest, call the Gemma executable from wherever its stored, and run the following line of code:
To break it down:
- -bfile loads in plink binary data follow this with the prefix of the data (that means all three files have to have the same prefix)
- -lm performs a linear model on the data, 1 denotes a Wald test
- -o denotes output file prefix for results which will include a text file with p-values and a log file
This will perform a Wald test on each site to test if it is significantly associated with a given phenotype. In this case, adverse reaction to drugs. It will output a text file that contains p-values for every SNP.
A Manhattan plot is a type of scatter plot that visualizes the p-value of a SNP, or rather its -log10 against its position on the chromosome. Here is a python script to generate a Manhattan plot with matplotlib that I tweaked from a stack user, here:
import pandas as pd import matplotlib.pyplot as plt import numpy as np pd.set_option('expand_frame_repr', False) df = pd.read_table('/Users/macuser/Desktop/GWASStuff/output/China_Pharm_Out.assoc.txt', sep = '\t') # print(df.head(10)) df['p_adj'] = -np.log10(df.p_wald) df.chr = df.chr.astype('category') # print(df.head(10)) df['ind'] = range(len(df)) df_grouped = df.groupby(('chr')) # print(df_grouped.head(10)) fig = plt.figure() ax = fig.add_subplot(111) colors = ['#E24E42','#008F95'] x_labels =  x_labels_pos =  for num, (name, group) in enumerate(df_grouped): group.plot(kind='scatter', x='ind', y='p_adj',color=colors[num % len(colors)], ax=ax, s=5) x_labels.append(name) x_labels_pos.append((group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc)/2)) ax.set_xticks(x_labels_pos) ax.set_xticklabels(x_labels) ax.set_xlim([0, len(df)]) ax.set_ylim([0, 6]) ax.set_xlabel('Chromosome') plt.xticks(fontsize = 8,rotation=60) plt.yticks(fontsize = 8) # xticks = ax.xaxis.get_major_ticks() # xticks.set_visible(False) plt.show()
As you can see, there are some areas around the beginning of the chromosome 7 and in the middle of the chromosome 6 that seem to have a larger relative contribution to adverse drug reactions (tbh a p-value of 0.001 is not that significant in GWAS terms- you’re really looking for something along the order of 10^-6). The significant SNPs at chromosome 7 correspond to the gene ABCB5 – a protein that participates in ATP-dependent transmembrane transport.
Alright, that’s all for today.
- Sud A, et al. Genome-wide association studies of cancer: current insights and future perspectives. Nature Reviews Cancer 17, pages 692–704 (2017).
- Liu F, et al. A Genome-Wide Association Study Identifies Five Loci Influencing Facial Morphology in Europeans. PLOS Genetics 8(9): e1002932.
Davies G, Tenesa A, Payton A, et al. Genome-wide association studies establish that human intelligence is highly heritable and polygenic. Molecular psychiatry. 2011;16(10):996-1005.
- Hayes B. Overview of Statistical Methods for Genome-Wide Association Studies (GWAS). Methods Mol Bio 2013;1019:149-69.
- Xiang Zhou and Matthew Stephens (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics. 44: 821–824.
- Woie-Yuh S, et al. Establishing multiple omics baselines for three Southeast Asian populations in the Singapore Integrative Omics Study. Nature Comms 8: 653 2017.
- Brunham, L. R. et al. Pharmacogenomic diversity in Singaporean populations and Europeans. Pharmacogenomics J. 14, 555–563 (2014).