Common variants and rare variants

Since the first description of DNA in 1953 by Watson and Crick¹, based heavily on work by Rosalind Franklin^2,3, interest in mapping the human genetic code and disentangle the role of genetics in various diseases has increased drastically. Early efforts focused primarily on a small amount of markers due to both methdological and practical contraints. These initial investigations used linkage analyis to identify the location on the chomosome of a number of disease genes⁴. Linkage analysis relies on the observation that genes located close together on a chromosome remain colocated during meiosis and have a lower chance of recombination^4,5. These early studies were fairly successful within Mendelian disorders in particular, such as Huntington’s disease⁶ and cystic fibrosis⁷, where a small number of genes are entirely responsible for disease etiology and progression⁸.

Out of the early progress of linkage studies, interest in so-called “candidate gene” studies increased over time with varying degrees of success. In candidate gene studies, genes were selected based on a priori knowledge on the biological role of these genes and the hypothesized role of this biological process in disease etiology or progression⁸. A major limitation of these type of studies is that they are very susceptible to publication bias and type-I error⁹. One of the most well-reported mishaps in the candidate gene approach is the 5-httlpr polymorphism and its supposed association with depression⁹. Despite a large number of articles that claimed to identify a link between the 5-httlpr polymorphism and depression traits, these findings were poorly replicated through gene-agnostic methods^10,11. A majority of the scientific community has since moved away from candidate gene studies in favor of sample- and data-driven approaches such as the GWAS.

A GWAS identifies genetic loci associated with a trait by assessing the differences in allelic frequency between individuals with different phenotypes. A GWAS can exist either as a case-control study with a binary phenotype, such as diagnosis or no diagnosis¹², or on a continuous scale, e.g. blood pressure¹³. In a case-control setup the associations are tested through a logistic regression model, whereas continous traits demand a linear regression model¹⁴. It is common to either regress the variables such as age, age², and sex from the phenotype prior to running the GWAS or to add the variables as covariates into the model directly¹⁵. One also commonly identifies multiple genetic variants associated with a trait located in a close region on the chromosome due to linkage disequilibrium, which is a feature of genetic studies that have to be accounted for or can be levereged depending on the goal of the analyses¹⁴. A classical GWAS applies the linear or logistic model to each of the individual SNPs, which requires stringent procedures to account for the Type I error rate. Previous genetic studies have shown that the human genome contains approximately 1 million haploblocks (a genetic region with little recombination)¹⁶. Current convention states that a Bonferroni correction accounting for the 1 million haploblocks in the genome is adequate to mitigate the rate of false positives without significantly increasing the rate of false negatives, which is how the common threshold of 5e^-8 (0.05/1e⁶) became the standard in the field^14,16.

In order to run a GWAS with adequate statistical power, researchers require a large sample of often several thousands of individuals who have each had their genome sequenced. Needless to say, when these approaches were in its infancy, they were incredibly demanding in terms of resources. In the early years after the first human genome was sequenced early this century¹⁷, the price of sequencing a single human genome could be in excess of $10.000.000 USD whereas recently the cost has dropped to below $1000 USD, defeating even Moore’s Law¹⁸. The technological advances that made genotyping possible and pushed down the cost over the past two decades have allowed for larger and larger sample sizes. Whereas a genetic study investigating for instance Huntington’s disease could identify the causal locus with a small number of participants, a larger sample size is required to identify the genetic loci associated with more complex traits such as most psychiatric disorders.

Watson JD, Crick FHC. Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature [Internet]. 1953 Apr 1;171(4356):737–8. Available from: https://doi.org/10.1038/171737a0

Franklin RE, Gosling RG. Evidence for 2-Chain Helix in Crystalline Structure of Sodium Deoxyribonucleate. Nature [Internet]. 1953 Jul 1;172(4369):156–7. Available from: https://doi.org/10.1038/172156a0

Wilkins M. The third man of the double helix : The autobiography of Maurice Wilkins [Internet]. Oxford: Oxford University Press; 2003. Available from: http://www.loc.gov/catdir/enhancements/fy0620/2004296803-t.html

Pulst SM. Genetic Linkage Analysis. Archives of Neurology [Internet]. 1999 Jun 1 [cited 2022 May 11];56(6):667–72. Available from: https://doi.org/10.1001/archneur.56.6.667

Morton NE. Sequential tests for the detection of linkage. Am J Hum Genet [Internet]. 1955 Sep;7(3):277–318. Available from: https://pubmed.ncbi.nlm.nih.gov/13258560

Gusella JF, Wexler NS, Conneally PM, Naylor SL, Anderson MA, Tanzi RE, et al. A polymorphic DNA marker genetically linked to Huntington’s disease. Nature [Internet]. 1983 Nov 1;306(5940):234–8. Available from: https://doi.org/10.1038/306234a0

Tsui LC, Buchwald M, Barker D, Braman JC, Knowlton R, Schumm JW, et al. Cystic fibrosis locus defined by a genetically linked polymorphic DNA marker. Science. 1985 Nov 29;230(4729):1054–7.

Abdellaoui A, Verweij KJH. Genetica en psychiatrie. Tijdschr Psychiatr. 2022;64(5):260–5.

Border R, Johnson EC, Evans LM, Smolen A, Berley N, Sullivan PF, et al. No Support for Historical Candidate Gene or Candidate Gene-by-Interaction Hypotheses for Major Depression Across Multiple Large Samples. AJP [Internet]. 2019 May 1 [cited 2022 May 10];176(5):376–87. Available from: https://doi.org/10.1176/appi.ajp.2018.18070881

10.

Bosker FJ, Hartman CA, Nolte IM, Prins BP, Terpstra P, Posthuma D, et al. Poor replication of candidate genes for major depressive disorder using genome-wide association data. Molecular Psychiatry [Internet]. 2011 May 1;16(5):516–32. Available from: https://doi.org/10.1038/mp.2010.38

11.

Johnson EC, Border R, Melroy-Greif WE, de Leeuw CA, Ehringer MA, Keller MC. No Evidence That Schizophrenia Candidate Genes Are More Associated With Schizophrenia Than Noncandidate Genes. Biological Psychiatry [Internet]. 2017 Nov 15;82(10):702–8. Available from: https://www.sciencedirect.com/science/article/pii/S0006322317317729

12.

Mullins N, Forstner AJ, O’Connell KS, Coombes B, Coleman JRI, Qiao Z, et al. Genome-wide association study of more than 40,000 bipolar disorder cases provides new insights into the underlying biology. Nature Genetics [Internet]. 2021 Jun 1;53(6):817–29. Available from: https://doi.org/10.1038/s41588-021-00857-4

13.

Andreassen OA, McEvoy LK, Thompson WK, Wang Y, Reppe S, Schork AJ, et al. Identifying common genetic variants in blood pressure due to polygenic pleiotropy with associated phenotypes. Hypertension [Internet]. 2014/01/06 ed. 2014 Apr;63(4):819–26. Available from: https://pubmed.ncbi.nlm.nih.gov/24396023

14.

Uffelmann E, Huang QQ, Munung NS, de Vries J, Okada Y, Martin AR, et al. Genome-wide association studies. Nature Reviews Methods Primers [Internet]. 2021 Aug 26;1(1):59. Available from: https://doi.org/10.1038/s43586-021-00056-9

15.

Pirinen M, Donnelly P, Spencer CCA. Including known covariates can reduce power to detect genetic effects in case-control studies. Nature Genetics [Internet]. 2012 Aug 1;44(8):848–51. Available from: https://doi.org/10.1038/ng.2346

16.

Altshuler D, Donnelly P, The International HapMap Consortium. A haplotype map of the human genome. Nature [Internet]. 2005 Oct 1;437(7063):1299–320. Available from: https://doi.org/10.1038/nature04226

17.

Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860–921.

18.

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) [Internet]. [cited 2022 May 11]. Available from: www.genome.gov/sequencingcostsdata