Common variants and rare variants
Since the first description of DNA in 1953 by Watson and Crick1, based heavily on work by Rosalind Franklin2,3, interest in mapping the human genetic code and disentangle the role of genetics in various diseases has increased drastically. Early efforts focused primarily on a small amount of markers due to both methdological and practical contraints. These initial investigations used linkage analyis to identify the location on the chomosome of a number of disease genes4. Linkage analysis relies on the observation that genes located close together on a chromosome remain colocated during meiosis and have a lower chance of recombination4,5. These early studies were fairly successful within Mendelian disorders in particular, such as Huntington’s disease6 and cystic fibrosis7, where a small number of genes are entirely responsible for disease etiology and progression8.
Out of the early progress of linkage studies, interest in so-called “candidate gene” studies increased over time with varying degrees of success. In candidate gene studies, genes were selected based on a priori knowledge on the biological role of these genes and the hypothesized role of this biological process in disease etiology or progression8. A major limitation of these type of studies is that they are very susceptible to publication bias and type-I error9. One of the most well-reported mishaps in the candidate gene approach is the 5-httlpr polymorphism and its supposed association with depression9. Despite a large number of articles that claimed to identify a link between the 5-httlpr polymorphism and depression traits, these findings were poorly replicated through gene-agnostic methods10,11. A majority of the scientific community has since moved away from candidate gene studies in favor of sample- and data-driven approaches such as the GWAS.
A GWAS identifies genetic loci associated with a trait by assessing the differences in allelic frequency between individuals with different phenotypes. A GWAS can exist either as a case-control study with a binary phenotype, such as diagnosis or no diagnosis12, or on a continuous scale, e.g. blood pressure13. In a case-control setup the associations are tested through a logistic regression model, whereas continous traits demand a linear regression model14. It is common to either regress the variables such as age, age2, and sex from the phenotype prior to running the GWAS or to add the variables as covariates into the model directly15. One also commonly identifies multiple genetic variants associated with a trait located in a close region on the chromosome due to linkage disequilibrium, which is a feature of genetic studies that have to be accounted for or can be levereged depending on the goal of the analyses14. A classical GWAS applies the linear or logistic model to each of the individual SNPs, which requires stringent procedures to account for the Type I error rate. Previous genetic studies have shown that the human genome contains approximately 1 million haploblocks (a genetic region with little recombination)16. Current convention states that a Bonferroni correction accounting for the 1 million haploblocks in the genome is adequate to mitigate the rate of false positives without significantly increasing the rate of false negatives, which is how the common threshold of 5e-8 (0.05/1e6) became the standard in the field14,16.
In order to run a GWAS with adequate statistical power, researchers require a large sample of often several thousands of individuals who have each had their genome sequenced. Needless to say, when these approaches were in its infancy, they were incredibly demanding in terms of resources. In the early years after the first human genome was sequenced early this century17, the price of sequencing a single human genome could be in excess of $10.000.000 USD whereas recently the cost has dropped to below $1000 USD, defeating even Moore’s Law18. The technological advances that made genotyping possible and pushed down the cost over the past two decades have allowed for larger and larger sample sizes. Whereas a genetic study investigating for instance Huntington’s disease could identify the causal locus with a small number of participants, a larger sample size is required to identify the genetic loci associated with more complex traits such as most psychiatric disorders.