10 Years of GWAS Discovery: Biology, Function, and Translation

Main Text

Introduction

Here, we review the remarkable range of discoveries that genome-wide association studies (GWASs) have facilitated in population and complex-trait genetics, the biology of diseases, and translation toward new therapeutics. In the introductory sections, we provide a background for this review, summarize its scope and layout, and revisit the scientific rationale for GWASs. We then review general conclusions that can be drawn from GWAS discoveries across a wide range of traits. We subsequently highlight more specific results of discoveries and methods on the path from GWAS to biology and review progress in three exemplar diseases, namely type 2 diabetes (T2D [MIM: 125853]), auto-immune diseases (MIM: 109100), and schizophrenia (MIM: 181500). We end the review with a number of sections on the limitations of current experimental designs and possible ways to overcome these and a prediction on the future of GWASs for human traits.

Background

Five years ago, a number of us reviewed (and gave our opinion on) the first 5 years of discoveries that came from the experimental design of the GWAS.1 That review sought to set the record straight on the discoveries made by GWASs because at that time, there was still a level of misunderstanding and distrust about the purpose of and discoveries made by GWASs. There is now much more acceptance of the experimental design because the empirical results have been robust and overwhelming, as reviewed here.

Scope and Framework

Data generated from genome-wide SNP surveys have been exploited for addressing many scientific questions other than SNP-trait associations. We do not have the space to give adequate coverage of discoveries in evolutionary and population genetics, nor can we fully cover the many developments in analytic methods, although we will briefly mention some recent developments. The scope of our review is novel discoveries on the genetics and resulting biology of common adult diseases (auto-immune, metabolic, and psychiatric disease in particular) and their risk factors and the wider implications of those discoveries. GWAS discoveries have and are affecting a wide variety of diseases and traits, many of which have been covered in other in-depth reviews. Our focus is on associations between complex traits and SNPs, but we note that there have been many reported associations between traits and copy-number variants (CNVs) and that there are known mechanisms by which CNVs can be associated with disease.2 Results from other genome-wide surveys, including exome and whole-genome sequencing (WGS) studies, are not reviewed here.

GWAS Rationale and Scientific Basis

The GWAS is an experimental design used to detect associations between genetic variants and traits in samples from populations. The primary goal of these studies is to better understand the biology of disease, under the assumption that a better understanding will lead to prevention or better treatment. The path from GWAS to biology is not straightforward because an association between a genetic variant at a genomic locus and a trait is not directly informative with respect to the target gene or the mechanism whereby the variant is associated with phenotypic differences. However, as reviewed herein, new types of data, new molecular technologies, and new analytical methods have provided opportunities to bridge the knowledge gap from sequence to consequence. GWASs have also been successfully implemented for better defining the relative role of genes and the environment in disease risk, assisting in risk prediction (enabling preventative and personalized medicine), and investigating natural selection and population differences (Table 1).

Table 1

Table 1The Role of GWAS SNP Arrays in Human Genetic Discoveries
Analysis	Purpose	Discoveries
GWAS	detecting trait-SNP associations	∼10,000 robust associations with diseases and disorders, quantitative traits, and genomic traits
Genome-wide CNV analysis	detecting trait-CNV associations	hundreds of associations with diseases and disorders
Genome-wide assessment of LD	quantifying genome architecture	large variation in LD in the genome
Estimation of SNP heritability^a	genetic architecture	large proportion of genetic variation captured by common SNPs
Estimation of genetic correlation^a	detecting and quantifying pleiotropy	pleiotropy is ubiquitous
Polygenic risk scores^a	detecting pleiotropy; validating GWAS discoveries	out-of-sample prediction works as expected; detection of novel trait associations
Mendelian randomization^a	testing causal relationships	replication of known causal relationships; empirical evidence of observational associations that are not causal
Population differences in allele frequencies	reconstructing human population history; detecting selection	genetic structure can mimic geographical structure; evidence of natural selection
Trait GWAS with -omics GWAS^a	fine-mapping; detecting target genes; function	two-thirds of GWAS-associated loci implicate a gene that is not the nearest gene to the most associated SNP

These analyses can be performed with GWAS summary statistics.

GWASs to date rely on and exploit linkage disequilibrium (LD), the correlation structure that exists among DNA variants in the current human genome as a result of historical evolutionary forces, particularly finite population size, mutation, recombination rate, and natural selection. The statistical power to detect associations between DNA variants and a trait depends on the experimental sample size, the distribution of effect sizes of (unknown) causal genetic variants that are segregating in the population, the frequency of those variants, and the LD between observed genotyped DNA variants and the unknown causal variants. Therefore, the potential of a GWAS to succeed for a particular trait or disease depends on (1) how many loci affecting the trait segregate in the population, (2) the joint distribution of effect size and allele frequency at those loci (sometimes called genetic architecture), (3) the experimental sample size, (4) the panel of genome-wide variants that are used in the GWAS, and (5) how heterogeneous the trait or disease being studied is. The last relates to both the biology of the trait and the ability to diagnose or measure it with precision.

If the genetic architecture of a particular trait or disease were known, the optimum experiments could be designed to detect specific variants. However, despite many theoretical studies on the likely relationship between allele frequency and trait loci, until the onset of GWASs, there was very little empirical data to validate prediction from theoretical models.

GWASs have been facilitated by the development of relatively inexpensive SNP arrays. Commonly used SNP arrays vary in their content, but they broadly contain 200,000 to more than 2,000,000 SNPs. To date, most genetic variants that have been surveyed through GWASs are common in the population, in that they have a minor allele frequency (MAF) typically larger than 1%. For the purpose of this review, we arbitrarily define common variants to have MAF ≥ 1% and rare variants to have MAF < 1%. The GWAS as an experimental design is more than just an array-based study of common variants. For example, association studies using WGS data are also GWASs. There is a continuum from GWASs based on SNP arrays to those using WGS, and the only difference (apart from cost) is the density of coverage of variation in the genome and the MAF spectrum of the variants.

LD between genetic variants is commonly measured as a squared correlation (r2) because this measure is linear in the sample size required for detecting association between an observed genotyped and an unobserved causal variant. LD r2 can be large only if the allele frequencies at the two loci match,3, 4 and this is the reason why GWASs from common SNP arrays are not powerful enough to detect associations due to rare causal variants (in addition to sample-size considerations; see below). Statistical imputation5, 6, 7 of unobserved variants can recover some of the information lost because of imperfect LD between observed genotypes and unobserved causal variants. Imputation is enabled by the fact that the genotypes of unobserved genetic variants can be predicted by the haplotypes inferred from multiple observed SNPs and the haplotypes observed from a fully sequenced reference panel.

Opens large image

In Figure 1, we summarize power calculations (see Appendix A for theory) of the minimum sample size required for detecting an association as a function of genotype method (SNP array plus imputation or WGS), allele frequency, and effect size. Given that statistical imputation of variants as infrequent as 1/1,000 is still reasonably accurate,8 not much power of detection can be gained from WGS. For ultra-rare variants, for example, those with a frequency of 1/100,000, WGS can identify associations but only when the effect sizes of the polymorphisms (mutations) are very large. For example, for such rare variants with an effect size of 1 phenotypic standard deviation unit (about 7 cm for height or 5 BMI units), a sample size of more than one million is required (i.e., an allelic count of ten). For case-control studies of disease, the effects sizes of β = 0.01, 0.1, and 1 phenotypic standard deviation in Figure 1 correspond approximately to odds ratios of 1.02, 1.2, and 4, respectively, if we assume that both allele frequency and population prevalence are 0.01 or lower.9 Segregation of rare variants with very large effects might be observable in certain families, and then a family-based experimental design would be more efficient at locating and identifying such (near) monogenic traits. In addition, other genome-wide scans, such as WES and WGS studies, allow testing for a burden of rare variants across shared functional units (e.g., genes) in a way that is not accessible to GWASs.

Results in Figure 1 are based on unselected population samples or population-based case-control samples and the detection of association between the trait or disease and the same genetic variant. Power will be increased for highly ascertained cases and enrichment of extreme cases or family-based studies with multiple cases of a rare disease. Furthermore, using WGS data for association analysis of rare variants has the potential to boost power through the combination of alleles of similar impact (e.g., via burden tests across a gene) under the assumption of multiple independent causative variants in a gene region. This strategy is justified from knowledge of monogenic disorders, where it is typical for different variants of the same gene to segregate with disease in different families. However, such tests also have challenges because prior knowledge about function or frequency is required for determining which alleles in a gene should be included in the burden count.

General Results

Complex Traits Are Highly Polygenic

GWAS results have now been reported for hundreds of complex traits across a wide range of domains, including common diseases, quantitative traits that are risk factors for disease, brain imaging phenotypes, genomic measures such as gene expression and DNA methylation, and social and behavioral traits such as subjective well-being and educational attainment. About 10,000 strong associations have been reported between genetic variants and one or more complex traits,10 where “strong” is defined as statistically significant at the genome-wide p value threshold of 5 × 10−8, excluding other genome-wide-significant SNPs in LD (r2 > 0.5) with the strongest association (Figure 2). GWAS associations have proven highly replicable, both within and between populations,11, 12 under the assumption of adequate sample sizes.

Opens large image

One unambiguous conclusion from GWASs is that for almost any complex trait that has been studied, many loci contribute to standing genetic variation. In other words, for most traits and diseases studied, the mutational target in the genome appears large so that polymorphisms in many genes contribute to genetic variation in the population. This means that, on average, the proportion of variance explained at the individual variants is small. Conversely, as predicted previously,1, 13 this observation implies that larger experimental sample sizes will lead to new discoveries, and that is exactly what has occurred over the last decade. For example, in 2009 the first genomic locus robustly associated with liability to schizophrenia was discovered with a sample of 3,000 cases;14 by 2014, this had increased to 108 with a sample size of 35,000 cases.15 Similarly, when the concept of “missing heritability”16 was introduced, it highlighted that in 2008, only 40 genome-wide-significant SNPs had been identified for height, and together these explained about 5% of heritability.17 In 2014, the number of associated SNPs had increased to ∼700, explaining more than 20% of heritability,18 and from the relationship between sample size and discoveries in the last 10 years, it is reasonable to predict that in the next few years, this will increase to thousands of variants, which will cumulatively explain a substantial proportion (e.g., more than one-third) of heritability.

The term polygenic describes the genetic architecture underpinning variation in a trait between individuals in a population, but what does it mean for each individual? It means that each individual will carry a number of alleles that increase (+) and a number of alleles that decrease (−) the trait or disease risk. There are so many possible combinations of these sets of alleles that each individual is likely to have a unique combination, and in studies designed to detect associated loci, the effect size of each allele is measured across the context of an averaged background, and the effect size of each locus is found to be small.

Pleiotropy Is Pervasive

The number of segregating variants in the human population is large but finite. It is not known what proportion of the segregating variants are associated with complex-trait genetic variation, but the fact that each of the many studied traits is associated with variants at hundreds to thousands of loci in the genome strongly suggests that some of the underlying causal variants are the same. Multiple lines of evidence are consistent with widespread pleiotropy for complex traits. First, Mendelian mutations that cause specific syndromes or diseases are frequently associated with multiple phenotypes in an affected individual. Second, pedigree studies have reported genetic correlations between traits, implying that a number of the same variants affect two or more traits in a consistent direction.19 Third, GWASs have shown that the same genetic variants can be significantly associated with multiple diseases and traits when the phenotypes are measured on different individuals (so that no environmental associations are driving the results).20, 21, 22 In the case of auto-immune diseases, evidence implies that at some loci, the same causal variants are driving the observations of associations across diseases.23, 24, 25 Fourth, analytical methods that estimate genetic correlations from GWAS data have provided evidence for widespread pleiotropy.20, 26

The corollary of pervasive pleiotropy for complex traits is that the paradigm of “one gene, one function, one trait” is the wrong way to view genetic variation in the human genome (and the same applies for studying disease in experimental organisms).27 It also implies that studying traits or disease in isolation with respect to past or present natural selection might lead to the wrong inference. The true nature of the pleiotropy is currently unknown but, in some cases, could imply an impact of the variants on different tissues and/or at different ages.

New Analysis Methodology Underpinning New Discovery

GWAS data have led to new analysis methods that fall into a number of categories depending on their purpose: (1) methods of better modeling population structure and relatedness between individuals in a sample during association analyses,28, 29, 30, 31, 32, 33, 34 (2) methods of detecting novel variants and gene loci on the basis of GWAS summary statistics,35, 36, 37 (3) methods of estimating and partitioning genetic (co)variance,38, 39 and (4) methods of inferring causality.40, 41, 42 In addition, GWAS discoveries and interpretation have benefited substantially from improved algorithms in statistical imputation of unobserved genotypes and statistical imputation of human leukocyte antigen (HLA) genes and amino acid polymorphisms.43, 44, 45, 46

Common Variants Together Tag a Substantial Proportion of Additive Genetic Variance

In addition to enabling the discovery of specific trait-locus associations, GWASs have facilitated estimation of how much of the total additive genetic variation due to segregating variants in the population is tagged by genotyped and imputed SNPs. This quantification of “SNP heritability” is informative with respect to the unknown genetic architecture of the trait. SNP heritability has provided objective guidance to inform decisions about which experimental designs are most efficient at detecting novel trait-locus associations on the basis of empirical data, i.e., increasing sample size of GWASs. Classical estimation of total narrow-sense heritability (estimated from phenotypic records of samples that include family members) captures the total amount of additive genetic variance in the population irrespectively of the joint distribution of allele frequency and effect size47 (we acknowledge a potential for bias by common environmental effects and non-additive genetic variation). In contrast, SNP heritability (estimated from tiny genetic relationships from unrelated individuals) captures only the proportion of additive genetic variance due to LD between the assayed and imputed SNPs and the unknown causal variants. Estimation and partitioning of additive genetic variation for quantitative traits and liability to disease have implied that one-third to two-thirds of genetic variation at causal variants can be tagged by common genotyped and imputed SNPs through LD.1, 48 At present, it is not known how much of the total additive genetic variation is due to causal variants with frequencies less than 1%. Evidence from imputed genotype data for height implies that more additive genetic variation is explained by variants with MAF < 10% than expected under an evolutionary neutral model, consistent with purifying selection of the height-associated loci.49 In the near future, when additive genetic variance will be estimated from WGS data in large samples, the contribution of observed rare and low-frequency variants will be estimated explicitly. Estimates from data available to date provide the first evidence for different genetic architectures between diseases,50 for example, there is more signal from rare variants for amyotrophic lateral sclerosis (motor neuron disease [MIM: 105400]) than for schizophrenia51 and more predicted loci for schizophrenia than for immune disorders52 and hypertension.53

Theoretical and empirical observations suggest a place for non-additive genetic variation, and there have been many largely unsuccessful attempts to detect epistasis with GWAS data. There are a number of likely explanations. First, there is limited evidence that non-additive genetic variation makes up a large fraction of the total genetic variation, so detection requires larger sample sizes than those necessary for main effects. Second, the loss of information due to imperfect LD between genotyped SNPs and causal variants is larger for interactions than for main effects. For example, loss of information for additive effects is proportional to the LD r2, whereas information loss for dominance and additive-additive interaction effects is proportional to r4. The first observation also applies to interactions between genes and environmental factors. One replicable example of epistatic interaction is the ERAP1-HLA interaction for psoriasis (MIM: 177900) and ankylosing spondylitis (MIM: 106300).54

The Utility of GWAS-Derived Genetic Predictors

In 2007, it was shown that one could use GWAS data from human studies to create genetic predictors for disease and other complex traits by estimating the effect size at multiple loci in a discovery sample and using those estimated SNP effects in independent samples13, 55 to generate a polygenic risk score (PRS) per individual. A thorough review of different methods of generating PRSs is outside the scope of this review, but currently the key driving force influencing prediction accuracy is the size of the discovery sample used for estimating the effects of individual variants. PRSs have been applied extensively over the last 5 years, not in a clinical setting for the prediction of a healthy individual’s risk of disease but in applications that facilitate new experimental designs and discoveries. Polygenic predictions are not particularly informative for an individual, but they do explain a sufficient proportion of variation (between 1% and 15% at present for highly polygenic traits without a major gene) to separate groups, for example, samples with the highest and lowest risk. They are also useful for detecting new trait associations by correlating observed phenotypes in a sample or cohort with the genetic prediction of another trait. This design is powerful, because if the discovery sample is fully independent of the new sample, an observed association between a complex trait and a genetic predictor from the discovery sample must be due to genetic factors, given that there are no shared environmental factors. The paradigm of PRSs can also be applied to the prediction of molecular phenotypes such as gene expression, even when they are not observed,56 for mining the human “phenome” for association with predictors derived from diseases and other traits57 or investigating genotype (proxied by PRS) by environment interaction.58

The Public Availability of Data Has Enabled Novel Research and Discoveries

Sharing of genetic data in the gene-mapping community has been a major enabling factor in gene-mapping success. At this point, the vast majority of the available data are from studies of populations of European descent, and it is hoped that data from other ethnic groups will be deposited more extensively in years to come.

The availability of GWAS summary statistics (the effect sizes and their standard errors or p values on millions of SNPs) in the public domain has increased dramatically in the last 5 years, and in 2017 hundreds of such datasets are publicly available.59 There are a number of reasons for this. Previous concerns about potential individual identification from GWAS summary data have proven to be unfounded, either because the sample size from GWAS summary statistics is typically very large or because a simple step such as providing average allele frequencies from a reference sample negates potential identification. The entire genomics field benefits from wide availability of genetic data. When a GWAS is published, full genome-wide summary statistics (at the very least) should be available for uncontrolled download. Funding bodies and journals could play a stronger role in enforcing such a requirement. The availability of summary statistics in the public domain has enabled discoveries of novel associations,37, 60, 61, 62 estimation of SNP-based heritability,63 quantification of pleiotropy across many traits,20, 21, 38 and creation of more accurate prediction scores, as well as follow-up with computational tools, functional assays, and model systems for the identification of candidate genes.

For the near future, the UK Biobank is pushing the barriers further by releasing both genome-wide genotypes and rich phenotypic data on 500,000 people to the international research community.

From GWAS to Biology

By design, associations detected by GWASs do not yield a particular gene target or mechanism. This is in contrast to the detection of Mendelian coding mutations in family studies, where the variant, target gene, and mechanism (change in protein) are identified simultaneously. Moreover, the sheer number of associated variants means that the battery of follow-up functional studies traditionally applied to new discoveries from Mendelian disease is not appropriate or achievable for discoveries of genes associated with complex traits. It should be noted that although the effect sizes of individual genetic variants are small in populations, their effect sizes on molecular phenotypes can be large, and the drug effects of gene targets can also be magnified (e.g., statins). Notably, the last 5 years have witnessed some clever laboratory experiments that have followed up on GWAS association, and these have led to the discovery of the target gene, for example, the targets of the associations between FTO (MIM: 610966) and obesity (MIM: 601665)64 and between the major histocompatibility complex (MHC) and schizophrenia.65 Performing similar or new laboratory experiments for many loci could be possible but would be time consuming and expensive.

Until recently, efforts to understand the biological mechanisms through which these various risk variants act have been thwarted by limitations in the capacity to perform large-scale evaluation of functional impact.66 The advent of sequence-based -omic analyses have been transformative by allowing functional analyses of risk variants to be pursued on the same genome scale (which has fueled their discovery) and allowing mechanistic inferences to be based on the behavior of the full set of risk loci for a given trait.67 The maps of regulatory annotations and connections in disease-relevant tissues, generated by projects such as ENCODE,68 Epigenome RoadMap,69 and GTEx,70 have been crucial to interpretation of the non-coding variants that account for the majority of GWAS-identified risk alleles. Tissue-specific resources could become increasingly important, and for neuro-psychiatric disorders in particular, appropriate human brain resources are essential. New initiatives such as CommonMind and PsychENCODE are providing data and tools for the neuro-psychiatry research community to follow up on GWAS signals. New analytical methods now provide the first steps of functional in silico follow-up by exploiting the availability of resource datasets detailing gene expression, epigenetic marks, 3D chromatin contacts,71 or other genomic annotations, including drug targets. One fertile area of method development is integrating data from GWASs and expression quantitative trait locus (eQTL) studies to identify associations between transcripts and complex traits.56, 61, 62 These methods are useful for prioritizing genes from known GWAS loci for functional follow-up, detecting novel gene-trait associations, and inferring the directions of associations.21, 27, 62 The analytical results that only about one-third of the associated genes are the nearest genes61, 62 are informative for the design of fine-mapping experiments.

One of the ultimate objectives of genetic research is to drive translational advances that enable more effective prevention and/or treatment of disease. Despite the inevitable time lag between basic research discoveries and clinical implementation, a growing number of examples highlight the diverse routes by which human genetics can inform translational medicine.

Three Exemplars of GWAS Success

Opens large image

Here, we focus on three examples of adult-onset disease to demonstrate some of the significant advances that have followed as a direct result of GWASs. Figure 3 illustrates examples of an overlap between GWAS signals that are known drug targets. In general, drug targets that are genetically informed have a higher probability of making it to phase III trial or to market, implying potential huge cost savings to the pharmaceutical industry.72

Type 2 Diabetes

Variant and Gene Discovery. Scores of genes have been causally implicated in monogenic forms of diabetes (e.g., neonatal diabetes mellitus [MIM: 601410]73), but GWASs have now identified over 100 common variant signals.74, 75, 76 Recent efforts to extend GWASs beyond array-based genotyping and to access a broader range of variants through sequencing (particularly those of lower frequency) have revealed that most genetic variation influencing T2D appears to reside at common variant sites.74, 77 This chimes with the view of T2D as a largely post-reproductive trait and is consistent with a failure to detect compelling empirical evidence that T2D risk alleles have been subject to marked purifying selection.78, 79 In keeping with the age of these common risk alleles (which predates the diaspora of modern humans out of Africa), most common variant associations for T2D are replicated across major ethnic groups.75, 80 However, as increasingly diverse populations are genotyped and sequenced, more ethnic-specific alleles are being identified. Several of these alleles have a relatively large phenotypic impact and have risen to high frequency in specific populations, including variants in PAX4 (MIM: 167413) in East Asians74 and TBC1D4 (MIM: 612465) in Inuit.81 Efforts to identify compelling evidence for gene-gene and gene-environment interactions have been largely unsuccessful.82

From GWAS to Biology. Regulatory information on the key tissues of insulin action (fat, muscle, and liver)82, 83 and equivalent data from pancreatic islet material67, 84 have provided compelling evidence that the variants most strongly associated with T2D (as well as fasting glucose and other related quantitative traits) are preferentially located at active enhancers (particularly stretch enhancers) in pancreatic islets67, 84 and, to a lesser extent, at enhancers active in fat, muscle, and liver.83, 85 Increasing refinement of regulatory annotation has brought more precise localization of these global regulatory effects, for example, emphasizing specific transcription factor genes (such as FOXA2 [MIM: 600288]).86 These patterns of tissue-specific genomic enrichment tie in with studies of the physiological correlates of T2D risk alleles, as observed in physiological data from non-diabetic subjects; these have indicated that, whereas some T2D risk alleles have a primary effect on insulin action, most appear to be associated with reduced insulin secretion.87 These approaches have generated some notable advances, for example, cis-expression mapping has highlighted KLF14 (MIM: 609393) as the mediator of a chromosome 7 T2D signal that is associated with insulin resistance and hyperlipidemia (appropriately, this expression signal is specific to adipose tissue).85 Equivalent data from human islets have characterized the likely effector transcripts at several T2D GWAS loci (such as ZMIZ1 [MIM: 607159], MTNR1B [MIM: 600804], and ADCY5 [MIM: 600293]), where the major impact is to reduce insulin secretion.86, 88 Additional clues to the identification of the causal transcripts at certain GWAS loci have come from examining the credentials of the regional transcripts themselves, assigning candidacy on the basis of known biology (e.g., NOTCH2 [MIM: 600275] and GIPR [MIM: 137241]),89 involvement in related monogenic conditions (WFS1 [MIM: 606201], HNF1 [MIM: 142410], and HNF4A [MIM: 600281]),90, 91 or data from animal models (CDKAL1 [MIM: 611259]).92 Finally, the accumulation of data on coding variants (via exome sequencing and/or exome array genotyping) has highlighted several instances where GWAS signals previously attributed to non-coding variants can be reassigned to causal coding variants (e.g., TM6SF2 [MIM: 606563] 74). For others, such as RREB1 (MIM: 602209), identification of T2D-associated coding variants, statistically independent of the original GWAS signal, flags the likely effector transcripts.74 All in all, it is possible to point to a compelling effector transcript at around one-third of the 100 T2D loci identified by GWASs. These genes represent legitimate targets for detailed empirical validation and mechanistic exploration. They also support efforts, via network-based approaches, to establish the extent to which the biology of T2D predisposition converges onto a restricted set of pathways.

Translation. Examples from T2D research highlight the diverse routes by which human genetics can inform translational medicine: (1) the combination of common-variant GWASs and candidate-gene resequencing has demonstrated that loss-of-function mutations in SLC30A8 (MIM: 611145; encoding a zinc transporter expressed in pancreatic islets) are protective for T2D, leading to efforts by several pharma companies to develop ZnT-8 antagonists;93 (2) the use of genetic variants as instruments that “simulate” variation in environmental and biochemical exposures has clarified the extent to which vitamin D intake, early nutrition, circulating lipid levels, and chronic inflammation play causal roles with respect to the development of T2D94, 95, 96, 97, 98 and has defined the relationship between insulin resistance and the distribution of adipose tissue;99 (3) the identification of genetic variants associated with individual variation in response to commonly used therapeutic agents has refined our understanding of the mechanisms through which those agents operate100,101 and, in some instances, has led to therapeutic optimization on the basis of genetic and/or clinical phenotype;102 and (4) the combination of -omic measurements, longitudinal clinical phenotypes, and GWAS data has highlighted sets of molecules (e.g., branched-chain amino acids) that not only are prospectively associated with T2D progression but could also play a causal role in T2D development and thereby provide valuable clinical tools for stratification and prognostication.103, 104

Auto-immune Diseases

Variant and Gene Discovery. In the last 5 years, GWASs have been undertaken for nearly all major immune-mediated diseases (with sample sizes of tens of thousands of case and control individuals for more common immune-mediated diseases studied either by GWASs or by more targeted chips, such as Illumina’s Immunochip105), resulting in hundreds of associated loci. The development of statistical approaches for cross-disease studies to identify pleiotropic loci has been particularly productive in identifying new genes and in better understanding the pathogenic relatedness of immune-mediated diseases. A recent cross-disease study involving the conditions ankylosing spondylitis (AS [MIM: 106300]), inflammatory bowel disease (IBD [MIM: 266600]), primary sclerosing cholangitis (MIM: 613806), and psoriasis identified without any further genotyping 30 new loci at genome-wide significance.24 Trans-ethnic studies have demonstrated substantial genetic overlap between ethnically remote populations;106, 107, 108 for example, genetic correlations of 0.76 and 0.79 between European and East Asians have been estimated for Crohn disease and ulcerative colitis (UC).106 Trans-ethnic comparisons of associations at shared loci have been quite helpful in pinpointing causal variants; for example, population-specific variation in HLA-DRB1 (MIM: 142857) associations in rheumatoid arthritis (RA [MIM: 180300]) has helped to define the key amino acids underpinning that association.

From GWAS to Biology. GWAS results have made key contributions to deeper biological understanding of immune-mediated diseases in the last 5 years. For example, in a cross-disease study, new loci included genes that for the first time implicate pathogenesis associated with methylation variation (DNMT3A [MIM: 602769] and DNMT3B [MIM: 602900]), bacteria-sensing genes (TLR4 [MIM: 603030]), genes influencing the host microbiome (FUT2 [MIM: 182100]), and NFKB pathway genes (NFKB1 [MIM: 164011], NFKBIA [MIM: 164008], TNFAIP3 [MIM: 191163]).24 Evidence of extensive pleiotropy includes variants that have different directions of association in different diseases or disease-specific variants at the same loci. This has relevance for the likely impact of targeting these loci therapeutically. For example, SNP rs1800693 in the major TNF-receptor gene TNFR1 (MIM: 191190) is associated in different directions with multiple sclerosis (MS [MIM: 126200]) and AS.109 The SNP leads to loss of the transmembrane domain of the receptor, and the risk SNP for MS (protective for AS) leads to increased serum soluble TNF receptor.110 TNF inhibition, including by decoy TNF receptor therapies, is highly effective for AS and many other immune-mediated diseases, but its use can be complicated by de novo development of MS, and in MS itself, it can exacerbate disease. Although this is a retrospective example, it demonstrates the potential of using genetics to predict toxicities. There are several agents in development where the genetics would point to the likelihood of toxicities. Examples include CD40 and its ligand, where SNP rs1883832 in the C allele of CD40 (MIM: 109535) is a risk factor for RA and auto-immune thyroid disorder (AITD) but is protective against MS and IBD, and the PTPN22 (MIM: 600716) variant c.1858C>T (p.Arg620Trp), which increases the risk of type 1 diabetes (MIM: 222100), systemic lupus erythematosus (MIM: 152700), vitiligo (MIM: 606579), AITD, and UC but is protective against Crohn disease.23 At the very least, this suggests that any clinical trials in these conditions should carefully screen for the development of the diseases with the converse genetic associations.

The MHC, as well as the HLA genes encoded within it, is the major locus for the majority of immune-mediated diseases. Although the major highly penetrant HLA types involved in different diseases have long been established, in the last 5 years, the ability to impute the composite amino acids and then test these for disease association has enabled research that has better defined the key components of the HLA proteins involved in disease. In RA, it had been known for roughly 30 years that a sequence of amino acids at positions 70–74 of HLA-DRB1 largely, though not fully, determine the differential association between HLA-DRB1 types and disease.111 Through the use of amino acid imputation and association studies, this “shared epitope” sequence was extended,112 and this information used to provide a molecular explanation for the propensity of peptides with citrullinated component amino acids to induce disease.113 HLA variants have long been known to be major determinants of severe immunologically mediated adverse drug reactions. For example, toxicity to the anti-retroviral abacavir is largely restricted to HLA-B57 carriers. With the use of GWAS and HLA imputation, an HLA-DQA1∗0201-HLA-DRB1∗0701 haplotype has been shown to be strongly associated with the risk of thiopurine-induced pancreatitis, such that homozygotes for this haplotype have a 17% risk of this major side effect.114 It is likely that with the increasing use of genetic profiling in clinical practice, further examples will be identified in coming years.

Translation. GWAS results have already proven highly successful at initiating medication repositioning. For example, GWAS discoveries triggered the repositioning of biological medications targeting components of the IL-23 pathway (including IL-12p40, IL-17, and IL-23p19), and now these are mainstay treatments for psoriasis and psoriatic arthritis (MIM: 607507), are highly effective in AS, and (with the exception of IL-17 blockade) are effective in IBD, as suggested by early studies.115, 116, 117 The annual sales of these medications alone are likely to be greater than the total amount spent on GWASs in the past decade.

Many other GWAS discoveries have stimulated targeted therapy-development programs, a few of which are described here. The discovery of the association between PADI4 (MIM: 605347) and RA provided conclusive evidence that immunological reactions to epitopes that had been citrullinated by PAD enzymes were causatively involved in RA. This led to programs developing PAD inhibitors in RA, and these have shown significant promise.118, 119 Major drug-development programs have been initiated to target the M1 aminopepidase genes ERAP1 (MIM: 606832) and ERAP2 (MIM: 609497) because of their genetic associations with AS, psoriasis, IBD, Behcet disease (MIM: 109650), and the rare condition Birdshot retinopathy (MIM: 605808).120

Bioinformatic follow-up of GWAS results has also been fruitful. For example, Okada et al. screened the overlap between genetic associations and known drug targets to demonstrate that existing RA therapies disproportionately target RA-associated gene products and their interacting protein partners.108 From this, they extrapolated that other agents with high levels of effects on these proteins would be enriched with potential new RA therapies and provided suggestive evidence that CDK4 and CDK6 inhibitors already in use, particularly in oncology, could be effective in RA. These agents have been shown to be effective in the collagen-induced arthritis model of RA and are now in human trials in RA in Japan.

Schizophrenia

Variant and Gene Discovery. Although psychiatric diseases had a slow start in GWAS locus identification, more than 50,000 samples have been genotyped in the last 5 years; the typical linear relationship between sample size and number of loci has been observed, and more than 100 risk loci have been discovered to date. These risk loci are enriched in genes containing de novo mutations in schizophrenia, autism (MIM: 209850), and intellectual disability,15 and several identified loci contain genes relevant to major hypotheses of schizophrenia etiology, including DRD2 (MIM: 126450; the target of anti-psychotic drugs) and genes involved in glutamatergic neurotransmission (GRM3 [MIM: 601115], GRIN2A [MIM: 138253], and GRIA1 [MIM: 138248]), as well as genes that extend previous observations of association with voltage-gated calcium channel subunits (CACNA1C [MIM: 114205], CACNB2 [MIM: 600003], and CACNA1I [MIM: 608230]).15 One of the most striking findings that emerged early in schizophrenia studies—at the stage where there were only a handful of genome-wide-significant loci—was the highly polygenic nature of the common variants contributing to risk.14 This observation has been widely replicated, and estimates are that 71% of 1 Mb genomic regions have at least one variant influencing schizophrenia risk,53 and there is evidence of substantial pleiotropy with other psychiatric disorders.26 However, genetic architecture, described as the mix of rare and common variants, is likely to differ between psychiatric disorders, as is already being observed for the higher rates of rare, de novo penetrant CNVs and single-nucleotide variants found in autism than in schizophrenia or bipolar disorder.121, 122, 123, 124, 125, 126, 127, 128, 129 PRS studies are being utilized extensively to investigate disease heterogeneity and contributions from environmental risk factors.

From GWAS to Biology. Functional follow-up is necessarily more difficult for psychiatric disorders, and to date, bioinformatic analyses have been the key focus providing strategies for prioritization of loci. Schizophrenia risk loci are over-represented in regulatory regions active in the brain15, 130, 131 and are enriched in genes from postsynaptic density, postsynaptic membrane, dendritic spine, axon, and voltage-gated potassium channel pathways, as well as histone H3-K4 methylation132 overlap with pathways identified in rare-variant studies of autism. Prioritization of GWAS results has progressed through integration with eQTL datasets, implicating synaptic genes (SNAP91 [MIM: 607923], TSNARE1, and CLCN4 [MIM 302910]) and genes with roles in neurodevelopment (FURIN [MIM: 136950] and CNTN4 [MIM: 607280]). 3D contacts between risk variants and promoters, explored by chromosome conformation capture (Hi-C) in the subcortical plate and germinal zone of the developing human cortex, supported putative interactions between causal risk variants and promoters in glutamatergic and calcium signaling genes (GRIA1 [MIM: 138248], NLGN1 [MIM: 600568], GRIN2A [MIM: 138253], and CACNA1C [MIM: 114205]), in several genes long implicated in schizophrenia (including DRD2 and DRD6, encoding acetylcholine receptors subunits), and genes SNAP91, TSNARE1, CLCN4, FURIN, and CNTN4.71

Fine-mapping has been accomplished for the strongest, and first identified, association with schizophrenia in the MHC region, a challenge because of its high genic content and high LD. The position of the association signal within the MHC region led to investigation of common structural haplotypes of complement factor 4 genes C4A (MIM: 120810) and C4B (MIM: 120820), combinations of which correlated well with schizophrenia risk and increased C4 expression and showed differential brain expression between case and control individuals.65 Several other complement proteins play a role in synapse elimination, and decreased numbers of synapses have long been suggested as a primary abnormality in schizophrenia. Observations that, in mice, a complement gene that shares features with human C4A and C4B is expressed in neurons and promoted synapse elimination in a developmental brain circuit strongly implicate this gene and its protein.

Translation. No new molecular targets for schizophrenia have been successfully identified since the first antipsychotic drugs were identified several decades ago. The reasons are likely to be manifold, but most drug development for schizophrenia has focused on achieving high-potency drugs for a single target—a methodology successful in many other areas of medicine—which necessitates a choice between the competing hypotheses of schizophrenia pathophysiology. GWAS results have provided unequivocal evidence of polygenicity, and because many of the GWAS loci contain genes that code for proteins among those indicated through multiple prior hypotheses, e.g., dopamine, glutamate, immune modulation, calcium signaling, and nicotinic cholinergic, future drug development could benefit from taking a multi-target approach. A proof-of-concept gene-set enrichment of schizophrenia risk alleles in sets of genes for drug targets identified several potential repurposing opportunities.133 Single-target medications could be appropriate for specific genetic subgroups, although identifying genetic subtypes is not yet part of the clinical trial paradigm.

Discussion

The Present

We have summarized the major kinds of discoveries made from GWASs focusing on adult traits and have reviewed the new biology and emerging translational outcomes for three diseases. Over the last decade, this experimental design has delivered a remarkably diverse set of discoveries in human genetics. For most traits and diseases studied, the mutational target in the genome appears large, in that polymorphisms in many genes contribute genetic variation in the population. Furthermore, the empirical evidence of widespread pleiotropy implies that many segregating variants affect multiple traits. A precise estimate of the proportion of all segregating genetic variants that are “functional” in the context of being associated with one or more traits, conditional on all other causal variants, remains elusive. For the highlighted traits, disorders, and diseases, we have given examples of routes from GWAS to biology and translation. For an experimental design only a decade old, this is an example of rapid translation of genetic findings toward clinical application.

The relationship between sample size and number of risk loci detected varies between traits, but all show a sharp increase at a critical sample size. To date, there has been no trait with evidence of a plateau of the number of risk loci discovered with increasing sample size. For some traits, such as height, schizophrenia, and IBD, discovery samples in the next 5 years are likely to continue to increase, perhaps at a lower rate per additional sample. A diminishing rate of discovery of new loci will provide a more complete picture of genetic architecture and will best satisfy the understanding of contributing biological pathways. According to the knowledge of Mendelian disease, the expectation is that multiple risk variants will be detected within loci that have already been identified. Hence, as sample sizes increase, the new discoveries of associated pathways will be saturated first, followed by genes and lastly variants.

GWASs have been successfully applied to molecular traits such as gene expression, DNA methylation,134 and metabolites.135 Conclusions from these studies are that most molecular phenotypes are just like other complex traits, in that differences between individuals are due to a combination of genetic factors and environmental exposures and that genetic loci can be mapped by GWASs.136 This makes the discovery of causal pathways from genomes to phenomes challenging, in that variation between people in modifiable risk factors might be partially anchored in DNA sequence variation for these “exposures.” Nevertheless, the combination of sequence variation with molecular phenotypes and disease data with novel analytical methods, such as Mendelian randomization,42 has great potential to unravel cause and consequence and to improve phenotypic prediction.137

GWASs to date have been based on SNP arrays designed to tag common variants in the genome. These arrays do not cover all genetic variants in the population, and it would seem natural that future GWASs will be based on WGS. However, the price differential between SNP arrays and WGS is still substantial, and array technology remains more robust than sequencing. Nevertheless, now hundreds of thousands of genomes are being sequenced as part of major initiatives, and the next 5 years will allow direct comparisons of discoveries made from sequencing and array studies. Interestingly, custom arrays without a GWAS “backbone” (such as the Immunochip, Metabochip, and exome-only arrays) have by and large failed to identify rare (MAF < 1%) variants at loci that were initially discovered from a GWAS, one of their aims. The reason for this is not clear. It could be because there are no rare variants of major effect, because the sample size is too small for detecting rare variants and/or estimating their effect size, because the chip coverage of rare variants is inadequate, or because of a combination of these. However, these custom arrays have led to the discovery of new loci and to fine-mapping at existing loci, mostly driven by increasing experimental sample size (see Appendix A on the relationship between sample size, imputation accuracy, and allele frequency on power of detection). A recent study of height using exome SNP arrays and a sample size of ∼700,000 reported 83 height-associated coding variants with a frequency of less than 5% and effect sizes of up to 2 cm.138 These variants each explain, on average, about the same amount of variation as common variants, whose effect sizes are of the order of 1 mm, because it is the combination of frequency and effect size that determines variation (Appendix A).

One limitation of both current array and WGS technology is that the precision of detection of structural variants (indels or inversions > 50 bp) is less than that of SNP detection. New technologies that enable long-range haplotyping are helping to overcome the weakness of short-read technologies, and cheap, genome-wide technologies for structural variants would constitute an important advance.

Fine-mapping of SNP-trait associations is the attempt to identify one or more causal variants that are responsible for the observed GWAS signals. Fine-mapping solely by statistical association is limited by experimental sample size and LD, given that the statistical evidence to separate a causal variant from a variant in LD with it is proportional to n(1 – r2) (see Equation A1 in Appendix A). If causal variants are not in the data (e.g., they have not been genotyped), then the imputation error also limits fine-mapping. With the likely availability of SNP-array-based GWAS data on very large sample sizes and WGS data on large sample sizes, statistical fine-mapping power will improve, and a small number of variants that are in extremely high LD might be identified as a plausible set of variants with a high probability of containing one or more causal variants. The use of additional information, such as prior knowledge of the likely function of specific variants given their location and surrounding DNA motif(s),139, 140 could help to reduce the set of statistical candidates to a smaller number. This is already a fertile area of statistical and bioinformatic research56, 62, 131, 141, 142 bringing together trait or disease GWAS results with those of tissue gene expression. More research on the resolution of fine-mapping is warranted, and this will be fueled by an expected increase in GWAS data on tissue- and cell-specific gene expression.

Most GWASs to date have been conducted on individuals of European descent, but there is a growing number of studies on populations of Asian and African ancestry. Because common variants contribute to the genetic architecture of complex traits, the expectation is that these common variants are evolutionarily old and shared across ethnicities, which is encouraging for generalizing treatments. The clearest demonstration of this, as discussed above, has been for IBD, for which the genetic correlation between Asian and European samples is close to 0.8, even though some individual risk loci differ in frequency or effect size.106 A characteristic of GWAS analyses to date is to strictly exclude individuals outside ethnicity boundaries on the basis of standard deviation units in principal-component dimensions. However, as sample sizes become larger, it is becoming possible to not only utilize but also take advantage of mixed and admixed ethnicity. The differing allele frequencies and LD structure across populations should aid in fine-mapping causal variants. New methods have emerged to deal with these data,75, 143 and we expect that this will be a fertile area of method development and discovery in the coming years.

The Future

Does the GWAS have a future? Extrapolating the discoveries from the last 10 years to the future, if we were to keep with the current experimental strategy of SNP arrays and imputation, then ever-increasing sample sizes would undoubtedly lead to new genetic discoveries, particularly (1) the discovery of more variants and more genes associated with one or more traits, (2) accounting for more genetic variation, (3) more accurate genetic predictors, and (4) a greater ability to evaluate disease heterogeneity and to derive genetically informed diagnoses that might be more aligned to specific treatments. For biological enrichment analyses and the discovery or fine-tuning of pathways involved in quantitative traits and disease, more loci are likely to increase resolution. In fields where diagnostic criteria are not based on biological markers, such as psychiatry, GWASs have turned the field on its head by, for the first time, contributing quantitative data that can be used for completely re-evaluating the relationship of previously distinct disorders.

The future of GWASs will have old and new challenges. With ever larger studies, the new loci identified will typically individually have smaller effect sizes (e.g., less than 0.5 mm for a trait such as height and an odds ratio of 1.01 for common disease) or, for rare variants, will be at very low frequency.138 For disorders with population prevalence of the order of 0.1%, discovery will still be limited by experimental sample size, given that it will take many years to accumulate sample sizes of 100,000 cases or more. One challenge is how such loci can be fine-mapped or studied for mechanism. Upscaling of technology, either through interfacing with sequenced-based -omic data or through upscaling by experimental perturbations (e.g., multiple-locus or genome-wide CRISPR) are likely to be key to overcoming the challenges of small effect size. What is likely to change in the near future is that GWASs by SNP arrays will be gradually replaced by GWASs by WGS, particularly for quantitative traits and very common diseases. Nonetheless, given a finite budget, larger sample sizes that are phenotypically more informative and genotyped on a SNP chip remain a powerful strategy for maximizing discovery. Fifteen years ago, genotyping technology was the limiting step to discovery, but now discovery is limited by phenotypic descriptors that could link with genetic data to allow disease stratification that might be more aligned with treatments. Furthermore, the emphasis in research will need to shift from gene discovery to translation into biological understanding and patient-focused outcomes, such as better diagnostic tests and novel treatments.

In conclusion, the experimental design of GWASs has led to a remarkable range of discoveries in human genetics over the last decade. It has delivered on its original aim of detecting associations between common DNA variants and human disease and disorders. It has led to a better understanding of the genetic architecture of complex traits and therefore of past natural selection on traits associated with fitness. It has led to the discovery of variants, genes, and biological pathways that play a role in specific diseases and disorders. It has led to new discoveries in disease epidemiology and to the discovery or repurposing of candidate therapeutics. As foreshadowed in 2007, it has indeed been a case of drinking from the fire hose.144 For the future, the combination of whole-genome surveys of genetic variation and detailed phenotypic and -omics data on millions of individuals will be a treasure trove for making new fundamental discoveries in human genetics. Some of those discoveries will be wholly unexpected, and others will detect or unravel biological mechanisms. Disease-specific discoveries will continue to spur the development and trials of new therapeutics, the understanding of pathways from sequence to consequence, and for some diseases, prevention or early intervention. In 10 years from now, genomic “personalized” or “precision” medicine is likely to be widespread and will include some applications to common diseases either directly through risk stratification for targeted prevention or intervention strategies or indirectly through new treatments where GWAS results provide the first step in the discovery pipeline. The experimental design of GWASs, which started as a theoretical exercise more than 20 years ago,145 has matured and delivered.