Cancer immunogenomics originally was framed by research supporting the hypothesis that cancer mutations generated novel peptides seen as “non-self” by the immune system. The search for these “neoantigens” has been facilitated by the combination of new sequencing technologies, specialized computational analyses, and HLA binding predictions that evaluate somatic alterations in a cancer genome and interpret their ability to produce an immune-stimulatory peptide. The resulting information can characterize a tumor’s neoantigen load, its cadre of infiltrating immune cell types, the T or B cell receptor repertoire, and direct the design of a personalized therapeutic.
Brief History of Tumor-Specific Mutant Antigens and Immunogenomics
The underpinnings of modern immunogenomics resulted from hypotheses generated and tested by visionaries in cancer immunology during the late 1980s through the 1990s. Their central hypothesis was that cancer cells presented novel, tumor-specific (i.e., mutated) peptides on the cancer cell surface bound by the patient’s HLA molecules. By virtue of this cell surface presentation, specific T cell immunity might be elicited to these “neoantigens.” Supporting evidence for this hypothesis was demonstrated in cancers of non-viral origin (Old and Boyse, 1964, Foley, 1953, Prehn and Main, 1957). This foundational work led to the identification and characterization of the role of MHC proteins in antigen presentation (Babbitt et al., 1985, Bjorkman et al., 1987). Concomitantly, methods to grow antigen-specific cytolytic T lymphocytes (CTLs) in culture were also developed (Cerottini et al., 1974, Gillis and Smith, 1977), as were the molecular biology procedures to clone and express gene products. Thierry Boon’s laboratory combined these new methods to identify the first tumor specific antigen (TSA), a point mutation in a protein called P91A (De Plaen et al., 1988). Subsequently, Hans Schreiber’s laboratory demonstrated that TSAs also function as neoantigens using primary UV-induced mouse tumors (Monach et al., 1995). Similarly, groups studying human melanomas showed they could identify T cells in the peripheral circulation that bind melanoma cells preferentially over normal cells from the same patient (Dubey et al., 1997, Knuth et al., 1984, Robbins et al., 1996, Van den Eynde et al., 1989). Shortly thereafter, Boon’s laboratory cloned the first human TSA, called MAGEA1 (van der Bruggen et al., 1991), and Sahin’s group demonstrated an autologous antibody-based method to clone and identify different human TSAs (Sahin et al., 1995). While these foundational studies established supporting evidence for the existence of tumor-specific peptide neoantigens, the lengthy and painstaking nature of these processes was unlikely to scale to clinical application for cancer patients.
More recently, these limitations have been alleviated by the application of new sequencing technologies and associated computational data analysis approaches. These methods, collectively referred to as “immunogenomics,” have improved the facility with which individual cancers can be studied to predict their neoantigens for prognostic purposes or to inform immunotherapeutic interventions. Complementary methods have been developed to study the changes in the T cell repertoire, to characterize the gene expression signatures of the immune cell types present in the tumor mass, and to design personalized vaccines or adoptive cell transfer (ACT) therapies. The now scalable nature of immunogenomic methods should permit their widespread clinical application, although there remain issues and challenges to be resolved. This primer will highlight the specific methods and describe the known strengths and weaknesses in modern immunogenomics.
Somatic Mutations Generate Neoantigens
It has long been known that cancer is caused by alterations to genomic DNA that impact protein functions, ultimately disrupting cellular control of pathways and resulting in the outgrowth of a tumor mass. Methods using next generation sequencing platforms generate data from tumor and normal DNA isolates that, once aligned to the Human Reference Genome sequence, can be interpreted to identify somatic alterations (Ley et al., 2008). In practice, such analyses aim to identify DNA alterations in known cancer genes, both oncogenes and tumor suppressors that combine to transform the founder cell. For certain oncogenes, identified mutations indicate therapeutic interventions that may successfully halt the tumor cell growth. By contrast, immunogenomic approaches aim to identify tumor-specific DNA alterations that predict amino acid sequence changes in all encoded proteins, and then evaluate their potential as neoantigens. In practice, most TSAs identified to-date are highly unique to each patient and generally do not involve known cancer genes.
Hence, the widespread use of next-generation sequencing (NGS) instrumentation has enabled immunogenomics, providing a facile way to generate data to predict tumor-specific neoantigens in a rapid, inexpensive and comprehensive manner (Gubin et al., 2015). NGS technologies have rapidly evolved over the past 10 years, resulting in dramatically increased amounts of sequencing data produced per instrument run at ever-decreasing costs (Mardis, 2017). In immunogenomics, since the focus is protein-coding genes, solution hybridization-based methods are used to select these sequences (“exome”) prior to sequencing (Bainbridge et al., 2010, Gnirke et al., 2009, Hodges et al., 2009). Importantly, the concomitant development of advanced variant detection algorithms that identify different classes of mutations from NGS data has enabled the identification of all classes of somatic variation. Accurate detection of variants in this setting is influenced by multiple factors, which are presented here in detail.
One important consideration for somatic variant detection is depth of coverage by NGS sequencing reads from the tumor. In principle, since tumor samples include variable percentages of normal cells, adjustments to the depth of NGS data generated must be flexible to ensure that a sufficient representation of tumor-derived sequence reads are obtained. Isolating DNA from selected, tumor-rich areas of a biopsy or resection sample is ideal, but not always possible, so average read depths of 300- to 500-fold exome coverage are typically attempted to compensate for the normal cell DNA-derived reads. A second reason for high coverage of the tumor-derived DNA is to enable the evaluation of founder clone versus subclonal mutations in the resulting data. Here, we define founder mutations as the original set of mutations present in the cell that transformed from normal to neoplastic, whereas subclonal mutations occur as the daughter cells of this founder acquire additional mutations during growth of the tumor mass. Based on this definition, founder clone mutations in diploid regions of the exome have a proportional fraction of variant-containing sequencing reads (variant allele fraction or VAF) that is around 50% (adjusted for normal DNA contribution), since most somatic mutations are heterozygous. In theory, neoantigens that result from founder clone mutations should elicit a T cell response that targets all cancer cells rather than the subset of tumor cells that would be targeted by T cell response to subclonal neoantigens in the vaccine.
Equally important to appropriate coverage depth for accurate prediction of variants is the algorithm or set of algorithms used to identify variants from the NGS exome data. The factors to consider here include the types of variants one wishes to evaluate in neoantigen discovery. For example, single nucleotide variants (point mutations) are easiest to predict with high accuracy because reads containing a single variant are readily aligned to their reference genome “match,” and because there are a variety of different algorithms that also can detect low VAF variants. Variant detection from NGS reads has been an area of rapid development and there are many algorithms to choose from, with variable performance, as has been evaluated (Cornish and Guda, 2015, Ghoneim et al., 2014, Krøigård et al., 2016). By contrast, variants resulting from insertion or deletion of one or a few nucleotides (“indels”) are significantly more difficult to identify due to issues of read alignment by standard alignment algorithms, that lead often to lower coverage in these regions for the variant-containing sequencing reads (Jiang et al., 2012, Jiang et al., 2015, Ratan et al., 2015). However, indels may be important to immunogenomics efforts because they can introduce frameshift mutations that result in highly divergent amino acid sequences in the resulting protein and hence may produce strong predicted neoantigens. Increased read lengths on NGS platforms have improved indel detection, as has the use of gapped alignment or split-read algorithms that are computationally intensive but better able to align the indel-containing reads to the reference genome. Assembly-based realignment approaches also have been developed to improve the precision of indel variant detection (Mose et al., 2014, Narzisi et al., 2014).
Another type of somatic variation that can lead to highly altered amino acid sequences, and as a result create a neoantigenic peptide, is a structural variant which fuses two protein-coding sequences. These can result from inversion or deletion of a chromosomal segment or from chromosomal translocations. Detecting these alterations from exome sequencing data is quite challenging and error-prone, but RNA-based analysis can identify the resulting fusion transcript (Li et al., 2011, Scolnick et al., 2015, Zhang et al., 2016a, Kumar et al., 2016) and compare the predicted fusion sequence to NGS data from DNA (whole genome or exome sequencing) to identify supporting evidence of the genomic event causing the fusion. Recently, we adapted this approach for neoantigen prediction with a process called IntegrateNEO, using the TMPRSS2-ERG fusions common in prostate cancer to evaluate its ability to identify fusion peptide neoantigens (Zhang et al., 2016b). RNaseq data bring added value to immunogenomics efforts beyond the detection of fusion peptides, as will be described later.
Once variant detection is completed, each variant is annotated to predict the resulting amino acid change(s) that result from the altered DNA sequence (if any). There are widely utilized computational tools such as Annovar and VEP available to produce the translated peptides from the DNA data. The translated peptides constitute one type of input data for the neoantigen prediction software to calculate the class I or class II predicted binding affinities.
The second data input for neoantigen prediction are the HLA haplotypes of the patient, also derived from exome data, since these reagents capture the HLA gene loci. Heretofore, HLA typing was performed using a PCR-based and Sanger sequencing-based clinical assay. The repetitive nature of the HLA genes requires a high-stringency assembly of these genes, which can be achieved using the >500 bp read lengths from Sanger data. Sequence analysis of these regions based on hybrid capture-derived NGS reads, which are relatively short (∼100 bp), requires a stringent alignment of the read data to the IMGT/HLA database (Robinson et al., 2001) using a haplotype-resolved algorithm to interpret the HLA class I and II haplotypes. There now exist multiple algorithms for accomplishing these data interpretations, including Polysolver (Shukla et al., 2015), HLAMiner (Warren et al., 2012), and OptiType (Szolek et al., 2014). Typically, one interprets the normal tissue-derived exome data to obtain the HLA haplotypes. Clinical analysis of these genes also should include repeating the alignment of the tumor-derived exome data and identification of mutations in order to identify HLA alleles that are impacted by nonsense mutations, deletions, or other similarly deleterious types of somatic alterations that may influence the presence of that allele (Shukla et al., 2015). Some algorithms also can use RNA-derived data to interpret the HLA haplotypes (Warren et al., 2012).
Another critical component of identifying neoantigens is the in silico prediction of HLA class I and II binding affinities for specific peptides. These predictions are quite computationally complex and require machine learning-based approaches to establish models for the different types of binding site interactions. In particular, each peptide interacts with the binding pocket residues of the many different HLA proteins through the amino acid side chains of specific residues. Therefore, the binding affinity of any peptide is sequence-specific relative to that patient’s HLA proteins, some of which may be common and some rare. There also are differences in the binding of peptides by class I or class II HLA that impact the precision of neoantigen prediction, as described later. Finally, there is considerable debate about an appropriate cutoff value for binding affinity in terms of what does or does not constitute a strong neoantigen candidate (Duan et al., 2014)(Bassani-Sternberg et al., 2016).
The initial approach to computational HLA binding predictions utilized a neural network-based learning method developed from a training set of experimentally derived binding affinities for class I HLA proteins and different peptides. This effort resulted in an HLA class I binding prediction software known as netMHC, devised by researchers in the Center for Biological Sequence Analysis at the Technical University of Denmark (Lundegaard et al., 2008a, Lundegaard et al., 2008b, Nielsen et al., 2003). The predictor has improved over time with the availability of training datasets for HLA proteins that are more rare in the population, although calculated binding affinities for the most rare HLA alleles in humans remain less certain (Wang et al., 2010). An interim approach to address rare HLA class I binding calculations was PickPocket, which extrapolated from variants with known binding specificity to those without existing experimental data (Zhang et al., 2009). The most recent version is netMHCstabpan (Rasmussen et al., 2016), which uses a neural network approach based on a dataset of stability values calculated for different peptide-MHC-1 complexes, rather than their binding affinity values, since the stability of their interaction has experimentally been shown to be more strongly correlated to T cell immunogenicity. Another early method developed to generate class I binding predictions was based on a stabilized matrix method (SMM) algorithm developed by Peters and Sette (Peters and Sette, 2005). This approach models the sequence specificity of binding processes as a means of predicting outcomes for untested sequences. SMM not only predicts HLA binding but also evaluates peptide transport as a function of antigen presentation and proteasomal cleavage with the TAP algorithm. Subsequent efforts to develop new class I binding affinity prediction software have included the use of combined support vector machine-based (SVM) and random forest machine-learning approaches (Srivastava et al., 2013), or combined the information obtained from amino acid pairwise contact potentials and quantum topology molecular similarity descriptors (Saethang et al., 2013) to better model HLA class I peptide interactions.
Figure 1
An Overall Workflow for Neoantigen Discovery and Personalized Cancer Vaccine Design
Starting from next-generation sequencing of DNA exomes to compare tumor to normal DNA, and of tumor RNA to evaluate gene expression, this figure illustrates the steps outlined in the primer to identify tumor-specific mutant antigens (neoantigens) from NGS data, to evaluate the neoantigens, and to design a personalized neoantigen vaccine.
With the requisite information generated by NGS to call somatic variants and interpret their impact on protein sequences, and to identify the HLA haplotypes specific to the patient, neoantigen prediction software can be used to predict both the class I and class II HLA binding affinities for each tumor-unique set of peptides. Considerations and specifics for these prediction approaches are described in detail below. There are a number of binding prediction software and associated immunogenomics algorithms available at the Immune Epitope DataBase (IEDB) analysis resource (http://tools.immuneepitope.org/main/) (Robinson et al., 2013). The IEDB web interface permits the input of peptide sequences for sequential evaluation by user-configured steps using the software of choice to predict neoantigens. Publicly available software pipelines also are available for local download and computing of neoantigen predictions by end-users, including pVAC-seq (https://github.com/griffithlab/pVAC-Seq) and epidisco (https://github.com/hammerlab). An overall workflow for the processes described above is shown in Figure 1.
Class I Predictions
Approaches to predict HLA class I neoantigens typically begin by parsing the tumor-specific peptides predicted from variant calling as 21-mer peptides that encompass the variant amino acid(s) placed as near to the center of the 21-mer as possible. This is easiest to envisage for simple non-synonymous amino acid substitutions, shown in Figure 2A, which then are tiled across the variant-containing peptides to define a set of 8-mer to 11mers to input for binding calculations, based on HLA class I binding characteristics (Figure 2B). These peptide sets are parsed along with their corresponding wild-type peptide sequences as input data for consideration by neoantigen prediction software, along with information about the HLA class I haplotypes determined for the patient. The resulting list of neoantigens can be quite extensive, depending upon the numbers and types of input peptide sequences and the diversity of the HLA haplotypes. Applying several criteria, if desired, can winnow the numbers of neoantigens. One conventional approach is to only consider variant peptides with a strong- to intermediate-binding affinity (typically lower than 500 nM) but this arbitrary cut-off is controversial because strong neoantigens can have lower calculated affinities than actual. This sometimes is due to the presence of a rare HLA haplotype, for which the neural net software provides an inaccurate binding affinity prediction. Thus, for each altered locus, one can select the candidate peptide with the single best binding affinity to each corresponding HLA allele across all peptide lengths considered, or proceed with all candidates for all HLA alleles to additional filtering steps, as follows.
Figure 2
Idealized Selection of Mutant-Containing Peptides for Neoantigen Prediction
(A) The localized peptides that tile across and contain the mutated amino acid substitution are identified and parsed into the neoantigen prediction pipeline. Each peptide is considered for HLA binding strength relative to its non-mutant (wild-type) counterpart.
(B) Shown is the top scoring candidate peptide that was selected across all specified k-mers and between all HLA types that were input to the neoantigen prediction pipeline.
Three important additional filters should be applied to remove false positives, (1) RNA-based filtering to remove genes with no evidence of expression, (2) filtering based on exome data coverage depth at the variant loci, and (3) filtering based on variant allele fraction (VAF)-based metrics. The RNA expression filter ensures that each peptide is supported by evidence of RNA expression, wherein evidence of RNA expression is considered a reasonable, but not absolute, proxy that the gene is expressed in the tumor cell proteome. For the NGS coverage filter, a minimum level of normal read coverage depth is required to ensure there is sufficient sequencing data coverage from the normal tissue (i.e., supports a true positive somatic variant call). Finally, both DNA and RNA data should be evaluated to ascertain the percentage of variant-containing reads or variant allele fraction (VAF). As described earlier, this criterion helps to inform the final list of neoantigen candidates by providing information on whether a specific alteration is shared across all tumor cells (i.e., in the founder clone) or is subclonal, based on DNA sequencing data, and ensures that a variant is expressed in the tumor RNA. The latter is especially important in tumor types with a high mutation load such as those with chemical or UV damage to DNA, since upward of 50% of mutations are typically not expressed in RNA (or protein by inference) for these tumors. With these filtering steps completed, a list of high confidence, predicted neoantigenic peptides and the HLA class I proteins predicted to bind them, their calculated binding affinity value(s), and the binding affinity of the cognate wild-type peptide values can be parsed for further consideration in vaccine design or other immunological evaluations such as neoantigen burden. In the former case, neoantigen predictions have been tested in clinical trials of personalized vaccines, with demonstrated ability to elicit specific T cell responses (Schumacher et al., 2014, Carreno et al., 2015, Tran et al., 2014). In the latter approach, there are demonstrated correlations between neoantigen burden and the likelihood of response to checkpoint blockade inhibition therapies (Le et al., 2015, Rizvi et al., 2015, Snyder et al., 2014, Van Allen et al., 2015), and a demonstration that predicted neoantigens also are the epitopes targeted by checkpoint blockade immunotherapies (Gubin et al., 2014).
Figure 3
Structure and Diversity in the T Cell Receptor
(A) The mature T cell heterodimer, consisting of α- and β-subunit chains. The α subunit chains consist of variable (V), joining (J), and constant (C) regions, whereas the β subunit includes an additional diversity (D) region.
(B) V-D-J recombination and post-transcriptional processing of a TCR-β subunit chain.
Class II Predictions
HLA class II predictions are significantly more difficult to generate with precision due to the nature of the HLA class II proteins. First, class II HLA proteins are heterodimers of alpha and beta peptides encoded by four different loci in the human genome. Only one of these four loci is not highly polymorphic (Robinson et al., 2003), meaning there is extensive HLA class II polymorphism in the general population. This becomes somewhat less complex if neoantigen predictions focus on the most frequently expressed class II molecules (McKinney et al., 2013). Second, certain peptides bind to multiple different HLA class II molecules and are responsible for the majority of antigen-specific T cell responses. These so-called “promiscuous peptides” are difficult to predict using computational approaches. Third, the HLA class II binding groove is open on both ends, and although the core binding motif is a 9-mer amino acid, variable length peptides are allowed to bind. Many of the HLA-II polymorphic sites comprise other regions of the binding groove outside the core motif binding region, which allows the flanking amino acid sequences on either side of the motif sequence to influence its binding affinity. As a result, binding affinities are difficult to predict with a high degree of precision. Input data for MHC class II binding predictions consist of 15-mer representatives of each somatic neoantigen candidate peptide, along with the patient’s HLA class II haplotypes. A cutoff binding of <1,000 nM may be utilized to distinguish strong binders but given the vagaries of binding affinity predictions described above, this cutoff may not be appropriate. RNA expression level has been identified as a critical filtering parameter for predicted class II neoantigen candidates, whereby those peptides corresponding to genes with higher relative expression values from RNaseq data analysis are considered to be the strongest candidates (Kreiter et al., 2015).
Computational predictions, considering the aforementioned caveats for both class I and II, therefore only offer putative neoantigen candidates that may be subject to a variety of errors or sources of inaccuracy. In addition to what we already have described, there are other challenges to accurate neoantigen prediction. First, even though RNA evidence supports a variant as being expressed, the most accurate evidence of a peptide’s presence in the cell is identifying that peptide from mass spectrometry-based proteomic data derived from the specific tumor under study. Second, binding affinity calculations are more accurate for the common class I HLA haplotypes, less so for rarer haplotypes. Third, a significant biological confounder of neoantigen discovery is our inability to predict precisely which of the putative neoantigen peptides will be processed in the tumor cell degradasome, then bound to and properly presented by HLA molecules on the cell surface. This critical component of T cell activation must occur for the neoantigen to stimulate a specific immune response, yet it is presently not possible to computationally predict the processing and presentation of peptides by HLA. One way to inform neoantigen prediction methods is using experimental measurements of T-cell-based immune responses to the predicted peptide epitopes. There are conventional methods such as EliSpot (IFN-gamma release) assays (Cole, 2005), flow cytometry-based dextramer assays (Carreno et al., 2015), and mass spectrometry-based evaluation of HLA-bound peptides (Gubin et al., 2014). However, scalable, high-throughput methods are in development at present and will require time and testing.
Immune Repertoire Profiling
Cellular immune responses from T cells and humoral immune responses from B cells are stimulated by exposures to antigens, including pathogens, allergens, and neoantigens. V(D)J recombination in the primary lymphoid organs creates the incredibly diverse and unique repertoire of the hypervariable regions of B cell receptors (BCR) and T cell receptors (TCR), and somatic hypermutations contribute to additional BCR diversity (Figure 3). During B and T cell development, self-antigens are presented to B and T cells to select out self-reacting types, and to ensure only B and T cells that recognize and attack foreign antigens are in the circulation. T cells only recognize foreign proteins presented on MHC, while B cells can also target foreign DNA, lipids, or carbohydrates. Upon recognition of foreign antigens and with the presence of co-stimulatory molecules, B and T cells express cell surface activation markers, attack foreign antigens, secrete cytokines, stimulate each other, and proliferate (Pasternack, 1994). One goal of immunogenomic studies is to characterize the repertoire of B and T cells in patients with cancer, especially before and after immunotherapy-based interventions.
DNA sequencing approaches have enabled the characterization of immune repertoires (Pasternack, 1994, Robins, 2013). After a pioneering study introduced the technique (Freeman et al., 2009), a plethora of immune repertoire methods have been published and commercial solutions are also available. Several studies (Calis and Rosenberg, 2014, Hou et al., 2016, Yaari and Kleinstein, 2015) have evaluated the experimental techniques and practical advice needed for immune repertoire profiling. Basically, multiplex PCR can amplify the recombined V(D)J regions from either mRNA or DNA in the B or T cells. The V(D)J, and most importantly, the variable complementarity-determining region CDR3 sequences, and their respective abundance can be resolved by high-throughput sequencing. Paired-end sequencing with additional PCR primers in the middle of the fragment permits full-length TCR repertoire sequencing with short read NGS technology to resolve the V/J pairing (Cole et al., 2016). One caveat to this approach is that PCR biases and sequencing errors can falsely increase the total repertoire with deeper sequencing coverage, so unique molecular identifier barcodes should be used to eliminate such artifacts (Cole et al., 2016), although such an approach is presently only available for RNA-based repertoire profiling.
Computational methods, as summarized in (Greiff et al., 2015a, Greiff et al., 2015b) are important components for the analysis, annotation, and visualization of immune repertoires. To this end, IMGT (Giudicelli et al., 1997) is the most widely cited immunogenetics database and provides many useful tools such as V-QUEST and HighV-QUEST (Alamyar et al., 2012) as well as statistical metrics (Aouinti et al., 2015) for the analysis and annotation of immune repertoire data. VDJtools (Shugay et al., 2015) is a comprehensive analysis framework for T cell and B cell repertoire sequencing data. It includes MIXCR for fast alignment and clonal type assembly (Bolotin et al., 2015), MIGEC for removing duplicates and combining barcodes (Shugay et al., 2014), and VDJviz for visualization (Bagaev et al., 2016), and provides basic statistical analyses for characterizing and comparing different immune repertoires.
The initial output from a repertoire profiling analysis is a list of BCR/TCR CDR3 sequences, sometimes including the adjoining V and J sequences, each followed by an abundance estimate. This output allows samples to be compared and clustered, if desired. For example, common CDR3 sequences that are shared among individuals indicate BCR/TCR clones that recognize common antigens such as herpes or common cold viruses. In comparison, CDR3 clones that are rare among patients but are abundant within a tumor, and more importantly for BCR lineage-related CDR3s with small numbers of mutations, indicate T/B cell recognition of the patient tumor-specific antigens (Saul et al., 2016). Repertoire profiles from individuals with similar ethnic backgrounds, lifestyles, or environmental exposures are often clustered. Two independent metrics, diversity (often measured by the Shannon entropy), and evenness (indicative of the degree of clonal expansion), have been proposed as important characteristics of immune repertoires (Greiff et al., 2015b). Since V(D)J recombination in TCR only occurs in children, TCR diversity generally declines with age. In contrast, V(D)J recombination in BCR occurs throughout life although at reduced levels in adults, and activated BCR undergoes somatic hypermutation to improve the antibody affinity to the recognized antigen, so the BCR diversity distributions assume more complex patterns. Although immune-stimulating events such as allergy or vaccination could shift the abundance of some clones, the immune repertoire has been suggested as a means to monitor an individual’s immune health (Johnson et al., 2014). The utility of this metric depends on the accurate measure of clonal abundance, which requires linear amplification from multiplex PCR products and additional normalization of TCR/BCR expression levels for RNA-based profiles. Furthermore, the method and time-span of sample storage can also influence sample quality for repertoire profiling.
While immune repertoires are informative, profiling them over large sample cohorts can be expensive. Computational methods have been developed to directly infer immune repertoires from unselected bulk tumor RNaseq data, such as TRUST for TCR (Li et al., 2016a) and V’DJer for BCR (Mose et al., 2016). The hypervariability of the CDR3 regions of TCR and BCR renders the RNaseq reads from these regions unmappable to the human reference genome sequence, and somatic hypermutation adds additional challenges to BCR mapping and alignment. Both of the aforementioned methods select unmappable RNaseq reads, align these unmapped reads to each other with de Brujin graphing methods, de novo assemble these alignments into contigs, and use IMGT (Giudicelli et al., 1997) to annotate those containing CDR3 motifs as potential BCR or TCR. Although these approaches only recover the most abundant of the immune repertoires, they were used to analyze RNA-seq data across tumor samples profiled by The Cancer Genome Atlas (TCGA) and resulted in novel findings. For example, TRUST revealed increased T cell clonal diversity in tumor types with higher mutational loads and potential neoantigens based on their co-occurrence with CDR3-containing sequences in the tumors (Li et al., 2016a), while V’DJer reported higher somatic hypermutation in IgG and IgA than in IgM (Mose et al., 2016).
Published studies have made fascinating observations on how immune repertoires can reflect an individual patient’s immune health and predict their response to therapy. The ability to reconstruct a more diverse TCR repertoire after autologous hematopoietic stem cell transplantation has been observed to predict better transplant outcomes in multiple sclerosis patients (Johnson et al., 2014, Muraro et al., 2014). Another study used TCR repertoire sequencing to compare each patient’s TCR before and after dendritic cell-based neoantigen vaccine dosing, illustrating expanded TCRs for the vaccine peptides that elicited a T cell response (Carreno et al., 2015). For metastatic melanoma patients, the anti-CTLA4 antibody ipilimumab has been shown to increase peripheral blood TCR diversity (Robert et al., 2014), and those patients with higher peripheral TCR diversity before treatment were reported to respond better to ipilimumab (Postow et al., 2015). In contrast, the anti-PD-1 antibody pembrolizumab showed better efficacy in melanoma patients whose pre-treatment tumor-infiltrating T cells were less diverse and more clonal (Tumeh et al., 2014). This study also demonstrated that more tumor-infiltrating T cell clones expanded after treatment in the therapy responsive group than in the (non-responding) disease progression group. Although these pioneering studies were conducted on a limited number of patients, they do suggest TCR repertoire as a universal cancer immunotherapy biomarker (McNeel, 2016). Potentially overall patient immune health from the peripheral TCR and signs of neoantigen recognition and clonal expansion from the tumor TCR before treatment could predict better patient response to cancer immunotherapies. As an example, one bioinformatics study using a Potential Support Vector Machine-based approach reported the ability to predict an individual’s age, health, transplantation status, and development of lymphoid cancer based on repertoire profiles (Greiff et al., 2015b).
Distribution of Tumor Infiltrating Lymphocytes
Large-scale molecular tumor profiling often selects samples with high tumor purity to best characterize the molecular signatures of the tumor. While most cancer genomics studies are focused on the cancerous cells in the tumor tissue, the impurities, such as stromal cells, endothelial cells, and immune cells, could have major impact on the development and progression of cancer. With genomic profiling, tumor purity could be estimated from DNA copy number (Carter et al., 2012), SNP allele frequency (Li and Li, 2014), RNA-seq (Yoshihara et al., 2013), or DNA methylation (Zhang et al., 2015, Zheng et al., 2014) data. Interestingly, these methods using orthogonal tumor profiling modalities yield very consistent tumor purity estimates, in distinct contrast to the estimates provided by pathologists, suggesting that molecular and morphological changes in the tumor do not appear simultaneously.
Pertinent to immunogenomic studies of cancer is the evaluation of tumor-infiltrating lymphocytes (TILs), which can involve traditional approaches such as flow cytometry and multiplex immunohistochemistry. Flow cytometry uses antibodies against proteins uniquely expressed on different subpopulations of immune cells to isolate specific subsets of these cells from blood or tissues. The resulting cell counts characterize the relative abundance of different subpopulations in individual cancer samples and can reveal changes following treatment. Flow-cytometry requires relatively large fresh tissue samples for study, but the resulting isolated cells, once sorted, can be cultured and profiled. Multiplex immunohistochemistry (IHC) can simultaneously capture the expression levels of multiple proteins in formalin-fixed paraffin-embedded (FFPE) tissue, with the advantage of capturing their spatial organization and co-expression patterns, although the number of proteins that can be differentially stained on each tissue slide is limited.
In addition to these conventional approaches, recent computational methods have also advanced our understanding of TILs. In a seminal study (Rooney et al., 2015), Rooney and colleagues used Granzyme A and perforin expression levels to model the immune cytolytic activities in tumors studied in TCGA, observing increased cytolytic activities in tumors with higher mutation load, copy number aberration, viral infection, and lower tumor stage. This signature-gene based approach has been employed by two recent studies (Angelova et al., 2015, Şenbabaoğlu et al., 2016) to estimate immune subset abundance based on a collection of pre-selected markers. CIBERSORT (Newman et al., 2015) used an expert-selected signature of about 500 genes to infer the abundance of 22 different tumor infiltrating immune components. In contrast, TIMER (Li et al., 2016b) selected cancer-specific signature genes to eliminate the bias from highly expressed genes in cancer cells and deconvolved only six immune components to ensure that colinear expression between closely related immune cells did not affect the deconvolution accuracy. These studies confirmed previous observations (Bindea et al., 2013, Rooney et al., 2015) and reported that CD8+ T cells are associated with better overall survival and fewer relapses, whereas macrophages are associated with worse clinical outcome in many cancer types (Li et al., 2016b).
There have been inconsistent observations on whether the abundance of B cells is associated with improved cancer survival (DiLillo et al., 2010, Perricone et al., 2004, Qin et al., 1998, Schultz et al., 1990). One potential reason is that B cells with different activation statuses may either inhibit or promote T cell functions (Nelson, 2010). Another possible reason is that B cells are sometimes enriched at the margins of tumor capsules instead of evenly distributed throughout the tumor tissue (Kroeger et al., 2016, Lao et al., 2016, Nelson, 2010, Shi et al., 2013). Therefore, abundance estimates of B cells may be variable due to the specific tumor section under assay. By contrast, TCR-seq of different sections of a large ovarian tumor (Emerson et al., 2013) revealed that T cells are spatially homogeneous within the tumor, similar to peripheral blood. Therefore, it is possible that the correlation of TIL abundance with patient outcome will depend on the homogeneity of TIL distribution for different cancer types.
Applications
The culmination of our renewed understanding of the immune system and its interaction potential with cancer cells has been a decades-long effort to develop therapeutic approaches that boost existing immune responses against neoplastic cells. These efforts span widely variable approaches, and a comprehensive review has been recently published that explores the broad landscape of cancer immunotherapies (Galluzzi et al., 2014).
Certain types of cancer immunotherapies act to re-invigorate existing immunity that has been suppressed in the tumor microenvironment. These so-called “checkpoint blockade” therapies were devised to address our fundamental understanding of immunosuppression and T cell exhaustion, and provide a relatively tumor-specific immune response. However, there often are attendant side effects of variable severity, because their action targets native immune molecules such as CTLA-4, PD-1 and PD-L1. Potentially, more specific targeting could result from using putative neoantigens predicted by NGS-based analysis, described above, delivered as patient-specific vaccines meant to stimulate an immune response that is highly specific for the tumor cells. In this paradigm, several different vaccine types (or “platforms”) have emerged and are actively being tested in pre-clinical and clinical settings, as follows (Hirayama and Nishimura, 2016, Overwijk et al., 2013, Vormehr et al., 2015, Zhang et al., 2016c).
- DNA minicassette vaccines: One vaccine platform is based on piecing together the individual coding sequences for each predicted neoantigen peptide into a DNA construct that contains a specific human promoter element to drive peptide production, once introduced into the patient. The sequence-verified vaccine construct can be electroporated into patient-derived dendritic cells and the DCs then re-infused into the patient. Synthetic DNA is relatively cheaply and quickly obtained, even with the attendant GMP requirements for sequence verification prior to use in a human vaccine. Hence, concerns about cost and scalability of this approach are minimal. One design consideration is ensuring that no self-antigens are potentially encoded by the junctions between each neoantigen sequence, but this is relatively easy to confirm computationally once the proposed vaccine design is in-hand.
- Peptide vaccines: Synthetic peptides representing computationally identified neoantigens can be combined and solubilized in the presence of one or more immune-stimulatory adjuvants to create patient-specific peptide vaccines. These can be directly injected intramuscularly, intradermally, or subcutaneously as a means of presenting the neoantigenic peptides during maturation of native dendritic cells, which then can prime a robust and specific immune response. Short neoantigen peptides of 8–12 amino acids can directly bind to HLAs expressed on the surface of antigen-presenting dendritic cells, thereby priming a T cell specific response. Peptide vaccines also can be comprised of synthetic long peptides (25–30 amino acids), which require uptake, processing, and presentation by antigen-presenting cells in order to elicit an immune response. While GMP-grade peptides are expensive to manufacture, this is a scalable enterprise and, when coupled with the simplicity of the peptide vaccine design, is being applied in clinical trials of patient-specific vaccines (W. Gillanders, personal communication).
- RNA vaccines: Conceptually similar to DNA and peptide vaccines are RNA-based neoantigen vaccines, wherein the RNA encodes the various predicted neoantigens that are unique to each patient’s tumor. As with all RNA-based therapeutics, the lability of RNA invokes a need to stabilize the RNA molecules and to provide for appropriate uptake by antigen presenting cells so the encoded peptides can be processed and presented. Cost and scalability of RNA synthesis are similarly straightforward as for DNA, so the packaging and stabilization are the challenging puzzles for this platform, which is being actively pursued in the research setting.
- Autologous dendritic cell vaccines: Dendritic cells (DCs) occupy a unique position at the interface of innate and adaptive immunity, and have been shown to effect a robust, therapeutically relevant anti-neoplastic immune response. In particular, autologous dendritic cells can be isolated from patients and conditioned ex-vivo to mature, thereby providing immune-stimulatory functions. When coupled with neoantigenic peptides from patient-specific analyses, the resulting dendritic cell vaccine can be re-infused and has been shown to elicit neoantigen-specific T cell immunity and an attendant expansion of the neoantigen-specific TCR (Carreno et al., 2015, Galluzzi et al., 2014). Emphasizing their specificity for tumor cells, no severe adverse events were recorded in this initial trial of patient-specific DC vaccines. However, not all of the predicted neoantigens elicited a T cell response, indicating that our ability to predict even class I neoantigens will require additional precision, as discussed herein. While these early first-in-human results are exciting, the preparation of dendritic cell vaccines requires significant amounts of peripheral blood mononuclear cells for dendritic cell isolation, as well as time- and effort-intensive laboratory work to culture and mature the DCs ex vivo. Hence, their scalability may be in question for broad-based clinical use.
Besides cancer vaccines, another type of genomics-driven patient-specific cancer immunotherapy is adoptive cell transfer (ACT), as pioneered by Rosenberg and colleagues (Rosenberg and Restifo, 2015). Basically, T cells extracted from a cancer patient, either from peripheral blood or the resected tumor, can be activated and expanded ex vivo by IL2 treatment, before infusing them back to the same patient to kill the cancer cells. Preparatory lymphodepletion either by chemotherapy or radiation of the patient is an important step done prior to infusion, to improve the engraftment and persistence of the adoptively transferred T cells, thus increasing durability of tumor regression (Dudley et al., 2002). ACT cells not only persist months after infusion, but also expand in the patient. Two additional genomic approaches below have been shown to further enhance tumor-specific killing and broaden the applicable cancer types suitable for ACT. Despite the cost and the technical and logistical challenges of ACT, this personalized immunotherapy has demonstrated promising rates and duration of response.
- Genetically engineering T cells: T cells extracted from patients can be genetically engineered to express TCRs that specifically recognize proteins expressed only in the patient’s cancer cells, such as the melanoma/melanocyte specific MART-1 antigen (Morgan et al., 2006) or the cancer-testis antigen NY-ESO-1 (Robbins et al., 2011). T cells also can be engineered by viral transduction to express a chimeric antigen receptor (CAR) that uniquely recognizes the B cell specific CD19 (Kalos et al., 2011, Kochenderfer et al., 2010) on the cell surface. Linking the CAR with a co-stimulatory domain such as CD137 (Imai et al., 2004, Milone et al., 2009) or engineering the cells to express another chimeric costimulatory receptor recognizing a second antigen (Kloss et al., 2013) have both improved T cell antitumor activity. Recently a new clinical trial has been proposed, where CRISPR technology is applied to further engineer the NY-ESO-1-targeting CAR T cells. Using a small number of CRISPR guide RNAs to knock out the PD-1 gene and the cells’ intrinsic TCR, this approach aims to eliminate immune suppression and improve the NY-ESO-1 receptor response. If proven effective, genome engineering technology could provide new opportunities to manipulate other genes in immune cells ex vivo using the CRISPR technology to achieve desired cancer killing phenotypes.
- Expanding tumor-specific T cells: Instead of engineering the autologous T cells ex vivo, this approach separately cultures tumor-infiltrating T cell clones or subpopulations, then selects those reacting against tumor cells for massive expansion before patient infusion. With the emergence of exome-sequencing, scientists can call somatic mutations from the tumor and computationally predict immunogenic neoantigens. Testing the immunogenicity of these mutations in parallel uses minigene constructs (described above) encoding the mutated peptides into expression vectors and the in vitro transcribed RNA from the vectors can be electroporated into antigen presenting cells (APC). Culturing the tumor-infiltrating T cells for reactivity against these APCs selects the tumor-specific T cells and identifies the immunogenic mutant minigenes (Robbins et al., 2013). Compared to genetically engineered T cells, the final T cells infused into patients using this approach are comprised of populations of different T cells recognizing different neoantigens. While most of these neoantigens are hypothesized to be passenger mutations, recently the Rosenberg group identified four T cell clones that specifically react to the KRAS G12D mutation in colon cancer (Tran et al., 2016), thereby drugging the undruggable.
Future Perspectives
As high-throughput technologies improve and our immunology knowledge grows, the future of immunogenomics-based application to cancer appears quite promising and likely will continue to broaden. Technological and computational innovations will be instrumental to overcome existing challenges and move the field forward. First, despite the advances offered by algorithms such as NetMHC-pan, both the accuracy of MHC presentation prediction, especially for rarer alleles, and of MHC class II presentation await improvements. In addition, most studies use MHC presentation of somatic mutations as a proxy to predict immunogenicity, although it is unclear which presented somatic mutations will elicit immune responses. Experimental assays such as EliSpot are currently used to validate the predicted neoantigens, although such assays are still conducted in a low throughput fashion (Cole, 2005).
Second, TIL deconvolution methods such as CIBERSORT and TIMER use reference expression profiling data on sorted immune components from peripheral blood. These methods could be combined with Nanostring-based measures of immune marker genes in addition to bulk tissue RNA-seq data for inexpensive profiling of large archival tumor cohorts. However, expression of immune cells in tumors might differ significantly from that in peripheral blood, which could influence the accuracy of these inference methods. Recent developments in single cell analyses techniques, such as CyTOF (Newell et al., 2012) and single-cell RNA-seq (Klein et al., 2015, Macosko et al., 2015, Tirosh et al., 2016), might offer more quantitative alternatives. However, for very detailed TIL deconvolution on large sample cohorts, the required starting tumor material and cost of single cell experiments need to decrease significantly for widespread use.
Third, monitoring an individual patient’s immune repertoire in peripheral blood or tumors provides insights into their immune health as well as their response to allergens, vaccines or therapies (Robins, 2013). However, there are still many challenges ahead, such as how to identify the specific TCR / BCR that recognizes each specific somatic mutation and how accurate the immune repertoire is at predicting patient response to immunotherapy. Other challenges include how to robustly estimate the total immune repertoire in different samples from the same individual, normalize bias from minor immune events, and distinguish immune repertoire signals from normal versus pathogenic immune events.
Last but not least, predicting response to immunotherapies, including tumor killing effects and autoimmune side effects, is still an open question. So far, higher T cell infiltration (Taube et al., 2012, Tumeh et al., 2014), higher PD-1 or PD-L1/L2 expression (Garon et al., 2015, Herbst et al., 2014, Taube et al., 2012), higher neoantigen load from BRCA or somatic mutations in DNA repair pathway genes (Hugo et al., 2016, Snyder et al., 2014, Van Allen et al., 2015), or microsatellite instability (Le et al., 2015), higher peripheral baseline TCR diversity (Postow et al., 2015), lower tumor infiltrating TCR diversity (Tumeh et al., 2014), lack of mutations in interferon gamma (INFG) (Gao et al., 2016), beta-2-microglobulin (B2M) (Zaretsky et al., 2016), or JAK1/JAK2 (Zaretsky et al., 2016) have been associated with better response to immunotherapies in various cancer types. A comprehensive model that integrates all of these factors to accurately predict patient response to immunotherapy is still lacking, and likely will require much more data to train and refine. In addition, methods to predict the optimal combination of immunotherapies or with other targeted, chemo, or radiation therapies for individual patients still await development. Despite all the aforementioned challenges, the exciting results obtained to-date from cancer immunotherapies will continue to motivate the biomedical research community to overcome these challenges and explore new frontiers.