MARRVEL: Integration of Human and Model Organism Genetic Resources to Facilitate Functional Annotation of the Human Genome

One major challenge encountered with interpreting human genetic variants is the limited understanding of the functional impact of genetic alterations on biological processes. Furthermore, there remains an unmet demand for an efficient survey of the wealth of information on human homologs in model organisms across numerous databases. To efficiently assess the large volume of publically available information, it is important to provide a concise summary of the most relevant information in a rapid user-friendly format. To this end, we created MARRVEL (model organism aggregated resources for rare variant exploration). MARRVEL is a publicly available website that integrates information from six human genetic databases and seven model organism databases. For any given variant or gene, MARRVEL displays information from OMIM, ExAC, ClinVar, Geno2MP, DGV, and DECIPHER. Importantly, it curates model organism-specific databases to concurrently display a concise summary regarding the human gene homologs in budding and fission yeast, worm, fly, fish, mouse, and rat on a single webpage. Experiment-based information on tissue expression, protein subcellular localization, biological process, and molecular function for the human gene and homologs in the seven model organisms are arranged into a concise output. Hence, rather than visiting multiple separate databases for variant and gene analysis, users can obtain important information by searching once through MARRVEL. Altogether, MARRVEL dramatically improves efficiency and accessibility to data collection and facilitates analysis of human genes and variants by cross-disciplinary integration of 18 million records available in public databases to facilitate clinical diagnosis and basic research.

Introduction

One major challenge encountered with interpreting human genetic variants is the limited understanding of the functional impact of genetic alterations on biological processes. Traditional variant interpretation methodology relies on restricting clinical interpretation to known Mendelian diseases and employing in silico prediction algorithms. For most genes, few variants have reliable and validated clinical significance designation, resulting in difficulties in differentiating between benign and pathogenic variants or determining whether variants in a candidate gene are causative.1 The wealth of available biological information across multiple model organisms could aid in the interpretation of variants such as known molecular functions of the candidate gene. However, there are major barriers to search for biological data in specific model organism databases due to the intricacies of evaluating orthologs and navigating seven different websites’ different organization, different approaches, and different use of gene or protein identifiers (Figure S1). This limits the efficiency of incorporating known model organism data into analysis of candidate genes.
Therefore, there is an unmet demand for resources to facilitate rapid curation of available human gene and variant information, to determine conservation, and to gather relevant information on homologous genes in model organisms. Furthermore, such data compilation is relevant to evaluating the consequences of human genetic variation in model organisms.2 To provide a concise and user-friendly curation of pertinent and publicly available knowledge, we created MARRVEL (model organism aggregated resources for rare variant exploration). MARRVEL is an open-access resource that synthesizes genetic and model organism information from several public databases into a single user-friendly website (Figure 1).

Figure 1Overall Structure of MARRVEL

The major impetus for developing MARRVEL arose from growing efforts to analyze the potential pathogenicity of genetic alterations in genes that are either not previously associated with human genetic disease or associated with different clinical features. A wide range of efforts for the discovery of disease-causing variants include the research consortiums for rare (e.g., Center for Mendelian Genomics3 and Undiagnosed Diseases Network4) or common (CHARGE consortium3) diseases, clinical genetics laboratories, large-scale sequencing projects,5, 6 and collaborations between human geneticists and model organism researchers.7 Together, these research efforts generate growing numbers of large human genomic datasets that require the development of resources and tools to facilitate efficient data analysis.
For example, the Undiagnosed Diseases Network4 combines the expertise of clinicians, sequencing centers (e.g., whole-exome, whole-genome, RNA-seq), metabolomics laboratories, and model organism scientists (fruit fly, zebrafish, and mouse) to diagnose individuals with rare disorders that eluded traditional diagnostic modalities. Many of these cases are predicted to have a primary genetic cause but the suspected causative variant may not be in disease-associated genes. When candidate pathogenic gene variants are identified, model organism data available for predicted orthologs of the human gene are an invaluable resource for interpreting the biological significance of the genetic alterations. However, this model organism-based resource is underutilized due to limited accessibility by non-model organism researchers. Currently, researchers need to visit and navigate separate model organism-specific databases (e.g., FlyBase,8 MGI,9 ZFIN10) that utilize distinct genotype and phenotype nomenclature as well as data organization. Moreover, in the study of genes or variants linked to human diseases, model organisms provide powerful platforms for mechanistic studies. Hence, a user-friendly open-access web-based resource to curate and synthesize current knowledge and resources from model organisms and human genomics databases is invaluable.11, 12, 13

Material and Methods
Human Genetics Databases
Human genetics data are extracted from Online Mendelian Inheritance in Man (OMIM),14 Exome Aggregation Consortium (ExAC),15 Genotype to Mendelian Phenotype (Geno2MP), ClinVar,16 Database of Genomic Variants (DGV),17 and DECIPHER (database of genomic variation and phenotype in humans using Ensembl resources).18
We display the human gene description, gene-phenotype relationships, and reported alleles from OMIM. Next, control population gene summary from the ExAC15 database is displayed. ExAC is a public collection of more than 60,000 exomes that have been selected against individuals with severe early-onset Mendelian phenotypes.15 When MARRVEL is primarily applied to early-onset pediatric phenotypes and used to evaluate candidate genes for Mendelian disease, the ExAC data can be considered as a “control” dataset. We will refer to this data as “control” throughout the paper though it should be noted these samples should not be considered similarly for adult neurodegenerative phenotypes, for example. Within the control population gene summary, we include the pLI (the probability of being loss-of-function [LoF] intolerant) score of a gene, which assesses the probability that a gene is extremely intolerant to loss of function variants (nonsense, splice acceptor, and splice donor variants) caused by single-nucleotide changes.15
We next display data from the Geno2MP database. Geno2MP is a database sponsored by the University of Washington Center for Mendelian Genetics displaying variants from Mendelian gene discovery projects and provide phenotype information for individuals with specific genotypes, including affected and unaffected family members.
Next, we extract data from ClinVar16 containing more than 255,000 unique variants annotated with clinical significance and review status (i.e., level of evidence). When a user searches for a gene and variant, MARRVEL displays all ClinVar variants reported in the gene of interest, summarizes the number of variants in each category of clinical significance, and highlights any variant(s) that match the location of the variant of interest. We provide both a high-level summary of the variants in terms of its reported clinical significance as well as a table with details for each reported variant. In addition, any alleles that overlap with the location of the variant of interest is highlighted in blue.
We then display data from the Database of Genomic Variants (DGV)17 database, which contains a large collection of structural variants from more than 54,000 individuals. The database includes samples of reportedly healthy individuals, at the time of ascertainment, from up to 72 different studies. Using the DGV database, we report all copy-number variants (CNVs) that overlap the input gene. If a CNV containing the gene of interest exists, we display the frequency, type of CNV, and publications associated with the CNV.
Finally, we display additional CNV information from the DECIPHER18 database based on the variant coordinate that includes common variants from the control population. Due to data display restrictions, we are able to provide the users only with control population data from DECIPHER.
Gene Function Databases
Biological and genetic features of human genes and their putative orthologous genes, including tissue expression pattern and Gene Ontology (GO) terms, are extracted from the following model organism databases: Saccharomyces Genome Database (SGD)19 for the budding yeast Saccharomyces cerevisiae, PomBase20 for the fission yeast Schizosaccharomyces pombe, WormBase21 for the nematode worm Caenorhabditis elegans, FlyBase8 for the fruit fly Drosophila melanogaster, ZFIN10 for the zebrafish Danio rerio, Mouse Genome Informatics (MGI)9 for mouse Mus musculus, and Rat Genome Database22 and Bodymap23 for rat Rattus norvegicus. For humans, we extract GO terms from QuickGO24 and tissue expression data from GTEx25 and Protein Atlas.26 To identify the putative orthologs of the human gene, we incorporate information from DIOPT (Drosophila RNAi Screening Center [DRSC] Integrative Ortholog Prediction Tool),27 an online tool integrating 14 ortholog prediction tools to provide a homology score for each predicted ortholog pair. Additionally, DIOPT is used to display a multiple protein alignment that is generated with MAFFT and human gene functional domains.27
Data Processing
MARRVEL search allows three types of inputs: a single HUGO gene symbol,28 a single human variant, or a combination of both. The human variant input can be in the format conforming to HGVS nomenclature29 or in the genomic variant format [chromosome number]:[genomic coordinate] [Reference nucleotide]>[Alternate nucleotide]), for example 6:99365567T>C. If the variant is input in HGVS nomenclature format, then the Mutalyzer Position Converter tool30 is used to transform the variant input into a genomic coordinate, as variants stored in our database follow the genomic variant format.
If the input to MARRVEL includes both variant and gene symbol, data from OMIM14 are retrieved using the OMIM API and gene summary table is extracted from the ExAC website in real time. Variant data from the ExAC15 and Geno2MP databases are retrieved from our MySQL31 database as explained in the following section. Regarding ClinVar16 alleles, MARRVEL searches by the gene symbol and reports all alleles that overlap with the input gene. MARRVEL also provides a summary on clinical significance from these alleles. MARRVEL displays DGV17 copy-number variants based on the genes that are encompassed by the copy-number variants. Variant data from DECIPHER18 are retrieved from our MySQL database based on the chromosomal location.
If the input includes only a gene symbol, MARRVEL retrieves the gene summary table from the ExAC website. For Geno2MP, it shows all variants overlapping the gene in the database and its heterozygote count, homozygote count, and their sum. For DGV, it shows all CNV regions overlapping the gene. DECIPHER data are not retrieved since it does not provide report data associated with genes.
When the input includes only a variant, MARRVEL first searches the ExAC database to retrieve the variant information. It then shows gene-related information such as OMIM, orthologs and their functions, and protein alignment of the first gene the ExAC database matches.
For any combination of gene and variant input, the gene function table includes the following columns: the orthologous genes column, the DIOPT27 score column, the tissue expression column, and the associated GO terms’ columns. The orthologous genes column displays the putative orthologs predicted using DIOPT with a link to each organism database as well as a PubMed link. The PubMed link is generated from the NCBI

MARRVEL: Integration of Human and Model Organism Genetic Resources to Facilitate Functional Annotation of the Human Genome

Leave a Reply Cancel Reply