Email updates

Keep up to date with the latest news and content from Journal of Clinical Bioinformatics and BioMed Central.

Open Access Highly Accessed Database

Huvariome: a web server resource of whole genome next-generation sequencing allelic frequencies to aid in pathological candidate gene selection

Andrew Stubbs1*, Elizabeth A McClellan1, Sebastiaan Horsman1, Saskia D Hiltemann12, Ivo Palli1, Stephan Nouwens1, Anton HJ Koning1, Frits Hoogland3, Joke Reumers4, Daphne Heijsman1, Sigrid Swagemakers1, Andreas Kremer1, Jules Meijerink5, Diether Lambrechts4 and Peter J van der Spek1

Author Affiliations

1 Department of Bioinformatics, Erasmus University Medical Center, Molewaterplein 50, Rotterdam, The Netherlands

2 Department of Urology, Erasmus University Medical Center, Molewaterplein 50, Rotterdam, The Netherlands

3 VX Company IT Services, Baarnsche dijk 8, 3741, LR, Baarn, The Netherlands

4 Vesalius Research Center, VIB and University of Leuven, Gasthuisberg Herestraat 49, 3000, Leuven, Belgium

5 Department of Pediatric Oncology, Sophia Children’s Hospital, Erasmus University Medical Center, Molewaterplein 50, Rotterdam, The Netherlands

For all author emails, please log on.

Journal of Clinical Bioinformatics 2012, 2:19  doi:10.1186/2043-9113-2-19

The electronic version of this article is the complete one and can be found online at: http://www.jclinbioinformatics.com/content/2/1/19


Received:29 May 2012
Accepted:16 October 2012
Published:19 November 2012

© 2012 Stubbs et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Next generation sequencing provides clinical research scientists with direct read out of innumerable variants, including personal, pathological and common benign variants. The aim of resequencing studies is to determine the candidate pathogenic variants from individual genomes, or from family-based or tumor/normal genome comparisons. Whilst the use of appropriate controls within the experimental design will minimize the number of false positive variations selected, this number can be reduced further with the use of high quality whole genome reference data to minimize false positives variants prior to candidate gene selection. In addition the use of platform related sequencing error models can help in the recovery of ambiguous genotypes from lower coverage data.

Description

We have developed a whole genome database of human genetic variations, Huvariome, determined by whole genome deep sequencing data with high coverage and low error rates. The database was designed to be sequencing technology independent but is currently populated with 165 individual whole genomes consisting of small pedigrees and matched tumor/normal samples sequenced with the Complete Genomics sequencing platform. Common variants have been determined for a Benelux population cohort and represented as genotypes alongside the results of two sets of control data (73 of the 165 genomes), Huvariome Core which comprises 31 healthy individuals from the Benelux region, and Diversity Panel consisting of 46 healthy individuals representing 10 different populations and 21 samples in three Pedigrees. Users can query the database by gene or position via a web interface and the results are displayed as the frequency of the variations as detected in the datasets. We demonstrate that Huvariome can provide accurate reference allele frequencies to disambiguate sequencing inconsistencies produced in resequencing experiments. Huvariome has been used to support the selection of candidate cardiomyopathy related genes which have a homozygous genotype in the reference cohorts. This database allows the users to see which selected variants are common variants (> 5% minor allele frequency) in the Huvariome core samples, thus aiding in the selection of potentially pathogenic variants by filtering out common variants that are not listed in one of the other public genomic variation databases. The no-call rate and the accuracy of allele calling in Huvariome provides the user with the possibility of identifying platform dependent errors associated with specific regions of the human genome.

Conclusion

Huvariome is a simple to use resource for validation of resequencing results obtained by NGS experiments. The high sequence coverage and low error rates provide scientists with the ability to remove false positive results from pedigree studies. Results are returned via a web interface that displays location-based genetic variation frequency, impact on protein function, association with known genetic variations and a quality score of the variation base derived from Huvariome Core and the Diversity Panel data. These results may be used to identify and prioritize rare variants that, for example, might be disease relevant. In testing the accuracy of the Huvariome database, alleles of a selection of ambiguously called coding single nucleotide variants were successfully predicted in all cases. Data protection of individuals is ensured by restricted access to patient derived genomes from the host institution which is relevant for future molecular diagnostics.

Keywords:
Medical genetics; Medical genomics; Whole genome sequencing; Allele frequency; Cardiomyopathy

Background

Next-generation sequencing (NGS) provides scientists with the ability to screen for genetic variants at a higher density than genome wide screens with array based platforms [1]. The choice of sequencing the whole genome or only the exome, the latter comprising approximately 1% of the entire genome, depends on the type of research question to be addressed. Exome sequencing delivers information on the coding regions of the genome [2] and has been successfully applied, and continues to be applied, to determine the causative genetic event in Mendelian inherited diseases [3]. Whole genome sequencing provides scientists with an unbiased view of genetic variation of the genome including promoters, intronic splicing regulators, regulatory regions (enhancers, silencers), non-coding RNAs (microRNAs, snoRNAs, lincRNA) and structural variation including copy number [4]. There are 3 to 4 million SNVs per human genome of which approximately 10% are novel variants, some of which are false positives and may confound the selection of disease causing variants [4]. Variants detected in other genomes are less likely to be artifacts; hence the use of databases to store high quality personal variants will improve the detection of pathogenic variants. The advent of whole genome and exome sequencing tests, replacing the single variant assay as clinical genetics tests and for cancer diagnosis based on reduced costs, will require access to large scale central databases to distinguish clinically relevant variations from neutral polymorphisms [5]. The central requirement for implementing NGS into clinical practice is to allow simple and secure access to databases containing curated knowledge of variants scored as clinically relevant pathogenic mutations with standardized clinical reporting. Several existing projects that support the detection of common deleterious variants in the population include Online Mendelian Inheritance in Man [6], dbSNP [7], Database of Genomic Variants [8] and Human Gene Mutation Database [9]. SeattleSeq Annotation [10] and ENGINES [11] are both web services for easy access to the genotypes stored in dbSNP, and for annotation of variants for both hg18 and hg19 genome builds. NGS catalog [12] which is built on SeattleSeq provides scientists with an integrated view of public literature derived variation results, summarized by sequencing platform type (e.g. RNAseq), technology platform (e.g. HiSeq2000) and linked to the publication from which the results were derived. ANNOVAR [13], a command line tool, is popular with bioinformaticians and is used to annotate experimentally derived variants with common and rare variants derived from the popular sources (e.g. dbSNP), the 1000 genomes project [14], and Exome Variant Server [15], and to provide functional impact where appropriate in coding regions. However, to the medical research scientist the majority of these results have been made available in the web application SNPnexus [16], which delivers functional annotation of novel and known variants and improved access via positional mapping through contig or clone coordinates. Huvariome provides the user with whole genome allele frequencies, their associated quality score (detection and chance to detect the variant), gene based ranking and integrated access to publicly available data for the detection of common, rare and deleterious variants. The functional impact of variants in Huvariome is provided by the Complete Genomics (CG) annotation pipeline [17]. The novelty of Huvariome is that it provides rapid and simple access to SNV, short indels, and de novo assembled regions of the genome at any position in the genome with allelic frequencies and associated error for position in the human genome. Huvariome also delivers common variants from a small cohort of Benelux genomes from unrelated individuals with no disease association. In light of these developments we have developed a simple application, Huvariome, which goes beyond the current platforms with similar goals [10,11] to enable efficient allelic frequencies searching in both public and private genomes for clinical research scientists.

Construction and content

Subjects

Standard for human subject and data protection

All records for the biological specimens are maintained within the hospital health record management system and an anonymized sample code was supplied with the DNA and used to map the returned sequence data to the appropriate sample information stored in the database. All subjects whose whole genome sequence (WGS) results are stored within the database were approved of by the Institutional Review Board of the Erasmus MC, Rotterdam, the Netherlands (MEC-2011-253, date of approval February 27th, 2011) in which patients gave written informed consent according to institutional and national guidelines. Formalized meta-data relating to the individual from whom the genome was sequenced, but with no name or hospital identification code, is stored (Table 1), thus preventing the individual from being identified from the database. In addition the variants within a genome for the samples in Huvariome which are not publicly available are not presented on the public user interface to ensure that these individuals cannot be identified by their genomic variation.

Table 1. Standardized parameters for meta-data

Next generation sequencing

Paired-end sequencing for all DNA samples was performed with the Complete Genomics service provider using a proprietary sequencing-by-ligation technology [17]. Complete Genomics also performed primary data analysis, including image analysis, base calling, alignment and variant calling. Reads were mapped to the NCBI Build 36.1 reference genome using a fast algorithm and initial mappings were expanded by local de novo assembly on all regions of the genome that contain single nucleotide variations (SNVs) relative to the reference genome [17]. Sequencing reads were mapped to the reference genome with versions 1.1.0 to 1.12.0 of the Complete Genomics Analysis (CGA) pipeline from which the derived variant files include the SNVs, inserts and deletions (indels), and substitutions (subs) with confidence scores and explicit differentiation of “no-variant” from “no-call”. Currently these data are shipped as bzip format on 1.5 Tb discs and uploaded to the Department of Bioinformatics IBM server. The resultant genomes are at least 40X (~120 GB) mapped coverage with accurate calls >95% for the genomes.

Development of the database

Informatics infrastructure

Huvariome is developed with an Oracle 11i 64-bit relational database (Enterprise Edition Release 11.2.0.1.0), the code developed in Perl 5.8.8 and the graphical user interface developed with PHP 5.3.3, and is available on an Apache 2.2.3 server. The database is designed to store all variation types detected by Complete Genomics and are supplied in the variation results file (“VAR” file) which includes SNVs, indels and subs up to ~100bp as defined by the Complete Genomics Release Notes Assembly Software v2.0 [18]. Those variations that occur in a gene (5′UTR, 3′UTR, exon and intron) are supplied with additional annotation describing the associated gene in the gene file. The VAR and gene files are loaded into the Oracle database using a custom loader that was developed to provide quality assurance upon upload and to be easily adapted to accommodate changes in the annotation pipeline of Complete Genomics. The database stores variants relative to a reference genome such that only differences to the reference are listed, allowing for a substantial reduction of data. Each genome is annotated with a minimal set of required information detailing the individual sample and the relationship with other samples (e.g. tumor versus paired normal and parents versus children). This minimal information of the genome source ensures that the propensity of variation to appear in a subset of the data can be traced and allows users to perform meta-analyses across the whole database to rapidly identify cancer associated and family based variants. Currently all genomes are mapped to NCBI build 36.3 and annotated using RefSeq data for gene and protein annotations, dbSNP version 130 [7], DGV [8] for known variations and GenomeTraxTM (Biobase, Germany) for multiple annotations including HGMD Professional.

Development of the schema

Database design was developed to reflect, where possible, the original data tables supplied by Complete Genomics [18] to ensure the ability to store genomes as the data output from the CGA pipeline advances with richer annotation and improved quality measurements included with the supplied data. Variation data, SNVs, indels and subs, supplied in the VAR files are stored as alleles in the var Tables 1 and 2 (Figure 1). Annotation can be scaled to include any annotation type or source including the reference genome associated with the original mapped reads from the primary sequencing files. Annotation is connected to allow for fast updates and migration, e.g. to NCBI build 37 and meta-data concerning the individual genomes, and the minimal information defining a phenotype is stored in a sample table (Figure 1).

Table 2. Summary of huvariome genomes

thumbnailFigure 1. Huvariome database schema. The part of the database schema representing the relationship between the core variation and annotation tables based on the content delivered by Complete Genomics.

User interface

Data in Huvariome can be queried and retrieved through a web interface that allows users to search the datasets for a specific gene or request information for a genomic region by means of a list of positions. The user inputs variations as a tab or space separated list of variation positions in the format chromosome, begin, end (optional) using a zero based format [18] (Figure 2). After the query is completed, the results, e.g. a table of allelic frequencies ordered by position per genetic group in the Diversity Panel, are returned on a webpage via an email containing the link to the table. Results for Huvariome Core (HVC) include the SNV quality measures derived from HVC and the Diversity Panel and are annotated with functional impact derived from Complete Genomics annotation. The results are additionally annotated with existing genetic variations, regulation features and disease association, as mentioned above. Registered users can submit a list of genomic positions to obtain allele frequencies and corresponding annotation for each of the HVC, the Diversity Panel, a Disease Cohort (genomes with associated disease phenotype), and a Pedigree Cohort (genomes of related individuals). Guest users can submit a list of genomic positions to receive the allele frequency for up to fifty nucleotide positions from the Diversity Panel with the associated no-call rate (see section Allele No-call rate). Common annotation, symbol, transcript identifier, the impact on the gene (gene component), known genomic variations (DGV), and regions of common sequences in multiple species (VISTA) are provided for each variation (Figure 3). Variations specific to a population, such as with the Diversity and Pedigree Panels, are returned as population specific variations which include the impact of variation on the coding sequence and associated dbSNP variants (Figure 3). Users can submit a list of genomic positions to receive the allele frequency for up to one hundred nucleotide positions and corresponding annotation for the Diversity Panel, the Pedigree Cohorts, the associated common variation tag (without frequencies from the HVC Panel) and the associated no-call rate. This functionality is accomplished by storing each observed variation indexed by both the library to which they belong and the location of origin in the genome to which the sequences were aligned. Registered users for can access their own genomes for study from the same access page.

thumbnailFigure 2. Huvariome database access screen. Users can access the system from a central page (http://huvariome.erasmusmc.nl webcite) in which a genome is chosen and variants to be searched are uploaded as tab or space delimited search requests such that there is one variation per line. A region is searched by including an end position with the chromosome and start position.

thumbnailFigure 3. Huvariome genotype frequency page. The output page giving the distribution of allelic variations in the Diversity Panel of genomes for the European (CEU) and African (ASW) populations. In this example the first ten positions that were queried from the cardiomyopathy data set from Meder et al. 2011 [25] are shown. Each variant is returned per row with the frequency of each genotype highlighted by the size of the associated blue bar. Abbreviations: chromosome (chr); 0-based location (pos); reference allele (ref); variant alleles 1 and 2 (a1, a2); indels (ins, del); substitutions (sub); no-call (unkn.); no-call rate (nc rate); external reference (xref); predicted amino acid change (impact); gene symbol (gsym); gene component (comp), e.g. exon, intron; and variant from database of genomic variants (dgv).

Utility

Content and individual characteristics

The Huvariome database currently contains variations from 165 deep sequenced human genomes, at the time of writing, obtained with WGS, including tumor and associated normal genomes, small pedigrees and 42 individuals representing 10 different populations. The 69 individual freely available genomes [19] are non-diseased samples which include a Yoruban trio, a CEPH/Utah pedigree of 17 family members, a Puerto Rican trio, and a diversity panel representing nine different populations, from the Coriell Institute for Medical Research. Samples were sequenced to an average genome-wide coverage of about 80X (range of 51X to 89X). Huvariome includes the additional control dataset, beyond the Diversity and Pedigree Panel, called Huvariome Core (HVC), which is a subset of 31 genomes from unrelated individuals with no detectable disease association. The average coverage for HVC genomes is 70X with no-call rate of 3.5% and 3.25 million SNVs per individual genome (Table 2). Additional quality control analysis includes the calculation of the transition (Ti) to transversion (Tv) and heterozygous (hetero) to homozygous (homo) SNV ratios for several genomics features including, coding, intergenic, intronic, untranslated region, splice sites, transcription start site and the impact on the resultant protein as synonymous, missense and nonsense changes (Table 3). The Ti/Tv ratios for synonymous, missense and nonsense variants are 5.20, 2.14 and 1.93, respectively, in the 31 genomes which are in agreement with values of 5.6, 2.31 and 2.13 determined by Tennessen et al. 2012 [20]. The coding and intergenic Ti/Tv for the 31 genomes are 2.99 and 2.07, which are consistent with those published by Guo et al. 2012 [21] of 2.81 and 2.27 for exonic and non-exonic regions. The het/homo ratio for the same regions for HVC, ranges from 1.40-0.97 (Table 3) and are consistent with previous studies [20].

Table 3. Quality measures for huvariome genomes

Allele no-call rate

We used the database to determine the no-call rate of allele calling at all 3 billion positions in the human genome. The control genomes are used to calculate a SNV no-call rate (nc rate) at the base pair level:

<a onClick="popup('http://www.jclinbioinformatics.com/content/2/1/19/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/2/1/19/mathml/M1">View MathML</a>

where n is the number of no-calls (unidentifiable alleles at the position) and t is the total number of genomes. The fraction n/t is the proportion of alleles that are not able to be sequenced at a given base, and it is subtracted from 1 so that the higher the nc rate, the more plausible the base is called the correct nucleotide. In other words, this value indicates how likely the base is able to be sequenced and can be viewed as a measure of reliability for the individual base (Figure 3).

Common variants

The minor allele frequency (MAF) for each SNV in HVC of 31 genomes is calculated as the smaller of the number of occurrences of a reference allele or its variant allele divided by the number of samples (n=31) as outline by Zhu et al. 2011 [22]. A SNV with a MAF equal to zero indicates the genotype is the same for all samples and is subsequently removed whilst the remaining SNVs are placed into one of 31 MAF bins. The AWclust package [23] from Bioconductor, which is not included in Huvariome, was used to determine the similarity of the all genomes to the HVC samples using a modified input to match the application. The HVC samples (red) clearly segregate with the CEU and apart from the African and Asian populations of the Diversity Panel (Figure 4). The % of the number of SNVs with a MAF ≤ 5% or > 5% are 91.8% and 8.2% for the Exome Project and 87.4% and 12.4% for the 31 genomes used in Huvariome, providing evidence of consistency between these CEU study cohorts.

thumbnailFigure 4. Similarity of HVC to diversity panel. The output from AWClust [23] is shown in a multi-dimensional scaling plot. The genomes of HVC samples (HuVar), the Chinese and Japanese (CHB, JPT), the African (MKK, ASW, YRI, LWK), the Indian and Mexican (MXL, GIH) and the European (CEU) populations are shown in this plot. The HVC (HuVar =red) overlap with the CEU population, indicating the samples are more similar to the European samples than the others.

Case study 1: confirmation of polymorphic variation

To identify new polymorphic variants and to demonstrate how Huvariome can provide an accurate prediction of rare SNVs we selected a set of 26 non-synonymous coding SNVs (cSNVs) which were found in one of the eight genomes sequenced in the study by Ng et al. 2009 [24]. Twenty six cSNVs which were called as potentially damaging but have ambiguous genotype information at these positions (Table 4, [24]) have been analyzed using Huvariome reference genomes, HVC and Diversity Panel (Table 2). The genotype for the eight HapMap samples (International HapMap Consortium, 2003) used in the study are displayed in the columns beginning with “NA” and the last three columns represent the genotypes predicted for each group from the two reference cohorts in Huvariome (Table 4). Seven of these 26 cSNVs genotypes are homozygous reference, as determined from Huvariome and heterozygous in the remaining 19 cases; the correct allele has been called in all cases (Table 4). This example demonstrates the value of a high quality reference cohort to help disambiguate potential false positive calls from NGS studies.

Table 4. Confirmation of genotypes from 26 ambiguous variations calls

Case study 2: cardiomyopathy genes

In this example we use the recently described list of resequenced candidate cardiomyopathy associated genes [25] to determine the common variants and the non-polymorphic variants, and therefore candidate cardiomyopathy genes, in our cohorts. A list of 38 randomly selected variants and 6 variations associated with DCM or HCM which were Sanger sequenced to confirm their presence in the sample of origin (Table 5 and 5[25]), were analyzed for allele frequency variation using Huvariome. Meder et al. 2011 [25] could confirm 86% of variants by Sanger sequencing and we could predict normal alleles in 100% of cases with an nc rate (0.0-0.1) based on the two reference cohorts, Diversity Panel and HVC. The 30 variants, present in 19 genes, were confirmed in Huvariome as being polymorphic with 26 having an associated dbSNP number and two of the polymorphic variants not being present in HVC genomes (Table 5). The six known variants were annotated in Huvariome (Table 6, HGMD) with Biobase HGMD and confirmed the prediction, however the reverse complement is listed if in Huvariome the gene in HGMD is on the reverse strand (e.g. MYBP3 47324447C>T is listed as 977G>A in Table 6). An additional microdeletion in TNNT2 is present in HGMD whereas deletion of C was originally observed (Table 6) [25]. In addition to the six known cardiomyopathy variants there are six novel candidate caridomyopathy variants for which the genomic positions are homozygous in all 71 reference genomes (Table 6).

Table 5. Confirmation of known population variation

Table 6. Variations in candidate cardiomyopathy genes

Common gene single nucleotide variation rate

In addition to the allele frequencies from HVC, Diversity and Pedigree Panels, any variant which is known to be part of a gene is used to search our database germ line SNV reported for the HVC reference set. We implemented a method to calculate the exon variation rate r per gene:

<a onClick="popup('http://www.jclinbioinformatics.com/content/2/1/19/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/2/1/19/mathml/M2">View MathML</a>

Where m is the number of exons within a given gene,vi is the number of variants in exon i, and li is the length (in base pairs) of exon i. The ratio within parentheses is the proportion of bases that are variants in exons out of all bases within the exons of the gene. The negative log transformation produces a score such that a relatively small value corresponds to a gene with a large number of variants per base within the exon. Likewise, a larger score indicates a gene with a smaller number of variants per base within the exon. The nine candidate cardiomyopathy genes from Table 6 were used to search this resource. The results (Table 7) demonstrate that these genes have similar mutation rates compared with all known genes (26,000) listed in the database where the most variable gene is HLA-DRB6 (rate = 2.1) and the least is AHNAK (rate = 9.8). These data suggest that these nine genes are correctly annotated with no missed paralogs and add support that these variations (Table 6) are associated with cardiomyopathy candidate genes.

Table 7. Rate of reference genome variation in candidate cardiomyopathy genes

Discussion

Huvariome was developed utilizing Oracle 11i technology designed to run on the Oracle Exadata platform [26], which was selected based on a number of favorable characteristics including scalability (Exadata scales linearly with added hardware) and performance (smart scans and hybrid columnar compression providing deep compression) [27]. This ensures that data do not need to be replicated as in a de-normalized data delivery platform such as Biomart [28] in which the data in the primary tables must be transformed and thus replicated to deliver fast return of results.

WGS was chosen as a basis for Huvariome to provide research scientists and the research community with a reference cohort for allele frequencies and for base quality checking at any position in the human genome. Here the database is presented as a resource for prioritizing rare SNVs identified with NGS technology. In contrast to other projects, only high coverage genome sequences are used and no imputation has been performed to infer unsequenced variants. Huvariome has been successfully used to prioritize candidate cancer targets and genomic variations detected in familial congenital malformations [29].

The system has been developed to address the need to access genetic variation frequency and assigned probability in control population datasets (e.g. to determine the frequency of a change in the population) and to perform aggregate analyses and assign validation probabilities to observed, naturally occurring variants based on sequencing characteristics across a population. To support these goals we have included the common variation determined in a reference population representative of the Benelux population as part of the output from the public reference datasets provided. In addition we have used the common variations present in HuVariome Core and the Diversity panel to determine the allele “no-call” rate per base of the human genome. We have demonstrated the ability of Huvariome to determine the variants in another resequencing project [24] and to support candidate gene selection in a cardiomyopathy resquencing project [25].

Conclusions

Huvariome was developed to facilitate data storage of WGS and the analysis of genetic variation detected by WGS in research and clinical diagnostics environment, which both require a secure and scalable database. Huvariome provides a user-friendly interface to access genetic variation data from diverse cohort studies for the identification of disease-promoting variations in the underlying database. The variants are annotated to provide users with a wealth of information that they would otherwise have to retrieve manually. The use of high depth and low error whole genome sequencing ensures a high accuracy of allele calling, and the no-call rate offers additional information about the allele frequencies at each base in the human genome build. The database is currently used for several tasks including SNV discovery and in silico validation. Since Huvariome contains data from experiments as well as from reference cohorts, we can separate rare polymorphisms from candidate disease-causing variants. Access to variations obtained from the public Diversity Panel data is freely available from the Huvariome web site. The examples show that Huvariome is a powerful application to confirm ambiguous genotype calls with the associated no-call per base of the human genome. The application allows users to easily compare their genotypes with the 69 reference genomes of the Diversity Panel and Pedigrees to prioritize the candidate gene selection for both family and tumor-based genome analysis. The use of Huvariome Core samples provide additional support to determine if a variant is common (or rare), if the gene that is a candidate has an excess of variations beyond what is statistically expected if a variant is common, and the no-call rate associated with sequencing any base in the reference genome. This application has been successfully used in candidate gene selection for both tumor profiling and Mendelian inheritance studies [29].

We are currently enhancing the performance and scalability by migrating this application to run on Oracle Exadata hardware, allowing highly optimized parallel processing and high compression capability for cheaper storage and faster querying [27], and are developing summary pages that include visualizations using TIBCO Spotfire Web Player technology [30]. The data loaders developed in this project can easily be adapted to accommodate changes in data format thereby making the database sequencing platform independent which will allow sequencing results from other NGS platform (e.g. Illumina, Roche, Life Technologies) and data types (e.g. RNAseq) to be incorporated into this database.

We encourage collaborators to upload their own variants files into the knowledge archive initially in collaboration with the Erasmus University Medical Center and in the future via an optimized upload website with an agreed policy and standardized format and to ensure that the data quality is maintained.

Availability

Huvariome is freely accessible for use from the web site at URL: http://huvariome.erasmusmc.nl webcite.

Abbreviations

SNV: Single nucleotide variation; HVC: Huvariome Core; WGS: Whole genome sequence; MAF: Minor allele frequency; DGV: Database of genomic variants; NGS: Next generation sequencing.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AS, EM, DL, PvdS and AHJK drafted the manuscript. SH, IP, AHJK and AS are involved with the design, data integration and maintenance of the database and application. EM, JR, SDH and AS are responsible for the case studies and statistical analysis. SS, DH, SN, AK, JM and PvdS provided analytical support and testing of the application. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by a grant from the Netherlands Genomic Initiative (050040212). We would like to thank the support of Rick Tearle, Greg Tyrelle and Steve Lincoln with whom invaluable discussions for the presentation of Complete Genomics Data helped to complete this application development.

References

  1. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM: Finding the missing heritability of complex diseases.

    Nature 2009, 461:747-753. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Clark MJ, Chen R, Lam HY, Karczewski KJ, Chen R, Euskirchen G, Butte AJ, Snyder M: Performance comparison of exome DNA sequencing technologies.

    Nat Biotechnol 2011, 10:908-914. OpenURL

  3. Montenegro G, Powell E, Huang J, Speziani F, Edwards YJ, Beecham G, Hulme W, Siskind C, Vance J, Shy M, Züchner S: Exome sequencing allows for rapid gene identification in a Charcot-Marie-Tooth family.

    Ann Neurol 2011, 3:464-470. OpenURL

  4. Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood L, Galas DJ: Analysis of genetic inheritance in a family quartet by whole-genome sequencing.

    Science 2010, 328:636-639. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Ropers HH: On the future of genetic risk assessment.

    J Community Genet 2012.

    [Epub ahead of print]

    OpenURL

  6. Online Mendelian Inheritance in Man (OMIM)http://omim.org/ webcite

  7. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation.

    Nucleic Acids Res 2001, 1:308-311. OpenURL

  8. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome.

    Nat Genet 2004, 36:949-951. PubMed Abstract | Publisher Full Text OpenURL

  9. Stenson PD, Ball EV, Howells K, Phillips AD, Mort M, Cooper DN: The Human Gene Mutation Database: providing a comprehensive central mutation database for molecular diagnostics and personalized genomics.

    Hum Genomics 2009, 2:69-72. OpenURL

  10. SeattleSeq Annotationhttp://snp.gs.washington.edu/SeattleSeqAnnotation/ webcite

  11. Amigo J, Salas A, Phillips C: ENGINES: exploring single nucleotide variation in entire human genomes.

    BMC Bioinformatics 2011, 12:105. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  12. Xia J, Wang Q, Jia P, Wang B, Pao W, Zhao Z: NGS Catalog: A database of next generation sequencing studies in humans.

    Hum Mutat 2012, 33:E2341-E2355. PubMed Abstract | Publisher Full Text OpenURL

  13. Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.

    Nucleic Acids Res 2010, 38:E164. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. 1000 genomes project http://www.1000genomes.org/ webcite

  15. NHLBI Exome Sequencing Project (ESP) http://evs.gs.washington.edu/EVS/ webcite

  16. Chelala C, Khan A, Lemoine NR: SNPnexus: A web database for functional annotation of newly discovered and public domain Single Nucleotide Polymorphisms.

    Bioinformatics 2009, 25:655-661. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Carnevali P, Baccash J, Halpern AL, Nazarenko I, Nilsen GB, Pant KP, Ebert JC, Brownley A, Morenzoni M, Karpinchyk V, Martin B, Ballinger DG, Drmanac R: Computational techniques for human genome resequencing using mated gapped reads.

    J Comput Biol 2012, 3:279-292. OpenURL

  18. Complete Genomics Release Notes Assembly Software v2.0http://www.completegenomics.com/customer-support/documentation/100357139.html webcite

  19. Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, Kang HM, Jordan D, Leal SM, Gabriel S, Rieder MJ, Abecasis G, Altshuler D, Nickerson DA, Boerwinkle E, Sunyaev S, Bustamante CD, Bamshad MJ, Akey JM, Broad GO, Seattle GO: NHLBI Exome Sequencing Project. Evolution and functional impact of rare coding variation from deep sequencing of human exomes.

    Science 2012, 337:64-69. PubMed Abstract | Publisher Full Text OpenURL

  20. Guo Y, Long J, He J, Li CI, Cai Q, Shu XO, Zheng W, Li C: Exome sequencing generates high quality data in non-target regions.

    BMC Genomics 2012, 13:194. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  21. Zhu Q, Ge D, Maia JM, Zhu M, Petrovski S, Dickson SP, Heinzen EL, Shianna KV, Goldstein DB: A genome-wide comparison of the functional properties of rare and common genetic variants in humans.

    Am J Hum Genet 2011, 88:458-468. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Complete Genomics Public Genomes http://community.completegenomics.com/kb/m/knowledgebase/320/download.aspx webcite

  23. Gao X, Starmer JD: AWclust: point-and-click software for non-parametric population structure analysis.

    BMC Bioinformatics 2008, 9:77. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  24. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J: Targeted capture and massively parallel sequencing of 12 human exomes.

    Nature 2009, 461:272-276. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  25. Meder B, Haas J, Keller A, Heid C, Just S, Borries A, Boisguerin V, Scharfenberger-Schmeer M, Stähler P, Beier M, Weichenhan D, Strom TM, Pfeufer A, Korn B, Katus HA, Rottbauer W: Targeted next-generation sequencing for the molecular genetic diagnostics of cardiomyopathies.

    Circ Cardiovasc Genet 2011, 2:110-122. OpenURL

  26. Oracle Exadata platformhttp://www.oracle.com/us/products/database/exadata/overview/index.html webcite

  27. Oracle Exadata technical whitepaperhttp://www.oracle.com/technetwork/database/exadata/exadata-technical-whitepaper-134575.pdf webcite

  28. Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMart–biological queries made easy.

    BMC Genomics 2009, 10:22. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  29. Nieminen P, Morgan NV, Fenwick AL, Parmanen S, Veistinen L, Mikkola ML, van der Spek PJ, Giraud A, Judd L, Arte S, Brueton LA, Wall SA, Mathijssen IMJ, Maher ER, Wilkie AOM, Kreiborg S, Thesleff I: Inactivation of IL11 Signaling Causes Craniosynostosis, Delayed Tooth Eruption, and Supernumerary Teeth.

    Am J Hum Genet 2011, 89:67-81. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  30. TIBCO Spotfire Webplayerhttp://spotfire.tibco.com/products/web-player/interactive-dashboards.aspx webcite