Summary of Databases
The great efforts of international projects and functional genomic studies have led to a rich tapestry of information from diverse biological domains that can be leveraged for variant annotation. To facilitate convenient variant annotation, we meticulously pre-processed some annotation fields from several commonly used databases. Users can download the entire or study needed databases directly from our website, and start annotation for a large number of variants. For small-scale annotations (e.g., <10,000 variants), users can access these databases remotely via FTP/HTTP. The program will automatically fetch the specified slices of databases based on the index file of the given variants in a distributed fashion, eliminating the need to download the entire database locally, and ensuring a more convenient and expedited process. Given the expansive and dynamic nature of genomic databases, KGGA also allows users to incorporate custom databases for annotation purposes.
Gene feature annotation
We support three well-established gene annotation systems: RefSeq genes (refgene), GENCODE genes (gencode), and UCSC KnownGene database (knowngene). In KGGA, databases used for gene feature annotation should be constructed in FASTA format. Additionally, KGGA also accommodates custom gene databases formatted in FASTA, allowing users to integrate their own gene annotations tailored to specific research needs or preferences.
Database Name | FTP/HTTP | Version/Date |
---|---|---|
refgene | https://ftp.ncbi.nih.gov/refseq/H_sapiens/annotation/ | 2024-08-27 |
encode | https://www.gencodegenes.org/human/ | V47 |
knowngene | https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ | 2024-08-19 |
Common variant annotation databases
In KGGA, common variant annotation databases, such as gnomAD for allele frequency annotation, CADD for variant function annotation, and ClinVar for disease linkage annotation, should be provided in GTB format. These databases are typically stored in well-structured tables, making them easily converted into GTB format. Below are brief descriptions of these databases. For more details, please click on the database name to view the full list of fields available for each.
Database Name | Short Description |
---|---|
gnomAD | Allele frequency data in the Genome Aggregation Database (gnomAD) v4 dataset (GRCh38) is derived from 730,947 exomes and 76,215 genomes from unrelated individuals of diverse ancestries. The pre-made GTB format of gnomAD(v4.1) for KGGA can be downloaded here. |
CADD | Combined Annotation Dependent Depletion (CADD) is a widely used matrix for mutation deleteriousness and integrates more than 100 annotations for all possible single-nucleotide variants (SNVs) of the GRCh38/hg38 human reference genome. The pre-made GTB format of CADD(v1.7) for KGGA can be downloaded here. |
FAVOR | Functional Annotation of Variants - Online Resource (FAVOR) provides comprehensive multi-faceted variant functional annotations that summarize findings of all possible nine billion SNVs across the genome (build GRCh38). The pre-made GTB format of FAVOR(v2.0) for KGGA can be downloaded here. |
dbNSFP | dbNSFP is a database for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. The pre-made GTB format of dbNSFP(v5.0a) for KGGA can be downloaded here. |
PEXT | Proportion expressed across transcripts (pext) is a transcript-level annotation metric that quantifies isoform expression for variants in gnomAD v2 (hg19) across 54 GTEx tissues. The pre-made GTB format of PEXT(v2) for KGGA can be downloaded here. |
ClinVar | ClinVar is a public database managed by the National Center for Biotechnology Information (NCBI) that provides information about the relationship between genetic variation and human health. The pre-made GTB format of ClinVar for KGGA can be downloaded here. |
Region annotation databases
To facilitate the convenient use of more resources, KGGSum provides an approach that allows users to specify customized third-party resources for annotation. For example, by setting the file name documented in EpiMap, KGGSum can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.
Database Name | Short Description |
---|---|
EpiMap | EpiMap is one of the most comprehensive human epigenome maps, providing approximately 15,000 datasets across 833 bio-samples and 18 epigenomic marks. It delivers rich gene-regulatory annotations encompassing chromatin states, high-resolution enhancers, activity patterns, enhancer modules, upstream regulators, and downstream target genes. |