Variant Annotation and Filtration with External Databases
Variant annotation and filtration with external databases enhance the analysis of genetic variations by leveraging external resources to provide additional information about variants. This process aids in filtering and prioritizing variants based on various annotations, such as allele frequencies, expression levels, gene features, prediction scores, and epigenetic markers.
Allele Frequency
Allele frequency annotation allows users to incorporate population-level allele frequency information into the analysis of genetic variants. For available databases, refer to the allele frequency databases. Users can also make customized annotation files according to the guidance.
Annotation Option | Description | Default |
---|---|---|
--freq-database | Specifies the reference databases for allele frequency annotation. Format: --freq-database Example: --freq-database gnomad field=gnomAD::EAS - - - |
[OFF] |
Once configured, users can filter variants based on allele frequencies in the reference population.
Filtration Option | Description | Default |
---|---|---|
--db-af | Excludes variants with alternative allele frequency (AF) outside the range [min, max]. Format: --db-af Example: --db-af 0.05~1.0 Valid Range: 0.0 to 1.0 |
[OFF] |
--db-maf | Excludes variants with minor allele frequency (MAF) outside the range [min, max]. Format: --db-maf Example: --db-maf 0.05~0.5 Valid Range: 0.0 to 0.5 |
[OFF] |
Allelic Expression
Expression annotation allows users to incorporate allelic expression of variants into the analysis of genetic variants. Normally, only expressed mutations are actionable. For available databases, see the curated GTEx expression dataset](../../databases/pext.md).
Annotation Option | Description | Default |
---|---|---|
--exp-database | Specifies the reference databases for expression annotation. Format: --exp-database Example: --exp-database |
[OFF] |
Once configured, users can filter variants based on their expression values.
Filtration Option | Description | Default |
---|---|---|
--exp-range | Excludes variants with standardized expression values outside the range [min, max]. Format: --exp-range Example: --exp-range 0.05~1.0 Valid Range: 0.0 to 1.0 |
[OFF] |
Gene Features
Gene feature annotation identifies which gene or transcript a variant affects and its functional role. KGGA supports three gene definition systems: RefSeq genes (refgene), GENCODE genes (gencode), and UCSC KnownGene (knowngene). Users can also provide customized gene model databases following the example format.
Annotation Option | Description | Default |
---|---|---|
--gene-model-database | Specifies the gene model database for annotation. Format: --gene-model-database Example: --gene-model-database refgene - - |
refgene |
Users can fine-tune gene feature annotations with the following parameters:
Annotation Option | Description | Default |
---|---|---|
--splicing-distance | Sets the base-pair distance for defining splicing junction variants. Format: --splicing-distance Example: --splicing-distance 3 Valid Setting: ≥1 |
3 |
--upstream-distance | Sets the upstream region length (bp) from the transcription start site. Format: --upstream-distance Example: --upstream-distance 1000 Valid Setting: ≥1 |
1000 |
--downstream-distance | Sets the downstream region length (bp) from the transcription end site. Format: --downstream-distance Example: --downstream-distance 1000 Valid Setting: ≥1 |
1000 |
To disable gene feature annotation:
Annotation Option | Description | Default |
---|---|---|
--disable-gene-feature | Disables gene feature annotation. Format: --disable-gene-feature |
[OFF] |
KGGA assigns numeric codes to gene features for easy reference:
Feature | Code | Explanation |
---|---|---|
Frameshift | 0 | Short indel causing a frameshift. |
Nonframeshift | 1 | Short indel causing amino acid loss without frameshift. |
Start-loss | 2 | Loss of start codon (ATG). |
Stop-loss | 3 | Loss of stop codon (TAG, TAA, TGA). |
Stop-gain | 4 | Gain of a stop codon, potentially truncating the protein. |
Splicing | 5 | Within 3 bp of a splicing junction (adjustable). |
Missense | 6 | Codon change leading to a different amino acid. |
Synonymous | 7 | Codon change without amino acid alteration. |
Exonic | 8 | Mapped to the exonic region without precise annotation. |
UTR5 | 9 | Within 5' untranslated region. |
UTR3 | 10 | Within 3' untranslated region. |
Intronic | 11 | Within an intron. |
Upstream | 12 | Within 1 kb upstream of transcription start site (adjustable). |
Downstream | 13 | Within 1 kb downstream of transcription end site (adjustable). |
ncRNA | 14 | Within a non-coding RNA transcript. |
Intergenic | 15 | In intergenic region. |
Monomorphic | 16 | Not a sequence variation (potential reference genome error). |
Unknown | 17 | No annotation available. |
Users can filter variants based on specific gene features or genes:
Filtration Option | Description | Default |
---|---|---|
--gene-feature-included | Retains variants with specified gene feature codes. Format: --gene-feature-included Example: --gene-feature-included 0~6,9,10 Valid Codes: 0 to 17 |
0~17 |
--gene-excluded | Excludes variants associated with specified genes. Format: --gene-excluded GeneSymbol1,GeneSymbol2,... |
[OFF] |
--gene-retained | Retains variants associated with specified genes. Format: --gene-retained GeneSymbol1,GeneSymbol2,... |
[OFF] |
Prediction Scores
KGGA provides access to a curated selection of databases for variant functional and pathogenic prediction scores.
Annotation Option | Description | Default |
---|---|---|
--variant-annotation-database | Specifies the databases for variant annotation. Format: --variant-annotation-database Example: --variant-annotation-database dbnsfp field=dbNSFP::VEST,dbNSFP::FATHMM-XF,dbNSFP::Eigen,dbNSFP::CADD There have been four databases provided on KGGA: CADD FAVOR dbNSFP ClinVar . |
[OFF] |
Epigenetic Markers
KGGA can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.
Annotation Option | Description | Default |
---|---|---|
--region-annotation-database | Specifies the interval/regional databases for annotation. Format: --region-annotation-database Example: --region-annotation-database EpiMap subID=BSS00001 marker=H3K4me3 - name: Database name (e.g., EpiMap). - subID: Subject ID from EpiMap. - marker: Epigenetic marker (e.g., H3K4me3). - path: Path to the database file (optional for EpiMap). - field: Specific fields to include. Note: For name=EpiMap, KGGSum automatically downloads data from the EpiMap Repository if not locally available. |
NA |