Variant Annotation and Filtration with External Databases

Variant annotation and filtration with external databases enhance the analysis of genetic variations by leveraging external resources to provide additional information about variants. This process aids in filtering and prioritizing variants based on various annotations, such as allele frequencies, expression levels, gene features, prediction scores, and epigenetic markers.


Allele Frequency

Allele frequency annotation allows users to incorporate population-level allele frequency information into the analysis of genetic variants. For available databases, refer to the allele frequency databases. Users can also make customized annotation files according to the guidance.

Annotation Option Description Default
--freq-database Specifies the reference databases for allele frequency annotation.
Format: --freq-database path= field=[field1,field2,...]
Example: --freq-database gnomad field=gnomAD::EAS
- : Database name (e.g., gnomad).
- : Path to the database (local or FTP/HTTP). If the resources folder is specified, can be omitted.
- : Specific fields to include (e.g., gnomAD::EAS). If unspecified, all fields are selected.
[OFF]

Once configured, users can filter variants based on allele frequencies in the reference population.

Filtration Option Description Default
--db-af Excludes variants with alternative allele frequency (AF) outside the range [min, max].
Format: --db-af ~
Example: --db-af 0.05~1.0
Valid Range: 0.0 to 1.0
[OFF]
--db-maf Excludes variants with minor allele frequency (MAF) outside the range [min, max].
Format: --db-maf ~
Example: --db-maf 0.05~0.5
Valid Range: 0.0 to 0.5
[OFF]

Allelic Expression

Expression annotation allows users to incorporate allelic expression of variants into the analysis of genetic variants. Normally, only expressed mutations are actionable. For available databases, see the curated GTEx expression dataset](../../databases/pext.md).

Annotation Option Description Default
--exp-database Specifies the reference databases for expression annotation.
Format: --exp-database path= field=[field1,field2,...]
Example: --exp-database field=
[OFF]

Once configured, users can filter variants based on their expression values.

Filtration Option Description Default
--exp-range Excludes variants with standardized expression values outside the range [min, max].
Format: --exp-range ~
Example: --exp-range 0.05~1.0
Valid Range: 0.0 to 1.0
[OFF]

Gene Features

Gene feature annotation identifies which gene or transcript a variant affects and its functional role. KGGA supports three gene definition systems: RefSeq genes (refgene), GENCODE genes (gencode), and UCSC KnownGene (knowngene). Users can also provide customized gene model databases following the example format.

Annotation Option Description Default
--gene-model-database Specifies the gene model database for annotation.
Format: --gene-model-database path=
Example: --gene-model-database refgene
- : Database name (e.g., refgene).
- : Path to the database (optional if resources folder is set).
refgene

Users can fine-tune gene feature annotations with the following parameters:

Annotation Option Description Default
--splicing-distance Sets the base-pair distance for defining splicing junction variants.
Format: --splicing-distance
Example: --splicing-distance 3
Valid Setting: ≥1
3
--upstream-distance Sets the upstream region length (bp) from the transcription start site.
Format: --upstream-distance
Example: --upstream-distance 1000
Valid Setting: ≥1
1000
--downstream-distance Sets the downstream region length (bp) from the transcription end site.
Format: --downstream-distance
Example: --downstream-distance 1000
Valid Setting: ≥1
1000

To disable gene feature annotation:

Annotation Option Description Default
--disable-gene-feature Disables gene feature annotation.
Format: --disable-gene-feature
[OFF]

KGGA assigns numeric codes to gene features for easy reference:

Feature Code Explanation
Frameshift 0 Short indel causing a frameshift.
Nonframeshift 1 Short indel causing amino acid loss without frameshift.
Start-loss 2 Loss of start codon (ATG).
Stop-loss 3 Loss of stop codon (TAG, TAA, TGA).
Stop-gain 4 Gain of a stop codon, potentially truncating the protein.
Splicing 5 Within 3 bp of a splicing junction (adjustable).
Missense 6 Codon change leading to a different amino acid.
Synonymous 7 Codon change without amino acid alteration.
Exonic 8 Mapped to the exonic region without precise annotation.
UTR5 9 Within 5' untranslated region.
UTR3 10 Within 3' untranslated region.
Intronic 11 Within an intron.
Upstream 12 Within 1 kb upstream of transcription start site (adjustable).
Downstream 13 Within 1 kb downstream of transcription end site (adjustable).
ncRNA 14 Within a non-coding RNA transcript.
Intergenic 15 In intergenic region.
Monomorphic 16 Not a sequence variation (potential reference genome error).
Unknown 17 No annotation available.

Users can filter variants based on specific gene features or genes:

Filtration Option Description Default
--gene-feature-included Retains variants with specified gene feature codes.
Format: --gene-feature-included ,,...
Example: --gene-feature-included 0~6,9,10
Valid Codes: 0 to 17
0~17
--gene-excluded Excludes variants associated with specified genes.
Format: --gene-excluded GeneSymbol1,GeneSymbol2,...
[OFF]
--gene-retained Retains variants associated with specified genes.
Format: --gene-retained GeneSymbol1,GeneSymbol2,...
[OFF]

Prediction Scores

KGGA provides access to a curated selection of databases for variant functional and pathogenic prediction scores.

Annotation Option Description Default
--variant-annotation-database Specifies the databases for variant annotation.
Format: --variant-annotation-database path= field=[field1,field2,...]
Example: --variant-annotation-database dbnsfp field=dbNSFP::VEST,dbNSFP::FATHMM-XF,dbNSFP::Eigen,dbNSFP::CADD
There have been four databases provided on KGGA: CADD FAVOR dbNSFP ClinVar .
[OFF]

Epigenetic Markers

KGGA can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.

Annotation Option Description Default
--region-annotation-database Specifies the interval/regional databases for annotation.
Format: --region-annotation-database subID=[] marker=[] path= field=[field1,field2,...]
Example: --region-annotation-database EpiMap subID=BSS00001 marker=H3K4me3
- name: Database name (e.g., EpiMap).
- subID: Subject ID from EpiMap.
- marker: Epigenetic marker (e.g., H3K4me3).
- path: Path to the database file (optional for EpiMap).
- field: Specific fields to include.
Note: For name=EpiMap, KGGSum automatically downloads data from the EpiMap Repository if not locally available.
NA
Copyright ©MiaoXin Li all right reservedLast modified time: 2025-04-26 03:48:14

results matching ""

    No results matching ""