Variant Annotation and Filtration with External Databases

Variant annotation and filtration with external databases enhance the analysis of genetic variations by leveraging external resources to provide additional information about variants. This process aids in filtering and prioritizing variants based on various annotations, such as allele frequencies, expression levels, gene features, prediction scores, and epigenetic markers.

Allele Frequency

Allele frequency annotation allows users to incorporate population-level allele frequency information into the analysis of genetic variants. For available databases, refer to the allele frequency databases. Users can also make customized annotation files according to the guidance.

Annotation Option	Description	Default
--freq-database	Specifies the reference databases for allele frequency annotation. Format: --freq-database path= field=[field1,field2,...] Example: --freq-database gnomad field=gnomAD::EAS - : Database name (e.g., gnomad). - : Path to the database (local or FTP/HTTP). If the resources folder is specified, can be omitted. - : Specific fields to include (e.g., gnomAD::EAS). If unspecified, all fields are selected.	[OFF]

Once configured, users can filter variants based on allele frequencies in the reference population.

Filtration Option	Description	Default
--db-af	Excludes variants with alternative allele frequency (AF) outside the range [min, max]. Format: --db-af ~ Example: --db-af 0.05~1.0 Valid Range: 0.0 to 1.0	[OFF]
--db-maf	Excludes variants with minor allele frequency (MAF) outside the range [min, max]. Format: --db-maf ~ Example: --db-maf 0.05~0.5 Valid Range: 0.0 to 0.5	[OFF]

Allelic Expression

Expression annotation allows users to incorporate allelic expression of variants into the analysis of genetic variants. Normally, only expressed mutations are actionable. For available databases, see the curated GTEx expression dataset](../../databases/pext.md).

Annotation Option	Description	Default
--exp-database	Specifies the reference databases for expression annotation. Format: --exp-database path= field=[field1,field2,...] Example: --exp-database field=	[OFF]

Once configured, users can filter variants based on their expression values.

Filtration Option	Description	Default
--exp-range	Excludes variants with standardized expression values outside the range [min, max]. Format: --exp-range ~ Example: --exp-range 0.05~1.0 Valid Range: 0.0 to 1.0	[OFF]

Gene Features

Gene feature annotation identifies which gene or transcript a variant affects and its functional role. KGGA supports three gene definition systems: RefSeq genes (refgene), GENCODE genes (gencode), and UCSC KnownGene (knowngene). Users can also provide customized gene model databases following the example format.

Annotation Option	Description	Default
--gene-model-database	Specifies the gene model database for annotation. Format: --gene-model-database path= Example: --gene-model-database refgene - : Database name (e.g., refgene). - : Path to the database (optional if resources folder is set).	refgene

Users can fine-tune gene feature annotations with the following parameters:

Annotation Option	Description	Default
--splicing-distance	Sets the base-pair distance for defining splicing junction variants. Format: --splicing-distance Example: --splicing-distance 3 Valid Setting: ≥1	3
--upstream-distance	Sets the upstream region length (bp) from the transcription start site. Format: --upstream-distance Example: --upstream-distance 1000 Valid Setting: ≥1	1000
--downstream-distance	Sets the downstream region length (bp) from the transcription end site. Format: --downstream-distance Example: --downstream-distance 1000 Valid Setting: ≥1	1000

To disable gene feature annotation:

Annotation Option	Description	Default
--disable-gene-feature	Disables gene feature annotation. Format: --disable-gene-feature	[OFF]

KGGA assigns numeric codes to gene features for easy reference:

Feature	Code	Explanation
Frameshift	0	Short indel causing a frameshift.
Nonframeshift	1	Short indel causing amino acid loss without frameshift.
Start-loss	2	Loss of start codon (ATG).
Stop-loss	3	Loss of stop codon (TAG, TAA, TGA).
Stop-gain	4	Gain of a stop codon, potentially truncating the protein.
Splicing	5	Within 3 bp of a splicing junction (adjustable).
Missense	6	Codon change leading to a different amino acid.
Synonymous	7	Codon change without amino acid alteration.
Exonic	8	Mapped to the exonic region without precise annotation.
UTR5	9	Within 5' untranslated region.
UTR3	10	Within 3' untranslated region.
Intronic	11	Within an intron.
Upstream	12	Within 1 kb upstream of transcription start site (adjustable).
Downstream	13	Within 1 kb downstream of transcription end site (adjustable).
ncRNA	14	Within a non-coding RNA transcript.
Intergenic	15	In intergenic region.
Monomorphic	16	Not a sequence variation (potential reference genome error).
Unknown	17	No annotation available.

Users can filter variants based on specific gene features or genes:

Filtration Option	Description	Default
--gene-feature-included	Retains variants with specified gene feature codes. Format: --gene-feature-included ,,... Example: --gene-feature-included 0~6,9,10 Valid Codes: 0 to 17	0~17
--gene-excluded	Excludes variants associated with specified genes. Format: --gene-excluded GeneSymbol1,GeneSymbol2,...	[OFF]
--gene-retained	Retains variants associated with specified genes. Format: --gene-retained GeneSymbol1,GeneSymbol2,...	[OFF]

Prediction Scores

KGGA provides access to a curated selection of databases for variant functional and pathogenic prediction scores.

Annotation Option	Description	Default
--variant-annotation-database	Specifies the databases for variant annotation. Format: --variant-annotation-database path= field=[field1,field2,...] Example: --variant-annotation-database dbnsfp field=dbNSFP::VEST,dbNSFP::FATHMM-XF,dbNSFP::Eigen,dbNSFP::CADD There have been four databases provided on KGGA: CADD FAVOR dbNSFP ClinVar .	[OFF]

Epigenetic Markers

KGGA can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.

Annotation Option	Description	Default
--region-annotation-database	Specifies the interval/regional databases for annotation. Format: --region-annotation-database subID=[] marker=[] path= field=[field1,field2,...] Example: --region-annotation-database EpiMap subID=BSS00001 marker=H3K4me3 - name: Database name (e.g., EpiMap). - subID: Subject ID from EpiMap. - marker: Epigenetic marker (e.g., H3K4me3). - path: Path to the database file (optional for EpiMap). - field: Specific fields to include. Note: For name=EpiMap, KGGSum automatically downloads data from the EpiMap Repository if not locally available.	NA

Allele Frequency

Variant Annotation and Filtration with External Databases

Allele Frequency

Allelic Expression

Gene Features

Prediction Scores

Epigenetic Markers

results matching ""

No results matching ""