Variant Annotation and Filtration with External Databases enhance the analysis of genetic variations by leveraging external databases to provide additional information about variants. This process aids in filtering and prioritizing variants based on various annotations.
Gene
Gene feature Annotation & Filtration
Gene feature annotation is used to identify which gene or transcript is affected by a variant and what functional role it has on known genes. We now support two gene definition systems: RefSeq genes (refgene) and GENCODE genes (gencode).
Options
The tutorial command is:
java -Xmx8g -jar ../kggsum.jar \
annot \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP,A1,A2 \
pbsCols=P,OR,SE \
betaType=2 \
prevalence=0.01 \
--gene-model-database refgene \
--gene-model-database gencode \
--threads 18 \
--output ./ta1
Flag | Description | Default |
---|---|---|
annot |
Trigger the annotation procedure. | - |
--gene-model-database |
Specifies the gene model database for annotating variants to genes. Syntax Specifies the gene model database for annotating variants to genes. Syntax --gene-model-database name [path=<path>] Arguments name (Required) The name of the database. Accepted values are:1. refgene OR gencode: Use a built-in database. KGGSum automatically loads the data from its internal resources directory. 2. customized: Use a user-provided gene model file. This requires the path argument. path=<path> (Required only when Custom File Format When using the customized option, the file must adhere to the following format: No header row: The file should not contain a header. Four columns: Each row represents a single gene and must contain four space- or tab-separated columns in this order: Column 1: Chromosome ID (e.g., 1, chrX) Column 2: Transcription Start Position Column 3: Transcription End Position Column 4: Gene Symbol or ID (e.g., MYH9, ENSG00000100345) Example content: 1 339070 350389 ENSBTAG00000006648 1 475398 475516 ENSBTAG00000049697 Important This option is mutually exclusive with --xqtl-file. The --xqtl-file option is intended for datasets where variants are already mapped to genes (e.g., eQTLs/pQTLs). Do not use both options in the same run. |
- |
--upstream-distance |
Set the region length (bp) of upstream from the transcription start site. Format: --upstream-distance <int> Example: --upstream-distance 1000 Valid setting: [int] >=1 |
1000 |
--downstream-distance |
Set the region length (bp) of the downstream from the transcription start site. Format: --downstream-distance <int> Example: --downstream-distance 1000 Valid setting: [int] >=1 |
1000 |
--disable-gene-feature |
Disable gene feature annotation. Format: --disable-gene-feature |
|
--gene-feature-in |
Retain variants with specified annotated gene features . Format: --gene-feature-in <int~int>,<int>,... Example: --gene-feature-in 0~6,9,10 Valid setting: [int] 0 ~ 17 |
0 ~ 17 |
... | ... | ... |
KGGA has 18 number codes for the gene features after annotation.
Feature | Code | Explanation |
---|---|---|
Frameshift | 0 | Short insertion or deletion results in a completely different translation from the original. |
Nonframeshift | 1 | Short insertion or deletion results in loss of amino acids in the translated proteins. |
Start-loss | 2 | Indels or nucleotide substitution results in the loss of the start codon (ATG) (mutated into a non-start codon). |
Stop-loss | 3 | Indels or nucleotide substitution results in the loss of stop codons (TAG, TAA, TGA). |
Stop-gain | 4 | Indels or nucleotide substitutions result in the new stop codons (TAG, TAA, TGA), which may truncate the protein. |
Splicing | 5 | Variant is within 3-bp of a splicing junction (use --splicing-distance x to change this; the unit of x is base-pairs). |
Missense | 6 | Nucleotide substitution results in a codon coding for a different amino acid. |
Synonymous | 7 | Nucleotide substitution does not change amino acids. |
Exonic | 8 | Due to the loss of sequences in the reference database, this variant can only be mapped into the exonic region without more precise annotation. |
UTR5 | 9 | Within a 5' untranslated region. |
UTR3 | 10 | Within a 3' untranslated region. |
Intronic | 11 | Within an intron. |
Upstream | 12 | Within 1-kb region upstream of the transcription start site (use --upstream-distance x to change this, the unit of x is base-pair). |
Downstream | 13 | Within 1-kb region downstream of the transcription end site (use `--downstream-distance x to change this; the unit of x is base-pairs). |
ncRNA | 14 | Within a transcript without protein-coding annotation in the gene definition. |
Intergenic | 15 | Variant is in the intergenic region. |
Monomorphic | 16 | It is not a sequence variation, which may result from bugs in the reference genome in variant calling. |
Unknown | 17 | Variants have no annotation. |
Output
The gene feature annotation results are saved in OutputVariants2TSVTask/variants.hg38.tsv.gz
. There are two relevant columns in the file:
Header | Description |
---|---|
... | ... |
MarkFeatureGene | The gene where a variant is located. When a variant is mapped onto multiple genes, the gene that leads to the smallest code is called the marker gene. |
MarkGeneFeature | The coordinate of the first SNP. |
... | .. |
Clump
Prune and Clump Variants Based on P-Values and Linkage Disequilibrium (LD)
In KGGSum, the "Clump" feature enables users to prune and clump Genome-Wide Association Study (GWAS) variants using p-values from summary statistics and linkage disequilibrium (LD) calculated from a reference population. This process produces a refined set of independent, statistically significant, and harmonized variants, ready for downstream applications such as Mendelian Randomization (MR) analysis with third-party tools or platforms.
LD Clumping Methodology
The LD clumping method filters out variants within a user-defined physical distance (set by windowKb) if their LD exceeds a specified threshold (controlled by r2Cut). Variant removal follows a prioritized, three-step approach to retain the most informative variants:
- Variants with Larger P-Values
Variants with less significant p-values (i.e., larger values) are prioritized for removal. The significance threshold is determined by the family-wise error rate (set by pCut) and adjusted using multiple testing correction methods: "fixed" (no correction), "bonf" (Bonferroni), "bhfdr" (Benjamini-Hochberg FDR), "byfdr" (Benjamini-Yekutieli FDR), or "localfdr" (local FDR).
- When multiple GWAS datasets are provided, the p-value to use is specified by pIndex; otherwise, the p-value from the first dataset is used by default.
- Among variants with p-values below the cutoff, those with the smallest sum of -log(p) across datasets or missing p-values in any dataset are prioritized for removal.
- Variants with More LD Connections
Variants in LD with a greater number of other variants within the windowKb distance are prioritized for removal to reduce redundancy.
- For example, consider three variants: A, B, and C, with LD r² values of 0.8 (A-B), 0.8 (B-C), and 0.64 (A-C). Variant B, which is in LD with both A and C, is removed, while A and C are retained. This preserves variants with fewer LD dependencies.
- Variants with Lower Minor Allele Frequency (MAF) If the first two criteria do not differentiate between variants (e.g., equal LD connections and similar p-values), the variant with the lower MAF is removed. Retaining variants with higher MAF ensures greater informativeness, as they are typically more robust for statistical analysis.
This hierarchical approach maximizes the retention of variants that are:
- Less redundant (fewer LD connections),
- More significant (smaller p-values),
- More informative (higher MAF).
Options
The tutorial command is:
java -Xmx8g -jar ../kggsum.jar \
annot \
--ld-clump r2Cut=0.1 \
windowKb=10000 \
pIndex=1 \
pCut=0.05 \
pCorrect=bhfdr \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--sum-file ./smoking_chr1.tsv.gz \
cp12Cols=CHR,POS,A1,A2 \
pbsCols=Pval,Beta,SE \
prevalence=0.05 \
betaType=1 \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP,A1,A2 \
pbsCols=P,OR,SE \
betaType=2 \
prevalence=0.01 \
--threads 18 \
--output ./ta2
This command clumps variants from two GWAS datasets (smoking and schizophrenia), prioritizing the smoking dataset’s p-values. It uses a 10 Mb window and an LD threshold of 0.1, retaining independent, significant variants after FDR correction.
Annotation option | Description | Default |
---|---|---|
--ld-clump |
r2Cut : LD threshold (e.g., 0.1). Controls the degree of independence among retained variants. A lower r2Cut yields a smaller, more independent set.windowKb : Physical distance window in kilobases (e.g., 10000). Often set between 500 kb and 10000 kb (10 Mb), depending on study needs. Larger windows capture long-range LD but increase computational load, while smaller windows focus on local LD patterns.pIndex : Index of the p-value column to use when multiple GWAS datasets are provided (e.g., 1). Ensures the correct dataset drives significance assessment in multi-dataset analyses.pCut : Significance cutoff (e.g., 0.05). Determines the pool of variants considered, with stricter thresholds retaining only highly significant ones.pCorrect : Multiple testing correction method (e.g., "bhfdr").Format: --ld-clump r2Cut=<value> windowKb=<value> pIndex=<index> pCut=<value> pCorrect=<method> Example: --ld-clump r2Cut=0.1 windowKb=10000 pIndex=1 pCut=0.05 pCorrect=bhfdr |
[OFF] |
Function
Functional score annotation at variants
In KGGSum, the GWAS variants can also be annotated with multiple genomic features. Three databases are available: gnomAD for allele frequency annotation, CADD for variant function annotation, and ClinVar for disease linkage annotation. Note that the annotation datasets should be downloaded from an independent resource domain of KGGA.
Database Name | Short Description | Tag |
---|---|---|
CADD | Combined Annotation Dependent Depletion (CADD) is a widely used matrix for mutation deleteriousness and integrates more than 100 annotations for all possible single-nucleotide variants (SNVs) of the GRCh38/hg38 human reference genome. | cadd |
Favor | Functional Annotation of Variants - Online Resource (FAVOR) provides comprehensive multi-faceted variant functional annotations that summarize findings of all possible nine billion SNVs across the genome (build GRCh38). | favor |
ClinVar | ClinVar is a public database managed by the National Center for Biotechnology Information (NCBI) that provides information about the relationship between genetic variation and human health. | clinvar |
Options
The tutorial command is:
java -Xmx8g -jar ../kggsum.jar \
annot \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP,A1,A2 \
pbsCols=P,OR,SE \
betaType=2 \
prevalence=0.01 \
--variant-annotation-database cadd \
field=Epigenetics::EncodeDNase-max,Epigenetics::EncodeDNase-sum,ProteinFunction::CADD_PHRED \
--threads 18 \
--output ./ta2
Annotation option | Description | Default |
---|---|---|
--variant-annotation-database |
Set the reference databases used for annotation at variants. --variant-annotation-database is a combination of parameters, and the usage rules are the same as those of --freq-database .Format: --variant-annotation-database <name> path=<path> field=[field1,field2,...] Example: --variant-annotation-database cadd field=ProteinFunction::CADD_PHRED |
[OFF] |
Additional epigenetic resources from third-party databases
To facilitate the convenient use of more resources, KGGSum provides an approach that allows users to specify customized third-party resources for annotation. For example, by setting the file name documented in EpiMap, KGGSum can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.
Database Name | Short Description |
---|---|
EpiMap | EpiMap is one of the most comprehensive maps of the human epigenome, providing approximately 15,000 datasets across 833 bio-samples and 18 epigenomic marks. It delivers rich gene-regulatory annotations encompassing chromatin states, high-resolution enhancers, activity patterns, enhancer modules, upstream regulators, and downstream target genes. |
Annotation Option | Description | Default |
---|---|---|
--region-annotation-database | Specifies the interval/regional databases for annotation. Format: --region-annotation-database Example: --region-annotation-database EpiMap subID=BSS00001 marker=H3K4me3 - name: Database name (e.g., EpiMap). - subID: Subject ID from EpiMap. - marker: Epigenetic marker (e.g., H3K4me3). - path: Path to the database file (optional for EpiMap). - field: Specific fields to include. Note: For name=EpiMap, KGGSum automatically downloads data from the EpiMap Repository if not locally available. |
NA |
Frequency
Allele Frequency Annotation & Filtration
Allele frequency annotation allows users to incorporate population-level allele frequency information into the analysis of genetic variants. Click to view the provided allele frequency annotation databases.
Database Name | Short Description | Tag |
---|---|---|
gnomAD | Allele frequency data in the Genome Aggregation Database (gnomAD) v4 dataset (GRCh38) is derived from 730,947 exomes and 76,215 genomes from unrelated individuals of diverse ancestries. | gnomad |
Options
The tutorial command is:
java -Xmx4g -jar ../kggsum.jar \
annot \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP,A1,A2 \
pbsCols=P,OR,SE \
betaType=2 \
prevalence=0.01 \
--freq-database gnomad \
field=gnomAD_joint::ALL,gnomAD_joint::NFE \
--threads 18 \
--output ./ta3
Annotation option | Description | Default |
---|---|---|
--freq-database |
Set the reference databases used for allele frequency annotation. --freq-database is a combination of parameters. The <name> is identified as the database name (such as gnomad). The <path> identifies the path to the database, which can be a local file path or an FTP/HTTP file path. If the path of the resources folder has been specified, the <path> parameter can be bypassed. The <field> is identified as the specified field filtered under this database. If no value is set, all fields of the specified database are selected by default.Format: --freq-database <name> path=<path> field=[field1,field2,...] Example: --freq-database gnomad field=gnomAD::EAS |
[OFF] |
Once the reference databases for allele frequency annotation have been properly configured, you can effectively filter variants by examining their allele frequencies within the reference population.
Filtration option | Description | Default |
---|---|---|
--db-af |
Exclude variants with alternative allele frequency (AF) outside the range [min , max ] in allele frequency databases.Format: --db-af <min>~<max> Example: --db-af 0.05~1.0 Valid setting: [float] 0.0 ~ 1.0 |
[OFF] |
--db-maf |
Exclude variants with minor allele frequency (MAF) outside the range [min , max ] in allele frequency databases.Format: --db-maf <min>~<max> Example: --db-maf 0.05~0.5 Valid setting: [float] 0.0 ~ 0.5 |
[OFF] |