Variant Annotation and Filtration with External Databases enhance the analysis of genetic variations by leveraging external databases to provide additional information about variants. This process aids in filtering and prioritizing variants based on various annotations.

Gene

Gene feature Annotation & Filtration

Gene feature annotation is used to identify which gene or transcript is affected by a variant and what functional role it has on known genes. We now support two gene definition systems: RefSeq genes (refgene) and GENCODE genes (gencode).

Options

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --threads 18 \
   --output ./ta1
Flag Description Default
annot Trigger the annotation procedure. -
--gene-model-database Specifies the gene model database for annotating variants to genes.
Syntax
Specifies the gene model database for annotating variants to genes.
Syntax
--gene-model-database name [path=<path>]
Arguments
name (Required) The name of the database. Accepted values are:
1. refgene OR gencode: Use a built-in database. KGGSum automatically loads the data from its internal resources directory.
2. customized: Use a user-provided gene model file. This requires the path argument.

path=<path> (Required only when is customized) The location of your custom gene model file. This can be a local file path (e.g., /data/my_genes.txt) or a URL (e.g., http://example.com/genes.txt).

Custom File Format
When using the customized option, the file must adhere to the following format:
No header row: The file should not contain a header.
Four columns: Each row represents a single gene and must contain four space- or tab-separated columns in this order:
Column 1: Chromosome ID (e.g., 1, chrX)
Column 2: Transcription Start Position
Column 3: Transcription End Position
Column 4: Gene Symbol or ID (e.g., MYH9, ENSG00000100345)
Example content:
1 339070 350389 ENSBTAG00000006648
1 475398 475516 ENSBTAG00000049697

Important
This option is mutually exclusive with --xqtl-file. The --xqtl-file option is intended for datasets where variants are already mapped to genes (e.g., eQTLs/pQTLs). Do not use both options in the same run.
-
--upstream-distance Set the region length (bp) of upstream from the transcription start site.
Format: --upstream-distance <int>
Example: --upstream-distance 1000
Valid setting: [int] >=1
1000
--downstream-distance Set the region length (bp) of the downstream from the transcription start site.
Format: --downstream-distance <int>
Example: --downstream-distance 1000
Valid setting: [int] >=1
1000
--disable-gene-feature Disable gene feature annotation.
Format: --disable-gene-feature
--gene-feature-in Retain variants with specified annotated gene features .
Format: --gene-feature-in <int~int>,<int>,...
Example:--gene-feature-in 0~6,9,10
Valid setting: [int] 0 ~ 17
0 ~ 17
... ... ...

KGGA has 18 number codes for the gene features after annotation.

Feature Code Explanation
Frameshift 0 Short insertion or deletion results in a completely different translation from the original.
Nonframeshift 1 Short insertion or deletion results in loss of amino acids in the translated proteins.
Start-loss 2 Indels or nucleotide substitution results in the loss of the start codon (ATG) (mutated into a non-start codon).
Stop-loss 3 Indels or nucleotide substitution results in the loss of stop codons (TAG, TAA, TGA).
Stop-gain 4 Indels or nucleotide substitutions result in the new stop codons (TAG, TAA, TGA), which may truncate the protein.
Splicing 5 Variant is within 3-bp of a splicing junction (use --splicing-distance x to change this; the unit of x is base-pairs).
Missense 6 Nucleotide substitution results in a codon coding for a different amino acid.
Synonymous 7 Nucleotide substitution does not change amino acids.
Exonic 8 Due to the loss of sequences in the reference database, this variant can only be mapped into the exonic region without more precise annotation.
UTR5 9 Within a 5' untranslated region.
UTR3 10 Within a 3' untranslated region.
Intronic 11 Within an intron.
Upstream 12 Within 1-kb region upstream of the transcription start site (use --upstream-distance x to change this, the unit of x is base-pair).
Downstream 13 Within 1-kb region downstream of the transcription end site (use `--downstream-distance x to change this; the unit of x is base-pairs).
ncRNA 14 Within a transcript without protein-coding annotation in the gene definition.
Intergenic 15 Variant is in the intergenic region.
Monomorphic 16 It is not a sequence variation, which may result from bugs in the reference genome in variant calling.
Unknown 17 Variants have no annotation.

Output

The gene feature annotation results are saved in OutputVariants2TSVTask/variants.hg38.tsv.gz. There are two relevant columns in the file:

Header Description
... ...
MarkFeatureGene The gene where a variant is located. When a variant is mapped onto multiple genes, the gene that leads to the smallest code is called the marker gene.
MarkGeneFeature The coordinate of the first SNP.
... ..

Clump

Prune and Clump Variants Based on P-Values and Linkage Disequilibrium (LD)

In KGGSum, the "Clump" feature enables users to prune and clump Genome-Wide Association Study (GWAS) variants using p-values from summary statistics and linkage disequilibrium (LD) calculated from a reference population. This process produces a refined set of independent, statistically significant, and harmonized variants, ready for downstream applications such as Mendelian Randomization (MR) analysis with third-party tools or platforms.

LD Clumping Methodology

The LD clumping method filters out variants within a user-defined physical distance (set by windowKb) if their LD exceeds a specified threshold (controlled by r2Cut). Variant removal follows a prioritized, three-step approach to retain the most informative variants:

  1. Variants with Larger P-Values Variants with less significant p-values (i.e., larger values) are prioritized for removal. The significance threshold is determined by the family-wise error rate (set by pCut) and adjusted using multiple testing correction methods: "fixed" (no correction), "bonf" (Bonferroni), "bhfdr" (Benjamini-Hochberg FDR), "byfdr" (Benjamini-Yekutieli FDR), or "localfdr" (local FDR).
    • When multiple GWAS datasets are provided, the p-value to use is specified by pIndex; otherwise, the p-value from the first dataset is used by default.
    • Among variants with p-values below the cutoff, those with the smallest sum of -log(p) across datasets or missing p-values in any dataset are prioritized for removal.
  2. Variants with More LD Connections Variants in LD with a greater number of other variants within the windowKb distance are prioritized for removal to reduce redundancy.
    • For example, consider three variants: A, B, and C, with LD r² values of 0.8 (A-B), 0.8 (B-C), and 0.64 (A-C). Variant B, which is in LD with both A and C, is removed, while A and C are retained. This preserves variants with fewer LD dependencies.
  3. Variants with Lower Minor Allele Frequency (MAF) If the first two criteria do not differentiate between variants (e.g., equal LD connections and similar p-values), the variant with the lower MAF is removed. Retaining variants with higher MAF ensures greater informativeness, as they are typically more robust for statistical analysis.

This hierarchical approach maximizes the retention of variants that are:

  • Less redundant (fewer LD connections),
  • More significant (smaller p-values),
  • More informative (higher MAF).

Options

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ld-clump r2Cut=0.1 \
              windowKb=10000 \
              pIndex=1 \
              pCut=0.05 \              
              pCorrect=bhfdr \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./smoking_chr1.tsv.gz \
              cp12Cols=CHR,POS,A1,A2 \
              pbsCols=Pval,Beta,SE \
              prevalence=0.05   \
              betaType=1   \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \     
   --threads 18 \
   --output ./ta2

This command clumps variants from two GWAS datasets (smoking and schizophrenia), prioritizing the smoking dataset’s p-values. It uses a 10 Mb window and an LD threshold of 0.1, retaining independent, significant variants after FDR correction.

Annotation option Description Default
--ld-clump r2Cut: LD threshold (e.g., 0.1). Controls the degree of independence among retained variants. A lower r2Cut yields a smaller, more independent set.
windowKb: Physical distance window in kilobases (e.g., 10000). Often set between 500 kb and 10000 kb (10 Mb), depending on study needs. Larger windows capture long-range LD but increase computational load, while smaller windows focus on local LD patterns.
pIndex: Index of the p-value column to use when multiple GWAS datasets are provided (e.g., 1). Ensures the correct dataset drives significance assessment in multi-dataset analyses.
pCut: Significance cutoff (e.g., 0.05). Determines the pool of variants considered, with stricter thresholds retaining only highly significant ones.
pCorrect: Multiple testing correction method (e.g., "bhfdr").
Format: --ld-clump r2Cut=<value> windowKb=<value> pIndex=<index> pCut=<value> pCorrect=<method>
Example: --ld-clump r2Cut=0.1 windowKb=10000 pIndex=1 pCut=0.05 pCorrect=bhfdr
[OFF]

Function

Functional score annotation at variants

In KGGSum, the GWAS variants can also be annotated with multiple genomic features. Three databases are available: gnomAD for allele frequency annotation, CADD for variant function annotation, and ClinVar for disease linkage annotation. Note that the annotation datasets should be downloaded from an independent resource domain of KGGA.

Database Name Short Description Tag
CADD Combined Annotation Dependent Depletion (CADD) is a widely used matrix for mutation deleteriousness and integrates more than 100 annotations for all possible single-nucleotide variants (SNVs) of the GRCh38/hg38 human reference genome. cadd
Favor Functional Annotation of Variants - Online Resource (FAVOR) provides comprehensive multi-faceted variant functional annotations that summarize findings of all possible nine billion SNVs across the genome (build GRCh38). favor
ClinVar ClinVar is a public database managed by the National Center for Biotechnology Information (NCBI) that provides information about the relationship between genetic variation and human health. clinvar

Options

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --variant-annotation-database cadd \
              field=Epigenetics::EncodeDNase-max,Epigenetics::EncodeDNase-sum,ProteinFunction::CADD_PHRED \
   --threads 18 \
   --output ./ta2
Annotation option Description Default
--variant-annotation-database Set the reference databases used for annotation at variants. --variant-annotation-database is a combination of parameters, and the usage rules are the same as those of --freq-database.
Format: --variant-annotation-database <name> path=<path> field=[field1,field2,...]
Example: --variant-annotation-database cadd field=ProteinFunction::CADD_PHRED
[OFF]

Additional epigenetic resources from third-party databases

To facilitate the convenient use of more resources, KGGSum provides an approach that allows users to specify customized third-party resources for annotation. For example, by setting the file name documented in EpiMap, KGGSum can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.

Database Name Short Description
EpiMap EpiMap is one of the most comprehensive maps of the human epigenome, providing approximately 15,000 datasets across 833 bio-samples and 18 epigenomic marks. It delivers rich gene-regulatory annotations encompassing chromatin states, high-resolution enhancers, activity patterns, enhancer modules, upstream regulators, and downstream target genes.
Annotation Option Description Default
--region-annotation-database Specifies the interval/regional databases for annotation.
Format: --region-annotation-database subID=[] marker=[] path= field=[field1,field2,...]
Example: --region-annotation-database EpiMap subID=BSS00001 marker=H3K4me3
- name: Database name (e.g., EpiMap).
- subID: Subject ID from EpiMap.
- marker: Epigenetic marker (e.g., H3K4me3).
- path: Path to the database file (optional for EpiMap).
- field: Specific fields to include.
Note: For name=EpiMap, KGGSum automatically downloads data from the EpiMap Repository if not locally available.
NA

Frequency

Allele Frequency Annotation & Filtration

Allele frequency annotation allows users to incorporate population-level allele frequency information into the analysis of genetic variants. Click to view the provided allele frequency annotation databases.

Database Name Short Description Tag
gnomAD Allele frequency data in the Genome Aggregation Database (gnomAD) v4 dataset (GRCh38) is derived from 730,947 exomes and 76,215 genomes from unrelated individuals of diverse ancestries. gnomad

Options

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \       
     --freq-database gnomad \
              field=gnomAD_joint::ALL,gnomAD_joint::NFE \
   --threads 18 \
   --output ./ta3
Annotation option Description Default
--freq-database Set the reference databases used for allele frequency annotation. --freq-database is a combination of parameters. The <name> is identified as the database name (such as gnomad). The <path>identifies the path to the database, which can be a local file path or an FTP/HTTP file path. If the path of the resources folder has been specified, the <path> parameter can be bypassed. The <field> is identified as the specified field filtered under this database. If no value is set, all fields of the specified database are selected by default.
Format: --freq-database <name> path=<path> field=[field1,field2,...]
Example: --freq-database gnomad field=gnomAD::EAS
[OFF]

Once the reference databases for allele frequency annotation have been properly configured, you can effectively filter variants by examining their allele frequencies within the reference population.

Filtration option Description Default
--db-af Exclude variants with alternative allele frequency (AF) outside the range [min, max] in allele frequency databases.
Format: --db-af <min>~<max>
Example:--db-af 0.05~1.0
Valid setting: [float] 0.0 ~ 1.0
[OFF]
--db-maf Exclude variants with minor allele frequency (MAF) outside the range [min, max] in allele frequency databases.
Format: --db-maf <min>~<max>
Example:--db-maf 0.05~0.5
Valid setting: [float] 0.0 ~ 0.5
[OFF]
Copyright ©MiaoXin Li all right reservedLast modified time: 2025-07-26 03:22:20

results matching ""

    No results matching ""