Methods & Options

Variant Annotation and Filtration with External Databases enhance the analysis of genetic variations by leveraging external databases to provide additional information about variants. This process aids in filtering and prioritizing variants based on various annotations.

How to use this page¶

This page documents analysis-specific options for the annot command in KGGSum.

Quick jump: - Gene - Clump - Function - RSID - Frequency

General idea: - --ref-gty-file + --sum-file define the variant set and required context - each section enables a different type of annotation/filtering - enabled annotation results are typically exported under OutputVariants2TSVTask/

Gene¶

What this section does¶

Annotates variants with gene features (refgene/gencode) and optionally filters variants by gene-feature codes.

Inputs you typically set¶

--gene-model-database (refgene/gencode or a custom gene model file)
--upstream-distance / --downstream-distance (gene-feature window sizes)
--gene-feature-in (keep only selected feature codes) and/or --disable-gene-feature

Output you typically scan¶

OutputVariants2TSVTask/variants.hg38.tsv.gz (look for MarkFeatureGene and MarkGeneFeature)

Quick interpretation (scan order)¶

Confirm gene feature codes are present and match your intended gene model (refgene vs gencode).
Check MarkFeatureGene and MarkGeneFeature are filled for the variants you expected to annotate.
If many variants are missing gene feature assignments, review --gene-model-database and gene-feature window sizes (--upstream-distance/--downstream-distance).

Gene feature Annotation & Filtration¶

Gene feature annotation is used to identify which gene or transcript is affected by a variant and what functional role it has on known genes. We now support two gene definition systems: RefSeq genes (refgene) and GENCODE genes (gencode).

Options¶

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --threads 18 \
   --output ./ta1

Flag	Description	Default
`annot`	Trigger the annotation procedure.	-
`--gene-model-database`	Specifies the gene model database for annotating variants to genes. Syntax Specifies the gene model database for annotating variants to genes. Syntax `--gene-model-database name [path=<path>]` Arguments `name` (Required) The name of the database. Accepted values are: 1. refgene OR gencode: Use a built-in database. KGGSum automatically loads the data from its internal resources directory. 2. customized: Use a user-provided gene model file. This requires the path argument. `path=<path>` (Required only when is customized) The location of your custom gene model file. This can be a local file path (e.g., /data/my_genes.txt) or a URL (e.g., http://example.com/genes.txt). Custom File Format When using the customized option, the file must adhere to the following format: No header row: The file should not contain a header. Four columns: Each row represents a single gene and must contain four space- or tab-separated columns in this order: Column 1: Chromosome ID (e.g., 1, chrX) Column 2: Transcription Start Position Column 3: Transcription End Position Column 4: Gene Symbol or ID (e.g., MYH9, ENSG00000100345) Example content: 1 339070 350389 ENSBTAG00000006648 1 475398 475516 ENSBTAG00000049697 Important This option is mutually exclusive with –xqtl-file. The –xqtl-file option is intended for datasets where variants are already mapped to genes (e.g., eQTLs/pQTLs). Do not use both options in the same run.	-
`--upstream-distance`	Set the region length (bp) of upstream from the transcription start site. Format: `--upstream-distance <int>` Example: `--upstream-distance 1000` Valid setting: [int] >=1	1000
`--downstream-distance`	Set the region length (bp) of the downstream from the transcription start site. Format: `--downstream-distance <int>` Example: `--downstream-distance 1000` Valid setting: [int] >=1	1000
`--disable-gene-feature`	Disable gene feature annotation. Format: `--disable-gene-feature`
`--gene-feature-in`	Retain variants with specified annotated gene features . Format: `--gene-feature-in <int~int>,<int>,...` Example:`--gene-feature-in 0~6,9,10` Valid setting: [int] 0 ~ 17	0 ~ 17
…	…	…

KGGA has 18 number codes for the gene features after annotation.

Feature	Code	Explanation
Frameshift	0	Short insertion or deletion results in a completely different translation from the original.
Nonframeshift	1	Short insertion or deletion results in loss of amino acids in the translated proteins.
Start-loss	2	Indels or nucleotide substitution results in the loss of the start codon (ATG) (mutated into a non-start codon).
Stop-loss	3	Indels or nucleotide substitution results in the loss of stop codons (TAG, TAA, TGA).
Stop-gain	4	Indels or nucleotide substitutions result in the new stop codons (TAG, TAA, TGA), which may truncate the protein.
Splicing	5	Variant is within 3-bp of a splicing junction (use `--splicing-distance x` to change this; the unit of x is base-pairs).
Missense	6	Nucleotide substitution results in a codon coding for a different amino acid.
Synonymous	7	Nucleotide substitution does not change amino acids.
Exonic	8	Due to the loss of sequences in the reference database, this variant can only be mapped into the exonic region without more precise annotation.
UTR5	9	Within a 5’ untranslated region.
UTR3	10	Within a 3’ untranslated region.
Intronic	11	Within an intron.
Upstream	12	Within 1-kb region upstream of the transcription start site (use `--upstream-distance x` to change this, the unit of x is base-pair).
Downstream	13	Within 1-kb region downstream of the transcription end site (use `–downstream-distance x to change this; the unit of x is base-pairs).
ncRNA	14	Within a transcript without protein-coding annotation in the gene definition.
Intergenic	15	Variant is in the intergenic region.
Monomorphic	16	It is not a sequence variation, which may result from bugs in the reference genome in variant calling.
Unknown	17	Variants have no annotation.

Output¶

The gene feature annotation results are saved in OutputVariants2TSVTask/variants.hg38.tsv.gz. There are two relevant columns in the file:

Header	Description
…	…
MarkFeatureGene	The gene where a variant is located. When a variant is mapped onto multiple genes, the gene that leads to the smallest code is called the marker gene.
MarkGeneFeature	The coordinate of the first SNP.
…	..

Clump¶

What this section does¶

Prunes and clumps variants using GWAS p-values plus LD from a reference population, producing a more independent and analysis-ready variant set.

Inputs you typically set for clumping¶

--ld-clump sub-options:
r2Cut (LD threshold)
windowKb (physical distance window)
pCut and pCorrect (significance + multiple-testing correction)
pIndex (which GWAS p-value column to use when multiple summary datasets are provided)

Output you typically scan¶

OutputVariants2TSVTask/variants.hg38.tsv.gz (variant set should reflect the clumped/pruned variants)

Quick interpretation (scan order)¶

Confirm --ld-clump is enabled in your command.
Compare variant counts/sets before vs after clumping (the clumped set should be smaller and less redundant).
Ensure downstream annotation columns (if enabled) are based on the clumped variant set.

Prune and Clump Variants Based on P-Values and Linkage Disequilibrium (LD)¶

In KGGSum, the “Clump” feature enables users to prune and clump Genome-Wide Association Study (GWAS) variants using p-values from summary statistics and linkage disequilibrium (LD) calculated from a reference population. This process produces a refined set of independent, statistically significant, and harmonized variants, ready for downstream applications such as Mendelian Randomization (MR) analysis with third-party tools or platforms.

LD Clumping Methodology¶

The LD clumping method filters out variants within a user-defined physical distance (set by windowKb) if their LD exceeds a specified threshold (controlled by r2Cut). Variant removal follows a prioritized, three-step approach to retain the most informative variants:

Variants with Larger P-Values Variants with less significant p-values (i.e., larger values) are prioritized for removal. The significance threshold is determined by the family-wise error rate (set by pCut) and adjusted using multiple testing correction methods: “fixed” (no correction), “bonf” (Bonferroni), “bhfdr” (Benjamini-Hochberg FDR), “byfdr” (Benjamini-Yekutieli FDR), or “localfdr” (local FDR).
When multiple GWAS datasets are provided, the p-value to use is specified by pIndex; otherwise, the p-value from the first dataset is used by default.
Among variants with p-values below the cutoff, those with the smallest sum of -log(p) across datasets or missing p-values in any dataset are prioritized for removal.
Variants with More LD Connections Variants in LD with a greater number of other variants within the windowKb distance are prioritized for removal to reduce redundancy.
For example, consider three variants: A, B, and C, with LD r² values of 0.8 (A-B), 0.8 (B-C), and 0.64 (A-C). Variant B, which is in LD with both A and C, is removed, while A and C are retained. This preserves variants with fewer LD dependencies.
Variants with Lower Minor Allele Frequency (MAF) If the first two criteria do not differentiate between variants (e.g., equal LD connections and similar p-values), the variant with the lower MAF is removed. Retaining variants with higher MAF ensures greater informativeness, as they are typically more robust for statistical analysis.

This hierarchical approach maximizes the retention of variants that are:

Less redundant (fewer LD connections),
More significant (smaller p-values),
More informative (higher MAF).

Options¶

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ld-clump r2Cut=0.1 \
              windowKb=10000 \
              pIndex=1 \
              pCut=0.05 \              
              pCorrect=bhfdr \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./smoking_chr1.tsv.gz \
              cp12Cols=CHR,POS,A1,A2 \
              pbsCols=Pval,Beta,SE \
              prevalence=0.05   \
              betaType=1   \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \     
   --threads 18 \
   --output ./ta2

This command clumps variants from two GWAS datasets (smoking and schizophrenia), prioritizing the smoking dataset’s p-values. It uses a 10 Mb window and an LD threshold of 0.1, retaining independent, significant variants after FDR correction.

Annotation option	Description	Default
`--ld-clump`	`r2Cut`: LD threshold (e.g., 0.1). Controls the degree of independence among retained variants. A lower r2Cut yields a smaller, more independent set. `windowKb`: Physical distance window in kilobases (e.g., 10000). Often set between 500 kb and 10000 kb (10 Mb), depending on study needs. Larger windows capture long-range LD but increase computational load, while smaller windows focus on local LD patterns. `pIndex`: Index of the p-value column to use when multiple GWAS datasets are provided (e.g., 1). Ensures the correct dataset drives significance assessment in multi-dataset analyses. `pCut`: Significance cutoff (e.g., 0.05). Determines the pool of variants considered, with stricter thresholds retaining only highly significant ones. `pCorrect`: Multiple testing correction method (e.g., “bhfdr”). Format: `--ld-clump r2Cut=<value> windowKb=<value> pIndex=<index> pCut=<value> pCorrect=<method>` Example: `--ld-clump r2Cut=0.1 windowKb=10000 pIndex=1 pCut=0.05 pCorrect=bhfdr`	[OFF]

Function¶

What this section does¶

Annotates variants with functional scores/functional annotations from external databases (e.g., CADD, FAVOR, ClinVar) and optionally filters based on selected fields.

Inputs you typically set for functional annotation¶

--variant-annotation-database <name> ... (e.g., cadd, favor, clinvar)
field=... to choose which annotation fields to export
(common) --ref-gty-file, --sum-file, and output prefix via --output

Output you typically scan¶

OutputVariants2TSVTask/variants.hg38.tsv.gz with the annotation fields you requested via field=... (selected columns should appear)

Quick interpretation (scan order)¶

In the exported TSV, confirm the annotation fields selected by field=... are present.
If the fields are missing or empty, double-check field names and ensure the external resource files are available/downloaded.

Functional score annotation at variants¶

In KGGSum, the GWAS variants can also be annotated with multiple genomic features. Three databases are available: gnomAD for allele frequency annotation, CADD for variant function annotation, and ClinVar for disease linkage annotation. Note that the annotation datasets should be downloaded from an independent resource domain of KGGA.

Database Name	Short Description	Tag
CADD	Combined Annotation Dependent Depletion (CADD) is a widely used matrix for mutation deleteriousness and integrates more than 100 annotations for all possible single-nucleotide variants (SNVs) of the GRCh38/hg38 human reference genome.	cadd
Favor	Functional Annotation of Variants - Online Resource (FAVOR) provides comprehensive multi-faceted variant functional annotations that summarize findings of all possible nine billion SNVs across the genome (build GRCh38).	favor
ClinVar	ClinVar is a public database managed by the National Center for Biotechnology Information (NCBI) that provides information about the relationship between genetic variation and human health.	clinvar

Options¶

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --variant-annotation-database cadd \
              field=Epigenetics::EncodeDNase-max,Epigenetics::EncodeDNase-sum,ProteinFunction::CADD_PHRED \
   --threads 18 \
   --output ./ta2

Annotation option	Description	Default
`--variant-annotation-database`	Set the reference databases used for annotation at variants. `--variant-annotation-database` is a combination of parameters, and the usage rules are the same as those of `--freq-database`. Format: `--variant-annotation-database <name> path=<path> field=[field1,field2,...]` Example: `--variant-annotation-database cadd field=ProteinFunction::CADD_PHRED`	[OFF]

Additional epigenetic resources from third-party databases¶

To facilitate the convenient use of more resources, KGGSum provides an approach that allows users to specify customized third-party resources for annotation. For example, by setting the file name documented in EpiMap, KGGSum can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.

What this subsection does¶

Lets you annotate variants using additional epigenetic/interval resources from third-party databases (e.g., EpiMap), including optionally auto-downloaded EpiMap datasets.

Inputs you typically set¶

--region-annotation-database EpiMap ... (or other supported interval resources via --region-annotation-database <name> ...)
field=[field1,field2,...] to choose exported annotation columns

Output you typically scan¶

OutputVariants2TSVTask/variants.hg38.tsv.gz and verify the epigenetic columns selected by field=... are present.

Quick interpretation (scan order)¶

Confirm the --region-annotation-database parameters match the intended subID/marker.
Check the selected epigenetic annotation columns appear in the exported TSV.
If results are unexpectedly sparse, verify that your region/marker selection overlaps the variant coordinates in the chosen genome build.

Database Name	Short Description
EpiMap	EpiMap is one of the most comprehensive maps of the human epigenome, providing approximately 15,000 datasets across 833 bio-samples and 18 epigenomic marks. It delivers rich gene-regulatory annotations encompassing chromatin states, high-resolution enhancers, activity patterns, enhancer modules, upstream regulators, and downstream target genes.

Annotation Option	Description	Default
–region-annotation-database	Specifies the interval/regional databases for annotation. Format: –region-annotation-database subID=[] marker=[] path= field=[field1,field2,…] Example: –region-annotation-database EpiMap subID=BSS00001 marker=H3K4me3 - name: Database name (e.g., EpiMap). - subID: Subject ID from EpiMap. - marker: Epigenetic marker (e.g., H3K4me3). - path: Path to the database file (optional for EpiMap). - field: Specific fields to include. Note: For name=EpiMap, KGGSum automatically downloads data from the EpiMap Repository if not locally available.	NA

RSID¶

What this section does¶

Annotates variants with rsIDs from a dbSNP-based resource (you may need to download the rsID dataset manually).

Inputs you typically set¶

--variant-annotation-database snp (dbSNP/rsID resource)

Output you typically scan¶

OutputVariants2TSVTask/variants.hg38.tsv.gz (confirm rsID-related columns are populated for variants present in the rsID resource)

Quick interpretation (scan order)¶

Confirm rsID annotation is enabled via --variant-annotation-database snp.
In the exported TSV, check that rsID fields are filled for variants that should exist in the rsID dataset.
If most variants have no rsID, verify dataset availability/build compatibility and input variant coordinates.

Assign DBSNP’s rsIDs to variants¶

In KGGSum, the GWAS variants can also be annotated with DBSNP’s rsIDs. Note that the annotation dataset needs to be downloaded manually.

Database Name	Short Description	Tag
dbsnp	The variants’ coordinates and rs IDs from NCBI’s dbSNP. RSID coordinates can be accessed at NCBI FTP. We provide a GTB version of this dataset, converted from dbSNP’s VCF (B157).	dbsnp

Options¶

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --variant-annotation-database snp \
   --threads 18 \
   --output ./ta2

Frequency¶

What this section does¶

Annotates variants with population allele frequency information (e.g., from gnomAD) and optionally filters variants by AF/MAF thresholds.

Inputs you typically set¶

--freq-database gnomad ... (choose the reference frequency database and which fields to export)
optional filtration:
--db-af <min>~<max>
--db-maf <min>~<max>

Output you typically scan¶

OutputVariants2TSVTask/variants.hg38.tsv.gz and verify: - frequency-related columns selected by field=... are present - filtration via --db-af / --db-maf is reflected in the retained variant set

Quick interpretation (scan order)¶

Confirm the frequency columns selected by field=... appear in the exported TSV.
Compare variant counts before vs after filtration: stricter AF/MAF ranges should reduce the retained set.
If too few variants remain, relax --db-af/--db-maf thresholds or verify AF/MAF field selection.

Allele Frequency Annotation & Filtration¶

Allele frequency annotation allows users to incorporate population-level allele frequency information into the analysis of genetic variants. Click to view the provided allele frequency annotation databases.

Database Name	Short Description	Tag
gnomAD	Allele frequency data in the Genome Aggregation Database (gnomAD) v4 dataset (GRCh38) is derived from 730,947 exomes and 76,215 genomes from unrelated individuals of diverse ancestries.	gnomad

Options¶

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \       
     --freq-database gnomad \
              field=gnomAD_joint::ALL,gnomAD_joint::NFE \
   --threads 18 \
   --output ./ta3

Annotation option	Description	Default
`--freq-database`	Set the reference databases used for allele frequency annotation. `--freq-database` is a combination of parameters. The `<name>` is identified as the database name (such as gnomad). The `<path>`identifies the path to the database, which can be a local file path or an FTP/HTTP file path. If the path of the resources folder has been specified, the `<path>` parameter can be bypassed. The `<field>` is identified as the specified field filtered under this database. If no value is set, all fields of the specified database are selected by default. Format: `--freq-database <name> path=<path> field=[field1,field2,...]` Example: `--freq-database gnomad field=gnomAD::EAS`	[OFF]

Once the reference databases for allele frequency annotation have been properly configured, you can effectively filter variants by examining their allele frequencies within the reference population.

Filtration option	Description	Default
`--db-af`	Exclude variants with alternative allele frequency (AF) outside the range [`min`, `max`] in allele frequency databases. Format: `--db-af <min>~<max>` Example:`--db-af 0.05~1.0` Valid setting: [float] 0.0 ~ 1.0	[OFF]
`--db-maf`	Exclude variants with minor allele frequency (MAF) outside the range [`min`, `max`] in allele frequency databases. Format: `--db-maf <min>~<max>` Example:`--db-maf 0.05~0.5` Valid setting: [float] 0.0 ~ 0.5	[OFF]