Skip to content

Variant Annotation & Filtration

At a glance

The annotate module enriches variants with functional annotations and (optionally) filters them using:

  • gene features / xQTL mappings
  • population allele frequency (e.g., gnomAD)
  • conservation and epigenetic resources (depending on your database selections)

In practice, this module helps you turn raw variant signals into a prioritized, analysis-ready variant set.

Typical command skeleton

java -jar kggsum.jar annot \
  --sum-file <GWAS summary> \
  --ref-gty-file <reference genotype> refG=<hg19|hg38> \
  --output <output_prefix> \
  [options]

Start with the option anchors in options.md, then open the database subpages under databases/ if you want to understand what each resource provides.

Inputs you typically provide

For all annotate runs, you must provide:

  • --ref-gty-file: reference genotypes (LD + coordinate context)
  • --sum-file: GWAS summary statistics (used to append SNP-level signals)
  • --output: output prefix

What you additionally provide depends on which annotation/filtering steps you enable in options.md:

  • Gene feature / xQTL-style mapping: --gene-model-database (or equivalent), plus gene-feature filtering options
  • Clumping (optional): --ld-clump sub-options
  • Functional/region databases: --variant-annotation-database ... / --region-annotation-database ... and field=...
  • Frequency: --freq-database ... and optional --db-af / --db-maf
  • rsID annotation: --variant-annotation-database snp (requires manual dataset availability)

Outputs you typically scan

  • OutputVariants2TSVTask/variants.hg38.tsv.gz: the consolidated TSV of your enabled annotation/filtering steps

Quick interpretation mindset

Scan in this order:

  1. Confirm the variant set you expect exists (especially if --ld-clump is enabled).
  2. In the TSV, verify which columns are exported (controlled by field=... and enabled options).
  3. Use the gene-feature codes / frequencies as your filtering signals to decide which variants are most credible for downstream causation / association analyses.

About

The annotate module enables rapid annotation of millions of variants with genomic features using one or multiple different databases, leveraging the full or partial fields of these databases. Additionally, it offers a variety of filtering functions based on annotation results, such as gene feature filtering, population frequency, conservation, and epigenetic modification, to assist in interpreting and deciphering the significant association signals.

Workflow of the Annotate Module

  1. Generation: Extract variant coordinates and frequencies from the VCF or GBC file to create a root variant set for further analysis.

  2. Gene Annotation: Annotate the root variant with gene features or xQTLs.

  3. Append: Integrate GWAS variants and their summary statistics into the annotated root variant set.

  4. Variant Annotation: Annotate variants with databases (e.g., CADD - Combined Annotation Dependent Depletion, and gnomAD) stored in GTB format to gain comprehensive insights into your genetic variations. KGGSum allows rapid, one-stop annotation of hundreds of fields from one or multiple databases.

Clump and Prune: Clump and prune GWAS variants using p-values from summary statistics and LD calculated from a reference population.

annotate
annotate

Basic Usage

java -jar kggsum.jar annot --sum-file <input1> --ref-gty-file <input2> --output <output> [options]