Variant Annotation and Filtration with External Databases enhance the analysis of genetic variations by leveraging external databases to provide additional information about variants. This process aids in filtering and prioritizing variants based on various annotations.

Gene

Gene feature Annotation & Filtration

Gene feature annotation is used to identify which gene or transcript is affected by a variant and what functional role it has on known genes. We now support two gene definition systems: RefSeq genes (refgene) and GENCODE genes (gencode).

Options

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --threads 18 \
   --output ./ta1
Flag Description Default
annot Trigger the annotation procedure. -
--gene-model-database et the reference gene model database(s) used for gene feature annotation. It is a combination of parameters. The <name> is identified as the database name (such as refgene or gencode). The <path>identifies the path to the database, which can be a local file path or an FTP/HTTP file path. If the path of the resources folder has been specified, the <path> parameter can be bypassed. Note This option conflicts with --xqtl-file.
Format: --gene-model-database <name> path=<path>
Example: --gene-model-database refgene
-
--upstream-distance Set the region length (bp) of upstream from the transcription start site.
Format: --upstream-distance <int>
Example: --upstream-distance 1000
Valid setting: [int] >=1
1000
--downstream-distance Set the region length (bp) of downstream from the transcription start site.
Format: --downstream-distance <int>
Example: --downstream-distance 1000
Valid setting: [int] >=1
1000
--disable-gene-feature Disable gene feature annotation.
Format: --disable-gene-feature
--gene-feature-in Retain variants with specified annotated gene features .
Format: --gene-feature-in <int~int>,<int>,...
Example:--gene-feature-in 0~6,9,10
Valid setting: [int] 0 ~ 17
0 ~ 17
... ... ...

KGGA has 18 number codes for the gene features after annotation.

Feature Code Explanation
Frameshift 0 Short insertion or deletion results in a completely different translation from the original.
Nonframeshift 1 Short insertion or deletion results in loss of amino acids in the translated proteins.
Start-loss 2 Indels or nucleotide substitution results in the loss of the start codon (ATG) (mutated into a non-start codon).
Stop-loss 3 Indels or nucleotide substitution results in the loss of stop codons (TAG, TAA, TGA).
Stop-gain 4 Indels or nucleotide substitution result in the new stop codons (TAG, TAA, TGA), which may truncate the protein.
Splicing 5 Variant is within 3-bp of a splicing junction (use --splicing-distance x to change this; the unit of x is base-pair).
Missense 6 Nucleotide substitution results in a codon coding for a different amino acid.
Synonymous 7 Nucleotide substitution does not change amino acids.
Exonic 8 Due to the loss of sequences in the reference database, this variant can only be mapped into the exonic region without more precise annotation.
UTR5 9 Within a 5' untranslated region.
UTR3 10 Within a 3' untranslated region.
Intronic 11 Within an intron.
Upstream 12 Within 1-kb region upstream of transcription start site (use --upstream-distance x to change this, the unit of x is base-pair).
Downstream 13 Within 1-kb region downstream of the transcription end site (use `--downstream-distance x to change this; the unit of x is base-pair).
ncRNA 14 Within a transcript without protein-coding annotation in the gene definition.
Intergenic 15 Variant is in intergenic region.
Monomorphic 16 It is not a sequence variation, which may result from bugs in the reference genome in variant calling.
Unknown 17 Variants has no annotation.

Output

The gene feature annotation results are saved in OutputVariants2TSVTask/variants.hg38.tsv.gz. There are two relevant columns in the file:

Header Description
... ...
MarkFeatureGene The gene where a variant is located. When a variant is mapped onto multiple genes, the genes led to the smallest code is called the mark gene.
MarkGeneFeature The coordinate of the first SNP.
... ..

Function

Functional score annotation at variants

In KGGSum, the GWAS variants can also be annotated with multiple genomic features. Three databases are available: gnomAD for allele frequency annotation, CADD for variant function annotation, and ClinVar for disease linkage annotation. Note that the annotation datasets should be downloaded from an independent resource domain of KGGA.

Database Name Short Description Tag
CADD Combined Annotation Dependent Depletion (CADD) is a widely used matrix for mutation deleteriousness and integrates more than 100 annotations for all possible single-nucleotide variants (SNVs) of the GRCh38/hg38 human reference genome. cadd
Favor Functional Annotation of Variants - Online Resource (FAVOR) provides comprehensive multi-faceted variant functional annotations that summarize findings of all possible nine billion SNVs across the genome (build GRCh38). favor
ClinVar ClinVar is a public database managed by the National Center for Biotechnology Information (NCBI) that provides information about the relationship between genetic variation and human health. clinvar

Options

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --variant-annotation-database cadd \
              field=Epigenetics::EncodeDNase-max,Epigenetics::EncodeDNase-sum,ProteinFunction::CADD_PHRED \
   --threads 18 \
   --output ./ta2
Annotation option Description Default
--variant-annotation-database Set the reference databases used for annotation at variants. --variant-annotation-database is a combination of parameters, and the usage rules are the same as those of --freq-database.
Format: --variant-annotation-database <name> path=<path> field=[field1,field2,...]
Example: --variant-annotation-database cadd field=ProteinFunction::CADD_PHRED
[OFF]

Additional epigenetic resources from third-party databases

To facilitate the convenient use of more resources, KGGSum provides an interactive approach that allows users to specify customized third-party resources for annotation. For example, by setting the file name documented in EpiMap, KGGSum can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.

Database Name Short Description
EpiMap EpiMap is one of the most comprehensive maps of the human epigenome, provides approximately 15,000 datasets across 833 bio-samples and 18 epigenomic marks, delivers a rich of gene-regulatory annotations encompassing chromatin states, high-resolution enhancers, activity patterns, enhancer modules, upstream regulators, and downstream target genes.
Annotation Option Description Default
--region-annotation-database Specifies the interval/regional databases for annotation.
Format: --region-annotation-database subID=[] marker=[] path= field=[field1,field2,...]
Example: --region-annotation-database EpiMap subID=BSS00001 marker=H3K4me3
- name: Database name (e.g., EpiMap).
- subID: Subject ID from EpiMap.
- marker: Epigenetic marker (e.g., H3K4me3).
- path: Path to the database file (optional for EpiMap).
- field: Specific fields to include.
Note: For name=EpiMap, KGGA automatically downloads data from the EpiMap Repository if not locally available.

Frequency

Allele Frequency Annotation & Filtration

Allele frequency annotation allows users to incorporate population-level allele frequency information into the analysis of genetic variants. Click to view the provided allele frequency annotation databases.

Database Name Short Description Tag
gnomAD Allele frequency data in the Genome Aggregation Database (gnomAD) v4 dataset (GRCh38) is derived from 730,947 exomes and 76,215 genomes from unrelated individuals of diverse ancestries. gnomad

Options

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   annot \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \       
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --freq-database gnomad \
              field=gnomAD_joint::ALL,gnomAD_joint::NFE \
   --threads 18 \
   --output ./ta3
Annotation option Description Default
--freq-database Set the reference databases used for allele frequency annotation. --freq-database is a combination of parameters. The <name> is identified as the database name (such as gnomad). The <path>identifies the path to the database, which can be a local file path or an FTP/HTTP file path. If the path of the resources folder has been specified, the <path> parameter can be bypassed. The <field> is identified as the specified field filtered under this database. If no value is set, all fields of the specified database are selected by default.
Format: --freq-database <name> path=<path> field=[field1,field2,...]
Example: --freq-database gnomad field=gnomAD::EAS
[OFF]

Once the reference databases for allele frequency annotation have been properly configured, you can effectively filter variants by examining their allele frequencies within the reference population.

Filtration option Description Default
--db-af Exclude variants with alternative allele frequency (AF) outside the range [min, max] in allele frequency databases.
Format: --db-af <min>~<max>
Example:--db-af 0.05~1.0
Valid setting: [float] 0.0 ~ 1.0
[OFF]
--db-maf Exclude variants with minor allele frequency (MAF) outside the range [min, max] in allele frequency databases.
Format: --db-maf <min>~<max>
Example:--db-maf 0.05~0.5
Valid setting: [float] 0.0 ~ 0.5
[OFF]
Copyright ©MiaoXin Li all right reservedLast modified time: 2025-04-04 04:19:54

results matching ""

    No results matching ""