Variant Annotation and Filtration with External Databases enhance the analysis of genetic variations by leveraging external databases to provide additional information about variants. This process aids in filtering and prioritizing variants based on various annotations.
Gene
Gene feature Annotation & Filtration
Gene feature annotation is used to identify which gene or transcript is affected by a variant and what functional role it has on known genes. We now support two gene definition systems: RefSeq genes (refgene) and GENCODE genes (gencode).
Options
The tutorial command is:
java -Xmx8g -jar ../kggsum.jar \
annot \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP,A1,A2 \
pbsCols=P,OR,SE \
betaType=2 \
prevalence=0.01 \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--gene-model-database refgene \
--gene-model-database gencode \
--threads 18 \
--output ./ta1
Flag | Description | Default |
---|---|---|
annot |
Trigger the annotation procedure. | - |
--gene-model-database |
et the reference gene model database(s) used for gene feature annotation. It is a combination of parameters. The <name> is identified as the database name (such as refgene or gencode). The <path> identifies the path to the database, which can be a local file path or an FTP/HTTP file path. If the path of the resources folder has been specified, the <path> parameter can be bypassed. Note This option conflicts with --xqtl-file .Format: --gene-model-database <name> path=<path> Example: --gene-model-database refgene |
- |
--upstream-distance |
Set the region length (bp) of upstream from the transcription start site. Format: --upstream-distance <int> Example: --upstream-distance 1000 Valid setting: [int] >=1 |
1000 |
--downstream-distance |
Set the region length (bp) of downstream from the transcription start site. Format: --downstream-distance <int> Example: --downstream-distance 1000 Valid setting: [int] >=1 |
1000 |
--disable-gene-feature |
Disable gene feature annotation. Format: --disable-gene-feature |
|
--gene-feature-in |
Retain variants with specified annotated gene features . Format: --gene-feature-in <int~int>,<int>,... Example: --gene-feature-in 0~6,9,10 Valid setting: [int] 0 ~ 17 |
0 ~ 17 |
... | ... | ... |
KGGA has 18 number codes for the gene features after annotation.
Feature | Code | Explanation |
---|---|---|
Frameshift | 0 | Short insertion or deletion results in a completely different translation from the original. |
Nonframeshift | 1 | Short insertion or deletion results in loss of amino acids in the translated proteins. |
Start-loss | 2 | Indels or nucleotide substitution results in the loss of the start codon (ATG) (mutated into a non-start codon). |
Stop-loss | 3 | Indels or nucleotide substitution results in the loss of stop codons (TAG, TAA, TGA). |
Stop-gain | 4 | Indels or nucleotide substitution result in the new stop codons (TAG, TAA, TGA), which may truncate the protein. |
Splicing | 5 | Variant is within 3-bp of a splicing junction (use --splicing-distance x to change this; the unit of x is base-pair). |
Missense | 6 | Nucleotide substitution results in a codon coding for a different amino acid. |
Synonymous | 7 | Nucleotide substitution does not change amino acids. |
Exonic | 8 | Due to the loss of sequences in the reference database, this variant can only be mapped into the exonic region without more precise annotation. |
UTR5 | 9 | Within a 5' untranslated region. |
UTR3 | 10 | Within a 3' untranslated region. |
Intronic | 11 | Within an intron. |
Upstream | 12 | Within 1-kb region upstream of transcription start site (use --upstream-distance x to change this, the unit of x is base-pair). |
Downstream | 13 | Within 1-kb region downstream of the transcription end site (use `--downstream-distance x to change this; the unit of x is base-pair). |
ncRNA | 14 | Within a transcript without protein-coding annotation in the gene definition. |
Intergenic | 15 | Variant is in intergenic region. |
Monomorphic | 16 | It is not a sequence variation, which may result from bugs in the reference genome in variant calling. |
Unknown | 17 | Variants has no annotation. |
Output
The gene feature annotation results are saved in OutputVariants2TSVTask/variants.hg38.tsv.gz
. There are two relevant columns in the file:
Header | Description |
---|---|
... | ... |
MarkFeatureGene | The gene where a variant is located. When a variant is mapped onto multiple genes, the genes led to the smallest code is called the mark gene. |
MarkGeneFeature | The coordinate of the first SNP. |
... | .. |
Function
Functional score annotation at variants
In KGGSum, the GWAS variants can also be annotated with multiple genomic features. Three databases are available: gnomAD for allele frequency annotation, CADD for variant function annotation, and ClinVar for disease linkage annotation. Note that the annotation datasets should be downloaded from an independent resource domain of KGGA.
Database Name | Short Description | Tag |
---|---|---|
CADD | Combined Annotation Dependent Depletion (CADD) is a widely used matrix for mutation deleteriousness and integrates more than 100 annotations for all possible single-nucleotide variants (SNVs) of the GRCh38/hg38 human reference genome. | cadd |
Favor | Functional Annotation of Variants - Online Resource (FAVOR) provides comprehensive multi-faceted variant functional annotations that summarize findings of all possible nine billion SNVs across the genome (build GRCh38). | favor |
ClinVar | ClinVar is a public database managed by the National Center for Biotechnology Information (NCBI) that provides information about the relationship between genetic variation and human health. | clinvar |
Options
The tutorial command is:
java -Xmx8g -jar ../kggsum.jar \
annot \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP,A1,A2 \
pbsCols=P,OR,SE \
betaType=2 \
prevalence=0.01 \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--variant-annotation-database cadd \
field=Epigenetics::EncodeDNase-max,Epigenetics::EncodeDNase-sum,ProteinFunction::CADD_PHRED \
--threads 18 \
--output ./ta2
Annotation option | Description | Default |
---|---|---|
--variant-annotation-database |
Set the reference databases used for annotation at variants. --variant-annotation-database is a combination of parameters, and the usage rules are the same as those of --freq-database .Format: --variant-annotation-database <name> path=<path> field=[field1,field2,...] Example: --variant-annotation-database cadd field=ProteinFunction::CADD_PHRED |
[OFF] |
Additional epigenetic resources from third-party databases
To facilitate the convenient use of more resources, KGGSum provides an interactive approach that allows users to specify customized third-party resources for annotation. For example, by setting the file name documented in EpiMap, KGGSum can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.
Database Name | Short Description |
---|---|
EpiMap | EpiMap is one of the most comprehensive maps of the human epigenome, provides approximately 15,000 datasets across 833 bio-samples and 18 epigenomic marks, delivers a rich of gene-regulatory annotations encompassing chromatin states, high-resolution enhancers, activity patterns, enhancer modules, upstream regulators, and downstream target genes. |
Annotation Option | Description | Default |
---|---|---|
--region-annotation-database | Specifies the interval/regional databases for annotation. Format: --region-annotation-database Example: --region-annotation-database EpiMap subID=BSS00001 marker=H3K4me3 - name: Database name (e.g., EpiMap). - subID: Subject ID from EpiMap. - marker: Epigenetic marker (e.g., H3K4me3). - path: Path to the database file (optional for EpiMap). - field: Specific fields to include. Note: For name=EpiMap, KGGA automatically downloads data from the EpiMap Repository if not locally available. |
Frequency
Allele Frequency Annotation & Filtration
Allele frequency annotation allows users to incorporate population-level allele frequency information into the analysis of genetic variants. Click to view the provided allele frequency annotation databases.
Database Name | Short Description | Tag |
---|---|---|
gnomAD | Allele frequency data in the Genome Aggregation Database (gnomAD) v4 dataset (GRCh38) is derived from 730,947 exomes and 76,215 genomes from unrelated individuals of diverse ancestries. | gnomad |
Options
The tutorial command is:
java -Xmx4g -jar ../kggsum.jar \
annot \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP,A1,A2 \
pbsCols=P,OR,SE \
betaType=2 \
prevalence=0.01 \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--freq-database gnomad \
field=gnomAD_joint::ALL,gnomAD_joint::NFE \
--threads 18 \
--output ./ta3
Annotation option | Description | Default |
---|---|---|
--freq-database |
Set the reference databases used for allele frequency annotation. --freq-database is a combination of parameters. The <name> is identified as the database name (such as gnomad). The <path> identifies the path to the database, which can be a local file path or an FTP/HTTP file path. If the path of the resources folder has been specified, the <path> parameter can be bypassed. The <field> is identified as the specified field filtered under this database. If no value is set, all fields of the specified database are selected by default.Format: --freq-database <name> path=<path> field=[field1,field2,...] Example: --freq-database gnomad field=gnomAD::EAS |
[OFF] |
Once the reference databases for allele frequency annotation have been properly configured, you can effectively filter variants by examining their allele frequencies within the reference population.
Filtration option | Description | Default |
---|---|---|
--db-af |
Exclude variants with alternative allele frequency (AF) outside the range [min , max ] in allele frequency databases.Format: --db-af <min>~<max> Example: --db-af 0.05~1.0 Valid setting: [float] 0.0 ~ 1.0 |
[OFF] |
--db-maf |
Exclude variants with minor allele frequency (MAF) outside the range [min , max ] in allele frequency databases.Format: --db-maf <min>~<max> Example: --db-maf 0.05~0.5 Valid setting: [float] 0.0 ~ 0.5 |
[OFF] |