Example
This section provides examples of using the annotate
module in KGGA, a software tool designed for geneticists. The module extracts variants and genotypes from VCF and PED files after quality control (QC), annotates them with various genomic features, and applies filters to identify high-quality, rare variants for downstream analysis.
Annotate Variants with Gene Features and Filter Based on Local Allele Frequencies
In this example, variants and genotypes are extracted from VCF and PED files post-QC. The variants are annotated with gene features using the RefGene and Gencode databases. Variants with common allele frequencies in the local sample (MAF > 0.05) or located outside exonic regions are excluded, yielding a subset of high-quality, rare exonic variants for further analysis.
java -Dccf.remote.timeout=60 -jar kgga.jar \
annotate \
--input-gty-file ./example/assoc.hg19.vcf.gz \
refG=hg19 \
--ped-file http://api.pmglab.top/kgga/download/example/assoc.ped \
--output ./test/demo2 \
--threads 6 \
--seq-ac 1 \
--local-maf 0~0.05 \
--gene-model-database refgene \
--gene-model-database gencode \
--gene-feature-included 0~8
Key Command Options
- --input-gty-file: Specifies the input VCF file and reference genome (e.g., hg19).
- --ped-file: Provides sample metadata (e.g., case/control status).
- --output: Defines the output directory.
- --threads: Sets the number of processing threads.
- --seq-ac: Filters variants by allele count (e.g., ≥1).
- --local-maf: Excludes variants with minor allele frequency (MAF) > 0.05 in the local sample.
- --gene-model-database: Specifies gene annotation databases (e.g., RefGene, Gencode).
- --gene-feature-included: Limits annotations to exonic features (values 0~8).
Outputs
The annotated variants are saved in the output directory ./test/demo2/. The final results are stored in variants.hg38.tsv.gz, with columns including:
Header | Description |
---|---|
CHROM | Chromosome (hg38) |
POS | Position (hg38) |
REF | Reference allele |
ALT | Alternative allele |
GeneFeature@MarkFeatureGene | Gene feature mark |
GeneFeature@MarkGeneFeature | Gene feature details |
GeneFeature@GeneFeatureDetails | Detailed gene feature information |
GeneFeature@HitGenes | Genes affected by the variant |
GeneFeature@HitGeneFeatures | Features of the affected genes |
SOURCE@hg19_CHROM | Original chromosome (hg19) |
SOURCE@hg19_POS | Original position (hg19) |
GTYSUM@RefHomGtyNum_ALL | Number of reference homozygotes (all samples) |
GTYSUM@HetGtyNum_ALL | Number of heterozygotes (all samples) |
GTYSUM@AltHomGtyNum_ALL | Number of alternative homozygotes (all samples) |
GTYSUM@MissingGtyNum_ALL | Number of missing genotypes (all samples) |
GTYSUM@RefHomGtyNum_CASE | Number of reference homozygotes (case samples) |
GTYSUM@HetGtyNum_CASE | Number of heterozygotes (case samples) |
GTYSUM@AltHomGtyNum_CASE | Number of alternative homozygotes (case samples) |
GTYSUM@MissingGtyNum_CASE | Number of missing genotypes (case samples) |
GTYSUM@RefHomGtyNum_CONTROL | Number of reference homozygotes (control samples) |
GTYSUM@HetGtyNum_CONTROL | Number of heterozygotes (control samples) |
GTYSUM@AltHomGtyNum_CONTROL | Number of alternative homozygotes (control samples) |
GTYSUM@MissingGtyNum_CONTROL | Number of missing genotypes (control samples) |
- Intermediate Files: Files with the .gtb suffix in the ConvertVCF2GTBTask subfolder can be used for subsequent analysis.
- Log Files: Processing logs are saved in the log subfolder.
Annotate Variants with Gene Features, Reference Allele Frequencies, and Deleterious Scores
Here, variants and genotypes are extracted from VCF and PED files after QC and annotated with gene features (RefGene, Gencode), allele frequencies from gnomAD (East Asian and African populations), and deleterious scores from dbNSFP. Variants with common allele frequencies in the local sample (AF > 0.05) or reference database (AF > 0.01), or those beyond exonic regions, are excluded, identifying rare variants with potential functional impact.
java -Dccf.remote.timeout=60 -jar kgga.jar \
annotate \
--input-gty-file ./example/assoc.hg19.vcf.gz \
refG=hg19 \
--ped-file http://api.pmglab.top/kgga/download/example/assoc.ped \
--output ./test/demo2 \
--threads 6 \
--seq-ac 1 \
--local-af 0~0.05 \
--gene-model-database refgene \
--gene-model-database gencode \
--gene-feature-included 0~8 \
--freq-database gnomad \
field=gnomAD_joint@EAS,gnomAD_joint@AFR \
--db-af 0~0.01 \
--variant-annotation-database dbnsfp
Key Command Options
- --local-af: Filters variants by local allele frequency (e.g., 0–0.05).
- --freq-database: Specifies the reference frequency database (e.g., gnomAD) and fields (e.g., EAS, AFR populations).
- --db-af: Excludes variants with allele frequency > 0.01 in the reference database.
- --variant-annotation-database: Adds deleterious scores from dbNSFP.
Notes
The gnomAD and dbNSFP databases are not included in the tutorial folder due to their large size. Download the corresponding GTB files from [XXXX] and place them in resources/gnomad/ and resources/dbnsfp/ before running the command.
Outputs
Results are saved in ./test/demo2/ with a format similar to the first example, including additional columns for allele frequencies and deleterious scores.
Annotate Variants with Gene Features, Allele Frequencies, Non-Coding Scores, and Epigenetic Markers
In this example, individuals are selected from VCF and PED files, and stricter-than-default QC is applied to their genotypes. Variants are annotated with gene features (RefGene, Gencode), allele frequencies (gnomAD), non-coding functional scores (CADD), and epigenetic markers (EpiMap Repository, H3K4me3 for sample BSS00001). The output is formatted as PLINK BED files.
java -Dccf.remote.timeout=60 -jar kgga.jar \
annotate \
--input-gty-file ./example/assoc.hg19.vcf.gz \
refG=hg19 \
--ped-file ./example/assoc.ped \
--output ./test/demo2 \
--threads 6 \
--seq-ac 1 \
--gene-model-database refgene \
--gene-model-database gencode \
--freq-database gnomad \
field=gnomAD_joint@EAS,gnomAD_joint@AFR \
--variant-annotation-database cadd \
--region-annotation-database name=EpiMap \
subID=BSS00001 \
marker=H3K4me3 \
--output-gty-format PLINK_BED
Key Command Options
- --variant-annotation-database: Adds non-coding scores from CADD.
- --region-annotation-database: Specifies epigenetic annotations (e.g., EpiGenome, H3K4me3 for BSS00001).
- --output-gty-format: Sets the output format to PLINK BED (.bed, .bim, .fam files).
Outputs
Results are saved in ./test/demo2/ in PLINK BED format, with additional files (.bim, .fam) and annotations for non-coding scores and epigenetic markers.
General Notes
- Ensure all external databases (e.g., gnomAD, dbNSFP, CADD, EpiGenome) are downloaded and correctly placed in the resources/ directory before execution.
- Adjust file paths and parameters based on your specific dataset and analysis needs.