Example

This section provides examples of using the annotate module in KGGA, a software tool designed for geneticists. The module extracts variants and genotypes from VCF and PED files after quality control (QC), annotates them with various genomic features, and applies filters to identify high-quality, rare variants for downstream analysis.


Annotate Variants with Gene Features and Filter Based on Local Allele Frequencies

In this example, variants and genotypes are extracted from VCF and PED files post-QC. The variants are annotated with gene features using the RefGene and Gencode databases. Variants with common allele frequencies in the local sample (MAF > 0.05) or located outside exonic regions are excluded, yielding a subset of high-quality, rare exonic variants for further analysis.

java -Dccf.remote.timeout=60 -jar kgga.jar \
annotate \
   --input-gty-file ./example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file http://api.pmglab.top/kgga/download/example/assoc.ped \
   --output ./test/demo2 \
   --threads 6 \
   --seq-ac 1 \
   --local-maf 0~0.05 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --gene-feature-included 0~8

Key Command Options

  • --input-gty-file: Specifies the input VCF file and reference genome (e.g., hg19).
  • --ped-file: Provides sample metadata (e.g., case/control status).
  • --output: Defines the output directory.
  • --threads: Sets the number of processing threads.
  • --seq-ac: Filters variants by allele count (e.g., ≥1).
  • --local-maf: Excludes variants with minor allele frequency (MAF) > 0.05 in the local sample.
  • --gene-model-database: Specifies gene annotation databases (e.g., RefGene, Gencode).
  • --gene-feature-included: Limits annotations to exonic features (values 0~8).

Outputs

The annotated variants are saved in the output directory ./test/demo2/. The final results are stored in variants.hg38.tsv.gz, with columns including:

Header Description
CHROM Chromosome (hg38)
POS Position (hg38)
REF Reference allele
ALT Alternative allele
GeneFeature@MarkFeatureGene Gene feature mark
GeneFeature@MarkGeneFeature Gene feature details
GeneFeature@GeneFeatureDetails Detailed gene feature information
GeneFeature@HitGenes Genes affected by the variant
GeneFeature@HitGeneFeatures Features of the affected genes
SOURCE@hg19_CHROM Original chromosome (hg19)
SOURCE@hg19_POS Original position (hg19)
GTYSUM@RefHomGtyNum_ALL Number of reference homozygotes (all samples)
GTYSUM@HetGtyNum_ALL Number of heterozygotes (all samples)
GTYSUM@AltHomGtyNum_ALL Number of alternative homozygotes (all samples)
GTYSUM@MissingGtyNum_ALL Number of missing genotypes (all samples)
GTYSUM@RefHomGtyNum_CASE Number of reference homozygotes (case samples)
GTYSUM@HetGtyNum_CASE Number of heterozygotes (case samples)
GTYSUM@AltHomGtyNum_CASE Number of alternative homozygotes (case samples)
GTYSUM@MissingGtyNum_CASE Number of missing genotypes (case samples)
GTYSUM@RefHomGtyNum_CONTROL Number of reference homozygotes (control samples)
GTYSUM@HetGtyNum_CONTROL Number of heterozygotes (control samples)
GTYSUM@AltHomGtyNum_CONTROL Number of alternative homozygotes (control samples)
GTYSUM@MissingGtyNum_CONTROL Number of missing genotypes (control samples)
  • Intermediate Files: Files with the .gtb suffix in the ConvertVCF2GTBTask subfolder can be used for subsequent analysis.
  • Log Files: Processing logs are saved in the log subfolder.

Annotate Variants with Gene Features, Reference Allele Frequencies, and Deleterious Scores

Here, variants and genotypes are extracted from VCF and PED files after QC and annotated with gene features (RefGene, Gencode), allele frequencies from gnomAD (East Asian and African populations), and deleterious scores from dbNSFP. Variants with common allele frequencies in the local sample (AF > 0.05) or reference database (AF > 0.01), or those beyond exonic regions, are excluded, identifying rare variants with potential functional impact.

java -Dccf.remote.timeout=60 -jar kgga.jar \
annotate \
   --input-gty-file ./example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file http://api.pmglab.top/kgga/download/example/assoc.ped \
   --output ./test/demo2 \
   --threads 6 \
   --seq-ac 1 \
   --local-af 0~0.05 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --gene-feature-included 0~8 \
   --freq-database gnomad \
                   field=gnomAD_joint@EAS,gnomAD_joint@AFR \
   --db-af 0~0.01 \
   --variant-annotation-database dbnsfp

Key Command Options

  • --local-af: Filters variants by local allele frequency (e.g., 0–0.05).
  • --freq-database: Specifies the reference frequency database (e.g., gnomAD) and fields (e.g., EAS, AFR populations).
  • --db-af: Excludes variants with allele frequency > 0.01 in the reference database.
  • --variant-annotation-database: Adds deleterious scores from dbNSFP.

Notes

The gnomAD and dbNSFP databases are not included in the tutorial folder due to their large size. Download the corresponding GTB files from [XXXX] and place them in resources/gnomad/ and resources/dbnsfp/ before running the command.

Outputs

Results are saved in ./test/demo2/ with a format similar to the first example, including additional columns for allele frequencies and deleterious scores.


Annotate Variants with Gene Features, Allele Frequencies, Non-Coding Scores, and Epigenetic Markers

In this example, individuals are selected from VCF and PED files, and stricter-than-default QC is applied to their genotypes. Variants are annotated with gene features (RefGene, Gencode), allele frequencies (gnomAD), non-coding functional scores (CADD), and epigenetic markers (EpiMap Repository, H3K4me3 for sample BSS00001). The output is formatted as PLINK BED files.

java -Dccf.remote.timeout=60 -jar kgga.jar \
annotate \
   --input-gty-file ./example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file ./example/assoc.ped \
   --output ./test/demo2 \
   --threads 6 \
   --seq-ac 1 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --freq-database gnomad \
                   field=gnomAD_joint@EAS,gnomAD_joint@AFR \
   --variant-annotation-database cadd \
   --region-annotation-database name=EpiMap \
                                subID=BSS00001 \
                                marker=H3K4me3 \
   --output-gty-format PLINK_BED

Key Command Options

  • --variant-annotation-database: Adds non-coding scores from CADD.
  • --region-annotation-database: Specifies epigenetic annotations (e.g., EpiGenome, H3K4me3 for BSS00001).
  • --output-gty-format: Sets the output format to PLINK BED (.bed, .bim, .fam files).

Outputs

Results are saved in ./test/demo2/ in PLINK BED format, with additional files (.bim, .fam) and annotations for non-coding scores and epigenetic markers.


General Notes

  • Ensure all external databases (e.g., gnomAD, dbNSFP, CADD, EpiGenome) are downloaded and correctly placed in the resources/ directory before execution.
  • Adjust file paths and parameters based on your specific dataset and analysis needs.
Copyright ©MiaoXin Li all right reservedLast modified time: 2025-03-22 04:23:57

results matching ""

    No results matching ""