Example

This section provides examples of using the pruning module in KGGA. The module enables users to select individuals from VCF and PED files, perform quality control, and prune variants based on association p-values, LD, or a combination of both, while optionally incorporating functional annotations such as gene features.


Prune Variants According to P-values

In this example, individuals present in both the VCF and PED files are selected, and the allelic association of each variant is examined. Variants with an association p-value (computed using the Cauchy Combination Test) greater than 1E-2 are excluded. The retained variants and their genotypes are stored in a VCF file for subsequent analysis.

java -Dccf.remote.timeout=60 -Xmx10g -jar kgga.jar \
   prune \
   --input-gty-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo3 \
   --local-maf 0.05~0.5 \
   --output-gty-format VCF \
   --allele-num 2~4 \
   --threads 4 \
   --assoc allelic,model-all \
   --p-cut 1E-2

Key Command Options

  • --input-gty-file: Specifies the input VCF file and reference genome of the VCF file(e.g., hg19).
  • --ped-file: Provides sample metadata (e.g., case/control status).
  • --output: Defines the output directory (e.g., ./test/demo3).
  • --local-maf: Filters variants by minor allele frequency (e.g., 0.05 to 0.5).
  • --output-gty-format: Sets the output format (e.g., VCF).
  • --allele-num: Filters variants by the number of alleles (e.g., 2 to 4).
  • --threads: Sets the number of processing threads (e.g., 4).
  • --assoc: Specifies the association methods (e.g., allelic and model-all).
  • --p-cut: Excludes variants with p-values greater than 1E-2.

Outputs

The pruned variants are saved in the specified output directory ./test/demo3/. The final results are stored in variants.hg38.tsv.gz, with columns including:

Header Description
CHROM Chromosome (hg38)
POS Position (hg38)
REF Reference allele
ALT Alternative allele
... Additional annotations
  • Intermediate Files: Files with the .gtb suffix in the ConvertVCF2GTBTask subfolder can be used for further analysis.
  • Log Files: Processing logs are saved in the log subfolder.

Prune Variants According to LD Only

In this example, individuals present in the VCF and PED files are selected, and LD-pruning is performed with a maximum LD r² of 0.1. The retained variants and their genotypes are stored in a VCF file for subsequent analysis.

java -Dccf.remote.timeout=60 -Xmx10g -jar kgga.jar \
   prune \
   --input-gty-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo3 \
   --local-maf 0.05~0.5 \
   --output-gty-format VCF \
   --allele-num 2~4 \
   --threads 4 \
   --r2-cut 0.1

Key Command Options

  • --r2-cut: Sets the LD threshold for pruning (e.g., 0.1).

Outputs

The pruned variants are saved in ./test/demo3/ with the same file structure as described in the p-value pruning example:

  • Final results in variants.hg38.tsv.gz.
  • Intermediate .gtb files in the ConvertVCF2GTBTask subfolder.
  • Logs in the log subfolder.

LD-Clumping of Variants According to P-values and Gene Features

In this example, individuals present in the VCF and PED files are selected, and the allelic association of each variant is examined. LD-clumping is performed on variants with p-values less than 1E-5, using an LD threshold of r² = 0.1. Additionally, variants in LD with less favorable gene feature codes (i.e., those less likely to affect gene function) are removed. The retained variants and their genotypes are stored in a VCF file.

java -Dccf.remote.timeout=60 -Xmx10g -jar kgga.jar \
   prune \
   --input-gty-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo3 \
   --local-maf 0.05~0.5 \
   --output-gty-format VCF \
   --allele-num 2~4 \
   --threads 4 \
   --p-cut 1E-5  \
   --r2-cut 0.1  \
   --clump Assoc@CCT_P,GeneFeature@MarkGeneFeature

Key Command Options

  • --p-cut: Sets the p-value threshold for selecting variants (e.g., 1E-5).
  • --r2-cut: Sets the LD threshold for clumping (e.g., 0.1).
  • --clump: Specifies the fields for ranking variants (e.g., Assoc@CCT_P for p-values and GeneFeature@MarkGeneFeature for gene features).

Outputs

The results are saved in ./test/demo3/ with the same structure as in previous examples:

  • Final results in variants.hg38.tsv.gz.
  • Intermediate .gtb files in the ConvertVCF2GTBTask subfolder.
  • Logs in the log subfolder.

General Notes

  • File Paths and Parameters: Adjust the file paths (e.g., --input-gty-file, --ped-file, --output) and parameters (e.g., --p-cut, --r2-cut, --local-maf) based on your specific dataset and analysis requirements.
  • Flexibility with Clumping: The --clump option allows users to prioritize variants based on p-values and gene features, ensuring that the most functionally relevant variants are retained.
Copyright ©MiaoXin Li all right reservedLast modified time: 2025-04-05 13:36:04

results matching ""

    No results matching ""