Variant Pruning

About

The prune module in KGGA is a powerful tool designed to refine genetic datasets by selecting variants that exhibit low linkage disequilibrium (LD) while retaining those with significant phenotypic associations and functional importance. This module is particularly valuable for reducing redundancy in large-scale genetic data, ensuring that the retained variants are both statistically robust and biologically meaningful. Its primary function is LD-based pruning, which integrates functional annotations (e.g., gene features, prediction scores) and phenotypic association p-values to prioritize variants. The pruned variants and their associated genotypes serve as high-quality input for downstream analyses, such as genetic risk prediction for complex diseases like diabetes or cardiovascular conditions.

This module is ideal for researchers aiming to streamline datasets without sacrificing critical genetic signals, making it a cornerstone for studies requiring efficient yet impactful variant selection.

Workflow of the Prune Module

  1. Association Testing: Identifies variants with statistically significant associations to a phenotype of interest. The module calculates the genetic association (e.g., via regression or chi-square tests) between each variant and the target phenotype, generating a p-value for each variant. Variants with p-values exceeding a user-defined threshold (e.g., --p-cut 1E-5) are pruned. This ensures that only variants with strong evidence of phenotypic relevance are retained.
  2. LD pruning: Reduces redundancy by removing variants in high LD with others, maximizing the number of independent variants retained. Within a user-specified genomic window (e.g., 10 MB, set via --window-kb), the module identifies pairwise LD between variants using a correlation metric (e.g., r², set via --r2-cut). Variants with the highest number of LD connections (i.e., correlated with the most other variants) are prioritized for removal. This approach minimizes redundancy while preserving dataset diversity. If multiple variants have the same number of LD connections, the variant with the higher minor allele frequency (MAF) is removed. If MAFs are equal, variants are removed randomly to maintain fairness.
  3. LD Clumping: Enhances LD pruning by incorporating variant significance and functional annotations for prioritized retention. Similar to LD pruning, variants are ranked based on user-defined criteria (e.g., p-values, functional scores) before pruning occurs. The --clump option allows users to specify prioritization fields, such as association p-values (Assoc@CCT_P) or gene annotations (GeneFeature@MarkGeneFeature).

Output and Format

Upon completion, the prune module generates output files in two complementary formats with coordinates, genomic features, and genotypes of retained variants: GTB (Genotype Block) binary format and text format. The GTB format is a compact, custom binary format optimized for rapid reading and integration with other KGGA modules. The text format is human-readable and has tab-separated values (TSV) for easy inspection and external use.

pruning

Basic Usage

To execute the prune module, use the following command structure in a terminal or shell environment:

java -jar kgga.jar prune --input <input1> --input <input2> --output <output> [options]
Copyright ©MiaoXin Li all right reservedLast modified time: 2025-04-02 13:14:04

results matching ""

    No results matching ""