The Variant Prune module is designed to filter and select variants based on linkage disequilibrium (LD), association p-values, and other functional annotations. This module allows users to efficiently reduce the number of redundant variants and streamline subsequent analysis. The retained variants and their corresponding genotype information can be exported into various formats, making them suitable for multiple downstream applications.

Association

KGGA provides a robust framework for single-variant association tests, incorporating genetic-environment interactions and supporting multiple genetic models. To improve statistical power and robustness, particularly in small-sample studies, KGGA combines multiple association p-values at a variant using the Cauchy Combination Test.


Association Testing Methods

KGGA offers a variety of association tests for both binary and continuous traits, allowing flexibility based on study design and research questions. The key command-line options and their functionalities are outlined below:

Option Description Default
--p-cut Sets the maximum p-value threshold for selecting variants for further analysis. [OFF]
--assoc Specifies the association analysis method and covariates.
Format: --assoc method= [sex=y/n] [interaction=1,...] [standardBeta=y/n] [minCell=5]
Example: --assoc method=allelic,model-trend,logistic-add sex=n interaction=1,3 standardBeta=n minCell=0
[OFF]

Association Methods (method) of --assoc

KGGA supports multiple association testing methods, each tailored to binary or continuous traits. These methods generate specific p-values and statistics, which are combined by default using the Cauchy Combination Test for a unified assessment.

Methods for Binary Traits

  • allelic: Performs a chi-square test on the 2x2 disease-by-genotype table.

    • Output: Assoc@Allelic_Assoc_P
  • model-trend: Applies the Cochran-Armitage trend test to assess trends in genotype frequencies.

    • Output: Assoc@Trend_Model_P
  • model-all: Uses Fisher’s exact test on the 2x2 table for an overall association.

    • Output: Assoc@Allelic_Model_P
  • model-dom: Tests the dominant model (e.g., DD + Dd vs. dd) using Fisher’s exact test.

    • Output: Assoc@Dominant_Model_P
  • model-rec: Tests the recessive model (e.g., DD vs. Dd + dd) using Fisher’s exact test.

    • Output: Assoc@Recessive_Model_P
  • logistic-add: Logistic regression testing the additive effect of the minor allele.

    • Outputs: Assoc@Logistic_Add_P, Assoc@Logistic_Add_Beta, Assoc@Logistic_Add_OR, Assoc@Logistic_Add_OR_Upper, Assoc@Logistic_Add_OR_Lower
  • logistic-dom: Logistic regression for the dominant model (DD + Dd vs. dd).

    • Outputs: Assoc@Logistic_Dom_P, Assoc@Logistic_Dom_Beta, Assoc@Logistic_Dom_OR, Assoc@Logistic_Dom_OR_Upper, Assoc@Logistic_Dom_OR_Lower
  • logistic-rec: Logistic regression for the recessive model (DD vs. Dd + dd).
  • Outputs: Assoc@Logistic_Rec_P, Assoc@Logistic_Rec_Beta, Assoc@Logistic_Rec_OR, Assoc@Logistic_Rec_OR_Upper, Assoc@Logistic_Rec_OR_Lower

Methods for Continuous Traits

  • anova: Performs an ANOVA test to compare phenotypic means across genotype groups.
    • Outputs: Assoc@ANOVA_P, Assoc@ANOVA_FStatistic, Assoc@ANOVA_DFB, Assoc@ANOVA_DFW
  • linear-add: Linear regression testing the additive effect of the minor allele.
    • Outputs: Assoc@Linear_Add_P, Assoc@Linear_Add_Beta, Assoc@Linear_Add_T
  • linear-dom: Linear regression for the dominant model (DD + Dd vs. dd).
    • Outputs: Assoc@Linear_Dom_P, Assoc@Linear_Dom_Beta, Assoc@Linear_Dom_T
  • linear-rec: Linear regression for the recessive model (DD vs. Dd + dd).
    • Outputs: Assoc@Linear_Rec_P, Assoc@Linear_Rec_Beta, Assoc@Linear_Rec_T

Note: KGGA employs the Cauchy Combination Test by default to combine p-values from selected methods into a single, comprehensive p-value.

  • Outputs: Assoc@CCT_P

This method enhances statistical power, especially in small-sample scenarios. For more details, refer to Liu & Xie (2020): Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures (Journal of the American Statistical Association, 115(529), 393-402).


Additional Association Options of --assoc

Option Description Default
sex When set to y, includes sex as a covariate and prioritizes it in SNP × covariate interaction analyses. [sex=n]
interaction Specifies covariates for SNP × covariate interaction terms in regression models. Covariates must be included in the PED file.
Example: interaction=1,3 includes interactions with the first and third covariates.
[OFF]
standardBeta When set to y for linear regression methods, standardizes the phenotype (mean 0, unit variance), yielding standardized beta coefficients. [standardBeta=n]
minCell For contingency table tests (e.g., chi-square, Fisher’s exact), skips tests if any cell has fewer observations than minCell. [minCell=0]

Examples of Association Models

  • Basic Logistic Regression (Additive Model): Command: --assoc method=logistic-add sex=y Model: Y = b0 + b1.ADD + b2.SEX + b3.A + b4.B + e
  • Logistic Regression with Interactions: Command: --assoc method=logistic-add sex=y interaction=1,3 Model: Y = b0 + b1.ADD + b2.SEX + b3.A + b4.B + b5.(ADD*SEX) + b6.(ADD*B) + e

Notes

  • Covariate Specification: Covariates for interaction terms must be explicitly included in the PED file.
  • P-value Combination: The Cauchy Combination Test integrates multiple p-values, improving robustness across tests.
  • Filtering with --p-cut: Use --p-cut to filter variants by p-value, with optional LD clumping via --r2-cut.

LD Pruning

Linkage Disequilibrium (LD) pruning is a straightforward method designed to eliminate genetic variants that exhibit high LD within a defined physical distance. This process reduces redundancy in genetic datasets by retaining variants that are as independent as possible, which is crucial for downstream genetic analyses.

LD Pruning Method

The LD pruning method in this context removes variants within a specified physical distance (controlled by --window-kb) if their LD exceeds a designated threshold (set by --r2-cut). The removal of variants follows a prioritized approach based on two key criteria:

  1. Variants with More LD Connections A variant that is in LD with a greater number of other variants within the specified window is prioritized for removal. For instance, consider three variants: A, B, and C. If the LD r² values are 0.8 between A and B, 0.8 between B and C, and 0.64 between A and C, variant B—being in LD with both A and C—will be removed, while A and C are retained. This step minimizes redundancy by preserving variants with fewer LD dependencies.
  2. Variants with Lower Minor Allele Frequency (MAF) If the first criterion does not distinguish between variants (i.e., they have an equal number of LD connections), the variant with the lower MAF is removed. This ensures that variants with higher MAF, which are typically more informative, are preferentially retained.

By applying these criteria, the method maximizes the retention of variants that are both less redundant (fewer LD connections) and more informative (higher MAF).

Command-Line Options for LD Pruning

The following table outlines the command-line options available for LD pruning, including their descriptions and default values:

Option Description Default
--r2-cut Specifies the LD threshold (r²) for pruning. Variants with an LD r² value exceeding this threshold within the defined window are candidates for removal.
Format: --r2-cut
Example: --r2-cut 0.8
[OFF]
--window-kb Defines the size of the sliding window (in kilobases) for LD pruning. Variants separated by a distance greater than this window are assumed to be in linkage equilibrium, and their LD coefficients are not computed to reduce computational effort.
Format: --window-kb
Example: --window-kb 10000
10000

Notes

  • Enabling LD Pruning: The --r2-cut option must be explicitly set to activate LD pruning; otherwise, this step is bypassed.
  • Window Size Impact: The --window-kb parameter determines the maximum distance over which LD is evaluated. Larger windows may yield a more thorough LD assessment but will increase computational time.

LD Clumping

LD Clumping in KGGA is an advanced linkage disequilibrium (LD) pruning method designed to reduce redundancy in genetic datasets by retaining variants with favorable characteristics. Unlike conventional LD clumping methods, such as those implemented in PLINK, which prioritize variants solely based on smaller p-values, KGGA’s LD clumping offers greater flexibility. It allows users to consider both association p-values and functional impact (e.g., gene features, CADD scores) when deciding which variants to retain. This makes it a powerful tool for genetic studies where both statistical significance and biological relevance are critical.


How LD Clumping Works

Within a specified LD pruning window (controlled by --window-kb), variants are sorted according to user-defined fields (specified via --clump). The clumping process starts with the variant having the most favorable value (e.g., the smallest p-value or highest functional score) and proceeds sequentially to variants with less favorable values. Variants in LD with a more favorable variant are pruned if they exceed the LD threshold.

Prioritization Criteria

The clumping process follows a hierarchical ranking system:

  1. Custom Fields (--clump): Variants are ranked based on the fields specified in --clump. For example, these could include association p-values (e.g., Assoc@CCT_P) or functional annotations (e.g., GeneFeature@MarkGeneFeature). Variants with more favorable values—such as smaller p-values or higher functional scores—are retained.
  2. LD Connections and MAF: If two variants are ranked equally based on the custom fields, the variant with more LD connections (i.e., higher correlation with other variants) or a lower minor allele frequency (MAF) is prioritized for removal.
  3. Random Selection: If all ranking conditions are equal (i.e., identical custom field values, LD connections, and MAF), one of the variants is randomly selected for pruning.

This approach ensures that the retained variants are both statistically significant and biologically meaningful, offering a more nuanced alternative to traditional LD clumping.


Options

The following table outlines the key command-line option for LD clumping in KGGA, including its description and default behavior:

Option Description Default
--clump Specifies the genomic feature names or fields used to rank variants during clumping.
Format: --clump
Example: --clump Assoc@CCT_P,GeneFeature@MarkGeneFeature
Explanation: In this example, variants are ranked first by their association p-value (Assoc@CCT_P), with smaller p-values preferred, and then by gene feature codes (GeneFeature@MarkGeneFeature), with higher codes indicating greater functional impact. Variants with larger p-values or less favorable gene features are pruned within the LD window.
-

Additional LD Pruning Options

LD Clumping in KGGA also relies on standard LD pruning parameters to define the pruning window and LD threshold. These options are typically inherited from the broader LD pruning framework:

Option Description Default
--r2-cut Specifies the LD threshold (r²) for clumping. Variants with LD r² exceeding this value are considered for removal.
Format: --r2-cut
Example: --r2-cut 0.8
[OFF]
--window-kb Defines the sliding window size (in kilobases) for LD clumping. Variants beyond this distance are assumed to be in linkage equilibrium.
Format: --window-kb
Example: --window-kb 10000
10000

Key Advantages

  • Flexibility: KGGA’s LD clumping goes beyond p-value-based pruning by incorporating functional annotations, making it ideal for studies where biological impact matters.
  • Customizable Ranking: Users can define multiple fields in --clump to create a tailored hierarchy for variant retention.
  • Balanced Pruning: The use of LD connections, MAF, and random selection as tiebreakers ensures a robust and fair pruning process when custom rankings are insufficient.
Copyright ©MiaoXin Li all right reservedLast modified time: 2025-04-24 07:49:09

results matching ""

    No results matching ""