Connection to Third-Party Tools or Platforms

KGGA’s Connect module facilitates seamless integration with four third-party tools and platforms—PLINK2, PLINK, GCTA, and Python—for advanced genetic analyses. By leveraging KGGA’s powerful data preprocessing capabilities, such as cleaning and annotating large-scale genotype data, the module generates efficient output formats (e.g., PLINK’s BED or PGEN) or maintains data in RAM for direct use by these tools.

PLINK2

The PLINK2 integration allows KGGA to produce cleaned genotypes in PLINK’s PGEN format and directly launch PLINK2 for analysis, such as regression using its generalized linear model (--glm). For additional PLINK2 analysis options, refer to the PLINK2 documentation.

Example Usage

java -Djava.library.path="$(pip3 show jep | grep Location | awk '{print $2"/jep"}')" \
   -Xmx10g -jar kgga.jar \
   clean \
   --input-gty-file ./example/assoc.hg19.vcf.gz refG=hg19 \
   --ped-file ./example/assoc.ped \
   --output ./test/demo1 \
   --threads 4 \
   --allele-num 2~4 \
   --seq-ac 1 \
   --min-obs-rate 0.9 \
   --hwe 1E-5 \
   --plink2 "--glm genotypic interaction --covar ./example/cov.plink2.txt --parameters 1-4,7" \
            path=./tools/plink2

The --plink2 Option

The --plink2 option is a composite parameter that specifies the path to the PLINK2 executable and its analysis options.

  • path: Defines the directory containing the PLINK2 executable.
  • Analysis Options: Specifies PLINK2 parameters, such as --glm for regression analysis, as shown in the example.

The PLINK integration generates cleaned genotypes in PLINK’s BED format and launches PLINK (version ≤ 1.9) for analysis, such as regression using its generalized linear model (--glm). For additional PLINK analysis options, refer to the PLINK documentation.

Example Usage

java -Xmx10g -jar kgga.jar \
   clean \
   --input-gty-file ./example/assoc.hg19.vcf.gz refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo1 \
   --threads 4 \
   --allele-num 2~4 \
   --seq-ac 1 \
   --min-obs-rate 0.9 \
   --hwe 1E-5 \
   --plink "--glm genotypic interaction --covar ./example/cov.txt --parameters 1-4,7"

The --plink option is a composite parameter that configures the path to the PLINK executable and its analysis options.

  • path: (Optional) Specifies the directory containing the PLINK executable. If omitted, KGGA assumes PLINK is in the system path.
  • Analysis Options: Defines PLINK parameters, such as --glm for regression analysis, as shown in the example.

GCTA

The GCTA integration produces cleaned genotypes in PLINK’s BED format and launches GCTA to perform analyses, such as creating a genetic relationship matrix (GRM) between pairs of individuals from retained variants. For additional GCTA analysis options, refer to the GCTA documentation.

Example Usage

java -Xmx10g -jar kgga.jar \
   clean \
   --input-gty-file ./example/assoc.hg19.vcf.gz refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo1 \
   --threads 4 \
   --allele-num 2~4 \
   --seq-ac 1 \
   --min-obs-rate 0.9 \
   --hwe 1E-5 \
   --gcta "--make-grm --thread-num 4" \
         path=./tools/gcta64

The --gcta Option

The --gcta option is a composite parameter that specifies the path to the GCTA executable and its analysis options.

  • path: Defines the directory containing the GCTA executable.
  • Analysis Options: Specifies GCTA parameters, such as --make-grm for generating a GRM, as shown in the example.

Python

The Python integration enables users to access cleaned genotype data in GTB format through KGGA’s Java APIs, allowing direct manipulation in RAM for custom Python analyses. This requires proficiency in both Java and Python programming.

Example Usage

The following Java code demonstrates how to read cleaned genotypes in GTB format and reformat them for use in a Python function:

// Initialize a global Python interpreter instance.
GlobalPythonInterpreter python = new GlobalPythonInterpreter();

// Initialize a GTBReader to read from a specific GTB (Genotype Block) file.
// This file likely contains genetic variant data.
GTBReader reader = new GTBReader("./test1/GenerateAnnotationBaseTask/variants.annot.gty.hg38.gtb");

// Get the total number of variants from the reader.
int varNum = (int) reader.numOfVariants();

// Initialize an array to store variant coordinates (e.g., "chr1:12345").
String[] coordinates = new String[varNum];

// Initialize a 2D array to store genotype data for each variant.
// The second dimension will be sized per variant later.
int[][] genotypes = new int[varNum][];

// Initialize a counter for the current variant ID/index.
int varID = 0;

// Loop through each variant in the GTB file.
while (reader.hasNext()) {
    // Read the next variant record.
    Variant variant = reader.read();

    // Get the genotype codes for the current variant.
    // tmpGty is likely a 2D array, possibly [alleles_per_sample][samples] or [ploidy_level][samples].
    // For diploid organisms, it's often [2][number_of_samples], where tmpGty[0] is the first allele
    // and tmpGty[1] is the second allele for each sample.
    int[][] tmpGty = variant.getGenotypes().getGenotypeCodes();

    // Initialize the genotype array for the current variant, sized by the number of samples.
    // Assumes tmpGty[0].length gives the number of samples.
    genotypes[varID] = new int[tmpGty[0].length];

    // For each sample, sum the two allele codes.
    // This typically converts diploid genotypes (e.g., 0/0, 0/1, 1/1) into a count of
    // the alternate allele (e.g., 0, 1, 2).
    for (int i = 0; i < tmpGty[0].length; i++) {
        genotypes[varID][i] = tmpGty[0][i] + tmpGty[1][i];
    }

    // Get and store the coordinate of the current variant as a string.
    coordinates[varID] = variant.getCoordinate().toString();

    // Increment the variant ID/index.
    varID++;
}

// Close the GTBReader to release resources.
reader.close();

// --- Python Interaction ---

// Execute Python code: import the NumPy library as 'np'.
python.exec("import numpy as np");

// Pass the Java 'genotypes' array to Python as a variable named 'gty'.
python.setValue("gty", genotypes);

// Pass the Java 'coordinates' array to Python as a variable named 'var'.
python.setValue("var", coordinates);

// Execute Python code: call a Python function named 'prs'.
// This function likely uses the 'gty' (genotypes) and a 'model' variable.
// 'model' is assumed to be defined or loaded within the Python environment/script.
// 'prs' could stand for Polygenic Risk Score calculation.
python.exec("prs(gty, model)");

// Close the Python interpreter and release its resources.
python.close();

This code reads GTB data, extracts genotypes and coordinates, and passes them to a Python function (prs) for analysis.


Copyright ©MiaoXin Li all right reservedLast modified time: 2025-05-16 01:23:49

results matching ""

    No results matching ""