Phenotype Prediction¶

About¶

The predict module in KGGA is a powerful tool designed to construct and apply predictive models for phenotypes using genotypes and covariates. This module is particularly valuable for genomic research and clinical applications, such as prioritizing individuals with high disease risk based on their genetic profiles. To enhance prediction accuracy, the module seamlessly integrates covariates like age and sex into the models, bridging the gap between genomic data and real-world outcomes.

What the Module Does¶

The predict module enables the rapid development and application of phenotype prediction models. It allows users to:

Leverage both genotype data (from VCF or GTB formats) and supplementary covariate information.
Integrate seamlessly with upstream KGGA modules (clean, annotate, and prune) to ensure high-quality input data.
Flexibly partition datasets into distinct training, testing, and prediction samples for robust model development and validation.
Utilize advanced machine learning algorithms for prediction, starting with ELAG (Ensemble Learning with Adaptive Sampling Guided by Policy Gradient).
Generate portable prediction models and detailed prediction scores for further analysis or clinical interpretation.

Workflow of the Predict Module¶

This module follows a structured workflow, beginning with data preparation and culminating in phenotype prediction. The key stages are as follows:

Input and Preprocessing: The module accepts genotypes from whole-genome variants in VCF or GTB formats. Before prediction, users can leverage other KGGA modules to perform quality control (clean), annotate variants with gene and genomic features (annotate), and reduce redundancy based on linkage disequilibrium (prune).
Sample Partitioning: A unique feature of this module is the explicit specification of sample sets. Users can define a training sample for model construction, a testing sample for model evaluation (optional), and a separate prediction sample for which phenotype predictions are desired.
Model Training: Using the specified training data, the module builds a predictive model. Currently, a method named ELAG (Ensemble Learning with Adaptive Sampling Guided by Policy Gradient) is available for disease risk prediction. Support for more methods will be added in the future.
Prediction: The trained model is then applied to the prediction sample to generate risk scores or phenotype values for each individual.

Output and Format¶

Upon completion, the predict module generates multiple output files designed for different purposes, ensuring comprehensive results for evaluation and downstream use.

Trained Prediction Model: A serialized binary file containing the complete, trained model. This portable file can be loaded for external use or to predict phenotypes on new datasets without the need for retraining.
Model Variant List: A text file (TSV) listing the specific variants and covariates that were used to build the prediction model. This file often includes feature importance scores or model weights, providing insights into the biological drivers of the prediction.
Prediction Values: A tab-separated (TSV) file containing the prediction results for the prediction sample. It typically includes individual IDs along with their calculated prediction values or risk scores, ready for further statistical analysis or individual prioritization.

Basic Usage¶

To execute the predict module, use the following command structure in a terminal or shell environment:

java -jar kgga.jar predict --input <input1> --input <input2> --output <output> [options]