Methods & Options

The predict module is designed to predict phenotypes based on genotypes and covariates. Multiple methods under this module are available. We are also calling for new methods based on this powerful infrastructure.

Polygenic Risk Prediction Using Genotype Data¶

ELAG¶

ELAG is a machine learning framework designed for polygenic risk prediction from high-dimensional genomic data. The workflow begins with genotype data that has undergone standard quality control to preserve a large variant pool. ELAG’s core is a policy-gradient-inspired algorithm that dynamically samples variant subsets, which are then used to train an ensemble of base classifiers. The collective performance of this ensemble provides a reward signal that iteratively refines the sampling policy. The final output is a trained ensemble model for predicting binary disease status (case/control).

Citations¶

Lihang Ye, Nan Lin, Yin Luo, …, Miaoxin Li. Policy gradient-guided ensemble learning for enhanced polygenic risk prediction in ultra-high-dimensional genomics. PrePrint. medrxiv

Functionality¶

ELAG enhances disease risk prediction through:

1) Adaptive variant sampling guided by a novel policy gradient algorithm that leverages feedback from machine learning classifier performance;

2) An ensemble learning framework that effectively addresses the challenges of ultra–high-dimensional genomic data;

3) Dual functionality, serving not only as a prediction tool but also as an upstream module to boost the performance of existing polygenic risk score (PRS) methods by selecting genetic variants.

4) Besides genetic variants, it can also integrate environmental factors to enhance prediction accuracy.

Options and Examples¶

Below is an example command demonstrating how ELAG can be used to train and test for genomic data. If functional annotations like CADD scores and external GWAS summary data can be used, we can also input them as external information. The output includes individual scores and the prediction performance of the test set.

ELAG utilizes Python as its computational engine. Please ensure that KGGA is properly configured to communicate with Python. We recommend using Docker to streamline this integration. For Docker configuration details, please refer to RServerDocker.

java -Xmx10g -jar kgga.jar \
    predict \
       --input-gty-file ./wgs/comm/variants.maf05.hg38.gtb \
       --ped-file ./AD/merged_all.fam \
       --output ./test/predict/AD \
       --threads 20 \
       --variant-annotation-database cadd \
                                    field=ProteinFunction@CADD_RawScore \
       --min-obs-rate 0.8 \
       --allele-num 2~4 \
       --local-maf 0.01~0.5 \
       --sum-file ./AD/gwas.summary.txt.gz \
                  cp12Cols=chromosome,base_pair_location,effect_allele,other_allele \
                  pbsCols=p_value \
                  refG=hg19 \
       --elag functionScore=ProteinFunction@CADD_RawScore \
              maxEpoch=20 \
              crossFold=5 \
              baggingNumber=10 \
              variantSampleSize=5000 \
              ignoreGty=n \
              permutNum=100 \
       --assign-sample trainingSample=./AD/merged_data_with_age_sex.fam \           testingSample=./AD/merged_test_data_with_age_sex.fam \
                       pheno=disease \
                       covar=age,sex

Note: The disease phenotypes to be analyzed are specified by --ped-file with <pheno>. Otherwise, the first phenotype ( the 6th column) is used by default.

Key Command-Line Options¶

--input-gty-file /path/to/genotype.gtb
Path to the input genotype file (e.g., GTB format).
--ped-file /path/to/pedigree_or_phenotype.fam
Provides sample IDs, pedigree information, and phenotypes.
Note: The phenotype used can be specified together with --assign-sample pheno=<name>; otherwise the first phenotype column (6th column in .fam) is used by default.
--output /path/to/output_prefix_or_dir
Output directory or file prefix for all result files.
--threads <int>
Number of CPU threads to use.
--variant-annotation-database <db> field=<field1>,<field2>,...
Selects the variant annotation source and the fields to import as features.
• Example:
--variant-annotation-database cadd field=ProteinFunction@CADD_PHRED,ProteinFunction@CADD_RawScore
--min-obs-rate <float>
Minimum per-variant call rate (proportion of non-missing genotypes) required to keep a variant.
• Example: --min-obs-rate 0.8 keeps variants with ≥80% observed genotypes.
--allele-num <low>~<high>
Keeps variants whose number of alleles falls within the given range.
• Example: 2~4 keeps bi-allelic up to quad-allelic sites.
--local-maf <low>~<high>
Filters variants by minor allele frequency computed in the current dataset.
• Example: 0.01~0.5 keeps variants with MAF between 1% and 50%.
--sum-file /path/to/sumstats.txt[.gz] cp12Cols=<chr,pos,effect_allele,other_allele> pbsCols=<pval_col> refG=<build>
Imports external GWAS summary statistics and maps the required columns; refG sets the reference genome build of the summary file.
• Example:
–sum-file … cp12Cols=chromosome,base_pair_location,effect_allele,other_allele pbsCols=p_value refG=hg19.
--assign-sample trainingSample=/path/to/train.fam testingSample=/path/to/test.fam pheno=<phenotype_name> covar=<cov1,cov2,...>
Explicitly assigns training/testing sample sets, selects the phenotype column to analyze, and specifies covariates to adjust (e.g., age,sex).

The `--elag` Option¶

The --elag option is a composite parameter that configures the prediction and configures the ELAG module (policy-gradient–guided adaptive variant sampling + bagged ensemble).

Sub-Options¶

• functionScore=<annotation_field>: annotation score used to guide sampling (e.g., ProteinFunction@CADD_RawScore)
• maxEpoch=<int>: maximum training epochs for the policy
• crossFold=<k>: k-fold cross-validation within training
• baggingNumber=<int>: number of bootstrap models in the ensemble
• variantSampleSize=<int>: number of variants sampled per iteration/epoch
• ignoreGty=y|n: whether to ignore individual-level genotypes (n means use genotypes)
• permutNum=<int>: number of phenotype-label permutations for empirical significance assessment

Format¶

--elag functionScore=<field> [maxEpoch=<int>] [crossFold=<k>] [baggingNumber=<int>] [variantSampleSize=<int>] [ignoreGty=y/n] [permutNum=<int>]

Example¶

--elag functionScore=ProteinFunction@CADD_RawScore maxEpoch=20 crossFold=5 baggingNumber=10 variantSampleSize=5000 ignoreGty=n permutNum=100

Output File Description¶

ELAG generates four main output files:

Model Save List: Saves the best model trained.
Predict Result: The output values of each individual in the test set from every predictor.
Test Metrics: The evaluation metrics of the entire test set.
Validate Metrics: The evaluation metrics of the validation set in each fold.

Example Predict Result¶

The *predict_result.tsv file contains the following columns:

IID: Individual ID.
1, 2, …, N: The output of the N-th classifier.
mean: The mean value of all classifier outputs.

IID 1   2   3   ... 50  mean 
1015905 0.3278  0.3446  0.4327  ... 0.6307  0.4914

Example Test Metrics¶

The *test_metrics.tsv file contains columns such as:

epoch: Training epoch number.
fold: Fold number; mean indicates the average across folds.
auc_roc: Area Under the ROC Curve (AUC-ROC). Higher values indicate better separation between positive and negative classes.
auc_pr: Area Under the Precision-Recall Curve (AUC-PR). More informative than AUC-ROC when classes are imbalanced.
ks: Kolmogorov–Smirnov statistic. Measures the maximum difference between cumulative distributions of positive and negative classes.
best_accuracy: Accuracy at the optimal threshold.
best_threshold: Threshold value that maximizes accuracy.
best_recall: Recall at the optimal threshold.
best_precision: Precision at the optimal threshold.
best_f1: F1 score at the optimal threshold
best_mcc: Matthews Correlation Coefficient. A balanced metric even for imbalanced datasets.

epoch   fold    sample  auc_roc auc_pr  ks  best_accuracy   best_threshold  best_recall best_precision  best_f1 best_mcc
1   mean    mean    0.7266  0.7127  0.3841  0.6890  0.5 0.6524  0.7039  0.6772  0.3791

Example Validate Metrics¶

The *validate_metrics.tsv file contains columns such as:

epoch: Training epoch number.
fold: Fold number; mean indicates the average across folds.
auc_roc: Area Under the ROC Curve (AUC-ROC). Higher values indicate better separation between positive and negative classes.
auc_pr: Area Under the Precision-Recall Curve (AUC-PR). More informative than AUC-ROC when classes are imbalanced.
ks: Kolmogorov–Smirnov statistic. Measures the maximum difference between cumulative distributions of positive and negative classes.
best_accuracy: Accuracy at the optimal threshold.
best_threshold: Threshold value that maximizes accuracy.
best_recall: Recall at the optimal threshold.
best_precision: Precision at the optimal threshold.
best_f1: F1 score at the optimal threshold
best_mcc: Matthews Correlation Coefficient. A balanced metric even for imbalanced datasets.

epoch   fold    sample  auc_roc auc_pr  ks  best_accuracy   best_threshold  best_recall best_precision  best_f1 best_mcc
1   1   mean    0.6088  0.6039  0.2014  0.5989  0.55    0.4846  0.6283  0.5472  0.2033