Phenotype and gene expression simulation

The simulate module in KGGA is designed to simulate phenotypes (including binary traits and quantitative traits) based on real genotype data. This module is particularly suitable for methodological studies in which multiple genetic loci or genes influence a phenotype under polygenic models. An advantage of using real genotypes for simulation is that it maintains the natural genetic structure for more realistic simulations, where linkage disequilibrium (LD) patterns, allele frequencies, and gene boundaries are preserved. The condition for such a simulation strategy is that a large sample (e.g., n > 500,000) is available, as this is required for accurately controlling a given heritability.

About

KGGA's simulate module leverages high-performance computing to generate synthetic phenotypes from large-scale genotype datasets. It supports polygenic models where genetic effects are distributed across numerous variants, enabling researchers to test statistical methods, evaluate power in GWAS, or model complex trait architectures. By integrating with KGGA's other modules, users can preprocess genotypes to focus on specific variant sets (e.g., rare variants or gene regions) before simulation.

What the Module Does

The simulate module enables rapid simulations with large genotype datasets. It allows users to:

Simulate a binary or quantitative phenotype given heritability, sample sizes, and sample numbers.
Simulate two related phenotypes given causal effects, heritability, sample sizes, and sample numbers.
Simulate space-time specific expression of genes and phenotypes regulated by the genes given heritability, gene number, specific expression-related parameters, sample sizes, and sample numbers.

These features support advanced scenarios, such as modeling pleiotropy or spatiotemporal gene regulation in transcriptomic studies.

Workflow of the Simulate Module

This module follows a structured workflow, beginning with data preparation and culminating in phenotype simulation. The key stages are as follows:

Input and Preprocessing: The module accepts genotypes from whole-genome variants in VCF or GTB formats. Before simulation, users can leverage other KGGA modules to perform quality control (clean), annotate variants with gene and genomic features (annotate), and reduce redundancy based on linkage disequilibrium (prune).
Phenotype Simulation: Given the heritability of phenotypes or genes, KGGA will simulate phenotypes for all input subjects with genotypes and gene expression (if required). Simulations account for polygenic risk, environmental noise, and user-defined parameters to ensure realistic variance partitioning.
Phenotype Sampling: The module will randomly assign subjects with simulated phenotypes into multiple samples according to the preset sample size and sample number. Subjects are randomly drawn without replacement, allowing for replication studies or bootstrapping analyses.

Output and Format

Upon completion, the simulate module generates multiple output files designed for subsequent analysis.

Case/Control Samples: A number of pedigree files, each for a sample, containing case/control subjects. The number is equal to the sample number. Files are in PED format for compatibility with tools like PLINK.
Quantitative Phenotype Samples: A number of pedigree files, each for a sample, containing subjects with quantitative traits. The number is equal to the sample number. Phenotypes are standardized (mean=0, variance=1) unless otherwise specified.
Gene Expression Profiles: A number of tab-separated (TSV) files containing the simulated gene expression. Typically, each row represents a gene (labeled by gene symbol), and each column represents an expression condition, such as a tissue space grid (for spatial transcriptomic data) or a time grid (temporal transcriptomic data). Files include metadata headers for heritability and simulation parameters.

All outputs are stored in the specified output directory, with optional compression (e.g., .gz) for large files.

Basic Usage

To execute the predict module, use the following command structure in a terminal or shell environment:

java -jar kgga.jar simulate --input <input1> --input <input2> --output <output> [options]

Simulate