Simulate
About
The simulate module is a powerful and flexible tool designed for generating synthetic phenotypes from real genotype data. This functionality is crucial for a wide range of applications in genetic research, including but not limited to: evaluating the statistical power of new analytical methods, validating the performance of association testing frameworks, understanding complex genetic architectures, and exploring causal relationships between traits (e.g., for Mendelian Randomization studies).
The module provides two sophisticated simulation frameworks:
- QTL-based Simulation: Generates phenotypes based on quantitative trait loci (QTLs) using established principles of quantitative genetics and the liability threshold model. It can simulate single traits, as well as causally related traits.
- Gene Expression-Mediated Simulation: Simulates phenotypes through a hierarchical model where genetic variants (eQTLs) first influence gene expression levels, which in turn determine the final complex trait. This framework can model both spatial and temporal (developmental) gene expression patterns.
By leveraging real genotype data from VCF or KGGA's native GTB format, the simulate module produces biologically plausible datasets that capture the linkage disequilibrium and allele frequency patterns of the source population.
1. Phenotype Simulation Based on Quantitative Trait Loci (QTLs)
This simulation framework operates on the principle that phenotypic variation arises from the combined effects of genetic variation and environmental factors. The simulation uses real genotypes to model the genetic component. Given a user-defined heritability, the environmental variance is calculated and applied to generate a continuous quantitative phenotype. For disease traits, this quantitative value is treated as a liability score, and a threshold corresponding to the specified disease prevalence is used to assign case/control status.
Example 1: Simulate a Single Phenotype
This example demonstrates how to simulate a single disease trait based on a set of 150 QTLs with varying effect models. The total heritability is set to 25%, and the disease prevalence is 1%. The underlying mathematical models can be viewed here.
java -Xmx10g -jar kgga.jar simulate \
--input-gty-file ./variants.maf05.hg38.gtb \
--output ./test/simulate_single_pheno \
--threads 20 \
--min-obs-rate 0.4 \
--allele-num 2~4 \
--local-maf 0.05~0.5 \
--r2-cut 0.1 \
--phenotype-variants qtlPattern=50@1,50@2,50@3 \
heritability=0.25 \
prevalence=0.01 \
confounding=0.05 \
--sampling-group groupNum=10 \
--sampling sampleSize=6000 \
caseProportion=0.5
- --phenotype-variants: Configures the genetic architecture of the simulated phenotype, including the number and effect models of QTLs, heritability, prevalence (for binary traits), and confounding variance.
- --sampling: Specifies the number of replicate datasets to generate and the desired number of cases and controls to sample for each replicate.
- --r2-cut: Sets the maximum Linkage Disequilibrium (LD) r² allowed between selected QTLs, ensuring they are relatively independent.
Example 2: Simulate Two Causally Related Phenotypes
This example simulates two phenotypes where the first trait (exposure) has a causal effect on the second (outcome). This is particularly useful for generating data to test methods like Mendelian Randomization. The underlying mathematical models can be viewed here.
java -Xmx10g -jar kgga.jar simulate \
--input-gty-file ./variants.maf05.hg38.gtb \
--output ./test/simulate_causal_pheno \
--threads 20 \
--local-maf 0.05~0.5 \
--r2-cut 0.1 \
--phenotype-variants qtlPattern=150@1 \
heritability=0.25 \
prevalence=0.01 \
confounding=0.05 \
--phenotype-variants qtlPattern=150@1 \
heritability=0.25 \
prevalence=0.01 \
confounding=0.05 \
--phenotype-causal causalEffect=0.2 \
causalHeritability=0.05 \
pleiotropyIVProportion=0.3 \
directPleiotropyHeritability=0.02 \
randPleiotropyHeritability=0.02 \
--sampling-group groupNum=10 \
overlapping=0 \
--sampling sampleSize=10000 \
caseProportion=0.5 \
--sampling sampleSize=10000 \
caseProportion=0.5
- --phenotype-causal: Specifies the causal effect of the first simulated phenotype (exposure) on the second (outcome). It also defines the proportion of the exposure's genetic variance that contributes to this causal pathway.
- --sampling: Set the sample size and the proportion of case individuals within a sample.
- --sampling-group: Defines the number of sample groups and the proportion of individuals that are shared between the samples within a group. The number of samples within a group is determined by the number of '--sampling' options.
Full Options for QTL-based Simulation
Option | Description | Default |
---|---|---|
--phenotype-variants | Configures the genetic architecture for a phenotype. This option can be specified multiple times for simulating multiple traits. Format: qtlPattern=[spec] heritability=[float] prevalence=[float] confounding=[float] - qtlPattern: A comma-separated list specifying the number and effect model of QTLs (e.g., n1@model1,n2@model2). Three effect models are supported: 1 (additive), 2 (pairwise interaction), and 3 (three-way interaction). - heritability: The proportion of phenotypic variance explained by all specified QTLs (range: 0-1). - prevalence: The disease prevalence for binary traits (range: 0-1). If omitted, a quantitative trait is simulated. - confounding: The proportion of variance explained by a common confounding factor (range: 0-1). Example: qtlPattern=50@1,50@2,50@3 heritability=0.25 prevalence=0.01 confounding=0.05 |
[OFF] |
--sampling | Specifies the sampling size from the full simulated population. Format: sampleSize=[int] caseProportion=[float] -sampleSize: The number of total subjects included in each sample. -caseProportion: The cases among the all subjects in each sample. Note: The sampled phenotypes are written to separate PED files, suitable for direct use in other KGGA modules. |
[OFF] |
--sampling-group | Specifies the sampling number and overlapping rates among samples. One group contains at least one sample. The number of samples in each group is determined by the number of '--sampling'. Format: groupNum=[int] overlapping=[float] - groupNum: The number of replicate sample groups to draw. - overlapping: The proportion of case subjects in each sample. A value of 0.5 means 50% overlap. |
|
--phenotype-causal | Specifies a causal relationship between two consecutively defined phenotypes. Format: causalEffect=[float] causalHeritability=[float] - causalEffect: The strength of the causal effect from the exposure to the outcome (e.g., -1 to 1). - causalHeritability: The proportion of the exposure's genetic variance that mediates the causal effect. - pleiotropyIVProportion: The proportion of correlated pleiotropic variants. - directPleiotropyHeritability: Variance proportions for directional pleiotropic variants. This is used to set the corrected horizontal pleiotropic effects. - randPleiotropyHeritability: Variance proportions for random pleiotropic variants. This is used to set the uncorrected horizontal pleiotropic effects. Note: The two values must be either zero or non-zero together. |
[OFF] |
--r2-cut | Specifies the maximum LD r² for QTL selection. QTLs in high LD with an already selected QTL will be excluded. Format: [float] Example: --r2-cut 0.1 |
1.0 |
2. Phenotype Simulation Based on Gene Expression Mediation
This advanced framework simulates phenotypes via a biologically inspired hierarchical model: Genotype -> Gene Expression -> Phenotype. It first models the expression of specified genes as a quantitative trait influenced by local eQTLs. It then uses these simulated gene expression levels as predictors for the final complex phenotype. This approach is ideal for investigating transcriptomic mediation and can simulate data for both spatial and temporal expression profiles.
Example 3: Simulate a Phenotype Mediated by Spatially-Variable Gene Expression
This command simulates spatial transcriptomic data across 10 chips. It models 500 genes with expression patterns concentrated in specific "hot spots" and subsequently simulates a disease phenotype where 250 of these genes act as drivers. Here are the underlying mathematical models on spatial transcriptomic profiles and eQTL gene expression regulating a phenotype.
java -Xmx10g -jar kgga.jar simulate \
--input-gty-file ./variants.maf05.hg38.gtb \
--output ./test/simulate_spatial \
--threads 20 \
--r2-cut 0.1 \
--xqtl-gene-file ./eqtlgenes.txt \
--spatial-gene chipSize=100 \
hotLocations=50:50,10:50 \
maxHotExpressionGeneNum=500 \
dropoutRate=0.4 \
chipNum=10 \
--xqtl-gene maxQTLNum=4 \
effectScale=1 \
heritability=0.3 \
--phenotype-gene maxDriverNum=250 \
heritability=0.15 \
prevalence=0.01 \
--sampling-group groupNum=10 \
--sampling sampleSize=6000 \
caseProportion=0.5
- --xqtl-gene-file: Provides a list of genes eligible for simulation.
- --spatial-gene: Defines the parameters for simulating spatial expression profiles, including chip dimensions and gene expression patterns.
- --xqtl-gene: Configures the genetic regulation of gene expression by eQTLs.
- --phenotype-gene: Configures the final phenotype as a function of the simulated gene expression levels.
Example 4: Simulate a Phenotype Mediated by Temporally-Variable Gene Expression
This example simulates gene expression data across a developmental timeline (ages 0-100). It models genes with expression levels that change over time and uses this temporal data to simulate a final phenotype. The mathematical models can be viewed here.
java -Xmx10g -jar kgga.jar \
simulate \
--input-gty-file ./variants.maf05.hg38.gtb \
--output ./test/simulateS \
--threads 20 \
--min-obs-rate 0.4 \
--allele-num 2~4 \
--local-maf 0.05~0.5 \
--r2-cut 0.1 \
--xqtl-gene-file /home/lmx/MyJava/idea/kgga2/eqtlgenes.txt \
--timing-gene startAge=0 \
endAge=100 \
ageInterval=1 \
sampleSizePerAge=10 \
maxTimingGenes=500 \
sampleNum=10 \
--timing-gene-model peakTime=20 \
scaleWidth=5 \
amplitude=10
--xqtl-gene maxQTLNum=4 \
effectScale=1 \
heritability=0.3 \
--phenotype-gene maxDriverNum=250 \
heritability=0.15 \
prevalence=0.01 \
--sampling-group groupNum=10 \
--sampling sampleSize=6000 \
caseProportion=0.5
Key Command-Line Options
--timing-gene
: Sets the parameters for the temporal simulation, including the age range and sampling scheme.--timing-gene-model
: Defines the mathematical model governing the temporal expression pattern of genes (e.g., peak expression time).
Full Options for Gene Expression-Mediated Simulation
Option | Description | Default |
---|---|---|
--xqtl-gene-file |
Specifies the path to a file containing a list of candidate genes for expression simulation. | [OFF] |
--xqtl-gene |
Configures the eQTL effects on gene expression. Format: maxQTLNum=[int] effectScale=[float] heritability=[float] - maxQTLNum : The maximum number of eQTLs per gene.- effectScale : A scaling factor for eQTL effect sizes.- heritability : The proportion of expression variance explained by eQTLs. |
[OFF] |
--phenotype-gene |
Configures the effect of gene expression on the final phenotype. Format: maxDriverNum=[int] heritability=[float] prevalence=[float] - maxDriverNum : The maximum number of driver genes influencing the phenotype.- heritability : The proportion of phenotypic variance explained by the driver genes.- prevalence : The disease prevalence if simulating a binary trait. |
[OFF] |
--spatial-gene |
Sets parameters for spatial transcriptomics simulation. Format: chipSize=[int] hotLocations=[coord] ... - chipSize : The dimension of the square chip (e.g., 100 for 100x100).- hotLocations : Comma-separated coordinates for centers of high-expression.- maxHotExpressionGeneNum : The number of genes to assign to hot spots.- dropoutRate : The rate of technical dropouts.- chipNum : The number of spatial chips to simulate. |
[OFF] |
--timing-gene |
Sets parameters for temporal (developmental) expression simulation. Format: startAge=[int] endAge=[int] ... - startAge , endAge : The age range for simulation.- ageInterval : The interval between age points.- sampleSizePerAge : Number of individuals to simulate per age point.- maxTimingGenes : The number of genes with temporal expression patterns.- sampleNum : The number of temporal expression profiles. |
[OFF] |
--timing-gene-model |
Configures the model for temporal expression. Format: peakTime=[int] scaleWidth=[int] amplitude=[float] - peakTime : The age of peak gene expression.- scaleWidth : The width of the expression peak.- amplitude : The maximum expression level. |
[OFF] |
Outputs
Simulated data is organized into task-specific subdirectories within the specified output folder.
QTL-based Simulation Output
Generated within a PhenotypeSimulateTask
folder:
- Population Phenotypes: A master file named
phenotype.ped
containing the simulated phenotypes for all individuals in the input genotype file. - Sampled Datasets: A series of numbered PED files (
phenotype.ped.0
,phenotype.ped.1
, etc.), corresponding to each replicate specified by--sampling
. Each file contains the sampled cases and controls for one replicate analysis.
Gene Expression-Mediated Simulation Output
Generated within an ExpressionPhenotypeSimulateTask
folder, in addition to the phenotype files described above:
- Spatial Expression Profiles: A series of text files (
gene.spatial.expression.txt.0
,gene.spatial.expression.txt.1
, etc.), one for each simulated chip. These files contain the gene expression matrices for each spatial location. - Temporal Expression Profiles: A series of text files (
gene.timing.expression.txt.0
,gene.timing.expression.txt.1
, etc.), one for each replicate. These files contain matrices of gene expression across the simulated age points.
Log File
A log file is generated summarizing the simulation parameters and key results, such as the calculated genetic and environmental variance components.
2025-09-04 12:11:27 The genetic variance and environmental variance are 95733.04 and 287199.12, respectively, to meet the heritability of 0.25.
```