The prioritize module is designed to prioritize variants, genes, or genomic regions based on allelic frequencies in reference populations and annotated functional prediction scores. Multiple methods under this module are available. We are also calling for more new methods based on this powerful infrastructure.

Prioritize genes enriched with rare sequence variants

RUNNER

The RUNNER function leverages a negative binomial regression model to prioritize genes or genomic regions enriched with rare sequence variants compared to background mutations. This methodology builds on foundational research detailed in our previous publications (WITER and RUNNER). It constructs a model to adjust mutation counts in predefined genomic regions or genes using multiple genomic features, such as reference allele frequencies, region lengths, and conservation scores. It evaluates the statistical enrichment of these adjusted counts, weighted by functional scores, under negative binomial distributions.

Citations

Lin Jiang, Hui Jiang, ..., Pak Chung Sham, and Miaoxin Li. Deviation from baseline mutation burden provides powerful and robust rare-variants association test for complex diseases. Nucleic Acids Res. 2022 Apr 8;50(6):e34. PubMed Link
Lin Jiang, Jingjing Zheng, ..., Yonghong Zhu, and Miaoxin Li. WITER: a powerful method for estimation of cancer-driver genes using a weighted iterative regression modelling background mutation counts. Nucleic Acids Res. 2019 Sep 19;47(16):e96. PubMed Link

Functionality

RUNNER identifies genes or regions with an unexpectedly high number of rare variants by:

Adjusting mutation counts with a negative binomial regression model based on genomic features.
Assessing enrichment using functional score weights provides a statistical measure of significance.

Options and Examples

Below is an example command demonstrating how RUNNER examines genes enriched with non-synonymous rare variants. Mutation counts are weighted by functional scores from dbNSFP and adjusted for evolutionary conservation, mutation frequency in reference populations (e.g., gnomAD), and gene length. The output provides enrichment p-values for each gene.

RUNNER utilizes R as its computational engine. Please ensure that KGGA is properly configured to communicate with R. We recommend using Docker to streamline this integration. For Docker configuration details, please refer to RServerDocker.

java -Xmx10g -jar kgga.jar \
 prioritize \
   -i /path/to/genotype/file \
   --ped-file /path/to/pedigree/phenotype/file \
   --output ./demo4 \
   --threads 8 \
   --variant-annotation-database dbNSFP \
   --variant-annotation-database conservation \
   --gene-feature-in 0~6 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --freq-database gnomad \
   --db-af 0~0.01 \
   --runner conservation@GC,conservation@CpG,conservation@priPhCons,conservation@mamPhCons,conservation@verPhCons,conservation@priPhyloP,conservation@mamPhyloP,conservation@verPhyloP,conservation@GerpN,conservation@GerpS,conservation@fitCons_all \
     weight=dbNSFP@VEST,dbNSFP@FATHMM-XF,dbNSFP@Eigen,dbNSFP@CADD,dbNSFP@GenoCanyon
   --r-server localhost:6300

Key Command-Line Options

-i /path/to/genotype/file: Specifies the input genotype file.
--ped-file /path/to/pedigree/phenotype/file: Provides pedigree and phenotype data.
--output ./demo4: Sets the output directory or file prefix.
--threads 8: Defines the number of processing threads.
--variant-annotation-database dbNSFP: Uses dbNSFP for variant annotations.
--variant-annotation-database conservation: Incorporates conservation data.
--gene-feature-in 0~6: Selects gene features (e.g., exons) to analyze.
--gene-model-database refgene: Uses the refGene database for gene models.
--gene-model-database gencode: Adds the GENCODE database for gene models.
--freq-database gnomad: Employs gnomAD for allele frequency data.
--db-af 0~0.01: Filters variants with allele frequencies between 0 and 0.01.
--r-server localhost:6300: Specifies the address and port of the R server that the KGGA will connect to for running the R-based analyses.

The `--runner` Option

The --runner option is a composite parameter that configures the regression analysis. All specified fields must be defined in the annotation database (--variant-annotation-database).

Sub-Options

predictor: Database fields used as predictors in the regression (e.g., conservation scores like GC, CpG, etc.). Default: region length if unspecified.
weight: Fields used as weights for mutation counts (e.g., functional scores like VEST, CADD). If @AFGRE is included, recalibrated weights are applied; otherwise, raw weights are used.
adjustMethod: Method to adjust mutation counts (full or cut). Default: cut.
combineWeight: Combines specified scores before weighting (yes or no). Default: yes.
countOnce: Counts only the most impactful variant per subject in a region/gene (yes or no). Default: yes.
freqRatio: Excludes variants where the case-to-control allele frequency ratio is outside 1/[value] to [value]. Larger values correspond to more extreme variants. Default: 3.

Format

--runner <field1>,<field2> [weight=<field3>,<field4>] [adjustMethod=full/cut] [combineWeight=y/n] [countOnce=y/n] [freqRatio=3]

Example

--runner conservation@priPhCons,conservation@mamPhCons weight=dbNSFP@CADD combineWeight=y

Output

RUNNER generates three main outputs:

Regression Model Summary: Displays the fitted zero-truncated negative binomial regression model, including estimates, standard errors, z-values, and p-values for predictors.
QQ Plot: A quantile-quantile plot of p-values, saved as a PDF file.
Prioritized Genes/Regions: A text file listing p-values and statistics for each gene or region.

Example Regression Model Summary

2025-04-19 10:29:02 Best standardized score bin: 0.45; Optimal truncation point: 1; MLFC: 0.039896260517655234
2025-04-19 10:29:04 The zero-truncated negative-binomial regression model fitted for region-based mutation counts regression:
  Variable                  Estimate         Std. Error      z value         Pr(>|z|)
  (Intercept)               0.924802489      0.102559969     9.017187660     1.92979663121953e-19
  conservation_GC           0.239511946      0.166125249     1.441755221     0.149371439179087
  conservation_CpG         -0.126699149      0.357418197    -0.354484328     0.722975946812435
  ...
  Theta = 2.64204113945025
  Log-likelihood: -17561.7404288397
  AIC: 35155.4808576794
  Number of iterations in BFGS optimization: 26

Output File Description

The *.regression.prioritized.genes.txt file contains columns such as:

ID: Gene or region identifier (e.g., RET).
Region: Type of region (e.g., Exons).
Chromosome: Chromosome number (e.g., 10).
StartPosition: Region start position (e.g., 43100542).
EndPosition: Region end position (e.g., 43129617).
CaseUnweightedMutationCounts: Mutation counts in cases (e.g., 60).
conservation@GC, conservation@CpG, ...: Predictor values.
ln_RegionLength: Log of region length.
ln_RefCount: Log of reference count.
ln_RegionLength_RefCount: Interaction term.
ControlUnweightedMutationCounts: Mutation counts in controls (e.g., 27).
z: Z-score (e.g., 4.33).
p: P-value (e.g., 7.57e-06).
FDRq: FDR-adjusted q-value (e.g., 0.0350).

Example Output Row

ID    Region    Chromosome    StartPosition    EndPosition    CaseUnweightedMutationCounts    conservation@GC    conservation@CpG    conservation@priPhCons    conservation@mamPhCons    conservation@verPhCons    conservation@priPhyloP    conservation@mamPhyloP    conservation@verPhyloP    conservation@GerpN    conservation@GerpS    conservation@fitCons_all    ln_RegionLength    ln_RefCount    ln_RegionLength_RefCount    ControlUnweightedMutationCounts    z    p    FDRq
RET  Exons  10  43100542  43129617  60  0.6173  0.0970  0.5678  0.6683  0.7831  0.4373  1.1918  2.3258  5.1527  3.4412  0.5569  1.2887  4.3521  5.6084  27  4.33  7.57e-06  0.0350

iRUNNER

iRUNNER is an extended method from RUNNER designed to assess the synergistic rare mutation burden in gene pairs or triples, referred to as interaction units, within case samples. These interaction units represent genes with functional cooperation, such as protein interactions, gene co-expression, or shared biological pathways, and must be supplied to KGGA. iRUNNER counts mutations occurring in every gene within an interaction unit at a patient. Once mutation counts are obtained, the analysis proceeds identically to RUNNER’s process for individual genes.

Citations

Hui Jiang, ..., and Miaoxin Li. Exploring Genetic Interactions with Rare Variants Reveals Gene Networks Susceptible to Complex Diseases. Link

Options and Examples

Below is an example command demonstrating how iRUNNER examines genes enriched with non-synonymous rare variants:

java -Xmx10g -jar kgga.jar \
 prioritize \
   # Identical to the above options for RUNNER 
 --interaction file=/home/lmx/MyJava/netbeans/kggseq1/resources/Coding_predict_fixed_0.5.txt.gz \
               threshold=0.8

The --interaction Option

The --interaction option is a composite parameter that specifies the file containing interaction units.

Sub-Options

file: Specifies the path to the file containing gene interactions. Each line must follow the format: gene1,gene2,...,interactionScore.
threshold: Excludes gene combinations with interaction scores below this value.
hasHead: Indicates whether the file includes a header row (yes or no). Default: yes.
sep: Defines the separator used in each line. Options: TAB, COMMA, SEMICOLON, BLANK. Default: TAB.

Format

--interaction file=<path> threshold=<float> [hasHead=y/n] [sep=TAB/COMMA/SEMICOLON/BLANK]

Output

The outputs, including the Regression Model Summary, QQ Plot, and Prioritized Genes/Regions, are similar to those generated by RUNNER.

PubMed

The PubMed function retrieves papers that co-mention the genes to be prioritized and specified phenotypes. It queries the PubMed database to identify relevant publications.

Functionality

Purpose: Retrieve paper IDs from PubMed that mention both the prioritized genes and the specified phenotypes.
Query Mechanism: Uses keywords for phenotypes and a specified query field to search for co-mentions in PubMed.

Options

Option	Description	Default
--pubmed-mining	Specifies phenotype keywords and the query field type for retrieving paper IDs from PubMed that co-mention the prioritized genes and phenotypes. Phenotype keywords are separated by + (e.g., disease+name1). By default, this function is inactive. If activated, the default query field is 'Title/Abstract'. Another available field is 'Text+Word'. Format: --pubmed-mining , [field=Title/Abstract] Example: --pubmed-mining Hirschsprung This command queries PubMed for papers that mention both the prioritized genes and "Hirschsprung" in the title or abstract.	[OFF]

Gene

Prioritize genes enriched with rare sequence variants

RUNNER

Citations

Functionality

Options and Examples

Key Command-Line Options

The `--runner` Option

Sub-Options

Format

Example

Output

Example Regression Model Summary

Output File Description

Example Output Row

iRUNNER

Citations

Options and Examples

The --interaction Option

Sub-Options

Format

Output

PubMed

Functionality

Options

results matching ""

No results matching ""

Prioritize genes enriched with rare sequence variants

RUNNER

Citations

Functionality

Options and Examples

Key Command-Line Options

The --runner Option

Sub-Options

Format

Example

Output

Example Regression Model Summary

Output File Description

Example Output Row

iRUNNER

Citations

Options and Examples

The --interaction Option

Sub-Options

Format

Output

PubMed

Functionality

Options

results matching ""

No results matching ""

The `--runner` Option