The prioritize module is designed to prioritize variants, genes, or genomic regions based on allelic frequencies in reference populations and annotated functional prediction scores. Multiple methods under this module are available. We are also calling for more new methods based on this powerful infrastructure.
Prioritize genes enriched with rare sequence variants
RUNNER
The RUNNER function leverages a negative binomial regression model to prioritize genes or genomic regions enriched with rare sequence variants compared to background mutations. This methodology builds on foundational research detailed in our previous publications (WITER and RUNNER). It constructs a model to adjust mutation counts in predefined genomic regions or genes using multiple genomic features, such as reference allele frequencies, region lengths, and conservation scores. It evaluates the statistical enrichment of these adjusted counts, weighted by functional scores, under negative binomial distributions.
Citations
- Lin Jiang, Hui Jiang, ..., Pak Chung Sham, and Miaoxin Li. Deviation from baseline mutation burden provides powerful and robust rare-variants association test for complex diseases. Nucleic Acids Res. 2022 Apr 8;50(6):e34. PubMed Link
- Lin Jiang, Jingjing Zheng, ..., Yonghong Zhu, and Miaoxin Li. WITER: a powerful method for estimation of cancer-driver genes using a weighted iterative regression modelling background mutation counts. Nucleic Acids Res. 2019 Sep 19;47(16):e96. PubMed Link
Functionality
RUNNER identifies genes or regions with an unexpectedly high number of rare variants by:
- Adjusting mutation counts with a negative binomial regression model based on genomic features.
- Assessing enrichment using functional score weights provides a statistical measure of significance.
Options and Examples
Below is an example command demonstrating how RUNNER examines genes enriched with non-synonymous rare variants. Mutation counts are weighted by functional scores from dbNSFP and adjusted for evolutionary conservation, mutation frequency in reference populations (e.g., gnomAD), and gene length. The output provides enrichment p-values for each gene.
java -Xmx10g -jar kgga.jar \
prioritize \
-i /path/to/genotype/file \
--ped-file /path/to/pedigree/phenotype/file \
--output ./demo4 \
--threads 8 \
--variant-annotation-database dbNSFP \
--variant-annotation-database conservation \
--gene-feature-in 0~6 \
--gene-model-database refgene \
--gene-model-database gencode \
--freq-database gnomad \
--db-af 0~0.01 \
--runner conservation@GC,conservation@CpG,conservation@priPhCons,conservation@mamPhCons,conservation@verPhCons,conservation@priPhyloP,conservation@mamPhyloP,conservation@verPhyloP,conservation@GerpN,conservation@GerpS,conservation@fitCons_all \
weight=dbNSFP@VEST,dbNSFP@FATHMM-XF,dbNSFP@Eigen,dbNSFP@CADD,dbNSFP@GenoCanyon
Key Command-Line Options
-i /path/to/genotype/file
: Specifies the input genotype file.--ped-file /path/to/pedigree/phenotype/file
: Provides pedigree and phenotype data.--output ./demo4
: Sets the output directory or file prefix.--threads 8
: Defines the number of processing threads.--variant-annotation-database dbNSFP
: Uses dbNSFP for variant annotations.--variant-annotation-database conservation
: Incorporates conservation data.--gene-feature-in 0~6
: Selects gene features (e.g., exons) to analyze.--gene-model-database refgene
: Uses the refGene database for gene models.--gene-model-database gencode
: Adds the GENCODE database for gene models.--freq-database gnomad
: Employs gnomAD for allele frequency data.--db-af 0~0.01
: Filters variants with allele frequencies between 0 and 0.01.
The --runner
Option
The --runner
option is a composite parameter that configures the regression analysis. All specified fields must be defined in the annotation database (--variant-annotation-database
).
Sub-Options
- predictor: Database fields used as predictors in the regression (e.g., conservation scores like
GC
,CpG
, etc.). Default: region length if unspecified. - weight: Fields used as weights for mutation counts (e.g., functional scores like
VEST
,CADD
). If@AFGRE
is included, recalibrated weights are applied; otherwise, raw weights are used. - adjustMethod: Method to adjust mutation counts (
full
orcut
). Default:cut
. - combineWeight: Combines specified scores before weighting (
yes
orno
). Default:yes
. - countOnce: Counts only the most impactful variant per subject in a region/gene (
yes
orno
). Default:yes
. - freqRatio: Excludes variants where the case-to-control allele frequency ratio is outside 1/[value] to [value]. Larger values correspond to more extreme variants. Default:
3
.
Format
--runner <field1>,<field2> [weight=<field3>,<field4>] [adjustMethod=full/cut] [combineWeight=y/n] [countOnce=y/n] [freqRatio=3]
Example
--runner conservation@priPhCons,conservation@mamPhCons weight=dbNSFP@CADD combineWeight=y
Output
RUNNER generates three main outputs:
- Regression Model Summary: Displays the fitted zero-truncated negative binomial regression model, including estimates, standard errors, z-values, and p-values for predictors.
- QQ Plot: A quantile-quantile plot of p-values, saved as a PDF file.
- Prioritized Genes/Regions: A text file listing p-values and statistics for each gene or region.
Example Regression Model Summary
2025-04-19 10:29:02 Best standardized score bin: 0.45; Optimal truncation point: 1; MLFC: 0.039896260517655234
2025-04-19 10:29:04 The zero-truncated negative-binomial regression model fitted for region-based mutation counts regression:
Variable Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.924802489 0.102559969 9.017187660 1.92979663121953e-19
conservation_GC 0.239511946 0.166125249 1.441755221 0.149371439179087
conservation_CpG -0.126699149 0.357418197 -0.354484328 0.722975946812435
...
Theta = 2.64204113945025
Log-likelihood: -17561.7404288397
AIC: 35155.4808576794
Number of iterations in BFGS optimization: 26
Output File Description
The *.regression.prioritized.genes.txt
file contains columns such as:
- ID: Gene or region identifier (e.g.,
RET
). - Region: Type of region (e.g.,
Exons
). - Chromosome: Chromosome number (e.g.,
10
). - StartPosition: Region start position (e.g.,
43100542
). - EndPosition: Region end position (e.g.,
43129617
). - CaseUnweightedMutationCounts: Mutation counts in cases (e.g.,
60
). - conservation@GC, conservation@CpG, ...: Predictor values.
- ln_RegionLength: Log of region length.
- ln_RefCount: Log of reference count.
- ln_RegionLength_RefCount: Interaction term.
- ControlUnweightedMutationCounts: Mutation counts in controls (e.g.,
27
). - z: Z-score (e.g.,
4.33
). - p: P-value (e.g.,
7.57e-06
). - FDRq: FDR-adjusted q-value (e.g.,
0.0350
).
Example Output Row
ID Region Chromosome StartPosition EndPosition CaseUnweightedMutationCounts conservation@GC conservation@CpG conservation@priPhCons conservation@mamPhCons conservation@verPhCons conservation@priPhyloP conservation@mamPhyloP conservation@verPhyloP conservation@GerpN conservation@GerpS conservation@fitCons_all ln_RegionLength ln_RefCount ln_RegionLength_RefCount ControlUnweightedMutationCounts z p FDRq
RET Exons 10 43100542 43129617 60 0.6173 0.0970 0.5678 0.6683 0.7831 0.4373 1.1918 2.3258 5.1527 3.4412 0.5569 1.2887 4.3521 5.6084 27 4.33 7.57e-06 0.0350
iRUNNER
iRUNNER is an extended method from RUNNER designed to assess the synergistic rare mutation burden in gene pairs or triples, referred to as interaction units, within case samples. These interaction units represent genes with functional cooperation, such as protein interactions, gene co-expression, or shared biological pathways, and must be supplied to KGGA. iRUNNER counts mutations occurring in every gene within an interaction unit at a patient. Once mutation counts are obtained, the analysis proceeds identically to RUNNER’s process for individual genes.
Citations
Hui Jiang, ..., and Miaoxin Li. Exploring Genetic Interactions with Rare Variants Reveals Gene Networks Susceptible to Complex Diseases. Link
Options and Examples
Below is an example command demonstrating how iRUNNER examines genes enriched with non-synonymous rare variants:
java -Xmx10g -jar kgga.jar \
prioritize \
# Identical to the above options for RUNNER
--interaction file=/home/lmx/MyJava/netbeans/kggseq1/resources/Coding_predict_fixed_0.5.txt.gz \
threshold=0.8
The --interaction Option
The --interaction option is a composite parameter that specifies the file containing interaction units.
Sub-Options
- file: Specifies the path to the file containing gene interactions. Each line must follow the format: gene1,gene2,...,interactionScore.
- threshold: Excludes gene combinations with interaction scores below this value.
- hasHead: Indicates whether the file includes a header row (yes or no). Default: yes.
- sep: Defines the separator used in each line. Options: TAB, COMMA, SEMICOLON, BLANK. Default: TAB.
Format
--interaction file=<path> threshold=<float> [hasHead=y/n] [sep=TAB/COMMA/SEMICOLON/BLANK]
Output
The outputs, including the Regression Model Summary, QQ Plot, and Prioritized Genes/Regions, are similar to those generated by RUNNER.
PubMed
The PubMed function retrieves papers that co-mention the genes to be prioritized and specified phenotypes. It queries the PubMed database to identify relevant publications.
Functionality
- Purpose: Retrieve paper IDs from PubMed that mention both the prioritized genes and the specified phenotypes.
- Query Mechanism: Uses keywords for phenotypes and a specified query field to search for co-mentions in PubMed.
Options
Option | Description | Default |
---|---|---|
--pubmed-mining | Specifies phenotype keywords and the query field type for retrieving paper IDs from PubMed that co-mention the prioritized genes and phenotypes. Phenotype keywords are separated by + (e.g., disease+name1). By default, this function is inactive. If activated, the default query field is 'Title/Abstract'. Another available field is 'Text+Word'. Format: --pubmed-mining Example: --pubmed-mining Hirschsprung This command queries PubMed for papers that mention both the prioritized genes and "Hirschsprung" in the title or abstract. |
[OFF] |