Genes
GATES and ECS
Performs the gene-based association analysis using GATES (a rapid and powerful Gene-based Association Test using Extended Simes procedure) and ECS (an Effective Chi-square Statistics).
GATES (Li et al. 2011) is basically an extension of the Simes procedure to dependent tests, as the individual GWAS tests are dependent due to LD. GATES calculates an effective number of independent p-values, which is then used by a Simes procedure. ECS (Li et al. 2019) first converts the p-values of a gene to chi-square statistics(one degree of freedom). Then, it merges all chi-square statistics of a gene after correcting the redundancy of the statistics due to LD. The merged statistic is called an ECS and is used to calculate the p-value of the gene.
Citations
For gene-based association analysis: Miaoxin Li, Hong-Sheng Gui, Johnny S.H. Kwan, et al. GATES: a rapid and powerful gene-based association test using extended Simes procedure. The American Journal of Human Genetics, 2011, 88(3):283-293. PubMed Link
For ECS and conditional gene-based association analysis Miaoxin Li, Lin Jiang, T S H Mak, et al. A powerful conditional gene-based association approach implicated functionally important genes for schizophrenia. Bioinformatics, 2019, 35(4):628-635. PubMed Link
For pathway or (gene-set) based association analysis: Hongsheng Gui, Johnny S Kwan, Pak C Sham et al. Sharing of Genes and Pathways Across Complex Phenotypes: A Multilevel Genome-Wide Analysis. Genetics, 2017, 206(3):1601-1609. PubMed Link
Options
This analysis inputs the p-values of SNPs and outputs the p-values of genes. The tutorial command is:
java -Xmx4g -jar ../kggsum.jar \
assoc \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP \
pbsCols=P \
refG=hg19 \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--output ./t1
Flag | Description |
---|---|
assoc |
Trigger the gene-based association analysis. |
--sum-file |
Specifies the file path of a GWAS summary file. The detailed description can be found here. |
--ref-gty-file |
Specifies the file path of a reference genotype file. The detailed description can be found here. |
--output |
Specifies the prefix of the output folder. |
--gene-model-database |
Specifies the gene boundaries for the variant-to-gene mapping. The default one is gencode. It can be refgene as well. The former contains more non-coding genes than the latter, while the latter may be more reliable. The exact boundaries are set by --upstream-distance and --downstream-distance . The default values are 1000 bp for both boundaries. |
--xqtl-file |
Specifies the target genes associated with each variant, typically using eQTL summary statistics. The detailed description can be found here. Note that the --gene-model-database and --xqtl-file options are mutually exclusive; specifying the latter will automatically disable the former. The --xqtl-file is usually used for causal gene inference via a Mendelian Randomization method (e.g., EMIC). |
--max-condi-gene |
Specifies the maximal number of significant genes for the conditional gene-based test. The default value is 1000. The value -1 disables this setting. |
Output
The numeric results of gene-based association tests are saved in GeneBasedAssociationTask\genes.hg38.assoc.txt
. These are the columns in the file:
Header | Description |
---|---|
RegionID | Region IDs or Gene symbols |
Chromosome | Chromosome of the region or gene |
StartPosition | The position of the first SNP in the region or gene |
EndPosition | The position of the last SNP in the region or gene |
#Var | Number of variants within the region or gene |
GATES.P | p-value of ECS |
ECS.P | p-value of GATES |
CCT.P | the combined p-value of ECS and GATES by the Cauchy Combination test |
The Q-Q plots for p-values of gene-based association tests by GATES or ECS are saved in GeneBasedAssociationTask\genes.hg38.assoc.qq.pdf
.
In addition, the conditional gene-based association test is then carried out for significant genes to remove significant genes, mostly due to LD with the most significant gene in a local region. The results are stored in GeneBasedConditionalAssociationTask\genes.hg38.condi.assoc.txt. These are the columns in the file:
Header | Description |
---|---|
... | ... |
Condi.ECS.P | The p-value of the conditional gene-based test by ECS |
In the above analysis, variants are mapped to genes according to their physical positions in the gene models (--gene-model-database
). However, remote regulatory variants may not be included depending on the position. One can use xQTL to link distant variants to genes by option --eqtl-file.
Heritability
Gene/region-based heritability estimation by EHE
Heritability measures how well differences in people’s genes account for differences in their phenotypes. This EHE analysis estimates each gene's heritability and performs gene-based association tests simultaneously (Miao et al. 2023).
Citations
Lin Miao, Lin Jiang, Bin Tang, Pak Chung Sham and Miaoxin Li. Dissecting the high-resolution genetic architecture of complex phenotypes by accurately estimating gene-based conditional heritability. The American Journal of Human Genetics (2023). 110(9):1534–1548. PubMed Link
Options
The tutorial command is:
java -Xmx4g -jar ../kggsum.jar \
assoc \
--calc-ehe \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP \
pbsCols=P \
sampleSizeCols=Nca,Nco \
refG=hg19 \
prevalence=0.01 \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--output ./t2
Flag | Description |
---|---|
assoc |
Trigger the gene-based association analysis. |
--calc-ehe | Ask the program to estimate heritability when calculating association. It has two sub-options. calcCondi: also calculate the conditional heritability. Note it may be time-consuming. So it can be turned off by setting a value no. The default is yes. "topGeneNum=0 |
--sum-file |
Specifies the file path of a GWAS summary file. For quantitative traits, a single column specifying the sample sizes is required. For binary traits, two columns indicating the case and control sample sizes are necessary. Additionally, for a disease phenotype, the disease prevalence must be specified. The detailed description can be found here. |
--ref-gty-file |
Specifies the file path of a reference genotype file. The detailed description can be found here. |
--out |
Specifies the prefix of the output folder. |
Output
The gene-based association p-values and heritability estimates are saved in GeneBasedAssociationTask\genes.hg38.assoc.txt
. These are the columns in the file:
Header | Description |
---|---|
... | ... |
eH2 | The estimated heritability of the region or gene by EHE. |
eH2.SE | The standard error of the estimated heritability. |
In addition, a conditional gene-based estimation is then carried out for significant genes to remove genes that have heritability merely due to LD with the most significant gene in a local region. The results are stored in GeneBasedConditionalAssociationTask\genes.hg38.condi.assoc.txt. These are the columns in the file:
Header | Description |
---|---|
... | ... |
Condi.eH2 | The estimated conditional heritability of the region or gene by EHE. |
Condi.eH2.SE | The standard error of the estimated conditional heritability. |
CellTypes
DESE
DESE (Driver-tissue/cell Estimation by Selective Expression; Jiang et al.. 2019) estimates driver tissues by tissue-selective expression of phenotype-associated genes in GWAS. The assumption is that the tissue-selective expression of causal or susceptibility genes indicates the tissues where complex phenotypes develop primarily, which are called driver tissues. Therefore, a driver tissue is likely to be enriched with the selective expression of susceptibility genes of a phenotype.
DESE initially analyzed the association by mapping SNPs to genes according to their physical distance. We further demonstrated that grouping eQTLs of a gene or a transcript to perform the association analysis could be more powerful. We named the eQTL-guided DESE eDESE. KGGSum implements DESE and eDESE with an improved effective chi-squared statistic to control type I error rates and remove redundant associations (Li et al. 2022).
DESE performs phenotype-tissue association tests and conditional gene-based association tests at the same time. This analysis inputs p-values of a GWAS and expression profile of multiple tissues and outputs p-values of phenotype-tissue associations and conditional p-values of genes.
Citations
For phenotype-associated tissue estimation by DESE: Lin Jiang, Chao Xue, Sheng Dai, et al. DESE: estimating driver tissues by selective expression of genes associated with complex diseases or traits. Genome biology, 2019, 20(1):1-19. PubMed Link
For phenotype-associated tissues' susceptibility genes and isoforms estimation: Xiangyi Li, Lin Jiang, Chao Xue, et al. A conditional gene-based association framework integrating isoform-level eQTL data reveals new susceptibility genes for schizophrenia. Elife. 2022 Apr 12;11:e70779. PubMed Link
- For phenotype-associated cell-type estimation by DESE: Xue C#, Jiang L#, Zhou M, Long Q, Chen Y, Li X, Peng W, Yang Q, Li M. PCGA: a comprehensive web server for phenotype-cell-gene association analysis. Nucleic Acids Res. 2022 May 26;50(W1):W568-76.
Options
The tutorial command is:
java -Xmx4g -jar ../kggsum.jar \
assoc \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP \
pbsCols=P \
refG=hg19 \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--gene-score-file ../resources/GTEx_v8_TMM_all.gene.meanSE.txt.gz \
--output ./t3
Flag | Description |
---|---|
assoc |
--sum-file , --ref-gty-file , and --out have the same functions as previously described. |
--gene-score-file |
Specifies a gene score file. The scores can represent various attributes, such as RNA expression, protein expression, epigenetic markers, or perturbation profiles at genes. See more at the Input Data Description |
--gene-p-cut |
Set p-value threshold to select significant genes for the conditional gene-based test. The default value is 0.05. |
--gene-multiple-testing |
Specifies the method for multiple testing correction with a given p-value threshold to select significant genes for the conditional gene-based test. It has three alternative method labels: bonf: Bonferroni correction with family-wise threshold specified by --gene-p-cut benfdr: Filter by the false discovery rate (FDR) calculated by Benjamini-Hochberg procedure. The threshold is also defined by --gene-p-cut fixed: Filter by the p-value threshold specified by --gene-p-cut without any multiple testing correction.The default value is bonf. |
--max-condi-gene |
Set the maximal number of significant genes for the conditional gene-based test. The default value is 1000. The value -1 disables this setting. |
--permutation-num |
Set the number of permutations to adjust the p-value for driver-tissue or -celltypes inference due to selection bias and multiple testing. The default value is 100. A larger number will take more running time. |
Output files
This function produces three sets of results: the gene-based association summary statistics saved in GeneBasedAssociationTask\genes.hg38.assoc.txt
, the gene-based conditional association summary statistics saved in GeneBasedAssociationTask\genes.hg38.condi.assoc.txt,
and the integrative enrichment summary statistics saved in GeneBasedConditionalAssociationTask\$scoreFileName.enrichment.txt
.
Basically, this is the result of the Wilcoxon rank-sum test, which tests whether the selective expression median of the phenotype-associated genes is significantly higher than that of the other genes in the interrogated tissue. The file contains four columns:
Header | Description |
---|---|
Condition | Name of the tissue being tested |
Unadjusted(p) | Unadjusted p-values for the tissue-phenotype associations |
Adjusted(p) | Adjusted p-values were calculated by adjusting both selection bias and multiple testing by permutation of gene scores within each condition. |
Median(IQR)SigVsAll | Median (interquartile range) expression of the conditionally significant genes and all the background genes Heritability |
Drugs
DESE
Infer effective drugs for a GWAS disease with selective perturbation gene expression profile by DESE. The assumption is that effective drugs may treat disease by specifically disturbing the expression of disease-susceptible genes. A detailed explanation can be found in this paper. The options and input format are the same as those of the above analyses for associated cell types. The difference is just what expression profiles are input. Instead of the gene expression profiles of various cell types or tissues, the perturbed gene expression profile by various drugs are specified by --gene-score-file
.
Citations
- Li X, Xue C, Zhu Z, Yu X, Yang Q, Cui L, Li M. Application of GWAS summary data and drug-induced gene expression profiles of neural progenitor cells in psychiatric drug prioritization analysis. Mol Psychiatry. 2025 Jan;30(1):111-121.
Options
The tutorial command is:
java -Xmx4g -jar ../kggsum.jar \
assoc \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
cp12Cols=CHR,BP \
pbsCols=P \
refG=hg19 \
--ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
refG=hg19 \
--gene-score-file https://idc.biosino.org/pmglab/resource/kgg/kggsum/datasets/drugs/GEO_expression_profiles/hipsc_ctrl_with_se_drug_induced_foldchange.txt.gz \
--threads 20 \
--output ./t4
The options are identical to those for the associated Cell-type inference.
Output
The output files are the same as those of the CellTypes association analyses. The prioritized drugs are saved in GeneBasedConditionalAssociationTask\$scoreFileName.enrichment.txt
, in which the Wilcoxon rank-sum test produces the enrichment scores, and the permutation approach is used to make valid statistical p-values with the consideration of multiple testing, selection bias, and internal correlation of gene perturbation scores.
Spatiality
Infer the spatial heterogeneity of cell types associated with complex diseases using DESE. We pre-integrated large-scale single-cell transcriptomics and spatial transcriptomics data to generate high-quality gene expression profiles of spatially specific cell types. The gene expression profiles can be input into DESE to estimate disease-associated spatially specific cell types and genes. Given the need for complex interactive visualizations, this functionality is implemented on the PSC web server (https://pmglab.top/psc), enabling convenient and rapid analysis and visualization of the results. Incidentally, the underlying program of PSC is still powered by the KGGSum platform.
Citations
- Xue C., Liu M., Zhou M., Li M., PSC: a comprehensive web server for resolving spatial heterogeneity of cell types associated with complex phenotypes, 2024. Unpublished manuscript.
Output
The output results can be obtained and visualized on the website (https://pmglab.top/psc).