Genes

GATES and ECS

Performs the gene-based association analysis using GATES (a rapid and powerful Gene-based Association Test using Extended Simes procedure) and ECS (an Effective Chi-square Statistics).

GATES (Li et al. 2011) is basically an extension of the Simes procedure to dependent tests, as the individual GWAS tests are dependent due to LD. GATES calculates an effective number of independent p-values, which is then used by a Simes procedure. ECS (Li et al. 2019) first converts the p-values of a gene to chi-square statistics(one degree of freedom). Then, it merges all chi-square statistics of a gene after correcting the redundancy of the statistics due to LD. The merged statistic is called an ECS and is used to calculate the p-value of the gene.

Citations

For gene-based association analysis: Miaoxin Li, Hong-Sheng Gui, Johnny S.H. Kwan, et al. GATES: a rapid and powerful gene-based association test using the extended Simes procedure. The American Journal of Human Genetics, 2011, 88(3):283-293. PubMed Link
For ECS and conditional gene-based association analysis Miaoxin Li, Lin Jiang, T S H Mak, et al. A powerful conditional gene-based association approach implicated functionally important genes for schizophrenia. Bioinformatics, 2019, 35(4):628-635. PubMed Link
For pathway or (gene-set) based association analysis: Hongsheng Gui, Johnny S Kwan, Pak C Sham et al. Sharing of Genes and Pathways Across Complex Phenotypes: A Multilevel Genome-Wide Analysis. Genetics, 2017, 206(3):1601-1609. PubMed Link

Options

This analysis takes the p-values of SNPs as input and outputs the p-values of genes. The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              refG=hg19 \
   --output ./t1

Flag	Description
`assoc`	Trigger the gene-based association analysis.
`--sum-file`	Specifies the file path of a GWAS summary file. The detailed description can be found here.
`--ref-gty-file`	Specifies the file path of a reference genotype file. The detailed description can be found here.
`--output`	Specifies the prefix of the output folder.
`--gene-model-database`	Specifies the gene model database for annotating variants to genes. Syntax `--gene-model-database name [path=<path>]` Arguments `name` (Required) The name of the database. Accepted values are: 1. refgene OR gencode: Use a built-in database. KGGSum automatically loads the data from its internal resources directory. 2. customized: Use a user-provided gene model file. This requires the path argument. `path=<path>` (Required only when is customized) The location of your custom gene model file. This can be a local file path (e.g., /data/my_genes.txt) or a URL (e.g., http://example.com/genes.txt). Custom File Format When using the customized option, the file must adhere to the following format: No header row: The file should not contain a header. Four columns: Each row represents a single gene and must contain four space- or tab-separated columns in this order: Column 1: Chromosome ID (e.g., 1, chrX) Column 2: Transcription Start Position Column 3: Transcription End Position Column 4: Gene Symbol or ID (e.g., MYH9, ENSG00000100345) Example content: 1 339070 350389 ENSBTAG00000006648 1 475398 475516 ENSBTAG00000049697 Important 1. This option is mutually exclusive with --xqtl-file. The --xqtl-file option is intended for datasets where variants are already mapped to genes (e.g., eQTLs/pQTLs). Do not use both options in the same run. 2. The exact boundaries are set by `--upstream-distance` and `--downstream-distance`. The default values are 5000 bp for both boundaries.
`--xqtl-file`	Specifies the target genes associated with each variant, typically using eQTL summary statistics. The detailed description can be found here. Note that the `--gene-model-database` and `--xqtl-file` options are mutually exclusive; specifying the latter will automatically disable the former. The `--xqtl-file` is usually used for causal gene inference via a Mendelian Randomization method (e.g., EMIC).
`--max-condi-gene`	Specifies the maximal number of significant genes for the conditional gene-based test. The default value is 1000. The value -1 disables this setting.

Output

The numeric results of gene-based association tests are saved in GeneBasedAssociationTask\genes.hg38.assoc.txt. These are the columns in the file:

Header	Description
RegionID	Region IDs or Gene symbols
Chromosome	Chromosome of the region or gene
StartPosition	The position of the first SNP in the region or gene
EndPosition	The position of the last SNP in the region or gene
#Var	Number of variants within the region or gene
GATES.P	p-value of ECS
ECS.P	p-value of GATES
CCT.P	the combined p-value of ECS and GATES by the Cauchy Combination test

The Q-Q plots for p-values of gene-based association tests by GATES or ECS are saved in GeneBasedAssociationTask\genes.hg38.assoc.qq.pdf.

In addition, the conditional gene-based association test is then carried out for significant genes to remove significant genes, mostly due to LD with the most significant gene in a local region. The results are stored in GeneBasedConditionalAssociationTask\genes.hg38.condi.assoc.txt. These are the columns in the file:

Header	Description
...	...
Condi.ECS.P	The p-value of the conditional gene-based test by ECS

In the above analysis, variants are mapped to genes according to their physical positions in the gene models (--gene-model-database). However, remote regulatory variants may not be included depending on the position. One can use xQTL to link distant variants to genes by option --eqtl-file.

Heritability

Gene/region-based heritability estimation by EHE

Heritability measures how well differences in people’s genes account for differences in their phenotypes. This EHE analysis estimates each gene's heritability and performs gene-based association tests simultaneously (Miao et al. 2023).

Citations

Lin Miao, Lin Jiang, Bin Tang, Pak Chung Sham and Miaoxin Li. Dissecting the high-resolution genetic architecture of complex phenotypes by accurately estimating gene-based conditional heritability. The American Journal of Human Genetics (2023). 110(9):1534–1548. PubMed Link

Options

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --calc-ehe \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              sampleSizeCols=Nca,Nco \
              refG=hg19 \
              prevalence=0.01 \   
   --output ./t2

Flag	Description
`assoc`	Trigger the gene-based association analysis.
--calc-ehe	Ask the program to estimate heritability when calculating association. It has two sub-options. calcCondi： also calculate the conditional heritability. Note it may be time-consuming. So it can be turned off by setting a value no. The default is yes. "topGeneNum=0
`--sum-file`	Specifies the file path of a GWAS summary file. For quantitative traits, a single column specifying the sample sizes is required. For binary traits, two columns indicating the case and control sample sizes are necessary. Additionally, for a disease phenotype, the disease prevalence must be specified. The detailed description can be found here. Note: It is recommended to exclude the Human Leukocyte Antigen (HLA) region for two main reasons: 1) Computational Efficiency: To save computing time, as this region has highly complex linkage disequilibrium (LD) patterns. 2) Statistical Inference: To avoid the strong, potentially confounding signals from the immune system, which can dominate the analysis. This can be achieved by setting the option exclude=chr6:27477797~35448354 (for hg19, coordinates include 1 Mbp flanking extensions).
`--ref-gty-file`	Specifies the file path of a reference genotype file. The detailed description can be found here.
`--out`	Specifies the prefix of the output folder.

Output

The gene-based association p-values and heritability estimates are saved in GeneBasedAssociationTask\genes.hg38.assoc.txt. These are the columns in the file:

Header	Description
...	...
eH2	The estimated heritability of the region or gene by EHE.
eH2.SE	The standard error of the estimated heritability.

In addition, a conditional gene-based estimation is then carried out for significant genes to remove genes that have heritability merely due to LD with the most significant gene in a local region. The results are stored in GeneBasedConditionalAssociationTask\genes.hg38.condi.assoc.txt. These are the columns in the file:

Header	Description
...	...
Condi.eH2	The estimated conditional heritability of the region or gene by EHE.
Condi.eH2.SE	The standard error of the estimated conditional heritability.

CellTypes

DESE

DESE (Driver-tissue/cell Estimation by Selective Expression; Jiang et al.. 2019) estimates driver tissues by tissue-selective expression of phenotype-associated genes in GWAS. The assumption is that the tissue-selective expression of causal or susceptibility genes indicates the tissues where complex phenotypes develop primarily, which are called driver tissues. Therefore, a driver tissue is likely to be enriched with the selective expression of susceptibility genes of a phenotype.

DESE initially analyzed the association by mapping SNPs to genes according to their physical distance. We further demonstrated that grouping eQTLs of a gene or a transcript to perform the association analysis could be more powerful. We named the eQTL-guided DESE eDESE. KGGSum implements DESE and eDESE with an improved effective chi-squared statistic to control type I error rates and remove redundant associations (Li et al. 2022).

DESE performs phenotype-tissue association tests and conditional gene-based association tests at the same time. This analysis inputs p-values of a GWAS and expression profile of multiple tissues and outputs p-values of phenotype-tissue associations and conditional p-values of genes.

Citations

For phenotype-associated tissue estimation by DESE: Lin Jiang, Chao Xue, Sheng Dai, et al. DESE: estimating driver tissues by selective expression of genes associated with complex diseases or traits. Genome biology, 2019, 20(1):1-19. PubMed Link
For phenotype-associated tissues' susceptibility genes and isoforms estimation: Xiangyi Li, Lin Jiang, Chao Xue, et al. A conditional gene-based association framework integrating isoform-level eQTL data reveals new susceptibility genes for schizophrenia. Elife. 2022 Apr 12;11:e70779. PubMed Link
For phenotype-associated cell-type estimation by DESE: Xue C#, Jiang L#, Zhou M, Long Q, Chen Y, Li X, Peng W, Yang Q, Li M. PCGA: a comprehensive web server for phenotype-cell-gene association analysis. Nucleic Acids Res. 2022 May 26;50(W1):W568-76.

Options

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              refG=hg19 \
              exclude=chr6:27477797~35448354   \
   --gene-score-file ../resources/GTEx_v8_TMM_all.gene.meanSE.txt.gz \
   --output ./t3

Flag	Description
`assoc`	`--sum-file`, `--ref-gty-file`, and `--out` have the same functions as previously described.
`--gene-score-file`	Specifies a gene score file. The scores can represent various attributes, such as RNA expression, protein expression, epigenetic markers, or perturbation profiles at genes. See more at the Input Data Description
`--gene-p-cut`	Set p-value threshold to select significant genes for the conditional gene-based test. The default value is 0.05.
`--gene-multiple-testing`	Specifies the method for multiple testing correction with a given p-value threshold to select significant genes for the conditional gene-based test. It has three alternative method labels: bonf: Bonferroni correction with family-wise threshold specified by `--gene-p-cut` benfdr: Filter by the false discovery rate (FDR) calculated by Benjamini-Hochberg procedure. The threshold is also defined by `--gene-p-cut` fixed: Filter by the p-value threshold specified by `--gene-p-cut` without any multiple testing correction. byfdr: Filter by the false discovery rate (FDR) calculated by Benjamini-Yekutieli procedure, which is more suitable for dependent tests. The threshold is also defined by `--gene-p-cut` The default value is bonf.
`--max-condi-gene`	Set the maximal number of significant genes for the conditional gene-based test. The default value is 1000. The value -1 disables this setting.
`--permutation-num`	Set the number of permutations to adjust the p-value for driver-tissue or -cell types inference due to selection bias and multiple testing. The default value is 100. A larger number will take more running time.

Output files

This function produces three sets of results: the gene-based association summary statistics saved in GeneBasedAssociationTask\genes.hg38.assoc.txt, the gene-based conditional association summary statistics saved in GeneBasedAssociationTask\genes.hg38.condi.assoc.txt, and the integrative enrichment summary statistics saved in GeneBasedConditionalAssociationTask\$scoreFileName.enrichment.txt. Basically, this is the result of the Wilcoxon rank-sum test, which tests whether the selective expression median of the phenotype-associated genes is significantly higher than that of the other genes in the interrogated tissue. The file contains four columns:

Header	Description
Condition	Name of the tissue being tested
Unadjusted(p)	Unadjusted p-values for the tissue-phenotype associations
Adjusted(p)	Adjusted p-values were calculated by adjusting both selection bias and multiple testing by permutation of gene scores within each condition.
Median(IQR)SigVsAll	Median (interquartile range) expression of the conditionally significant genes and all the background genes Heritability

Drugs

DESE

Infer effective drugs for a GWAS disease with selective perturbation gene expression profile by DESE. The assumption is that effective drugs may treat disease by specifically disturbing the expression of disease-susceptible genes. A detailed explanation can be found in this paper. The options and input format are the same as those of the above analyses for associated cell types. The difference is just what expression profiles are input. Instead of the gene expression profiles of various cell types or tissues, the perturbed gene expression profile by various drugs are specified by --gene-score-file.

Citations

Li X, Xue C, Zhu Z, Yu X, Yang Q, Cui L, Li M. Application of GWAS summary data and drug-induced gene expression profiles of neural progenitor cells in psychiatric drug prioritization analysis. Mol Psychiatry. 2025 Jan;30(1):111-121.

Options

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              refG=hg19 \   
   --gene-score-file https://idc.biosino.org/pmglab/resource/kgg/kggsum/datasets/drugs/GEO_expression_profiles/hipsc_ctrl_with_se_drug_induced_foldchange.txt.gz \
   --threads 20 \
   --output ./t4

The options are identical to those for the associated Cell-type inference.

Output

The output files are the same as those of the CellTypes association analyses. The prioritized drugs are saved in GeneBasedConditionalAssociationTask\$scoreFileName.enrichment.txt, in which the Wilcoxon rank-sum test produces the enrichment scores, and the permutation approach is used to make valid statistical p-values with the consideration of multiple testing, selection bias, and internal correlation of gene perturbation scores.

Spatiality

Infer the spatial heterogeneity of cell types associated with complex diseases using DESE. We pre-integrated large-scale single-cell transcriptomics and spatial transcriptomics data to generate high-quality gene expression profiles of spatially specific cell types. The gene expression profiles can be input into DESE to estimate disease-associated spatially specific cell types and genes. Given the need for complex interactive visualizations, this functionality is implemented on the PSC web server (https://pmglab.top/psc), enabling convenient and rapid analysis and visualization of the results. Incidentally, the underlying program of PSC is still powered by the KGGSum platform.

Citations

Xue C., Liu M., Zhou M., Li M., PSC: a comprehensive web server for resolving spatial heterogeneity of cell types associated with complex phenotypes, 2024. Unpublished manuscript.

Output

The output results can be obtained and visualized on the website (https://pmglab.top/psc).

Genes

Genes

GATES and ECS

Citations

Options

Output

Heritability

Gene/region-based heritability estimation by EHE

Citations

Options

Output

CellTypes

DESE

Citations

Options

Output files

Drugs

DESE

Citations

Options

Output

Spatiality

Citations

Output

results matching ""

No results matching ""