Genes

GATES and ECS

Performs the gene-based association analysis using GATES (a rapid and powerful Gene-based Association Test using Extended Simes procedure) and ECS (an Effective Chi-square Statistics).

GATES (Li et al. 2011) is basically an extension of the Simes procedure to dependent tests, as the individual GWAS tests are dependent due to LD. GATES calculates an effective number of independent p-values, which is then used by a Simes procedure. ECS (Li et al. 2019) first converts the p-values of a gene to chi-square statistics(one degree of freedom). Then, it merges all chi-square statistics of a gene after correcting the redundancy of the statistics due to LD. The merged statistic is called an ECS and is used to calculate the p-value of the gene.

Citations

  1. For gene-based association analysis: Miaoxin Li, Hong-Sheng Gui, Johnny S.H. Kwan, et al. GATES: a rapid and powerful gene-based association test using extended Simes procedure. The American Journal of Human Genetics, 2011, 88(3):283-293. PubMed Link

  2. For ECS and conditional gene-based association analysis Miaoxin Li, Lin Jiang, T S H Mak, et al. A powerful conditional gene-based association approach implicated functionally important genes for schizophrenia. Bioinformatics, 2019, 35(4):628-635. PubMed Link

  3. For pathway or (gene-set) based association analysis: Hongsheng Gui, Johnny S Kwan, Pak C Sham et al. Sharing of Genes and Pathways Across Complex Phenotypes: A Multilevel Genome-Wide Analysis. Genetics, 2017, 206(3):1601-1609. PubMed Link

Options

This analysis inputs the p-values of SNPs and outputs the p-values of genes. The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              refG=hg19 \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --output ./t1
Flag Description
assoc Trigger the gene-based association analysis.
--sum-file Specifies the file path of a GWAS summary file. The detailed description can be found here.
--ref-gty-file Specifies the file path of a reference genotype file. The detailed description can be found here.
--output Specifies the prefix of the output folder.
--gene-model-database Specifies the gene boundaries for the variant-to-gene mapping. The default one is gencode. It can be refgene as well. The former contains more non-coding genes than the latter, while the latter may be more reliable. The exact boundaries are set by --upstream-distance and --downstream-distance. The default values are 1000 bp for both boundaries.
--xqtl-file Specifies the target genes associated with each variant, typically using eQTL summary statistics. The detailed description can be found here. Note that the --gene-model-database and --xqtl-file options are mutually exclusive; specifying the latter will automatically disable the former. The --xqtl-file is usually used for causal gene inference via a Mendelian Randomization method (e.g., EMIC).
--max-condi-gene Specifies the maximal number of significant genes for the conditional gene-based test. The default value is 1000. The value -1 disables this setting.

Output

The numeric results of gene-based association tests are saved in GeneBasedAssociationTask\genes.hg38.assoc.txt. These are the columns in the file:

Header Description
RegionID Region IDs or Gene symbols
Chromosome Chromosome of the region or gene
StartPosition The position of the first SNP in the region or gene
EndPosition The position of the last SNP in the region or gene
#Var Number of variants within the region or gene
GATES.P p-value of ECS
ECS.P p-value of GATES
CCT.P the combined p-value of ECS and GATES by the Cauchy Combination test

The Q-Q plots for p-values of gene-based association tests by GATES or ECS are saved in GeneBasedAssociationTask\genes.hg38.assoc.qq.pdf.

In addition, the conditional gene-based association test is then carried out for significant genes to remove significant genes, mostly due to LD with the most significant gene in a local region. The results are stored in GeneBasedConditionalAssociationTask\genes.hg38.condi.assoc.txt. These are the columns in the file:

Header Description
... ...
Condi.ECS.P The p-value of the conditional gene-based test by ECS

In the above analysis, variants are mapped to genes according to their physical positions in the gene models (--gene-model-database). However, remote regulatory variants may not be included depending on the position. One can use xQTL to link distant variants to genes by option --eqtl-file.


Heritability

Gene/region-based heritability estimation by EHE

Heritability measures how well differences in people’s genes account for differences in their phenotypes. This EHE analysis estimates each gene's heritability and performs gene-based association tests simultaneously (Miao et al. 2023).

Citations

Lin Miao, Lin Jiang, Bin Tang, Pak Chung Sham and Miaoxin Li. Dissecting the high-resolution genetic architecture of complex phenotypes by accurately estimating gene-based conditional heritability. The American Journal of Human Genetics (2023). 110(9):1534–1548. PubMed Link

Options

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --calc-ehe \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              sampleSizeCols=Nca,Nco \
              refG=hg19 \
              prevalence=0.01 \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --output ./t2
Flag Description
assoc Trigger the gene-based association analysis.
--calc-ehe Ask the program to estimate heritability when calculating association. It has two sub-options.
calcCondi: also calculate the conditional heritability. Note it may be time-consuming. So it can be turned off by setting a value no. The default is yes.
"topGeneNum=0
--sum-file Specifies the file path of a GWAS summary file. For quantitative traits, a single column specifying the sample sizes is required. For binary traits, two columns indicating the case and control sample sizes are necessary. Additionally, for a disease phenotype, the disease prevalence must be specified. The detailed description can be found here.
--ref-gty-file Specifies the file path of a reference genotype file. The detailed description can be found here.
--out Specifies the prefix of the output folder.

Output

The gene-based association p-values and heritability estimates are saved in GeneBasedAssociationTask\genes.hg38.assoc.txt. These are the columns in the file:

Header Description
... ...
eH2 The estimated heritability of the region or gene by EHE.
eH2.SE The standard error of the estimated heritability.

In addition, a conditional gene-based estimation is then carried out for significant genes to remove genes that have heritability merely due to LD with the most significant gene in a local region. The results are stored in GeneBasedConditionalAssociationTask\genes.hg38.condi.assoc.txt. These are the columns in the file:

Header Description
... ...
Condi.eH2 The estimated conditional heritability of the region or gene by EHE.
Condi.eH2.SE The standard error of the estimated conditional heritability.

CellTypes

DESE

DESE (Driver-tissue/cell Estimation by Selective Expression; Jiang et al.. 2019) estimates driver tissues by tissue-selective expression of phenotype-associated genes in GWAS. The assumption is that the tissue-selective expression of causal or susceptibility genes indicates the tissues where complex phenotypes develop primarily, which are called driver tissues. Therefore, a driver tissue is likely to be enriched with the selective expression of susceptibility genes of a phenotype.

DESE initially analyzed the association by mapping SNPs to genes according to their physical distance. We further demonstrated that grouping eQTLs of a gene or a transcript to perform the association analysis could be more powerful. We named the eQTL-guided DESE eDESE. KGGSum implements DESE and eDESE with an improved effective chi-squared statistic to control type I error rates and remove redundant associations (Li et al. 2022).

DESE performs phenotype-tissue association tests and conditional gene-based association tests at the same time. This analysis inputs p-values of a GWAS and expression profile of multiple tissues and outputs p-values of phenotype-tissue associations and conditional p-values of genes.

Citations

  1. For phenotype-associated tissue estimation by DESE: Lin Jiang, Chao Xue, Sheng Dai, et al. DESE: estimating driver tissues by selective expression of genes associated with complex diseases or traits. Genome biology, 2019, 20(1):1-19. PubMed Link

  2. For phenotype-associated tissues' susceptibility genes and isoforms estimation: Xiangyi Li, Lin Jiang, Chao Xue, et al. A conditional gene-based association framework integrating isoform-level eQTL data reveals new susceptibility genes for schizophrenia. Elife. 2022 Apr 12;11:e70779. PubMed Link

  3. For phenotype-associated cell-type estimation by DESE: Xue C#, Jiang L#, Zhou M, Long Q, Chen Y, Li X, Peng W, Yang Q, Li M. PCGA: a comprehensive web server for phenotype-cell-gene association analysis. Nucleic Acids Res. 2022 May 26;50(W1):W568-76.

Options

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \             
              pbsCols=P \
              refG=hg19 \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --gene-score-file ../resources/GTEx_v8_TMM_all.gene.meanSE.txt.gz \                  
   --output ./t3
Flag Description
assoc --sum-file, --ref-gty-file, and --out have the same functions as previously described.
--gene-score-file Specifies a gene score file. The scores can represent various attributes, such as RNA expression, protein expression, epigenetic markers, or perturbation profiles at genes. See more at the Input Data Description
--gene-p-cut Set p-value threshold to select significant genes for the conditional gene-based test. The default value is 0.05.
--gene-multiple-testing Specifies the method for multiple testing correction with a given p-value threshold to select significant genes for the conditional gene-based test. It has three alternative method labels:
bonf: Bonferroni correction with family-wise threshold specified by --gene-p-cut
benfdr: Filter by the false discovery rate (FDR) calculated by Benjamini-Hochberg procedure. The threshold is also defined by --gene-p-cut
fixed: Filter by the p-value threshold specified by --gene-p-cut without any multiple testing correction.
The default value is bonf.
--max-condi-gene Set the maximal number of significant genes for the conditional gene-based test. The default value is 1000. The value -1 disables this setting.
--permutation-num Set the number of permutations to adjust the p-value for driver-tissue or -celltypes inference due to selection bias and multiple testing. The default value is 100. A larger number will take more running time.

Output files

This function produces three sets of results: the gene-based association summary statistics saved in GeneBasedAssociationTask\genes.hg38.assoc.txt, the gene-based conditional association summary statistics saved in GeneBasedAssociationTask\genes.hg38.condi.assoc.txt, and the integrative enrichment summary statistics saved in GeneBasedConditionalAssociationTask\$scoreFileName.enrichment.txt. Basically, this is the result of the Wilcoxon rank-sum test, which tests whether the selective expression median of the phenotype-associated genes is significantly higher than that of the other genes in the interrogated tissue. The file contains four columns:

Header Description
Condition Name of the tissue being tested
Unadjusted(p) Unadjusted p-values for the tissue-phenotype associations
Adjusted(p) Adjusted p-values were calculated by adjusting both selection bias and multiple testing by permutation of gene scores within each condition.
Median(IQR)SigVsAll Median (interquartile range) expression of the conditionally significant genes and all the background genes Heritability

Drugs

DESE

Infer effective drugs for a GWAS disease with selective perturbation gene expression profile by DESE. The assumption is that effective drugs may treat disease by specifically disturbing the expression of disease-susceptible genes. A detailed explanation can be found in this paper. The options and input format are the same as those of the above analyses for associated cell types. The difference is just what expression profiles are input. Instead of the gene expression profiles of various cell types or tissues, the perturbed gene expression profile by various drugs are specified by --gene-score-file.

Citations

  1. Li X, Xue C, Zhu Z, Yu X, Yang Q, Cui L, Li M. Application of GWAS summary data and drug-induced gene expression profiles of neural progenitor cells in psychiatric drug prioritization analysis. Mol Psychiatry. 2025 Jan;30(1):111-121.

Options

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              refG=hg19 \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --gene-score-file https://idc.biosino.org/pmglab/resource/kgg/kggsum/datasets/drugs/GEO_expression_profiles/hipsc_ctrl_with_se_drug_induced_foldchange.txt.gz \
   --threads 20 \
   --output ./t4

The options are identical to those for the associated Cell-type inference.

Output

The output files are the same as those of the CellTypes association analyses. The prioritized drugs are saved in GeneBasedConditionalAssociationTask\$scoreFileName.enrichment.txt, in which the Wilcoxon rank-sum test produces the enrichment scores, and the permutation approach is used to make valid statistical p-values with the consideration of multiple testing, selection bias, and internal correlation of gene perturbation scores.

Spatiality

Infer the spatial heterogeneity of cell types associated with complex diseases using DESE. We pre-integrated large-scale single-cell transcriptomics and spatial transcriptomics data to generate high-quality gene expression profiles of spatially specific cell types. The gene expression profiles can be input into DESE to estimate disease-associated spatially specific cell types and genes. Given the need for complex interactive visualizations, this functionality is implemented on the PSC web server (https://pmglab.top/psc), enabling convenient and rapid analysis and visualization of the results. Incidentally, the underlying program of PSC is still powered by the KGGSum platform.

Citations

  1. Xue C., Liu M., Zhou M., Li M., PSC: a comprehensive web server for resolving spatial heterogeneity of cell types associated with complex phenotypes, 2024. Unpublished manuscript.

Output

The output results can be obtained and visualized on the website (https://pmglab.top/psc).

Copyright ©MiaoXin Li all right reservedLast modified time: 2025-04-04 04:12:13

results matching ""

    No results matching ""