Brief Document

KGGSum（Knowledge-based Genetic and Genomic analysis platform for GWAS Summary statistics）¶

KGGSum (Knowledge-based Genetic and Genomic analysis platform for GWAS Summary statistics) is an open-source, Java-based platform designed for interpreting LARGE-scale GWAS signals through comprehensive, integrative analyses of omics data using advanced statistical models. It provides a cohesive framework that combines GWAS summary statistics with diverse omics layers, ranging from transcriptomic and proteomic data to perturbation sets, expression data across different cell types. Utilizing the robust GTB and ECC file systems, KGGSum effectively manages intermediate and resource data, supporting advanced models for prioritizing genes, cell types, gene networks, life exposures, and even microbiomes driving complex phenotypes. With an intuitive command-line interface, extensive functionality, and thorough documentation, KGGSum is a versatile, one-stop solution for large-scale post-GWAS analyses.

Diangram of KGGSum

Functionality¶

KGGSum currently supports the following function modules:

association: a module that links genes, cell types, and gene networks to phenotypes based on the GWAS signals at variants. Various association analyses are driven by the corresponding resource data.
causation: a module that infers causation from genes or exposures to phenotypes, mainly by various Mendelian randomization methods.
annotation: a module that annotates variants with gene features and functional prediction scores.

Functions of KGGSum

Unique Advantages of KGGSum¶

Fast and easy screening for multiple LARGE-scale GWAS summary datasets at a time
Comprehensive with numerous high-quality resources for integrative analyses
One-stop platform for a series of professional data mining

Citations¶

We kindly ask that you cite the relevant papers associated with the methods you utilize in the KGGSum platform. You can find the corresponding references under each method’s introduction.

Download and Setup

KGGSum and Its Running Resources¶

KGGSum is a Java-based tool distributed as a Java Archive file (kggsum.jar). Depending on the analysis type, additional resource files may be needed:

Gene-based association tests (GATES, ECS) and heritability estimation (EHE) require gene annotations.
Gene-expression causal-effect estimation (EMIC) requires eQTL summary statistics.
Almost all analyses in KGGSum require reference genotypes to estimate linkage disequilibrium of variants.

The following table provides download links for the latest version of KGGSum and its resources:

File	Description	Main Version	Size
kggsum.jar	Core program file	1.0	~32 MB
resources.zip	Resource files (excluding reference genotypes and eQTL data)	1.0	~180 MB
tutorials.zip	Example datasets for tutorials	1.0	~255 MB
	For historical releases

Contents of resources.zip

The resources.zip package provides the essential annotation and coordinate-conversion files used by KGGSum:

File	Description
`resources/liftover/hg19ToHg38.over.chain.gz`	Chain file for converting coordinates from hg19 to hg38
`resources/reference/CanonicalTranscript.txt.gz`	Canonical transcript metadata (gene symbol, Ensembl ID, etc.)
`resources/reference/GEncode_hg38_kggseq_v2.txt.gz`	GENCODE gene annotations (hg38)
`resources/reference/refGene_hg38_kggseq_v2.txt.gz`	RefGene gene annotations (hg38)

💡 KGGSum allows flexible resource usage — you can download only the files needed for your specific analysis.

Installation and Environment Setup¶

KGGSum requires Java Runtime Environment (JRE) 1.8 or higher. You may use either Oracle Java or OpenJDK.

To verify your installation, open a terminal (or Command Prompt on Windows) and run:

java -version

If the output shows a version ≥ 11, your Java environment is ready, and KGGSum can run normally.

💡 If KGGSum cannot start, check whether Java is added to your system PATH variable.

Also, ensure your system meets the recommended hardware specifications, especially for large datasets.

Quick Tutorial Setup¶

Follow these steps to set up a minimal working environment for KGGSum tutorials:

1. Download Required Files: kggsum.jar, resources.zip and tutorials.zip
2. Extract Files: Unzip both resources.zip and tutorials.zip into your preferred directory.
3. Organize Folder Structure: Place all components under one working folder, e.g.:

my-project/
├── kggsum.jar
├── resources/
└── tutorials/

4. Test the Installation

Open a terminal in the tutorials/ directory and run:
```
java -jar ../kggsum.jar
```
If installed successfully, KGGSum will print usage information or help text.

Notes

The resources/ folder contains files necessary for tutorials (e.g., liftover chains, gene annotations).
For advanced analyses, additional resources (e.g., reference genotypes, eQTL summary statistics) may be downloaded separately.

Resources

Resources for Analyses¶

We have investigated various types of public data that you can download to carry out large-scale knowledge-based data mining analyses on KGGSum. The following are available links for downloading.

Type	Description
Reference	Variant IDs and genotypes in VCF or GTB formats are used as reference panels to correct the coordinates LD of GWAS variants.
Genotypes	1. 1000 Genomes Project (1KG): the genotypes of 5 different ancestry panels can be downloaded here. 2. The Haplotype Reference Consortium (HRC): It contains high-quality haplotype reference data for approximately 64,940 European individuals. The dataset publishes its sub-reference dataset at https://ega-archive.org/datasets/EGAD00001002729, and you can apply to download the original .vcf file. Note that the ancestry of the reference panel should be identical to that of the GWAS sample.
dbSNP IDs	This is required when only dbSNP SNP IDs (rs#), but not genomic coordinates, are available in the GWAS summary data. RSID coordinates can be accessed at NCBI FTP. We provide a GTB version of this dataset, converted from dbSNP’s VCF (B157).

GWAS summary	The GWAS summary statistics of phenotypes. Many public domains are providing such data. As long as the data are in text and contain the columns required by KGGSum, it can be used as input.
Phenotypes	The following domains are widely used with downloadable GWAS summary statistics datasets. 1. Open GWAS: A database of genetic associations from GWAS summary datasets for querying or downloading. Note: The GWAS summary data downloaded from Open GWAS in VCF form can be directly used as input of KGGSum without format conversion. 2. GWAS Catalog: The NHGRI-EBI Catalog of human genome-wide association studies 3. Pan-UKBB: Pan-UKBB provides a multi-ancestry analysis of 7,228 phenotypes using a generalized mixed model association testing framework spanning 16,131 genome-wide association studies. 4. FinnGen: is a research project in genomics and personalized medicine. It is a large public-private partnership that has collected and analyzed genome and health data from 500,000 Finnish biobank donors to understand the genetic basis of diseases.
Microbes	A collection of GWAS summary datasets for microbes quantities in the human digestive system. They were curated from three databases: Dutch Microbiome Project, Mibiogen, and Finrisk. The curated datasets can be downloaded from here.

Gene expression	The gene expression profiles are curated from various public domains.
Tissues	The expression profiles of genes in 54 organs of tissues were curated from the GTEx project. It has been included in the downloaded resources.zip file, with the file name, GTEx_v8_TMM_all.gene.meanSE.txt.gz.
Cell-types	Gene expression profiles of 6,598 single-cell types from humans and mice, compiled by PCGA (https://pmglab.top/pcga) from publicly available scRNA-seq datasets. The curated datasets can be downloaded from here.
Spatial and cell-types	The phenotype-relevant Spatial and cell types can be inferred at the online platform, PSC https://pmglab.top/psc/. The backend analysis tool for this online platform is KGGSum.
Drug perturbation	The gene expression perturbation profiles in multiple cell types by various drugs. The preprocessing pipeline and methods can be seen in this Molecular Psychiatry paper. The curated datasets can be downloaded from here.

xQTL	The variants linked to genes by properties of genes such as RNA expression, splicing, protein expression, and DNA methylation.
eQTL	GTEx eQTL: cis-eQTL summary statistics calculated from the gene or transcript-level expression profile of 49 tissues or organs provided by the GTEx project (v8). The curated datasets can be downloaded from here.

Resources for Annotation¶

The great efforts of international projects and functional genomic studies have led to a rich tapestry of information from diverse biological domains that can be leveraged for variant annotation. We meticulously pre-processed some annotation fields from several commonly used databases to facilitate convenient variant annotation. Users can download the entire or study-needed databases directly from our website and start annotations for a large number of variants. For small-scale annotations (e.g., <10,000 variants), users can access these databases remotely via FTP/HTTP. The program will automatically fetch the specified slices of databases based on the index file of the given variants in a distributed fashion, eliminating the need to download the entire database locally and ensuring a more convenient and expedited process. Given the expansive and dynamic nature of genomic databases, KGGSum also offers users the adaptability to incorporate custom databases for annotation purposes.

Gene feature annotation¶

We support three well-established gene annotation systems: RefSeq genes (refgene), and GENCODE genes (gencode). In KGGSum, databases used for gene feature annotation should be constructed in FASTA format. Additionally, KGGA also accommodates custom gene databases formatted in FASTA, allowing users to integrate their own gene annotations tailored to specific research needs or preferences.

Database Name	FTP/HTTP	Version
refgene	https://ftp.ncbi.nih.gov/refseq/H_sapiens/annotation/	Updated on 2024-08-27
encode	https://www.gencodegenes.org/human/	Release 47

FAQ

General¶

Why do I need KGGSum?

KGGSum is fast, convenient, and integrates multiple analysis functions for large-scale GWAS summary data mining.

How do I cite KGGSum?

The publication of the specific method(s) you use should be cited to allow readers to understand the principles of your analyses.

Feedback and Support¶

Where do I report bugs or feedback?

You can send an email to limx54@163.com for this.

Using AI Assistance¶

How can I use generative AI to assist with KGGSum?

You can print the full documents of the online user manual as a PDF file and upload the PDF file to the AI and then ask questions like:

“Please generate commands for me to perform gene-based analysis using KGGSum with my GWAS summary file (abc.tsv.gz) and reference genome = hg19”.

Troubleshooting¶

Java Version Issue (KGGSum)¶

Symptoms¶

When running Association → Region/Gene-based association analysis, you may see:
java.lang.UnsupportedClassVersionError: org/apache/arrow/memory/BufferAllocator ... class file version 55.0 ... up to 52.0
Followed by: java.lang.ArrayIndexOutOfBoundsException: -1

Root Cause¶

Apache Arrow integrated in KGGSum v1.0 is compiled with Java 11 (class version 55).
Your runtime environment only provides Java 8 (class version 52), causing class loading to fail.
Thread-pool tasks exit unexpectedly, triggering a secondary exception (array out-of-bounds).

Solution¶

Install OpenJDK 17 on Ubuntu: sudo apt install openjdk-17-jre
Run KGGSum with the newer Java runtime: /usr/lib/jvm/java-17-openjdk-amd64/bin/java -jar kggsum.jar
Result: the issue disappears; the Association example command completes normally with full output.

Conclusion¶

In practice, Java Runtime ≥ 21 is more reliable (even though the manual states ≥ 17).

gsMap Installation Environment Issue¶

Symptoms¶

In Conda (Python 3.13), running pip install gsmap may fail because numpy<2.0.0 needs to be built from source, with errors such as:
ERROR: Unknown compiler(s)... [Errno 2] No such file or directory: 'gcc'

Root Cause¶

Python 3.13 may not have an available prebuilt NumPy wheel, so pip falls back to building from source.
Your system does not have a C/C++ compiler installed (gcc/clang), so the build fails.
Ecosystem support is still catching up: gsmap is not yet well-supported on Python 3.13.

Solution¶

Create a new Python 3.10 environment: conda create -n gsMap python=3.10
Install dependencies in the new environment: conda install numpy, then pip install gsmap
Result: gsmap installs and works normally.

Note¶

TC-04 depends on gsmap. Due to local Python 3.13 and a missing compiler, this testing round does not cover sDESE.

Detailed Document

Basic Options

Input

The main input of KGGSum includes the GWAS summary data in a text file and reference genotypes of GWAS in VCF or GTB formats.

GWAS summary data file¶

The GWAS summary data file includes variants with GWAS association summaries arranged in rows, with columns separated by whitespace or tabs. The minimal required columns are genomic coordinates and p-values (CHR, BP, and P by default). Additional statistical attributes may be needed for specific analyses; these requirements are detailed in the descriptions of the corresponding analysis functions:

CHR	POS	A1	A2	AF_A1	Beta	SE	P
1	86028	T	C	0.908392	-0.00108	0.00356	0.7626
1	693731	A	G	0.877949	-0.00034	0.00339	0.9209
1	713092	A	G	0.00623653	-0.04053	0.02491	0.1038
1	714596	T	C	0.962227	-0.00045	0.00615	0.9407
1	715205	C	G	0.993317	0.03543	0.02428	0.1445

The corresponding option for the GWAS summary input file is --sum-file or -sf. More settings about the variants and statistics can be specified in the option.

--sum-file, -sf Specify the path and required columns of one or more GWAS summary input files.

Format

--sum-file file [cp12Cols=CHR,POS,..] [r12Cols=RSID,..] [pbsCols=P,...] [refG=] [sep=] [freqA1Col=] [sampleSizeCols=] [betaType=<0/1/2>] [prevalence=] [exclude=]

Example

--sum-file ./CAD_UKBIOBANK.gz cp12Cols=chr,bp,a1,a2 pbsCols=pval,beta,se refG=hg19 freqA1Col=AF_A1 exclude=chr6:27477797~35448354

Options

file specifies the path to the GWAS summary statistics. This can be a local file, an internet URL, or an intranet file path accessed via SFTP.
Note: A LOCAL file path allows wildcards (say, *.tsv) to specify multiple files as a single input. KGGSum will process the files one by one.

cp12Cols specifies the column name in the summary file: chromosome, positions, effective (VCF alternative, A1) allele, and base (VCF reference, A2) allele.

r12Cols specifies the column names in the summary file: dbSNP rs ID, effective allele (VCF alternative, A1), and base allele (VCF reference, A2). The corresponding coordinates will be retrieved from the dbSNP database. Ensure the database file in GTB format is downloaded, unzipped, and placed in KGGSum’s working directory: ./resources/dbsnp/*.gtb.
Note: r12Cols and cp12Cols are mutually exclusive.

pbsCols specifies the column name in the summary file, which are p-values, effect size, and standard errors of effect size.

type specifies the file type of the GWAS summary file. The default format is TSV (Tab-Separated Values). Two alternative formats are also supported:

VCF: Recognized by the .vcf.gz or .vcf file extension. This is an adaptation of the Variant Call Format for GWAS summary statistics, compliant with the Open GWAS standard. Note that no column mapping is required for standard GWAS-VCF files. The default column attributes are: pbsCols (LP, ES, SE), sampleSizeCols (NS, NC), and freqA1Col (AF). For more details, see the gwas-vcf-specification.

GTB: Recognized by the .gtb file extension. This is a custom format developed by KGGSum to efficiently store GWAS summary statistics. No column mapping is required for standard GWAS-GTB files. The default column attributes are: pbsCols (P, Beta, SE), sampleSizeCols (N), and freqA1Col (A1F).

refG specifies the reference genome of input variants, . The default is refG=hg38. Note that incorrect specification of the genome version will lead to the mismatching of GWAS variants with the annotation base variants. All built-in annotation of KGGSum is hg38.

sep specifies the separator of the summary file. By default, it can recognize tabs and spaces. It recognizes four values, . The UNIVERSAL means tabs and spaces or commas. The default is sep=UNIVERSAL

freqA1Col specifies the column for the value of A1's frequency sampleSizeCols specifies the columns for the sample sizes of cases and controls. If only one column is specified, it is supposed to be the whole sample size.

betaType specifies the type of effect sizes, <0/1/2>.
0 means coefficients of linear regression for a quantitative phenotype beta; 1 means coefficients of logistic regression or the logarithms of odds ratio for a qualitative phenotype; 2 means the odds ratio for a qualitative phenotype. The default is betaType=1.

prevalence specifies the disease prevalence in a population. This is only required for a GWAS of disease phenotypes. exclude specifies the genomic regions of variants to be excluded, e.g., exclude=chr6:27477797~35448354 to exclude the HLC regions (hg19) to avoid time-consuming computation due to complex LD patterns.

genomicControl asks KGGSum to adjust the p-values and chi-square statistics using a genomic control (GC) factor from the input GWAS data before all follow-up analyses. genomicControl=-1 means to use the calculated GC factor from input p-values.

Reference genotypes for linkage-disequilibrium calculation¶

In the analysis of some methods, genotype data from GWAS samples are required to perform linkage disequilibrium correction. However, such genotypes are often unavailable. In these cases, ancestry-matched reference genotypes can be used as a substitute for KGGSum. Suitable reference datasets include genotypes from the 1000 Genomes Project or the UK Biobank. These references are primarily used to estimate LD for common variants. Ideally, the reference dataset should include between 500 and 5000 subjects; larger datasets, while more comprehensive, may increase computation time. KGGSum supports two genotype file formats: VCF and [GenoType Block (GTB)], with the option --ref-gty-file:

Option	Description	Default
`--ref-gty-file`	Specify the path reference genotype file. It is a combination of multiple parameters. `<type>` is used to specify the format of the input file. `<refG>` is used to specify the reference genome of input variants. Format: `--input <file> type=[AUTO/VCF/GTB] refG=[hg18/hg19/hg38]` Eaxmple: `--input ./example.vcf.gz type=VCF refG=hg38`	type=AUTO refG=hg19

Gene Score Profiles¶

The gene score profile contains various values representing different contexts or conditions, such as RNA or protein expression levels or perturbation effects. Each row corresponds to a gene, and the columns, separated by tabs, represent different contexts or conditions. For each context or condition, two columns can be provided: one for the mean (labeled with .mean) and another for the standard error (SE, labeled with .SE). While the SE column is optional, including it can enhance the accuracy of the analysis. Below is an example of the file format:

Gene	Adipose-Subcutaneous.mean	Adipose-Subcutaneous.SE	Adipose-VisceralOmentum.mean	Adipose-VisceralOmentum.SE	…
ENSG00000223972.5	0.0038016	0.00036668	0.0045709	0.00046303	…
ENSG00000227232.5	1.9911	0.030021	1.8841	0.040247	…
ENSG00000278267.1	0.00049215	0.00010645	0.00036466	9.29E-05	…
ENSG00000243485.5	0.0047772	0.00038018	0.0067897	0.00074318	…
ENSG00000237613.2	0.0030462	0.00027513	0.0030465	0.00031694	…

The path and relevant settings can be specified by the option:

Option Description

--gene-score-file The scores can represent various attributes, such as RNA expression, protein expression, epigenetic markers, or perturbation profiles at genes. Each row corresponds to a gene, and each column (except the first) represents a condition. The first column should contain the gene symbols. This is a combination parameter with the following options:

file: Specifies the file path of the gene score file, which can be a local path or a remote path accessed via a network. NOTE: For a LOCAL file path, it allows wildcards (say, brain.tsv) to specify multiple files as a single input.
NOTE*: It also supports a more efficient format, Feather, to speed up analysis

calcSpecificity: Triggers the calculation of the specificity of gene scores for each condition. The default is “y(es)”.

disableDirection: Instructs KGGSum to ignore the directionality of specificity. The default is “n(o)”.
ignoreSE: Instructs KGGSum to ignore the SE of expression mean values (if available) when calculating the specificity. The default is “n(o)”.
minValue: Instructs KGGSum to ignore the expression below a minimal value. The default is NA.

Format: --gene-score-file file=file/path calcSpecifity=<y/n> disableDirection=<y/n>
Example: --gene-score-file file/path
calcSpecifity=y

xQTL summary data¶

This dataset is used to link variants to their target genes, typically using each gene’s eQTL summary statistics. Each row represents an eQTL and must include the following nine columns: gene symbol, gene ID, chromosome, position, p-value, effective (alternative) allele, base (reference) allele, effect size, and standard error. Below is an example of the file format:

symbol	id	chr	pos	ref	alt	altfreq	beta	se	p
LINC00115	ENSG00000225880	1	796375	T	C	0.149	-0.223	0.081	5.87E-03
LINC00115	ENSG00000225880	1	797440	T	C	0.159	-0.24	0.078	2.28E-03
LINC00115	ENSG00000225880	1	802496	C	T	0.146	-0.247	0.083	2.95E-03
LINC00115	ENSG00000225880	1	812743	C	T	0.17	-0.19	0.073	9.57E-03
LINC01128	ENSG00000228794	1	693731	A	G	0.118	-0.258	0.094	6.31E-03
LINC01128	ENSG00000228794	1	731718	T	C	0.151	-0.293	0.084	4.50E-04

The path and relevant settings can be specified by the option:

Option Description

--xqtl-file Specify the xQTL summary file. This is a combination of parameters.
In the file, one row represents a genetic variant with its association summary to a gene. The association can be calculated based on various gene characteristics, including RNA expression (eQTL), RNA splicing (sQTL), protein expression (pQTL), and methylation (mQTL). The first column should contain the gene symbols. This is a combination parameter with the following options:
- file specifies the path to the xQTL summary statistics. This can be a local file, an internet URL, or an intranet file path accessed via SFTP. NOTE: For a LOCAL file path, it allows wildcards (say, a*b?.qtl.tsv.gz) to specify multiple files as a single input.
-cp12Cols specifies the column names in the summary file, which are chromosome, positions, effective (VCF alternative, A1) allele, and base (VCF reference, A2) allele.
-pbsCols specifies the column names in the summary file, which are p-values, effect size, and standard errors of effect size.
-giCols specifies the column names in the summary file, which are gene symbols, and gene ID.
-freqA1Col specifies the column for the value of A1’s frequency
-sampleSizeCols specifies the columns for the sample sizes for the xQTL.
-refG specifies the reference genome of input variants. The default value is hg19.
-sep specifies the separator of the summary file. By default, it can recognize Tabs, spaces and commas, and the corresponding tag is UNIVERSAL
-pCut specifies the p-value threshold for selecting significant xQTL for subsequent analyses. The default value is pCut=1E-6
-ldCut specifies the LD r² to prune highly redundant xQTLs. The default value is ldCut=0.8

Format: --xqtl-file file=file/path [cp12Cols=chr,pos,alt,ref] [pbsCols=p,beta,se] [giCols=symbol,id] [refG=hg19] [sep=TAB] [freqA1Col=altfreq] [sampleSizeCols=neff] [pCut=1E-6][ldCut=0.8]

KGGSum operates using a task-by-task model for all analyses. Each task generates an efficient binary file named variants.annot.hg38.gtb, which stores variants along with harmonized statistics, intermediate results and annotations. This design allows users to seamlessly resume interrupted workflows or re-run analyses with adjusted parameters, starting from the previous breakpoints or creating branched workflows.

Option	Description	Default
`--output`	Specify the output folder path. All data from each task will be put under the specified folder. That preserve intermediate files and can avoid duplicate tasks. Format: `--output <dir>` Example: `--output ./out/test`	./kggsum
`--clean-intermediate-data`	Clean the all intermediate data of the analysis, reducing memory usage. Format: `--clean-intermediate-data`	[OFF]

While each analysis has a unique task, some common tasks are listed below.

File	Description
ConvertVCF2GTBTask*.gtb	The gtb format file converted from the input VCF file by `--ref-gty-file`
GenerateRootVariantSetTask/variants.annot.hg38.gtb	The variants extracted from the reference genotype file specified by `--ref-gty-file` in gtb format will be used as the base for the following analysis.
AppendVariants2RootVariantSetTask/variants.annot.hg38.gtb	The base variants appended with GWAS summary statistics specified by the `--sum-file`
GeneFeatureAnnotationTask/variants.annot.hg38.gtb	The variants annotated with gene features subsequently
OutputVariants2TSVTask/variants.hg38.tsv.gz	The GWAS summary and annotations of variants retained for analysis in TSV format.

There are also some options for resource file paths, template data, and parallel tasks.

Option	Description	Default
`--threads`	Specify the number of threads on which the program will be running. Format: `--threads <int>` Example: `--threads 8` NOTE: As a rule of thumb, please do not give a thread number larger than the number of CPU cores. Too many threads may even slow down the analysis.	4
`--channel`	Specify the path of resource data files. It could be a local system file path, an intranet file path, or even an internet one. Format: `--channel path/to/file` Example: `--channel https://idc.biosino.org/pmglab/resource/kgg/kggsum/resources/` Note: this option should be placed ahead of the main options. e.g., `java -Xmx4g -jar ../kggsum.jar --channel ./resources assoc`	./resources

`--chromosome-tag`	Specify the recognizable chromosomes. By default, it only considers the standard chromosomes, 1, …, 22, X, Y, and M.	1,…,22,X,Y,M
update	Check and update the local kggsum.jar file to the latest version available on the website. Example: `java -jar kggsum.jar update` Note: The update option does not work on Windows. On Windows, you must manually download the latest kggsum.jar file and replace the existing one.	-

Association

Association¶

About¶

The association module in KGGSum offers various functions to link genes, cell types, gene networks, and even drugs to phenotypes using GWAS signals from variants. A common step across all analyses involves aggregating association signals from multiple variants to derive gene-level associations using GATES (Li et al., 2011) and ECS (Li et al., 2019). Advanced association analyses build upon these gene-level associations. The rationale behind each type of association analysis is detailed in the respective option descriptions and relevant publications.

Main Workflow of the Association Module¶

Generation: Extract variant coordinates and frequencies from the VCF or GTB file to create a root variant set for further analysis.
Annotation: Annotate the root variant with gene features or xQTLs.
Append: Integrate GWAS variants and their summary statistics into the annotated root variant set.
Gene-based association: Conduct gene-based association analyses based on the gene feature annotations.
Advanced association: Conduct gene-based association analyses based on the gene feature annotations.
…: (Additional association analysis as specified).

assoc

Basic Usage¶

java -jar kggsum.jar assoc --ref-gty-file <input1> --sum-file <input2> --output <output> [options]

By default, the program performs gene-based association tests using GATES and ECS. Additional association analyses in this module can be initiated by specifying relevant input options (e.g., --gene-score-file).

Methods & Options

Genes¶

GATES and ECS¶

Performs the gene-based association analysis using GATES (a rapid and powerful Gene-based Association Test using Extended Simes procedure) and ECS (an Effective Chi-square Statistics).

GATES (Li et al. 2011) is basically an extension of the Simes procedure to dependent tests, as the individual GWAS tests are dependent due to LD. GATES calculates an effective number of independent p-values, which is then used by a Simes procedure. ECS (Li et al. 2019) first converts the p-values of a gene to chi-square statistics(one degree of freedom). Then, it merges all chi-square statistics of a gene after correcting the redundancy of the statistics due to LD. The merged statistic is called an ECS and is used to calculate the p-value of the gene.

Citations¶

For gene-based association analysis: Miaoxin Li, Hong-Sheng Gui, Johnny S.H. Kwan, et al. GATES: a rapid and powerful gene-based association test using the extended Simes procedure. The American Journal of Human Genetics, 2011, 88(3):283-293. PubMed Link
For ECS and conditional gene-based association analysis Miaoxin Li, Lin Jiang, T S H Mak, et al. A powerful conditional gene-based association approach implicated functionally important genes for schizophrenia. Bioinformatics, 2019, 35(4):628-635. PubMed Link
For pathway or (gene-set) based association analysis: Hongsheng Gui, Johnny S Kwan, Pak C Sham et al. Sharing of Genes and Pathways Across Complex Phenotypes: A Multilevel Genome-Wide Analysis. Genetics, 2017, 206(3):1601-1609. PubMed Link

Options¶

This analysis takes the p-values of SNPs as input and outputs the p-values of genes. The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              refG=hg19 \
   --output ./t1

Flag	Description
`assoc`	Trigger the gene-based association analysis.
`--sum-file`	Specifies the file path of a GWAS summary file. The detailed description can be found here.
`--ref-gty-file`	Specifies the file path of a reference genotype file. The detailed description can be found here.
`--output`	Specifies the prefix of the output folder.
`--gene-model-database`	Specifies the gene model database for annotating variants to genes. Syntax `--gene-model-database name [path=<path>]` Arguments `name` (Required) The name of the database. Accepted values are: 1. refgene OR gencode: Use a built-in database. KGGSum automatically loads the data from its internal resources directory. 2. customized: Use a user-provided gene model file. This requires the path argument. `path=<path>` (Required only when is customized) The location of your custom gene model file. This can be a local file path (e.g., /data/my_genes.txt) or a URL (e.g., http://example.com/genes.txt). Custom File Format When using the customized option, the file must adhere to the following format: No header row: The file should not contain a header. Four columns: Each row represents a single gene and must contain four space- or tab-separated columns in this order: Column 1: Chromosome ID (e.g., 1, chrX) Column 2: Transcription Start Position Column 3: Transcription End Position Column 4: Gene Symbol or ID (e.g., MYH9, ENSG00000100345) Example content: 1 339070 350389 ENSBTAG00000006648 1 475398 475516 ENSBTAG00000049697 Important 1. This option is mutually exclusive with –xqtl-file. The –xqtl-file option is intended for datasets where variants are already mapped to genes (e.g., eQTLs/pQTLs). Do not use both options in the same run. 2. The exact boundaries are set by `--upstream-distance` and `--downstream-distance`. The default values are 5000 bp for both boundaries.
`--xqtl-file`	Specifies the target genes associated with each variant, typically using eQTL summary statistics. The detailed description can be found here. Note that the `--gene-model-database` and `--xqtl-file` options are mutually exclusive; specifying the latter will automatically disable the former. The `--xqtl-file` is usually used for causal gene inference via a Mendelian Randomization method (e.g., EMIC).
`--max-condi-gene`	Specifies the maximal number of significant genes for the conditional gene-based test. The default value is 1000. The value -1 disables this setting.

Output¶

The numeric results of gene-based association tests are saved in GeneBasedAssociationTask/genes.hg38.assoc.txt. These are the columns in the file:

Header	Description
RegionID	Region IDs or Gene symbols
Chromosome	Chromosome of the region or gene
StartPosition	The position of the first SNP in the region or gene
EndPosition	The position of the last SNP in the region or gene
#Var	Number of variants within the region or gene
GATES.P	p-value of ECS
ECS.P	p-value of GATES
CCT.P	the combined p-value of ECS and GATES by the Cauchy Combination test

The Q-Q plots for p-values of gene-based association tests by GATES or ECS are saved in GeneBasedAssociationTask/genes.hg38.assoc.qq.pdf.

In addition, the conditional gene-based association test is then carried out for significant genes to remove significant genes, mostly due to LD with the most significant gene in a local region. The results are stored in GeneBasedConditionalAssociationTask\genes.hg38.condi.assoc.txt. These are the columns in the file:

Header	Description
…	…
Condi.ECS.P	The p-value of the conditional gene-based test by ECS

In the above analysis, variants are mapped to genes according to their physical positions in the gene models (--gene-model-database). However, remote regulatory variants may not be included depending on the position. One can use xQTL to link distant variants to genes by option –eqtl-file.

Heritability¶

Gene-based heritability by EHE¶

Heritability measures how well differences in people’s genes account for differences in their phenotypes. This EHE analysis estimates each gene’s heritability and performs gene-based association tests simultaneously (Miao et al. 2023).

Citations¶

Lin Miao, Lin Jiang, Bin Tang, Pak Chung Sham and Miaoxin Li. Dissecting the high-resolution genetic architecture of complex phenotypes by accurately estimating gene-based conditional heritability. The American Journal of Human Genetics (2023). 110(9):1534–1548. PubMed Link

Options¶

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --calc-ehe \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              sampleSizeCols=Nca,Nco \
              refG=hg19 \
              prevalence=0.01 \   
   --output ./t2

Flag	Description
`assoc`	Trigger the gene-based association analysis.
–calc-ehe	Ask the program to estimate heritability when calculating association. It has two sub-options. calcCondi： also calculate the conditional heritability. Note it may be time-consuming. So it can be turned off by setting a value no. The default is yes. “topGeneNum=0
`--sum-file`	Specifies the file path of a GWAS summary file. For quantitative traits, a single column specifying the sample sizes is required. For binary traits, two columns indicating the case and control sample sizes are necessary. Additionally, for a disease phenotype, the disease prevalence must be specified. The detailed description can be found here. Note: It is recommended to exclude the Human Leukocyte Antigen (HLA) region for two main reasons: 1) Computational Efficiency: To save computing time, as this region has highly complex linkage disequilibrium (LD) patterns. 2) Statistical Inference: To avoid the strong, potentially confounding signals from the immune system, which can dominate the analysis. This can be achieved by setting the option exclude=chr6:27477797~35448354 (for hg19, coordinates include 1 Mbp flanking extensions).
`--ref-gty-file`	Specifies the file path of a reference genotype file. The detailed description can be found here.
`--out`	Specifies the prefix of the output folder.

Output¶

The gene-based association p-values and heritability estimates are saved in GeneBasedAssociationTask\genes.hg38.assoc.txt. These are the columns in the file:

Header	Description
…	…
eH2	The estimated heritability of the region or gene by EHE.
eH2.SE	The standard error of the estimated heritability.

In addition, a conditional gene-based estimation is then carried out for significant genes to remove genes that have heritability merely due to LD with the most significant gene in a local region. The results are stored in GeneBasedConditionalAssociationTask\genes.hg38.condi.assoc.txt. These are the columns in the file:

Header	Description
…	…
Condi.eH2	The estimated conditional heritability of the region or gene by EHE.
Condi.eH2.SE	The standard error of the estimated conditional heritability.

CellTypes¶

DESE¶

DESE (Driver-tissue/cell Estimation by Selective Expression; Jiang et al.. 2019) estimates driver tissues by tissue-selective expression of phenotype-associated genes in GWAS. The assumption is that the tissue-selective expression of causal or susceptibility genes indicates the tissues where complex phenotypes develop primarily, which are called driver tissues. Therefore, a driver tissue is likely to be enriched with the selective expression of susceptibility genes of a phenotype.

DESE initially analyzed the association by mapping SNPs to genes according to their physical distance. We further demonstrated that grouping eQTLs of a gene or a transcript to perform the association analysis could be more powerful. We named the eQTL-guided DESE eDESE. KGGSum implements DESE and eDESE with an improved effective chi-squared statistic to control type I error rates and remove redundant associations (Li et al. 2022).

DESE performs phenotype-tissue association tests and conditional gene-based association tests at the same time. This analysis inputs p-values of a GWAS and expression profile of multiple tissues and outputs p-values of phenotype-tissue associations and conditional p-values of genes.

Citations¶

For phenotype-associated tissue estimation by DESE: Lin Jiang, Chao Xue, Sheng Dai, et al. DESE: estimating driver tissues by selective expression of genes associated with complex diseases or traits. Genome biology, 2019, 20(1):1-19. PubMed Link
For phenotype-associated tissues’ susceptibility genes and isoforms estimation: Xiangyi Li, Lin Jiang, Chao Xue, et al. A conditional gene-based association framework integrating isoform-level eQTL data reveals new susceptibility genes for schizophrenia. Elife. 2022 Apr 12;11:e70779. PubMed Link
For phenotype-associated cell-type estimation by DESE: Xue C#, Jiang L#, Zhou M, Long Q, Chen Y, Li X, Peng W, Yang Q, Li M. PCGA: a comprehensive web server for phenotype-cell-gene association analysis. Nucleic Acids Res. 2022 May 26;50(W1):W568-76.

Options¶

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              refG=hg19 \
              exclude=chr6:27477797~35448354   \
   --gene-score-file ../resources/GTEx_v8_TMM_all.gene.meanSE.txt.gz \
   --output ./t3

Flag	Description
`assoc`	`--sum-file`, `--ref-gty-file`, and `--out` have the same functions as previously described.
`--gene-score-file`	Specifies a gene score file. The scores can represent various attributes, such as RNA expression, protein expression, epigenetic markers, or perturbation profiles at genes. See more at the Input Data Description
`--gene-p-cut`	Set p-value threshold to select significant genes for the conditional gene-based test. The default value is 0.05.
`--gene-multiple-testing`	Specifies the method for multiple testing correction with a given p-value threshold to select significant genes for the conditional gene-based test. It has three alternative method labels: bonf: Bonferroni correction with family-wise threshold specified by `--gene-p-cut` bhfdr: Filter by the false discovery rate (FDR) calculated by Benjamini-Hochberg procedure. The threshold is also defined by `--gene-p-cut` fixed: Filter by the p-value threshold specified by `--gene-p-cut` without any multiple testing correction. byfdr: Filter by the false discovery rate (FDR) calculated by Benjamini-Yekutieli procedure, which is more suitable for dependent tests. The threshold is also defined by `--gene-p-cut` The default value is bonf.
`--max-condi-gene`	Set the maximal number of significant genes for the conditional gene-based test. The default value is 1000. The value -1 disables this setting.
`--permutation-num`	Set the number of permutations to adjust the p-value for driver-tissue or -cell types inference due to selection bias and multiple testing. The default value is 100. A larger number will take more running time.

Output files¶

This function produces three sets of results: the gene-based association summary statistics saved in GeneBasedAssociationTask\genes.hg38.assoc.txt, the gene-based conditional association summary statistics saved in GeneBasedAssociationTask\genes.hg38.condi.assoc.txt, and the integrative enrichment summary statistics saved in GeneBasedConditionalAssociationTask\$scoreFileName.enrichment.txt. Basically, this is the result of the Wilcoxon rank-sum test, which tests whether the selective expression median of the phenotype-associated genes is significantly higher than that of the other genes in the interrogated tissue. The file contains four columns:

Header	Description
Condition	Name of the tissue being tested
Unadjusted(p)	Unadjusted p-values for the tissue-phenotype associations
Adjusted(p)	Adjusted p-values were calculated by adjusting both selection bias and multiple testing by permutation of gene scores within each condition.
Median(IQR)SigVsAll	Median (interquartile range) expression of the conditionally significant genes and all the background genes Heritability

Drugs¶

pDESE¶

Infer effective drugs for a GWAS disease with selective perturbation of gene expression profile by pDESE (named pDESE). The assumption is that effective drugs may treat disease by specifically disturbing the expression of disease-susceptible genes. A detailed explanation can be found in this paper. The options and input format are the same as those of the above analyses for associated cell types. The difference is just in what expression profiles are input. Instead of the gene expression profiles of various cell types or tissues, the perturbed gene expression profiles by various drugs are specified by --gene-score-file.

Citations¶

Li X, Xue C, Zhu Z, Yu X, Yang Q, Cui L, Li M. Application of GWAS summary data and drug-induced gene expression profiles of neural progenitor cells in psychiatric drug prioritization analysis. Mol Psychiatry. 2025 Jan;30(1):111-121.

Options¶

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              refG=hg19 \
   --gene-score-file https://idc.biosino.org/pmglab/resource/kgg/kggsum/datasets/drugs/GEO_expression_profiles/hipsc_ctrl_with_se_drug_induced_foldchange.txt.gz \
   --threads 20 \
   --output ./t4

The options are identical to those for the associated Cell-type inference.

Output¶

The output files are identical to those of the CellTypes association analyses. The prioritized drugs are saved in GeneBasedConditionalAssociationTask\$scoreFileName.enrichment.txt, where the Wilcoxon rank-sum test generates enrichment scores. The permutation approach is used to obtain valid statistical p-values, taking into account multiple testing, selection bias, and internal correlation of gene perturbation scores.

Spatial Analysis¶

sDESE¶

Infer disease-associated cellular microenvironments by integrating spatial transcriptomics with GWAS summary data using sDESE. The core principle is that disease-susceptible genes are likely to be enriched in specific spatial domains or cellular niches. A detailed explanation of the core method and its application can be found in Genome Biology, 2019, 20(1):1-19 and Nature. 2025;641(8064):932-941.

The analysis is a two-step process. First, Gene Specificity Scores (GSS) are calculated from a spatial transcriptomics dataset (e.g., an H5AD file) using the gsmap tool, a required dependency. Second, the resulting GSS file is used as the expression profile input for the association analysis, specified by the –gene-score-file option.

Citations¶

For spatial mapping by DESE: Jiang L, Xue C, Dai S, et al. DESE: estimating driver tissues by selective expression of genes associated with complex diseases or traits. Genome Biology, 2019, 20(1): 222. PubMed Link
For gene specificity scores by gsMap: Song L, Chen W, Hou J, et al. Spatially resolved mapping of cells associated with human complex traits. Nature. 2025;641(8064):932-941.
For benchmarking of spatial mapping methods: Liu M, Xue C, Li M. (2025+). SMECT: benchmarking post-GWAS methods for spatial mapping of cells associated with human complex traits. In preparation.

Prerequisites¶

The first step of the sDESE analysis relies on the gsmap Python package. Please install it in a dedicated Conda environment.

conda create -n gsMap python>=3.10
conda activate gsMap
pip install gsmap

For a detailed description of gsmap and its parameters, please refer to the official gsmap documentation.

Options¶

The tutorial command demonstrates the two-step workflow.

This step processes the spatial transcriptomics data (.h5ad file) to derive a gene score file.

# 1.1: Learn latent representations from the spatial data
gsmap run_find_latent_representations \
    --workdir /path/to/your/workdir/MDD \
    --sample_name 'MDD_mmbrain' \
    --input_hdf5_path /path/to/your/Mouse_Brain.h5ad \
    --annotation 'annotation' \
    --data_layer 'count'

# 1.2: Convert latent representations to gene specificity scores
gsmap run_latent_to_gene \
    --workdir /path/to/your/workdir/MDD \
    --sample_name 'MDD_mmbrain' \
    --annotation 'annotation' \
    --latent_representation 'latent_GVAE' \
    --num_neighbour 51 \
    --num_neighbour_spatial 201 \
    --homolog_file /path/to/resource/mouse_human_homologs.txt

This process generates a gene score file, typically named [sample_name]_gsmap_gene_marker_score.feather, in your working directory.

This step uses the GSS file generated above to perform the association analysis with GWAS summary statistics.

Important: Reading .feather files requires a specific Java runtime setting on JDK 17 and later due to Java’s module system. You must set the JDK_JAVA_OPTIONS environment variable before executing the kggsum.jar command.

JDK_JAVA_OPTIONS=--add-opens=java.base/java.nio=ALL-UNNAMED \
java -Xmx48g -jar /path/to/kggsum.jar \
   assoc \
   --ref-gty-file /path/to/EUR.hg19.vcf.bgz \
                  refG=hg19 \
   --sum-file /path/to/MDD.gwas.sum.tsv.gz \
              cp12Cols=chr,bp,a1,a2 \
              pbsCols=p,beta,se \
              refG=hg19 \
   --gene-score-file /path/to/your/workdir/MDD/MDD_mmbrain_gsmap_gene_marker_score.feather \
                     calcSpecificity=N \
   --threads 30 \
   --output ./sdese_output

Note: The key parameter is –gene-score-file, which points to the output from gsmap in Step 1.

Output¶

The output files are structurally identical to those from the pDESE and CellTypes association analyses. Prioritized spatial domains associated with the trait are saved in GeneBasedConditionalAssociationTask$scoreFileName.enrichment.txt.

This file contains enrichment scores calculated by the Wilcoxon rank-sum test. A permutation approach is employed to obtain robust statistical p-values, corrected for multiple testing, selection bias, and the internal correlation of gene specificity scores.

PSC Web Server¶

In addition to the command-line workflow, we have pre-integrated large-scale single-cell and spatial transcriptomics data to generate high-quality gene expression profiles of spatially specific cell types. These profiles can be used to estimate disease-associated cell types and genes via our interactive PSC web server.

The PSC web server is built upon the KGGSum platform, offering a user-friendly interface for rapid analysis and visualization. It is accessible at: https://pmglab.top/psc.

Citation¶

Xue C, Liu M, Zhou M, Li M. PSC: a comprehensive web server for resolving spatial heterogeneity of cell types associated with complex phenotypes. 2024. Unpublished manuscript.

Output¶

All analysis results and visualizations are directly available on the PSC website.

Temporal Analysis¶

tDESE¶

Infer age-specific windows during which disease-associated genetic variants are likely to exert their effects using tDESE (named tDESE). tDESE integrates developmental gene expression with genome-wide association studies to identify critical developmental stages during which genetic variants linked to complex diseases may influence disease risk. The underlying assumption is that disease-associated variants are more likely to act during specific age periods. Detailed methodology can be found in this paper.

The options and input format are similar to those used for analyses of associated cell types. The key difference is in the input expression profiles: instead of gene expression profiles from various cell types or tissues, tDESE uses temporal gene expression profiles across developmental stages. Additionally, the parameter fittingMethod=GAM must be specified to indicate that these are age-specific expression profiles, allowing tDESE to fit expression levels across ages and generate smoothed, age-resolved profiles for downstream analysis.

Input Format for Temporal Expression Profiles¶

Temporal expression profiles follow the same format as regular expression profiles, with the following differences in column naming:

Each column name should start with the age (in years), followed by any covariates.
Covariates are specified as value@type, where type is l for linear (continuous) variables or f for categorical variables.
Multiple covariates are separated by semicolons (;).

Example column name: 12;male@f;7.8@l

This indicates that the column corresponds to age 12, with two covariates:

male@f: gender, categorical variable with value male
7.8@l: RIN value, continuous variable with value 7.8

This format allows tDESE to fit expression levels across ages while accounting for covariates, generating smoothed, age-resolved expression profiles for downstream analysis.

Citations¶

Chao Xue#, Xuegao Yu#, Lihang Ye, Ying Luo, Liubin Zhang, Bin Tang, Hongsheng Gui, Lin Jiang, Miaoxin Li*. Developmental expression of risk genes implicates the age of onset for neuropsychiatric disorders. bioRxiv. 2025

Prerequisites¶

tDESE requires Python to perform expression and age fitting, so you need to install the necessary Python environment and packages first. You can do this using conda:

conda create -n tdese_env python=3.10 -y
conda activate tdese_env
pip install numpy pandas joblib pygam tqdm scipy

Options¶

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   assoc \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
                  refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP \
              pbsCols=P \
              refG=hg19 \
   --gene-score-file https://idc.biosino.org/pmglab/resource/kgg/kggsum/datasets/development/brain/PsyEn-LIBD_szControl.forKGGSUM.txt.gz \
                     fittingMethod=GAM \
   --threads 20 \
   --output ./tdese

The options are identical to those for the associated Cell-type inference.

Output¶

The output files are identical to those of the CellTypes association analyses. The prioritized ages are saved in GeneBasedConditionalAssociationTask\$scoreFileName.enrichment.txt, where the Wilcoxon rank-sum test generates enrichment scores. The permutation approach is used to obtain valid statistical p-values, accounting for multiple testing, selection bias, and the internal correlation of gene perturbation scores.

Causation

Causation inference¶

About¶

The causation module provides advanced Mendelian randomization methods for inferring causation from genes, lifestyle, or microbes (as exposures) to phenotypes (as outcomes). It enables rapid inference screening of tens of exposures and outcomes simultaneously.

Main Workflow of the Causation Module¶

Generation: Extract variant coordinates and frequencies from the VCF or GBC file to create a root variant set for further analysis.
Annotation: Annotate the root variant with gene features or xQTLs.
Append: Integrate GWAS variants and their summary statistics into the annotated root variant set.
LD Clumping: Select the significant variants of exposures as IVs and remove redundant IVs according to LD.
Gene Sticking: Link IV to the potential target gene according to LD. Note that this is a rough linking, and some target genes may not be true.
MR analysis: Infer the causality by suitable MR methods (e.g., EMIC or PCMR)
…: (Additional analysis as specified).

causal

Basic Usage¶

java -jar kggsum.jar causal --sum-file <input1> --ref-gty-file <input2> --output <output> [options]

Methods & Options

Genes¶

EMIC¶

EMIC (Effective-median-based Mendelian randomization framework for Inferring the Causal genes of complex phenotypes) inferences gene expressions’ causal effect on a complex phenotype with dependent expression quantitative loci by a robust median-based Mendelian randomization. The effective-median method solved the high false-positive issue in the existing MR methods due to either correlation among instrumental variables or noises in approximated linkage disequilibrium (LD). EMIC can further perform a pleiotropy fine-mapping analysis to remove possible false-positive estimates (Jiang et al. 2022).

Citations¶

[1] Lin Jiang, Lin Miao, Guorong Yi, Xiangyi Li, Chao Xue, Mulin Jun Li, Hailiang Huang and Miaoxin Li. Powerful and robust inference of causal genes of complex phenotypes with dependent expression quantitative loci by a novel median-based Mendelian randomization. Am J Hum Genet. 2022 May 5;109(5):838-856. PubMed

Options¶

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   causal \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --emic-mr removePleiotropicEQTL=Y
   --xqtl-file https://idc.biosino.org/pmglab/resource/kgg/kggsum/datasets/gtex/eqtl/hg38/Brain_Frontal_Cortex_BA9_eur_v8_tmm_p01.gene.hg38.cov.eqtl.tsv.gz \
              refG=hg38 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --threads 18 \
   --output ./test/ba9_scz_causal

Flag	Description	Default
`causal`	Trigger the causality inference. The default analysis is EMIC to infer causal genes of phenotypes.	-
`--emic-mr`	Triggers the EMIC-MR (eQTL-mediated MR) analysis. This is a combination parameter with the following options: • `plotPCut`: Sets the P-value threshold for plotting significant results. Default is 1E-5. • `outputPCut`: Sets the P-value threshold for results to be included in the output file. Default is 1E-5. • `multiPCorrect`: Specifies the method for multiple testing correction. There are four candidate methods: fixed (no correction), bonf (Bonferroni correction), bhfdr (Benjamini and Hochberg FDR), and byfdr (Benjamini and Yekutieli FDR). Default is fixed. • `removePleiotropicEQTL`: A flag to indicate whether to remove potentially pleiotropic eQTLs before analysis. Accepts Y (Yes) or N (No). Default is N. Format: `--emic-mr plotPCut=[p-value] outputPCut=[p-value] multiPCorrect=[fixed] removePleiotropicEQTL=[N]` Example: `--emic-mr outputPCut=1E-8 multiPCorrect=bonf removePleiotropicEQTL=Y`
`--xqtl-file`	Specifies a file containing SNP effects on gene or transcript expression. The file should be a text table, where each row represents a single SNP, and columns are delimited by tabs or spaces. This is a combination parameter; further details can be found in the description of –xqtl-file Input Data Description	-
…	…	…

Output¶

The numeric results of EMIC are saved in GeneBasedCausationTask/genes.hg38.emic.tsv. There are nine columns in the file:

Header	Description
SymbolID	The gene symbol
ChromosomeID	Chromosome of a gene
Start	The coordinate of the first SNP.
End	The coordinate of the last SNP.
ExpressionID	The gene ID.
::IVNum	Number of IVs within the gene.
::Effect	The estimated causality effect.
::SE	The standard error of the estimated causality effect.
::P	p-value of EMIC for statistical causality test.

Phenotypes¶

PCMR¶

PCMR (Pleiotropic Clustering model for MR analysis) is a tool for analyzing GWAS summary statistics (provided via –sum-file) to infer causal relationships between phenotypes. It is designed to tackle correlated horizontal pleiotropy, a common challenge in Mendelian Randomization (MR) studies.

By extending the zero modal pleiotropy assumption (ZEMPA), PCMR improves causal inference even in the presence of a high proportion of pleiotropic variants. It tackles the difficulty of distinguishing between correlated pleiotropic effects and true causal effects by combining them into a single “correlated HVP effect,” modeled using a Gaussian Mixture Model. This allows PCMR to categorize instrumental variables (IVs) effectively, including identifying those with causal effects.

PCMR also includes a pleiotropy test to detect correlated horizontal pleiotropy and enhances causal inference in these scenarios. This makes it a powerful tool for evaluating the causal effects of gene expression on complex phenotypes. (Tang et al., 2024).

Citations¶

[1] Bin Tang, Nan Lin, Junhao Liang, Guorong Yi, Liubin Zhang, Wenjie Peng, Chao Xue, Hui Jiang, Miaoxin Li. Leveraging Pleiotropic Clustering to Address High Proportion Correlated Horizontal Pleiotropy in Mendelian Randomization Studies. Nat Commun. 2025 Mar 21;16(1):2817 PubMed

Options¶

This main analysis inputs GWAS summary of SNPs and outputs p-values of genes. The following are options for an example:

java -Xmx4g -jar ../kggsum.jar \
  causal \
   --pcmr 1T2,2T1 \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
   --sum-file ./smoking_chr1.tsv.gz \
              cp12Cols=CHR,POS,A1,A2 \
              pbsCols=Pval,Beta,SE \
              prevalence=0.05   \
              betaType=1   \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \   
   --threads 10 \
   --output ./test/smk_scz_pcmr \
   --exclude-complementary-allele

Format	Description	Default
`--vbmr` or `--variant-based-mr`	Triggers the variant-based MR analysis. This is a combination parameter with the following options: • `causalPair`: Defines the direction of causal inference, where traits (indicated by their order number) are specified in the `--sum-file`. For example, a value of 1T2 indicates an inference of causation from the phenotype(s) specified by the first `--sum-file` to the phenotype(s) listed in the second`--sum-file`. • `effIVPCut`: Sets the p-value threshold for selecting instrumental variables. • `effIVPCorrect`: Set a method for multiple testing of p-values for selecting instrumental variables. There are three candidate methods: fixed (no correction), bonf (Bonferroni correction), and bhdfr (Benjamini and Hochberg FDR). • `ldPruneCut`: Sets the r² threshold for LD clumping. • `initIVPCut`: Sets the p-value threshold for selecting instrumental variables to model uncorrected pleiotropic effects. • `ldStickCut`: Sets the LD r² threshold for clustering genes whose SNPs are in LD with an instrumental variable. Format: `--pcmr causalPair= effIVPCut=[p-value] effIVPCorrect=[fixed] ldPruneCut=[r²] initIVPCut=[p-value] ldStickCut= [r²]` Example: `--pcmr 1T2,2T1 5E-8 fixed 0.1 0.5 0.8`	causalPair=1T2,2T1 effIVPCut=5E-8 effIVPCorrect=fixed ldPruneCut=0.1 initIVPCut=0.5 ldStickCut=0.8
`--exclude-complementary-allele`	If specified, variants with complementary alleles (e.g., A/T and C/G) are excluded from the analysis.	-
	The description of other options is the same as that for association analyses.

Tip: Accelerating Batch MR Analysis

To significantly accelerate your MR analysis when exploring numerous exposures and outcomes, we recommend organizing your data for batch processing. Place all exposure GWAS files into one dedicated folder and all outcome GWAS files into another. Crucially, ensure all files within a given folder have identical column names. This standardized format allows you to specify the entire set of files with a single --sum-file option. Furthermore, this folder-based approach enables KGGSum to optimize the LD pruning process for selecting independent IVs, substantially reducing overall computation time.

Output¶

At the end of the PCMR analysis, the results are summarized and stored in a file named MendelianRandomization.summary.tsv. Meanwhile, the main causal inference results are detailed on the screen. Here is a case example:

2024-11-19 21:15:04 Clustering (2 categories) phi: [0.29485845139124117,0.3508103656075827] 
2024-11-19 21:15:04 Heterogeneity test by P_plei-test for correlated horizontal pleiotropy: 0.86885
2024-11-19 21:16:17 Correlated horizontal pleiotropy may be absent (P_plei-tes >= 0.20), and the estimate causal effect is: 

- By the one-category model of PCMR: 
  - The causal effect(SE): 0.323(0.0523); OR: 1.38(1.25-1.53) 
   - PCMR's causality evaluation p-value: 6.56e-10

- By Inverse-Variance Weighted MedianMR: 
  - The causal effect(SE): 0.321(0.0634); OR: 1.38(1.22-1.56) 
  - Median-based causality evaluation p-value: 4.03e-07

According to PCMR’s heterogeneity test, it indicates the failure to reject the null hypothesis of no correlated pleiotropy ($P_{plei-tes} >= 0.20$). The causality may be more appropriately inferred by the one-category model of PCMR and conventional inverse-variance weighted median MR.

Below is an explanation of the output columns of MendelianRandomization.summary.tsv, specifically from the PCMR (Pleiotropic Clustering model for Mendelian Randomization) analysis, as well as other conventional Mendelian Randomization (MR) methods provided for comparison.

General Columns¶

These columns provide the basic setup of the analysis:

Exposure: The variable tested as the potential cause. This could be a genetic variant, a gene expression profile, a lifestyle factor, or a microbial abundance, depending on the input data specified via the –sum-file in KGGSum. For example, it might represent smoking behavior in a phenotype-to-phenotype analysis or a gene’s expression level in a gene-to-phenotype analysis.
Outcome: The variable tested as the effect, typically a phenotype or disease of interest (e.g., schizophrenia). This is specified in the second –sum-file when using the –pcmr option with a causal direction like 1T2 (first file to second file).
IVNum: The number of Instrumental Variables (IVs) used in the analysis. IVs are genetic variants significantly associated with the exposure (based on a p-value threshold) but assumed to affect the outcome only through the exposure. This is determined by the effIVPCut parameter in the –pcmr option (default: 5E-8).
PCut: The p-value cutoff used to select IVs from the exposure GWAS summary statistics. This threshold filters variants with a significant association with the exposure (e.g., effIVPCut=5E-8 in the PCMR options).
FDRCut: The False Discovery Rate cutoff, an alternative method to control for multiple testing when selecting IVs. This may be applied if effIVPCorrect=bhfdr is specified in the –pcmr option, adjusting the p-value threshold using the Benjamini-Hochberg procedure.

Two-Category PCMR Model Results¶

The Two-Category PCMR model divides IVs into two groups to account for correlated horizontal pleiotropy (where IVs affect the outcome through pathways other than the exposure). These columns report the results:

TwoCategoryPCMR_CorrelatedHorizontalPleiotropy: A binary indicator (e.g., Absent/Present) showing whether correlated horizontal pleiotropy is detected. This is based on a statistical test (see P_Plei below). If “Present,” the two-category model is preferred for causal inference.
TwoCategoryPCMR_P_Plei: The p-value from the pleiotropy test. A small p-value (e.g., < 0.2) suggests significant correlated horizontal pleiotropy, meaning some IVs influence the outcome independently of the exposure.
TwoCategoryPCMR_P_CausalEval: The p-value for assessing the causal effect in the two-category PCMR model. A small p-value (e.g., < 0.05) suggests that the group of instrumental variables (IVs) deriving the dominant causal effect can be confidently identified, supporting a significant causal link between the exposure and the outcome. Conversely, an insignificant p-value does not necessarily imply the absence of a causal relationship; rather, it may indicate that the two IV groups—one tied to the true causal effect and the other to pleiotropic effects—are too similar in size to determine which reflects the causal pathway. For further details, see the PCMR paper.
TwoCategoryPCMR_C1_Effect(SE): The estimated causal effect (beta coefficient) and its standard error (SE) for the first category of IVs. This represents the effect size of the exposure on the outcome for this group.
TwoCategoryPCMR_C1_OR(95CI): The odds ratio (OR) and its 95% confidence interval (CI) for the first category. For binary outcomes, this indicates the change in odds of the outcome per unit change in the exposure (e.g., 1.38 means a 38% increase).
TwoCategoryPCMR_C2_Effect(SE): The estimated causal effect and standard error for the second category of IVs.
TwoCategoryPCMR_C2_OR(95CI): The odds ratio and 95% CI for the second category.

One-Category PCMR Model Results¶

If no significant pleiotropy is detected (i.e., P_Plei ≥ 0.2), the One-Category PCMR model is used, assuming all IVs reflect a single causal pathway:

OneCategoryPCMR_P: The p-value for the causal effect in the one-category model. A small value suggests a significant causal relationship.
OneCategoryPCMR_Beta(SE): The estimated causal effect (beta) and its standard error.
OneCategoryPCMR_OR(95CI): The odds ratio and 95% CI for the causal effect.

Other MR Methods¶

KGGSum also provides results from standard MR methods for comparison, each with its own assumptions and robustness to pleiotropy:

Inverse-Variance Weighted (IVW) MR¶

IVW_MR_P: P-value for the causal effect using the IVW method, which combines IV effects weighted by their precision.
IVW_MR_Beta(SE): Estimated causal effect and standard error.
IVW_MR_OR(95CI): Odds ratio and 95% CI.

Egger MR¶

EGGER_MR_P: P-value for the causal effect, robust to pleiotropy but less powerful than IVW.
EGGER_MR_Beta(SE): Estimated causal effect and standard error.
EGGER_MR_OR(95CI): Odds ratio and 95% CI.
EGGER_MR_Intercept(SE): The intercept from Egger regression, indicating pleiotropy if significantly different from zero.
EGGER_MR_InterceptP: P-value for the intercept; a small value suggests pleiotropy.

Median MR¶

Median_MR_P: P-value using the median-based method, robust to outliers.
Median_MR_Beta(SE): Estimated causal effect and standard error.
Median_MR_OR(95CI): Odds ratio and 95% CI.

Mode-Based Estimate (MBE) MR¶

MBE_MR_P: P-value for the mode-based method, focusing on the most common effect size.
MBE_MR_Beta(SE): Estimated causal effect and standard error.
MBE_MR_OR(95CI): Odds ratio and 95% CI.

Robust IVW (RIVW) MR¶

RIVW_MR_P: P-value for a rerandomized robust IVW method, MR.Rerand, adjusting for the winner’s curse.
RIVW_MR_Beta(SE): Estimated causal effect and standard error.
RIVW_MR_OR(95CI): Odds ratio and 95% CI.

JCWC MR¶

JCWC_MR_P: P-value for a new and unpublished MR method (adjust for winner’s curse and false positive IVs).
JCWC_MR_Beta(SE): Estimated causal effect and standard error.
JCWC_MR_OR(95CI): Odds ratio and 95% CI.

Interpretation guideline¶

Check for Pleiotropy: Start with TwoCategoryPCMR_P_Plei. If it’s ≥ 0.2, pleiotropy is not significant, and the OneCategoryPCMR results are more appropriate. If < 0.2, use the TwoCategoryPCMR results.
Causal Effect Significance: Look at the p-values (P_CausalEval for two-category, OneCategoryPCMR_P for one-category, or conventional MR p-values). A value < 0.05 suggests a significant causal effect.
Effect Size and Direction: The Beta (effect size) and OR (odds ratio) indicate the strength and direction of the causal relationship. Positive beta/OR > 1 means the exposure increases the outcome; negative beta/OR < 1 means that it decreases it.
Consistency Across Methods: Compare PCMR results with conventional MR methods (IVW, Egger, etc.). Consistent estimates across methods strengthen confidence in the causal inference.
Example Interpretation: If TwoCategoryPCMR_P_Plei = 0.20 (no pleiotropy), OneCategoryPCMR_P = 6.56e-10, and OneCategoryPCMR_OR = 1.38 (1.25-1.53), it suggests a significant causal effect where a unit increase in the exposure raises the odds of the outcome by 38%, with no evidence of pleiotropy complicating the analysis.

In addition, the summary statistics of IVs for MR are saved in the file named variants.hg38.tsv.gz under the subdirectory of PCMRTask. There are thirteen columns in the file:

Header	Description
CHROM	Chromosome of the gene
POS	The coordinate of the IV with the lowest GWAS p-value
REF	The reference sequence base
ALT	The alternative sequence base
MarkFeatureGene	The Gene annotated with the SNP
MarkGeneFeature	The feature of the gene annotated with the SNP
[exposure]::P	The P value of this SNP on exposure
[exposure]::Beta	The effect of this SNP on exposure
[exposure]::SE	The effect’s standard error of this SNP on exposure
[outcome]::P	The P value of this SNP on the outcome
[outcome]::Beta	The effect of this SNP on the outcome
[outcome]::SE	The effect’s standard error of this SNP on the outcome
Class	The category given by PCMR, may be 1, 2, 3, etc.

A graphical result file is also presented as IVScatterPlots.pdf in the same directory. The horizontal axis of the graph represents the effect of SNPs on the exposure variable, while the vertical axis represents the effect of SNPs on the outcome variable. Each point signifies an SNP selected by PCMR, along with the confidence interval of its corresponding effect size. Different colors are used to distinguish between different types of points identified by PCMR. The slope of the diagonal line with the same color represents the effect of the exposure on the outcome by the corresponding category of SNPs.

There are also some intermediate output files in the directory Actinobacteria_AN_pcmr. A brief introduction is provided in the following table.

File	Description
ConvertVCF2GTBTask\EUR.hg19.gtb	the gtb format file of input VCF file
GenerateRootVariantSetTask\variants.annot.hg38.gtb	An hg38 file in gtb format with genotype data removed and annotation information added
IncorporateVariants2RootVariantSetTask\{exposure_file}.gtb	the gtb format file of the exposure sum file. By default, all coordinates are converted to hg38.
IncorporateVariants2RootVariantSetTask\{outcome_file}.gtb	the gtb format file of the outcome sum file. By default, all coordinates are converted to hg38.
IncorporateVariants2RootVariantSetTask\variants.annot.hg38.gtb	The gtb format file of the SNPS selected by P-value in the exposure and outcome files
GeneFeatureAnnotationTask\variants.annot.hg38.gtb	variants.annot.hg38.gtb in IncorporateVariants2RootVariantSetTask with annotation
LDGeneStickingTask\variants.annot.hg38.gtb	The loci annotated to the nearest gene region
LDPruningTask\variants.annot.hg38.5.0E-8.gtb variants.annot.hg38.0.5.gtb	The files containing the retained SNPs after LD clumping.

Microbes¶

Infer causality from microbes to phenotypes by multiple MR methods¶

We provide GWAS summary statistics of microbes to enable users to identify causal microbes associated with phenotypes using Mendelian Randomization (MR) methods. The advanced MR method, PCMR, which is more robust to correlated horizontal pleiotropy (see details above), is applied as the primary analysis. Simultaneously, the presence of correlated horizontal pleiotropy is assessed. If no significant correlated horizontal pleiotropy is detected, a conventional IVW-based MR method is subsequently performed to evaluate the significance of the estimated causal effects.

KGGSum is uniquely suited for exploring causal relationships between the microbiome and complex traits, offering two primary advantages:

1) A Comprehensive Built-in Database: KGGSum comes pre-packaged with GWAS summary statistics for over 500 microbes, compiled from three large-scale cohorts. This curated resource eliminates the need for extensive data sourcing. (See the resource page for details.) 2) High-Throughput Analysis: The software is optimized for efficiency, enabling the rapid and simultaneous analysis of hundreds of microbes against multiple complex phenotypes in a single run.

Citations¶

[1] Bin Tang, Nan Lin, Junhao Liang, Guorong Yi, Liubin Zhang, Wenjie Peng, Chao Xue, Hui Jiang, Miaoxin Li. Leveraging Pleiotropic Clustering to Address High Proportion Correlated Horizontal Pleiotropy in Mendelian Randomization Studies. Nat Commun. 2025 Mar 21;16(1):2817 PubMed

Options¶

This main analysis inputs a GWAS summary of SNPs and outputs p-values of genes. The following are options for an example:

java -Xmx4g -jar ../kggsum.jar \
  causal \
   --pcmr 1T2 \
          effIVPCut=1E-3 \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
   --sum-file ./microbiome/mibiogen/DMP.gwas.hg38.gtb \
              pbsCols=P,Beta,SE \
              refG=hg38 \
              betaType=0 \
   --sum-file ./*.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \   
   --threads 10 \
   --output ./test/microb_scz_casual \
   --exclude-complementary-allele

Format	Description	Default
`--pcmr`	Triggers the PCMR analysis. This is a parameter set with multiple sub-options, as described above. Example: `--pcmr 1T2,2T1 5E-5 fixed 0.1 0.5 0.8`	causalPair=1T2,2T1 effIVPCut=5E-8 effIVPCorrect=fixed ldPruneCut=0.1 initIVPCut=0.5 ldStickCut=0.8
..	…	-

Output¶

The output results of the analyses are also the same as those of the phenotype causality. The causality inference results are summarized and stored in the MendelianRandomization.summary.tsv file.

Annotation

Variant Annotation & Filtration¶

About¶

The annotate module enables rapid annotation of millions of variants with genomic features using one or multiple different databases, leveraging the full or partial fields of these databases. Additionally, it offers a variety of filtering functions based on annotation results, such as gene feature filtering, population frequency, conservation, and epigenetic modification, to assist in interpreting and deciphering the significant association signals.

Workflow of the Annotate Module¶

Generation: Extract variant coordinates and frequencies from the VCF or GBC file to create a root variant set for further analysis.
Gene Annotation: Annotate the root variant with gene features or xQTLs.
Append: Integrate GWAS variants and their summary statistics into the annotated root variant set.
Variant Annotation： Annotate variants with databases (e.g., CADD - Combined Annotation Dependent Depletion, and gnomAD) stored in GTB format to gain comprehensive insights into your genetic variations. KGGSum allows for rapid, one-stop annotation of hundreds of fields from one or multiple databases.

Clump and Prune: Clump and prune GWAS variants using p-values from summary statistics and LD calculated from a reference population. annotate

Basic Usage¶

java -jar kggsum.jar annot --sum-file <input1> --ref-gty-file <input2> --output <output> [options]

Methods & Options

Variant Annotation and Filtration with External Databases enhance the analysis of genetic variations by leveraging external databases to provide additional information about variants. This process aids in filtering and prioritizing variants based on various annotations.

Gene¶

Gene feature Annotation & Filtration¶

Gene feature annotation is used to identify which gene or transcript is affected by a variant and what functional role it has on known genes. We now support two gene definition systems: RefSeq genes (refgene) and GENCODE genes (gencode).

Options¶

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --threads 18 \
   --output ./ta1

Flag	Description	Default
`annot`	Trigger the annotation procedure.	-
`--gene-model-database`	Specifies the gene model database for annotating variants to genes. Syntax Specifies the gene model database for annotating variants to genes. Syntax `--gene-model-database name [path=<path>]` Arguments `name` (Required) The name of the database. Accepted values are: 1. refgene OR gencode: Use a built-in database. KGGSum automatically loads the data from its internal resources directory. 2. customized: Use a user-provided gene model file. This requires the path argument. `path=<path>` (Required only when is customized) The location of your custom gene model file. This can be a local file path (e.g., /data/my_genes.txt) or a URL (e.g., http://example.com/genes.txt). Custom File Format When using the customized option, the file must adhere to the following format: No header row: The file should not contain a header. Four columns: Each row represents a single gene and must contain four space- or tab-separated columns in this order: Column 1: Chromosome ID (e.g., 1, chrX) Column 2: Transcription Start Position Column 3: Transcription End Position Column 4: Gene Symbol or ID (e.g., MYH9, ENSG00000100345) Example content: 1 339070 350389 ENSBTAG00000006648 1 475398 475516 ENSBTAG00000049697 Important This option is mutually exclusive with –xqtl-file. The –xqtl-file option is intended for datasets where variants are already mapped to genes (e.g., eQTLs/pQTLs). Do not use both options in the same run.	-
`--upstream-distance`	Set the region length (bp) of upstream from the transcription start site. Format: `--upstream-distance <int>` Example: `--upstream-distance 1000` Valid setting: [int] >=1	1000
`--downstream-distance`	Set the region length (bp) of the downstream from the transcription start site. Format: `--downstream-distance <int>` Example: `--downstream-distance 1000` Valid setting: [int] >=1	1000
`--disable-gene-feature`	Disable gene feature annotation. Format: `--disable-gene-feature`
`--gene-feature-in`	Retain variants with specified annotated gene features . Format: `--gene-feature-in <int~int>,<int>,...` Example:`--gene-feature-in 0~6,9,10` Valid setting: [int] 0 ~ 17	0 ~ 17
…	…	…

KGGA has 18 number codes for the gene features after annotation.

Feature	Code	Explanation
Frameshift	0	Short insertion or deletion results in a completely different translation from the original.
Nonframeshift	1	Short insertion or deletion results in loss of amino acids in the translated proteins.
Start-loss	2	Indels or nucleotide substitution results in the loss of the start codon (ATG) (mutated into a non-start codon).
Stop-loss	3	Indels or nucleotide substitution results in the loss of stop codons (TAG, TAA, TGA).
Stop-gain	4	Indels or nucleotide substitutions result in the new stop codons (TAG, TAA, TGA), which may truncate the protein.
Splicing	5	Variant is within 3-bp of a splicing junction (use `--splicing-distance x` to change this; the unit of x is base-pairs).
Missense	6	Nucleotide substitution results in a codon coding for a different amino acid.
Synonymous	7	Nucleotide substitution does not change amino acids.
Exonic	8	Due to the loss of sequences in the reference database, this variant can only be mapped into the exonic region without more precise annotation.
UTR5	9	Within a 5’ untranslated region.
UTR3	10	Within a 3’ untranslated region.
Intronic	11	Within an intron.
Upstream	12	Within 1-kb region upstream of the transcription start site (use `--upstream-distance x` to change this, the unit of x is base-pair).
Downstream	13	Within 1-kb region downstream of the transcription end site (use `–downstream-distance x to change this; the unit of x is base-pairs).
ncRNA	14	Within a transcript without protein-coding annotation in the gene definition.
Intergenic	15	Variant is in the intergenic region.
Monomorphic	16	It is not a sequence variation, which may result from bugs in the reference genome in variant calling.
Unknown	17	Variants have no annotation.

Output¶

The gene feature annotation results are saved in OutputVariants2TSVTask/variants.hg38.tsv.gz. There are two relevant columns in the file:

Header	Description
…	…
MarkFeatureGene	The gene where a variant is located. When a variant is mapped onto multiple genes, the gene that leads to the smallest code is called the marker gene.
MarkGeneFeature	The coordinate of the first SNP.
…	..

Clump¶

Prune and Clump Variants Based on P-Values and Linkage Disequilibrium (LD)¶

In KGGSum, the “Clump” feature enables users to prune and clump Genome-Wide Association Study (GWAS) variants using p-values from summary statistics and linkage disequilibrium (LD) calculated from a reference population. This process produces a refined set of independent, statistically significant, and harmonized variants, ready for downstream applications such as Mendelian Randomization (MR) analysis with third-party tools or platforms.

LD Clumping Methodology¶

The LD clumping method filters out variants within a user-defined physical distance (set by windowKb) if their LD exceeds a specified threshold (controlled by r2Cut). Variant removal follows a prioritized, three-step approach to retain the most informative variants:

Variants with Larger P-Values Variants with less significant p-values (i.e., larger values) are prioritized for removal. The significance threshold is determined by the family-wise error rate (set by pCut) and adjusted using multiple testing correction methods: “fixed” (no correction), “bonf” (Bonferroni), “bhfdr” (Benjamini-Hochberg FDR), “byfdr” (Benjamini-Yekutieli FDR), or “localfdr” (local FDR).
When multiple GWAS datasets are provided, the p-value to use is specified by pIndex; otherwise, the p-value from the first dataset is used by default.
Among variants with p-values below the cutoff, those with the smallest sum of -log(p) across datasets or missing p-values in any dataset are prioritized for removal.
Variants with More LD Connections Variants in LD with a greater number of other variants within the windowKb distance are prioritized for removal to reduce redundancy.
For example, consider three variants: A, B, and C, with LD r² values of 0.8 (A-B), 0.8 (B-C), and 0.64 (A-C). Variant B, which is in LD with both A and C, is removed, while A and C are retained. This preserves variants with fewer LD dependencies.
Variants with Lower Minor Allele Frequency (MAF) If the first two criteria do not differentiate between variants (e.g., equal LD connections and similar p-values), the variant with the lower MAF is removed. Retaining variants with higher MAF ensures greater informativeness, as they are typically more robust for statistical analysis.

This hierarchical approach maximizes the retention of variants that are:

Less redundant (fewer LD connections),
More significant (smaller p-values),
More informative (higher MAF).

Options¶

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ld-clump r2Cut=0.1 \
              windowKb=10000 \
              pIndex=1 \
              pCut=0.05 \              
              pCorrect=bhfdr \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./smoking_chr1.tsv.gz \
              cp12Cols=CHR,POS,A1,A2 \
              pbsCols=Pval,Beta,SE \
              prevalence=0.05   \
              betaType=1   \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \     
   --threads 18 \
   --output ./ta2

This command clumps variants from two GWAS datasets (smoking and schizophrenia), prioritizing the smoking dataset’s p-values. It uses a 10 Mb window and an LD threshold of 0.1, retaining independent, significant variants after FDR correction.

Annotation option	Description	Default
`--ld-clump`	`r2Cut`: LD threshold (e.g., 0.1). Controls the degree of independence among retained variants. A lower r2Cut yields a smaller, more independent set. `windowKb`: Physical distance window in kilobases (e.g., 10000). Often set between 500 kb and 10000 kb (10 Mb), depending on study needs. Larger windows capture long-range LD but increase computational load, while smaller windows focus on local LD patterns. `pIndex`: Index of the p-value column to use when multiple GWAS datasets are provided (e.g., 1). Ensures the correct dataset drives significance assessment in multi-dataset analyses. `pCut`: Significance cutoff (e.g., 0.05). Determines the pool of variants considered, with stricter thresholds retaining only highly significant ones. `pCorrect`: Multiple testing correction method (e.g., “bhfdr”). Format: `--ld-clump r2Cut=<value> windowKb=<value> pIndex=<index> pCut=<value> pCorrect=<method>` Example: `--ld-clump r2Cut=0.1 windowKb=10000 pIndex=1 pCut=0.05 pCorrect=bhfdr`	[OFF]

Function¶

Functional score annotation at variants¶

In KGGSum, the GWAS variants can also be annotated with multiple genomic features. Three databases are available: gnomAD for allele frequency annotation, CADD for variant function annotation, and ClinVar for disease linkage annotation. Note that the annotation datasets should be downloaded from an independent resource domain of KGGA.

Database Name	Short Description	Tag
CADD	Combined Annotation Dependent Depletion (CADD) is a widely used matrix for mutation deleteriousness and integrates more than 100 annotations for all possible single-nucleotide variants (SNVs) of the GRCh38/hg38 human reference genome.	cadd
Favor	Functional Annotation of Variants - Online Resource (FAVOR) provides comprehensive multi-faceted variant functional annotations that summarize findings of all possible nine billion SNVs across the genome (build GRCh38).	favor
ClinVar	ClinVar is a public database managed by the National Center for Biotechnology Information (NCBI) that provides information about the relationship between genetic variation and human health.	clinvar

Options¶

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --variant-annotation-database cadd \
              field=Epigenetics::EncodeDNase-max,Epigenetics::EncodeDNase-sum,ProteinFunction::CADD_PHRED \
   --threads 18 \
   --output ./ta2

Annotation option	Description	Default
`--variant-annotation-database`	Set the reference databases used for annotation at variants. `--variant-annotation-database` is a combination of parameters, and the usage rules are the same as those of `--freq-database`. Format: `--variant-annotation-database <name> path=<path> field=[field1,field2,...]` Example: `--variant-annotation-database cadd field=ProteinFunction::CADD_PHRED`	[OFF]

Additional epigenetic resources from third-party databases¶

To facilitate the convenient use of more resources, KGGSum provides an approach that allows users to specify customized third-party resources for annotation. For example, by setting the file name documented in EpiMap, KGGSum can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.

Database Name	Short Description
EpiMap	EpiMap is one of the most comprehensive maps of the human epigenome, providing approximately 15,000 datasets across 833 bio-samples and 18 epigenomic marks. It delivers rich gene-regulatory annotations encompassing chromatin states, high-resolution enhancers, activity patterns, enhancer modules, upstream regulators, and downstream target genes.

Annotation Option	Description	Default
–region-annotation-database	Specifies the interval/regional databases for annotation. Format: –region-annotation-database subID=[] marker=[] path= field=[field1,field2,…] Example: –region-annotation-database EpiMap subID=BSS00001 marker=H3K4me3 - name: Database name (e.g., EpiMap). - subID: Subject ID from EpiMap. - marker: Epigenetic marker (e.g., H3K4me3). - path: Path to the database file (optional for EpiMap). - field: Specific fields to include. Note: For name=EpiMap, KGGSum automatically downloads data from the EpiMap Repository if not locally available.	NA

RSID¶

Assign DBSNP’s rsIDs to variants¶

In KGGSum, the GWAS variants can also be annotated with DBSNP’s rsIDs. Note that the annotation dataset needs to be downloaded manually.

Database Name	Short Description	Tag
dbsnp	The variants’ coordinates and rs IDs from NCBI’s dbSNP. RSID coordinates can be accessed at NCBI FTP. We provide a GTB version of this dataset, converted from dbSNP’s VCF (B157).	dbsnp

Options¶

The tutorial command is:

java -Xmx8g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \
   --variant-annotation-database snp \
   --threads 18 \
   --output ./ta2

Frequency¶

Allele Frequency Annotation & Filtration¶

Allele frequency annotation allows users to incorporate population-level allele frequency information into the analysis of genetic variants. Click to view the provided allele frequency annotation databases.

Database Name	Short Description	Tag
gnomAD	Allele frequency data in the Genome Aggregation Database (gnomAD) v4 dataset (GRCh38) is derived from 730,947 exomes and 76,215 genomes from unrelated individuals of diverse ancestries.	gnomad

Options¶

The tutorial command is:

java -Xmx4g -jar ../kggsum.jar \
   annot \
   --ref-gty-file ./1kg_hg19_eur_chr1.vcf.gz \
              refG=hg19 \
   --sum-file ./scz_gwas_eur_chr1.tsv.gz \
              cp12Cols=CHR,BP,A1,A2 \
              pbsCols=P,OR,SE \
              betaType=2 \
              prevalence=0.01 \       
     --freq-database gnomad \
              field=gnomAD_joint::ALL,gnomAD_joint::NFE \
   --threads 18 \
   --output ./ta3

Annotation option	Description	Default
`--freq-database`	Set the reference databases used for allele frequency annotation. `--freq-database` is a combination of parameters. The `<name>` is identified as the database name (such as gnomad). The `<path>`identifies the path to the database, which can be a local file path or an FTP/HTTP file path. If the path of the resources folder has been specified, the `<path>` parameter can be bypassed. The `<field>` is identified as the specified field filtered under this database. If no value is set, all fields of the specified database are selected by default. Format: `--freq-database <name> path=<path> field=[field1,field2,...]` Example: `--freq-database gnomad field=gnomAD::EAS`	[OFF]

Once the reference databases for allele frequency annotation have been properly configured, you can effectively filter variants by examining their allele frequencies within the reference population.

Filtration option	Description	Default
`--db-af`	Exclude variants with alternative allele frequency (AF) outside the range [`min`, `max`] in allele frequency databases. Format: `--db-af <min>~<max>` Example:`--db-af 0.05~1.0` Valid setting: [float] 0.0 ~ 1.0	[OFF]
`--db-maf`	Exclude variants with minor allele frequency (MAF) outside the range [`min`, `max`] in allele frequency databases. Format: `--db-maf <min>~<max>` Example:`--db-maf 0.05~0.5` Valid setting: [float] 0.0 ~ 0.5	[OFF]

Again, following workflow-based programming, programmers can easily call the API function to annotate sequence variants. The input variants should be stored in the GTB format. The following are codes to annotate variant gene features and allele frequencies.

// This code snippet demonstrates how to configure and execute an annotation workflow using the KGGA
// platform. The snippet includes setting up the annotation resources, defining input data, configuring
// annotation databases, and exporting annotated variants to a TSV file format. The code also shows how
// to add tasks to a workflow and execute them sequentially.

public class AnnotationExample {
    public static void main(String[] args) {
        // Step 1: Create an object for annotation options.
        // The `AnnotationOptions` object is used to configure various settings related to the annotation 
        // process, such as specifying databases and frequency cutoffs.
        AnnotationOptions annotationOptions = new AnnotationOptions();

        try {
            // Step 2: Set the source of annotation resources.
            // The `Channel` class is used to add local and online channels (resource paths) where 
            // annotation databases and other resources are stored. These channels provide access to 
            // annotation data, such as population allele frequencies or gene annotations.
            Channel.addChannel("./resources"); // Local resource path
            Channel.addChannel("http://pmglab.top/kgga/resources"); // Online resource URL

            // Step 3: Specify the workspace directory for the workflow.
            // The workspace is a directory that stores intermediate files and results generated during the 
            // annotation and analysis processes.
            File workspace = new File("./test1");

            // Step 4: Define the number of threads for parallel processing.
            // The number of threads determines how many tasks will be processed simultaneously, allowing 
            // for faster execution when running on multi-core systems.
            int threadNum = 4;

            // Step 5: Create an executor object to manage and execute the workflow.
            // The `Executor` class provides a framework to handle the various tasks required for annotation 
            // and other computational steps.
            Executor workflow = new Executor();

            // Step 6: Track the workflow and workspace for output management.
            // This utility function links the workflow with the specified workspace, ensuring that 
            // all intermediate files and outputs are managed properly.
            Utility.addTrack(workflow, workspace);

            // Step 7: Execute any initial tasks defined in the workflow.
            // Although there are no specific tasks at this point, this command ensures that the workflow 
            // is in a clean state before adding new tasks.
            workflow.execute();
            workflow.clearTasks(); // Clear any residual tasks to prevent conflicts.

            // Step 8: Set the path to the input GTB file for annotation.
            // This file contains the genetic variants to be annotated. The file path is passed to the 
            // workflow as a parameter named "AnnotationBaseVariantSet".
            String inputAnnotationBasedGTBPath = "/Users/jianglin/Working/tools/kgga/test/demo2/GenerateAnnotationBaseTask/variants.annot.hg38.gtb";
            File annotationBasedGTB = new File(inputAnnotationBasedGTBPath);
            workflow.setParam("AnnotationBaseVariantSet", annotationBasedGTB);

            // Step 9: Configure the annotation databases.
            // Various annotation databases can be specified to provide additional information about 
            // genetic variants, such as allele frequencies or gene annotations.
            // - `freqDatabase`: Specifies databases for allele frequency annotations. Here, the `gnomad` 
            //   database is used with specific subpopulations: "EAS" (East Asian) and "AFR" (African).
            annotationOptions.freqDatabase.add(new DatabaseDescription("gnomad", new String[]{"gnomADv4.0::EAS", "gnomADv4.0::AFR"}));

            // Set the allele frequency (AF) range filter.
            // This filter restricts the annotated variants to those with allele frequencies between 0 and 
            // 0.01, ensuring that only rare variants are considered.
            annotationOptions.dbAf = new FloatInterval(0, 0.01f);

            // Specify the gene annotation database.
            // The `gencode` database is used here, providing comprehensive gene annotation information. 
            // Gene annotations help link genetic variants to genes and identify their functional context.
            annotationOptions.geneDatabase = List.singleton(new DatabaseDescription("gencode"));

            // Step 10: Add the annotation tasks to the workflow.
            // The `AnnotationPipeline` class builds the necessary tasks for variant annotation using the 
            // specified options, workspace, and number of threads. These tasks are then added to the workflow.
            workflow.addTasks(new AnnotationPipeline(annotationOptions, workspace, threadNum).build());

            // Step 11: Add a task to export annotated variants to a TSV file.
            // The `OutputVariants2TSVTask` is responsible for exporting the annotated variants stored in 
            // the GTB format to a tab-separated value (TSV) file. This file can be easily viewed and 
            // processed by other bioinformatics tools.
            workflow.addTasks(new OutputVariants2TSVTask(workspace, threadNum));

            // Step 12: Execute the workflow to perform annotation and export tasks.
            // This command initiates the workflow, executing all tasks sequentially and producing the 
            // desired output files (e.g., annotated TSV files) in the specified workspace.
            workflow.execute();

        } catch (IOException e) {
            // Handle any exceptions that occur during file I/O operations.
            throw new RuntimeException(e);
        }
    }
}

Changelog

2025¶

Add plot and tDESE

v1.2.2 Release

2025-12-6 Feature

1) Introducing Scatter and Forest Plots ro visualize your EMIC analysis results more effectively. 2) Fixed bugs to ensure reliable processing of multiple GWAS summary inputs. 3) Added a function to fit age-specific gene expression, enabling temporal inference into the dynamics of complex genetic diseases.

Improve the robustness

v1.2.1 Release

2025-11-6 Optimization

Improve the robustness of task tracking and gene-based ESC functions.

Support Feather format for gene-score file

2025-10-1 Feature

Allow the Feather format for --gene-score-file option to speed up the analysis.

Non-human GWAS support

2025-7-26 Feature

Allow gene-based association analysis of non-human GWAS when a gene model is specified by a customized --gene-model-database.

LD clumping enhancement

2025-7-21 Optimization

Add LD clumping with data harmonization in the Annotation module.

LD blocking & clumping optimization

2025-6-15 Optimization Feature

Improve algorithms of LD blocking and clumping.

Cell type prioritization

2025-4-11 Feature

Develop scalable algorithms to prioritize cell types in large-scale datasets containing thousands of cell types or expression profiles.

PCMR analysis summary table

2025-4-4 Feature

Generate a summary table for PCMR analysis.

Resource link update

2025-4-2 Normal

Update some download links of KGGSum for resource data.

Data flow optimization

2025-3-26 Optimization

Optimize some internal data flows to speed up analysis and reduce memory usage.

Initial release

2025-3-10 Release

Release the first formal version of KGGSum.