Brief Document

KGGA（Knowledge-based Genetic and Genomic Analyses Accelerator）¶

Introduction¶

KGGA is an open-source, high-performance computational platform engineered to revolutionize large-scale genomic studies. Leveraging the GTB genomic encoding framework and CCF distributed computing architecture, KGGA achieves acceleration of 10–100X in processing cohorts exceeding 1 million subjects, while maintaining analytical rigor, enabling population-scale genomics on standard workstations (≥8 cores, 16GB RAM). By democratizing supercomputer-grade genomics for standard research environments, KGGA provides an industrial-grade solution for large-scale genome-wide association studies (GWAS), rare variant analysis, and clinical genomic interpretation.

Functionality¶

KGGA currently supports the following function modules:

clean: a basic module that makes genotype-level and variant-level QC, and extracts variant and sample subgroups. The processed variants can be exported into various formats (e.g., plink, pgen, and VCF) or further analyzed by the other functional modules on KGGA.
annotate: a module that quickly annotates variants using hundreds of fields from multiple databases simultaneously, and filters variants based on the annotations.
prune: a module that prunes variants according to association p-value, linkage disequilibrium (LD), and annotation scores. The genotypes of pruned variants can be exported in multiple formats for follow-up analyses.
prioritize: a module that prioritizes variants, genes, or genomic regions based on allelic frequencies in reference populations by various comprehensive analysis pipelines.
connect: designed for advanced users, KGGA also provides a suite of high-performance API functions that enable direct interaction with third-party applications built using Java, Python, or R. These APIs support bidirectional integration, allowing them to be seamlessly invoked by external tools as well.
simulate: a module for generating synthetic phenotypes from real genotype data based on real genotypes. This functionality is crucial for a wide range of applications in genetic research, including but not limited to: evaluating the statistical power of new analytical methods and exploring causal relationships between traits (e.g., for Mendelian Randomization studies).
prediction : a module designed to construct and apply predictive models for phenotypes using genotypes and covariates. This module is particularly valuable for genomic research and clinical applications, such as prioritizing individuals with high disease risk based on their genetic profiles.

overview

Key Advances of KGGA¶

Ultra-high speed and minimal memory footprint, approaching hardware limits
Modular incremental execution & resumable workflows
Extensive biomedical knowledge base
Seamless integration with analysis tools and platforms via efficient APIs and binary formats

Citations¶

The GTB format paper: Zhang L, Yuan Y, Peng W, Tang B, Li MJ, Gui H, Wang Q, Li M*. GBC: A parallel toolkit based on fast-accessible byte blocks for extremely large-scale genotypes of species. Genome Biol. 2023 Apr 17;24(1):76.

System Requirements and Setup for KGGA¶

This section provides the system requirements for running KGGA, a Java-based tool for genetic analysis, along with detailed instructions for setting up the Java Runtime Environment (JRE) and downloading necessary resources.

KGGA and Its Running Resources¶

KGGA is distributed as a Java Archive file (kgga.jar) and requires additional resources, such as reference genes and genomic annotations, for most analyses. Below are the download links for the latest version of KGGA and its resources:

File	Description	Version	Size
kgga.jar	The KGGA program	1.0	~23 MB
tutorials.zip	Toy input data and minimal resources for tutorials	1.0	~200 MB
resources	Folder with full resource data files (GTB or text format)	1.0	Varies
	For historical releases

Contents of resources.zip

The resources.zip package provides the essential annotation and coordinate-conversion files used by KGGA:

File	Description
`resources/liftover/hg19ToHg38.over.chain.gz`	Chain file for converting coordinates from hg19 to hg38
`resources/reference/CanonicalTranscript.txt.gz`	Canonical transcript metadata (gene symbol, Ensembl ID, etc.)
`resources/reference/GEncode_hg38_kggseq_v2.txt.gz`	GENCODE gene annotations (hg38)
`resources/reference/refGene_hg38_kggseq_v2.txt.gz`	RefGene gene annotations (hg38)

💡 KGGSum allows flexible resource usage — you can download only the files needed for your specific analysis.

System Requirements¶

KGGA operates within a Java Virtual Machine (JVM), ensuring compatibility across multiple operating systems. Below are the hardware and software requirements:

Hardware/Software	Requirement
Operating System	Any OS supporting Java (e.g., Linux, macOS, Windows)
Java Runtime Environment	Java SE JRE version 1.8 or higher
CPU	4 cores or more recommended
Memory	16 GB RAM or higher recommended
Free Disk Space	Up to 10 GB for KGGA and associated datasets

Setting Up a Java Runtime Environment (JRE)¶

KGGA requires a Java Runtime Environment (JRE) version 1.8 or higher. You can use either Java SE JRE or OpenJDK JRE.

Installation Steps¶

Download and Install JRE
For Java SE JRE, visit the official Java download page and follow the instructions.
For OpenJDK JRE, refer to the OpenJDK installation guide.
Verify Installation
Open a terminal (Linux/macOS) or Command Prompt/PowerShell (Windows).
Enter the command:
```
java -version
```
If installed correctly, you should see output like:

Java(TM) SE Runtime Environment (build 1.8.0_XXX)

or

OpenJDK Runtime Environment (build 1.8.0_XXX)
If the command is not recognized, confirm that the JRE is installed and that the java command is in your system’s PATH.

Setting Up an Environment for Quick Tutorials¶

Follow these steps to set up an environment for running the tutorial examples:

Download Files
kgga.jar
tutorials.zip
Unzip tutorials.zip
Extract the contents to a directory (e.g., tutorials/).
Place kgga.jar
Copy or move the latest kgga.jar into the tutorials/ directory.
Verify KGGA Installation
Open a terminal (Linux/macOS) or Command Prompt/PowerShell (Windows).
Navigate to the tutorials/ directory:
```
cd path/to/tutorials
```
Run the following command to test KGGA:
```
java -jar kgga.jar
```
If successful, KGGA will display usage information or a help message.

Notes¶

The resources/ folder within tutorials/ provides minimal data for tutorials, such as gene annotations and liftover chains.
For more extensive analyses, download additional resources from the resources folder as needed.

General Notes¶

Ensure your system meets the recommended hardware specifications, particularly for large datasets.
If Java-related issues arise, double-check the JRE installation and PATH configuration.
Advanced analyses may require additional resources (e.g., reference genotypes, annotation databases) available via the provided download links.

Summary of Databases¶

The great efforts of international projects and functional genomic studies have led to a rich tapestry of information from diverse biological domains that can be leveraged for variant annotation. To facilitate convenient variant annotation, we meticulously pre-processed some annotation fields from several commonly used databases. Users can download the entire database or study the necessary databases directly from our website and start annotating a large number of variants. For small-scale annotations (e.g., <10,000 variants), users can access these databases remotely via FTP/HTTP. The program will automatically fetch the specified slices of databases based on the index file of the given variants in a distributed fashion, eliminating the need to download the entire database locally and ensuring a more convenient and expedited process. Given the expansive and dynamic nature of genomic databases, KGGA also allows users to incorporate custom databases for annotation purposes.

Gene feature annotation¶

We support three well-established gene annotation systems: RefSeq genes (refgene), GENCODE genes (gencode), and UCSC KnownGene database (knowngene). In KGGA, databases used for gene feature annotation should be constructed in FASTA format. Additionally, KGGA also accommodates custom gene databases formatted in FASTA, allowing users to integrate their own gene annotations tailored to specific research needs or preferences.

Database Name	Source	Pre-made for KGGA	Source Version/Date
refgene	https://ftp.ncbi.nih.gov/refseq/H_sapiens/annotation/	https://idc.biosino.org/pmglab/resource/kgg/kgga/resources/gene/refGene_hg38_kggseq_v2.txt.gz	2024-08-27
encode	https://www.gencodegenes.org/human/	https://idc.biosino.org/pmglab/resource/kgg/kgga/resources/gene/GEncode_hg38_kggseq_v2.txt.gz	V47
knowngene	https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/	https://idc.biosino.org/pmglab/resource/kgg/kgga/resources/gene/knownGene_hg38_kggseq_v2.txt.gz	2024-08-19

Variant annotation databases¶

In KGGA, common variant annotation databases, such as gnomAD for allele frequency annotation, CADD for variant function annotation, and ClinVar for disease linkage annotation, should be provided in GTB format. These databases are typically stored in well-structured tables, making them easily converted into GTB format. Below are brief descriptions of these databases. For more details, please click on the database name to view the full list of fields available for each.

Database Name	Short Description
gnomAD	Allele frequency data in the Genome Aggregation Database (gnomAD) v4 dataset (GRCh38) is derived from 730,947 exomes and 76,215 genomes from unrelated individuals of diverse ancestries. The pre-made GTB format of gnomAD(v4.1) for KGGA can be downloaded here.
CADD	Combined Annotation Dependent Depletion (CADD) is a widely used matrix for mutation deleteriousness. It integrates more than 100 annotations for all possible single-nucleotide variants (SNVs) of the GRCh38/hg38 human reference genome. The pre-made GTB format of CADD(v1.7) for KGGA can be downloaded here.
FAVOR	Functional Annotation of Variants - Online Resource (FAVOR) provides comprehensive multi-faceted variant functional annotations that summarize findings of all possible nine billion SNVs across the genome (build GRCh38). The pre-made GTB format of FAVOR(v2.0) for KGGA can be downloaded here.
dbNSFP	dbNSFP is a database for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. The pre-made GTB format of dbNSFP(v5.0a) for KGGA can be downloaded here.
PEXT	Proportion expressed across transcripts (pext) is a transcript-level annotation metric that quantifies isoform expression for variants in gnomAD v2 (hg19) across 54 GTEx tissues. The pre-made GTB format of PEXT(v2) for KGGA can be downloaded here.
ClinVar	ClinVar is a public database managed by the National Center for Biotechnology Information (NCBI) that provides information about the relationship between genetic variation and human health. The pre-made GTB format of ClinVar for KGGA can be downloaded here.

Region annotation databases¶

To facilitate the convenient use of more resources, KGGSum provides an approach that allows users to specify customized third-party resources for annotation. For example, by setting the file name documented in EpiMap, KGGSum can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.

Database Name	Short Description
EpiMap	EpiMap is one of the most comprehensive human epigenome maps, providing approximately 15,000 datasets across 833 bio-samples and 18 epigenomic marks. It delivers rich gene-regulatory annotations encompassing chromatin states, high-resolution enhancers, activity patterns, enhancer modules, upstream regulators, and downstream target genes.

Quick Start

Quick-Start Examples¶

After confirming that a Java runtime environment (version 8 or higher) is installed, you can run kgga.jar directly without additional configuration. The following examples showcase how KGGA cleans and annotates sequence variants and genotypes. These quick-start demos use small datasets retrieved from the internet, requiring no local resources. They are optimized for small datasets and serve as an easy way to explore KGGA’s core features.

Perform Quality Control on VCF Data and Output in PLINK BED Format¶

This example retrieves genotypes from an online VCF file, applies default quality control (QC) criteria, filters for low-frequency variants (minor allele frequency between 0.05 and 0.5), and exports the cleaned data in PLINK’s BED format.

Example Command¶

java -jar kgga.jar \
   clean \
   --input-gty-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.hg19.vcf.gz refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo1 \
   --local-maf 0.05~0.5 \
   --output-gty-format PLINK_BED

Key Parameters¶

java This is the command that starts the Java Virtual Machine (JVM). Since KGGA is a Java-based application, java is used to execute it.
-jar kgga.jar This tells the JVM to run the kgga.jar file, a Java Archive (JAR) that contains the KGGA application. A JAR file bundles all the necessary code and key resources, making it the executable file for KGGA.
–input-gty-file refG=hg19: Specifies the input VCF file (fetched from a URL) and the reference genome (hg19).
–ped-file : Provides the pedigree and phenotype file, also retrieved from a URL.
–output ./test/demo1: Defines the output directory for the cleaned data.
–local-maf 0.05~0.5: Filters variants with a minor allele frequency (MAF) between 0.05 and 0.5.
–output-gty-format PLINK_BED: Exports the cleaned genotypes in PLINK’s BED format.

Output¶

The cleaned genotype data is saved in PLINK BED format in the specified directory (./test/demo1).

Annotate Variants from VCF Data and Output in TSV Format¶

This example retrieves genotypes from an online VCF file, applies default QC criteria, filters for rare variants (frequency<=1%), annotates gene features, filters for non-synonymous variants, and adds functional prediction scores from the dbNSFP database. The results are saved in a compressed TSV file.

Example Command¶

java -Dccf.remote.timeout=60 -jar kgga.jar \
   annotate \
   --input-gty-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/rare.disease.hg19.vcf refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/rare.disease.ped.txt \
   --output ./test/demo2 \
   --gene-feature-included 0~6 \
   --freq-database gnomad \
   --db-af 0~0.01 \
   --variant-annotation-database dbnsfp

Key Parameters¶

-Dccf.remote.timeout=60 This part sets a system property for the JVM, configuring a timeout for remote operations within KGGA. The value 60 sets the timeout to 60 seconds. The Dccf.remote.timeout=60 setting is key when KGGA interacts with remote resources, like online databases or large datasets. A timeout of 60 seconds might work fine in many cases, but you might need to tweak it if:
Your network is slow, causing delays.
You’re working with big data that takes longer to fetch or process.
You see timeout errors during execution.

By setting this property, you control how long KGGA waits before giving up on a remote operation, making it adaptable to your specific needs.

–input-gty-file refG=hg19: Specifies the input VCF file (fetched from a URL) and the reference genome (hg19).
–ped-file : Provides the pedigree and phenotype file from a URL.
–output ./test/demo2: Sets the output directory for the annotated data.
–freq-database gnomad: Uses gnomAD population frequency data to annotate variants.
–db-af 0~0.01: Filters out variants with allele frequencies outside 0–1% in the specified database (gnomAD). Retains rare/low-frequency variants for analysis.
–gene-feature-included 0~6: Annotates gene features (e.g., frameshift, missense), using indices 0 to 6.
–variant-annotation-database dbnsfp: Adds functional prediction scores from the dbNSFP database.

Output¶

The annotated variants and their details are saved in a compressed TSV file in the specified directory (./test/demo2).

GBC

Starting GBC 中文 ¶

GBC is now part of the KGGA toolkit. It specializes in genotype data processing—including memory encoding, storage encoding, and computational encoding to optimize performance for specific tasks—and the design of efficient coordinate search algorithms. To launch GBC’s command-line interface, use the following command:

java -jar kgga.jar gbc

You can access detailed documentation for all commands by adding the –help flag.

Core Functionality: The convert Command¶

GBC’s primary command is convert, which enables parallel conversion between common genomic analysis file formats. It also supports features like filtering, LiftOver, Biallelic conversion, quality control (QC), sorting, and concatenation.

Syntax¶

java -jar kgga.jar gbc convert <source>2<target> [options]

Examples: vcf2gtb, plink2gtb, gtb2vcf

Note: Conversion to PLINK-PGEN requires additional extensions (see “Extended Functionalities” below).

Format Conversion Examples¶

Converting VCF to GTB¶

This example converts a VCF file into a GTB file with default quality control and specific options.

Command:

java -jar kgga.jar gbc convert vcf2gtb ~/ukb24310_c1_b6089_v1.vcf.gz --field  --prune --seq-an 1~ --seq-af 0.000001~0.999999 -o /Users/suranyi/ukb24310_c1_b6089_v1.3.gtb

Details:

Converts the input VCF file to GTB format.
Applies default QC filters:
GQ >= 20
DP >= 8
MQ >= 40
PL >= 20
LPL >= 20
FT == PASS
AD_HOM_REF <= 0.05
AD_HOM_ALT >= 0.75
AD_HET >= 0.25
–field: Removes INFO and FILTER fields from the VCF (no parameters specified).
–prune: Trims alternate (ALT) mutations with an allele count (AC) of 0.
–seq-an 1~: Filters by allele number range (minimum 1, no upper limit).
–seq-af 0.000001~0.999999: Filters by allele frequency range.

To disable QC, add –disable-qc. For more details on QC parameters, run:

java -jar kgga.jar gbc convert -h

Converting VCF to PLINK-PGEN¶

This example converts a VCF file to PLINK-PGEN format.

Command:

java -Djava.library.path=$(pip3 show jep | grep Location | awk '{print $2"/jep"}') -jar kgga.jar gbc convert vcf2plink ./ukb24310_c1_b6089_v1.vcf.gz  -o ./ukb24310_c1_b6089_v1  --output-type pgen

Details:

Requires the jep library path for PLINK-PGEN support (see “Extended Functionalities” below).
Outputs a PLINK-PGEN file with the specified name.

Converting Multiple VCF Files to a Single GTB File¶

This example processes multiple VCF files with a LiftOver operation.

Command:

java -jar kgga.jar gbc convert vcf2gtb 1kg.phase3.v5.shapeit2.amr.hg19.chr*.vcf.gz -o ~/tmp/AMR.hg38.gtb --liftover hg19ToHg38

Details:

Combines multiple chromosome-specific VCF files into one GTB file.
Applies a LiftOver from hg19 to hg38 coordinates.

Subset Extraction: Filtering Genotypes¶

You can extract subsets of genotype data (e.g., specific individuals or positions) using the convert command with additional options:

–individual ,,…: Select specific individuals by ID.
–pos [expression]: Filter by genomic position.
–index-range ~: Filter by index range.
–allele-num ~: Filter by allele number.
–seq-ac ~: Filter by allele count.
–seq-af ~: Filter by allele frequency.

Example: PLINK to VCF with Subset Extraction¶

Command:

java -Djava.library.path=$(pip3 show jep | grep Location | awk '{print $2"/jep"}') -jar kgga.jar gbc convert plink2vcf ./ukb24310_c1_b6089_v1 --input-type pgen --individual 1718672,2380098,5176706,4729017,1930596 --seq-an 1~  -o ./ukb24310_c1_b6089_v1.s5.vcf.gz

Details:

Converts a PLINK-PGEN file to VCF.
Filters to include only the specified individuals.
Applies an allele number filter (–seq-an 1~).

Additional Command-Line Features¶

GBC offers several other useful commands:

Queue Merging: java -jar kgga.jar gbc merge <file1> <file2> -o <output>

Merges two genotype files into one.

Vertical Concatenation (e.g., for chromosome files): java -jar kgga.jar gbc concat <input> <input> ... --output <output>

Combines multiple files into a single output.

Linkage Disequilibrium (LD) Calculation: java -jar kgga.jar gbc ld
Graphical User Interface (GUI): java -jar kgga.jar gbc gui

Launches a visual interface for GBC.

Database Creation: java -jar kgga.jar gbc make-database

Generates a database file from genotype data.

Note: Genomic annotation and analysis tools are integrated into KGGA (visit http://pmglab.top/kgga).

Extended Functionalities: PLINK and BGEN Support¶

To enable support for PLINK and BGEN formats, install the required Python libraries:

pip install jep zstandard pgenlib bgen_reader

When running GBC, specify the jep library path:

java -Djava.library.path="$(pip3 show jep | grep Location | awk '{print $2"/jep"}')" -jar kgga.jar gbc

Example Equivalent:

java -Djava.library.path=/opt/homebrew/lib/python3.13/site-packages/jep -jar kgga.jar gbc

Finding the jep Path¶

For Windows users or non-standard installations, determine the jep directory by running:

pip3 show jep

Locate the Location field in the output, append /jep, and use the resulting path with -Djava.library.path.

Remarks¶

CCF Architecture (Version 4.x):
Features a more flexible row-column block design and fine-grained parallelization for better performance.
Optimized for low memory usage (e.g., encoding and filtering UK Biobank whole-genome genotypes at 1GB per thread).
Incompatible with version 3.x file formats.
Development Focus:
KGGA and CCF prioritize systematic engineering improvements via Java APIs.
Command-line tools are supplementary and may have usability limitations, to be addressed in the next minor release (ccf-4.6).
File Merging:
Currently supports basic merging based on coordinate/REF consistency and standard bases (ATCG alleles).
Future updates will enhance performance, handle multi-allelic sites, and add advanced merging modes (e.g., intersection, union, complement, left alignment).

UKB

Managing Genotype Data on the Cloud: A Tutorial with DNANexus and UKB WES 中文 ¶

1. Required Software¶

KGGA Software Package: An integrated tool for managing GBC, CCF, genomes, and other large-scale public data.
Java Runtime Environment
Download Oracle JDK via wget https://download.oracle.com/java/24/latest/jdk-24_linux-x64_bin.tar.gz.

Since the Java environment may be used multiple times on different compute workers, we recommend storing these resources in your project directory.

2. Step-by-Step Procedure¶

2.1. Download Software Locally¶

On your local machine, download KGGA, the gene annotation file, and Oracle JDK:

# Use -c to enable resumable downloads
wget -c https://pmglab.top/kgga/download/lib/latest/kgga.jar
wget -c https://idc.biosino.org/pmglab/resource/kgg/kgga/resources/gene/refGene_hg38_kggseq_v2.txt.gz
wget -c https://download.oracle.com/java/24/latest/jdk-24_linux-x64_bin.tar.gz

2.2. Upload to the RAP Platform¶

Navigate to your project on the DNANexus platform, upload the files from the previous step, and note their file IDs:

KGGA: file-J21ZP4QJ5qQ25bjXZJX435Bq
JDK: file-J214ZzjJ5qQ8GfF272Fy4ZqP
refGene_hg38: file-J21Y8J0J5qQ6KZPxjvpgjF0P

2.3. Obtain Data Paths¶

The UK Biobank genotype data is stored in the following directories on the DNANexus Research Analysis Platform (RAP):

# WES VCF
<PROJECT-ID>:/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/

# WGS VCF
<PROJECT-ID>:/Bulk/DRAGEN WGS/ML-corrected DRAGEN population level WGS variants, pVCF format [500k release]/

# WGS PGEN
<PROJECT-ID>:/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release]/

For more details on the genotype data storage, please refer to: https://biobank.ndph.ox.ac.uk/ukb/label.cgi?id=100314

The WES data is split by chromosome and then by genomic blocks. The table below shows the total size and file count for each chromosome.

Chromosome	VCF.GZ [GB]	No. of Files
chr1	2,273.42	97
chr2	1,636.94	71
chr3	1,304.10	56
chr4	890.46	39
chr5	994.17	43
chr6	1,128.24	48
chr7	1,091.85	47
chr8	824.10	35
chr9	983.73	42
chr10	914.43	40
chr11	1,382.71	57
chr12	1,187.92	52
chr13	402.98	18
chr14	716.82	30
chr15	780.47	34
chr16	1,141.16	47
chr17	1,360.13	56
chr18	357.89	16
chr19	1,662.61	65
chr20	595.50	25
chr21	246.88	11
chr22	539.19	23
chrX	563.02	24
chrY	5.97	1
ALL	22,984.69	977

All files are named in the format ukb23157_c_b_v1.vcf.gz.

2.4. Launch a Worker and Configure the Environment¶

Launch an interactive worker. A suggested configuration is mem1_ssd1_v2_x4 (7.8 GB RAM, 93 GB disk, 4 cores).

In the worker’s terminal, set up the environment:

cd /opt/notebooks/
dx download file-J21ZP4QJ5qQ25bjXZJX435Bq
dx download file-J214ZzjJ5qQ8GfF272Fy4ZqP
tar zxvf ./jdk-24_linux-x64_bin.tar.gz
export PATH=/opt/notebooks/jdk-24.0.2/bin:$PATH

2.5. Download and Process Data in Batches¶

Individual UKB VCF files are very large, making it impractical to process all sub-files for a chromosome at once. Therefore, we will process the data in batches. First, we divide the files for an entire chromosome into k parts. Then, each worker processes one part (which contains multiple VCF sub-files). The processing includes genotype/individual-level QC and format conversion to the compact GTB format. The resulting GTB files are uploaded back to the DNANexus platform. Finally, once all parts of a chromosome are processed, they are concatenated into a single, complete GTB file.

The following script processes one batch of VCF files. It can be run in parallel on multiple workers by specifying a different BATCH_INDEX for each.

# ---------- Job Basic Information ----------
  CHROM="21"
  # Total number of batches to split the chromosome files into
  BATCH_NUM=5
  # Current batch index (0-based, from 0 to BATCH_NUM-1)
  BATCH_INDEX=0
  # Input data project and directory
  INPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
  INPUT_PROJECT_DIR="/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/"
  # Output data project and directory. [Ensure this directory exists to avoid upload errors]
  OUTPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
  OUTPUT_PROJECT_DIR="/user/zlb/ukb/wes/chr${CHROM}"

# ---------- File List ----------
  echo "[STEP 1] Finding VCF files for chromosome ${CHROM}..."
  # Search for all vcf.gz files in the specified input path
  mapfile -t FILES < <(
    dx find data -r --name "ukb23157_c${CHROM}_b*.vcf.gz" \
      --path "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/" \
      --brief \
    | xargs -n1 -I {} dx describe {} --name
  )

  TOTAL_FILES=${#FILES[@]}
  echo "Found ${TOTAL_FILES} files"

# ---------- Job Mapping ----------
  if [[ $BATCH_NUM -le 0 ]]; then
    echo "Error: BATCH_NUM must be greater than 0"
    exit 1
  fi

  # Calculate the size of each batch (ceiling)
  BATCH_SIZE=$(( (TOTAL_FILES + BATCH_NUM - 1) / BATCH_NUM ))
  START_INDEX=$(( BATCH_INDEX * BATCH_SIZE ))
  END_INDEX=$(( START_INDEX + BATCH_SIZE - 1 ))

  # Validate the batch index
  if [[ $BATCH_INDEX -ge $BATCH_NUM ]]; then
    echo "Error: BATCH_INDEX cannot exceed BATCH_NUM-1"
    exit 1
  fi

  # Handle the last batch which may be smaller
  if [[ $END_INDEX -ge $TOTAL_FILES ]]; then
    END_INDEX=$(( TOTAL_FILES - 1 ))
  fi

  # Check for empty batches (e.g., if BATCH_INDEX is too high)
  if [[ $START_INDEX -gt $END_INDEX ]]; then
    echo "Warning: Current batch is empty (all files have likely been processed)."
    exit 0
  fi

  echo "Total batches: ${BATCH_NUM} | Approx. ${BATCH_SIZE} files per batch"
  echo "Current batch: ${BATCH_INDEX} (file index range ${START_INDEX}-${END_INDEX})"

# ---------- Processing Logic ----------
  for (( i=START_INDEX; i<=END_INDEX; i++ ))
  do
      INPUT="${FILES[$i]}"
      echo "Progress: $((i - START_INDEX + 1))/$((END_INDEX - START_INDEX + 1)) - File ${INPUT}"

      # Output filename, e.g., rename ukb23157_c1_b1.vcf.gz to ukb23157_c1_b1.gtb
      OUTPUT="$(basename ${INPUT} .vcf.gz).gtb"

      # Check if the output file already exists on the platform and skip if it does
      if dx ls "project-${OUTPUT_PROJECT_ID}:${OUTPUT_PROJECT_DIR}/${OUTPUT}" > /dev/null 2>&1; then
          echo "File already exists, skipping: ${OUTPUT}"
          continue
      fi

      # Download the file if it doesn't exist locally
      if ls "${INPUT}" > /dev/null 2>&1; then
          echo "Local file exists, skipping download: ${INPUT}"
      else
          echo "Downloading file: ${INPUT}"
          dx download "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/${INPUT}"
      fi

      # Process the file: convert VCF to GTB with filtering
      echo "Converting VCF -> GTB..."
      java -jar -Xms6g -Xmx6g kgga.jar gbc convert \
           vcf2gtb "${INPUT}" -o "${OUTPUT}" \
           --field --prune --seq-an 1~ --seq-af 0.000001~0.999999 \
           -t 4 || { echo "Java conversion failed for ${INPUT}"; exit 1; }

      # Upload the resulting GTB file
      echo "Uploading file: ${OUTPUT}"
      dx upload "${OUTPUT}" --path "project-${OUTPUT_PROJECT_ID}:${OUTPUT_PROJECT_DIR}/${OUTPUT}" || { echo "Upload failed for ${OUTPUT}"; exit 1; }

      # Clean up by deleting local files to save space
      rm "${INPUT}" "${OUTPUT}" || { echo "Failed to clean up files"; exit 1; }
      echo "Finished processing: ${INPUT}"
  done

echo "Batch ${BATCH_INDEX} processing complete"

After all batches for a chromosome have been processed, execute the following commands on a single worker to merge the intermediate GTB files:

# ---------- Job Basic Information ----------
  CHROM="21"
  # Input directory containing the intermediate GTB files
  INPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
  INPUT_PROJECT_DIR="/user/zlb/ukb/wes/chr${CHROM}"
  # Output project and directory for the final merged file
  OUTPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
  OUTPUT_PROJECT_DIR="/user/zlb/ukb/wes/"
  OUTPUT="chr${CHROM}.hg38.gtb"

# ---------- Processing Logic ----------
  # Get a list of all intermediate GTB files
  mapfile -t FILES < <(
    dx find data --name "ukb23157_c${CHROM}_b*.gtb" \
      --path "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/" \
      --brief \
    | xargs -n1 -I {} dx describe {} --name
  )

  # Download all intermediate files
  for (( i=0; i<${#FILES[@]}; i++ ))
  do
      dx download "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/${FILES[$i]}"
  done

  # Sort files numerically by block number to ensure correct order
  SORTED_FILES=($(ls ./*.gtb | sort -t '_' -k3.2n))

  # Concatenate the GTB files
  java -jar kgga.jar gbc concat "${SORTED_FILES[@]}" -o "${OUTPUT}"

  # Upload the final merged file
  dx upload "${OUTPUT}" --path "project-${OUTPUT_PROJECT_ID}:${OUTPUT_PROJECT_DIR}/${OUTPUT}" || { echo "Upload failed for ${OUTPUT}"; exit 1; }

  # Delete the intermediate sub-files
  rm -rf ukb23157_c*.gtb

This process results in a single GTB file for each chromosome.

2.6. Convert GTB to Other Formats (Genotype-level Analysis)¶

The GTB format typically saves 50%–99% of storage space compared to other formats. Its analysis and computation performance is comparable to high-performance bit-encoded formats like PLINK’s PGEN and BED. A brief guide can be found at https://pmglab.top/kgga/v1.0/GBCQuickStartEn.html.

To access specific subsets of samples and genomic regions and export them to PGEN format (first, install dependencies using pip3 install jep zstandard pgenlib):

java -Djava.library.path="$(pip3 show jep | grep Location | awk '{print $2"/jep"}')" -jar \
    kgga.jar gbc convert gtb2plink chr21.hg38.gtb --seq-af 0.05~0.95 -ot pgen

2.7. Annotate Variants using KGGA¶

KGGA can perform high-speed annotation of genomic attributes directly on GTB files. This example demonstrates how to annotate variants based on gene features and filter them to retain only protein-altering non-synonymous mutations. For more detailed functionalities, please refer to the KGGA online documentation.

To annotate and filter variants, saving the results to a TSV text file:

# Prepare the resource directory for KGGA annotation
mkdir -p resources/gene
cd resources/gene

# Download the resource file for KGGA annotation (using the file ID from step 2.2)
dx download file-J21Y8J0J5qQ6KZPxjvpgjF0P
cd ../../

# Start the annotation process
# This command annotates variants in chr21.hg38.gtb with an Allele Count >= 1,
# using the refGene model, and includes functional consequences from nonsynonymous to LoF (0~6).
java -Xmx4g -jar kgga.jar annotate \
   --input-gty-file ./chr21.hg38.gtb \
   --output ./annot \
   --seq-ac 1 \
   --gene-model-database refgene \
   --gene-feature-included 0~6

You can quickly check the results with command-line tools:

zless /opt/notebooks/annot/variants.hg38.tsv.gz

FAQ

Frequently Asked Questions (FAQ)¶

1. Why do I need KGGA?¶

Response: KGGA is a fast and convenient platform designed for large-scale genotype analysis of extensive population samples. It provides a wide range of powerful quality control (QC) and analysis functions, along with comprehensive resources to support your work. Additionally, KGGA integrates seamlessly with external tools and platforms, making it a versatile choice for diverse genetic analyses.

2. How do I cite KGGA?¶

Response: We kindly ask that you cite the publications of the methods you use within KGGA. This practice helps readers understand the underlying principles and methodologies applied in your analyses.

3. Where do I report bugs and provide feedback on KGGA?¶

Response: To report bugs or share feedback, please send an email to limx54@163.com. We welcome your input to help improve the platform.

4. How can I use generative AI to assist with KGGA?¶

Response: You can leverage generative AI by uploading the PDF version of the KGGA user manual to the AI tool. Once uploaded, you can ask specific questions, such as: “Please generate commands for me to perform gene-based prioritization using KGGA with my VCF.gz input file (path=abc.vcf.gz) and reference genome hg19.” This approach allows you to quickly obtain tailored guidance for your tasks.

Detailed Document

Basic Options

Input¶

KGGA currently accepts VCF and GTB format files (one or more, with or without PED files) as input files for germ-line variants research and MAF format files for somatic mutations analysis.

Option	Description	Default
`--input-gty-file`	Specify the input file. `--input-gty-file` is a combination of parameters. `<type>` is used to specify the format of the input file. `<refG>` is used to specify the reference genome of input variants. Format: `--input <file> type=[AUTO/VCF/GTB/MAF] refG=[hg18/hg19/hg38]` Eaxmple: `--input ./example.vcf.gz type=VCF refG=hg38`	type=AUTO refG=hg38

Variant and Genotype File¶

For germ-line variant research, the expected input data for the program should follow the VCF (Variant Calling Format) format. Here is a short description. The program accepts input from VCF files in

text format. Suffix as .vcf.
GZ compression format by gzip <path/to/VCF>. Suffix as .vcf.gz.
BGZ compression format by bgzip <path/to/VCF>. Suffix as .vcf.bgz or .vcf.gz.
GenoType Block (GTB) format produced by Genotype Blocking Compressor (GBC), which facilitates ultra-fast access for large-scale genotypes of hundreds of thousands of subjects. Suffix as .gtb. The program will automatically convert the input VCF file to GTB format in the first step. The GTB file generated can be used as an input file for subsequent analyses. Alternatively, you can manually generate the GTB file using the command:

java -jar kgga.jar gbc convert vcf2gtb <path/to/VCF> --output <output> [options]

Pedigree and Phenotype File (optional)¶

To specify the phenotypes corresponding to subjects in the VCF file and the pedigree relationships between subjects or to analyze only a subset of subjects in the VCF file, you must provide information about the samples and record them in the PED file. The PED file should be in the LINKAGE Pedigree format. Here is a short description.

Option	Description	Default
`--ped-file`	Specify the PED file with phenotypes. `--ped-file`is a combination of parameters. `<pheno>` is used to set the column name of the major phenotype in the PED file. `<covar>` is used to set the column name(s) used as covariate phenotype(s). By default, the individual IDs in the PED file must be unique and identical to the ones defined in the VCF file(s). However, users can also ask KGGA to use a composite individual ID, which is combined as “FamilyID$IndividualID” by setting the as Y (true) to match the VCF file(s). Format: `--ped-file <file> pheno=[columnName] covar=[columnName2,columnName2,...] composite=[Y/N]` Example: `--ped-file ./example.ped pheno=disease covar=QT,age composite=Y`	composite=N

Mutation Annotation Format File¶

For somatic mutation research, the expected input data for the program should follow the MAF (Mutation Annotation Format) format. Here is a short description. The program accepts input with MAF files in

text format. Suffix as .maf.
GZ compression format by gzip <path/to/MAF>. Suffix as .maf.gz.
BGZ compression format by bgzip <path/to/MAF>. Suffix as .maf.bgz or .maf.gz.
GenoType Block (GTB) format produced by Genotype Blocking Compressor (GBC), which facilitates ultra-fast access for large-scale genotypes of hundreds of thousands of subjects. Suffix as .gtb. The program will automatically convert the input MAF file to GTB format in the first step. The GTB file generated can be used as an input file for subsequent analyses. Additionally, you can manually generate the GTB file using the command:

java -jar kgga.jar gbc maf2gtb <path/to/MAF> -o <path/to/out>

Output¶

Path¶

After analysis, the final retained variants with annotations are stored in a compressed TSV format under the specified output folder. During the analyses, intermediate data for each task are stored in GTB format, facilitating efficient management of large datasets and enabling analysis resumption from specific breakpoints—a critical feature for studies involving extensive computational demands.

Option	Description	Default
`--output`	Specify the output folder path. Format: `--output <dir>` Example: `--output ./out/test`	./kgga

Formats¶

KGGA supports exporting processed genotype data to several commonly used formats, such as VCF and PLINK, to facilitate downstream analysis using other software tools.

Option	Description	Default
`--output-gty-format`	Specify the output format for genotype data. Format: `--output-gty-format <GTB/VCF/PLINK_BED/PLINK_PGEN>` Example: `--output-gty-format GTB`	[OFF]

General Setting¶

There are also some options for resource file paths, template data, and parallel tasks.

Option	Description	Default
`--threads`	Specify the number of threads on which the program will be running. Format: `--threads <int>` Example: `--threads 8` NOTE As a rule of thumb, please do not give a thread number larger than the number of CPU cores. Too many threads may even slow down the analysis.	4
–channel	Specify the path of resource data files. It could be a local system file path, an intranet file path, or even an internet one. Format: `--channel path/to/file` Example: `--channel /public1/resources` Example: `--channel https://idc.biosino.org/pmglab/resource/kgg/kgga/resources/`	./resources
–cache	Specify the path of the temporary data files. It must be a local file path on the Operating System. Format: `--cache path/to/file` Example: `--cache ./tmp/`	[-o]/tmp/
–rand-seed	Specify a seed for a random number generator. It can be any int value except for 0. Format: `--rand-seed <int>` Example: `--rand-seed 12334456`	current time
–chromosome-tag	Specify the recognizable chromosomes. By default, it only considers the standard chromosomes, 1, …, 22, X, Y, and M.	1,…,22,X,Y,M
`--r-server`	Set the address and port number of the R Server. On KGGA, this is required for RUNNER analysis. Format: `--r-server host_ip:host_port` Example:`--r-server localhost:6300`	[OFF]
update	Check and update the local kgga.jar file to the latest version available on the website. Example: `java -jar kgga.jar update` Note: The update option does not work on Windows. On Windows, you must manually download the latest kgga.jar file and replace the existing one.	-

Data Cleaning & QC

Variant & Sample Cleaning¶

About¶

The clean Module is the only way that must be passed for all other Modules of KGGA. It produces a high-quality subset of variants and extracts genotypes of samples in the efficient binary format GTB for follow-up analyses.

Workflow of the Manage Module¶

Genotype-level Quality Control: Removing genotypes with low sequencing quality, low read depth, or confusion.
Variant-level Quality Control: Removing variants with low sequencing quality, low mapping quality, or other filtration criteria (e.g., allele number, allele frequency, specific INFO field of VCF).

The above two steps are based on the VCF file(s) only.

Incorporate Phenotype Data: Integrating phenotype data of subjects from the PED file to extract their corresponding genotypes.
Sample-level Variant Selection: Selecting variants based on mutation type, allele frequency, missing genotype rate, or the Hardy-Weinberg equilibrium test in a specific subset of samples.
Generate Annotation Base Set: Consolidate all cleaned variants from input files (where applicable) into a unified GTB-format variant file. This dataset retains only genomic coordinates (hg38/GRCh38), alleles, and genotype frequency summaries, forming the foundational dataset for downstream KGGA analyses.

Note: In the last two steps, if a PED file is provided, the genotype data for individuals who are present in both the VCF and PED files will be utilized for the selection process. Without a PED file, the selection will proceed based solely on the genotype data available for all individuals listed in the VCF file.

manage

Basic Usage¶

java -jar kgga.jar clean --input <input1> --input <input2> --output <output> [options]

If no other option is set, the program will perform some preset filters by default. Users are also allowed to set custom filtration criteria if needed or disable QC entirely.

Examples¶

This section demonstrates how to use the manage module in KGGA to select a subset of samples from VCF and PED files, apply stringent quality control filters to their genotypes, and export the results in formats suitable for downstream genetic analysis, such as PLINK BED or PGEN.

Selecting a Subset of Samples and Outputting Genotypes in PLINK BED Format¶

In this example, individuals present in both the VCF and PED files are selected, and their genotypes are subjected to stricter quality control than the default settings. Variants are filtered based on allele number, allele count, observation rate, and Hardy-Weinberg equilibrium (HWE). The cleaned genotypes are then output in PLINK BED format for subsequent analysis.

java -Xmx10g -jar kgga.jar \
   clean \
   --input-gty-file ./example/assoc.hg19.vcf.gz \
                   refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo1 \
   --threads 4 \
   --allele-num 2~4 \
   --seq-ac 1 \
   --min-obs-rate 0.9 \
   --hwe 1E-5 \
   --output-gty-format PLINK_BED

Key Command Options¶

–input-gty-file: Specifies the input VCF file and reference genome (e.g., hg19).
–ped-file: Provides sample metadata, such as case/control status, from a PED file.
–output: Defines the output directory (e.g., ./test/demo1).
–threads: Sets the number of processing threads (e.g., 4).
–allele-num: Filters variants by the number of alleles (e.g., 2 to 4).
–seq-ac: Filters variants by minimum allele count (e.g., ≥1).
–min-obs-rate: Sets the minimum genotype observation rate (e.g., 0.9 or 90%).
–hwe: Applies a Hardy-Weinberg equilibrium filter with a p-value threshold (e.g., 1E-5).
–output-gty-format: Specifies the output format (e.g., PLINK_BED).

Outputting Genotypes in PGEN Format¶

KGGA also supports exporting cleaned genotypes in other formats, such as PGEN. This requires the Python library libjep, which must be installed before running the command. For installation instructions, refer to xxx.

java -Djava.library.path="$(pip3 show jep | grep Location | awk '{print $2"/jep"}')" \
   -Xmx10g -jar kgga.jar \
   clean \
   --input-gty-file ./example/assoc.hg19.vcf.gz \
                   refG=hg19 \
   --ped-file ./example/assoc.ped \
   --output ./test/demo1 \
   --threads 4 \
   --allele-num 2~4 \
   --seq-ac 1 \
   --min-obs-rate 0.9 \
   --hwe 1E-5 \
   --output-gty-format PLINK_PGEN

Additional Command Option¶

-Djava.library.path: Specifies the path to the libjep.jnilib file, required for PGEN output.

Notes¶

KGGA also supports exporting genotypes in VCF format by setting –output-gty-format VCF, though this does not require libjep.
Ensure the libjep library is correctly installed and the path is accurate when using PGEN output.

Outputs¶

The cleaned variants, along with basic count summaries, are saved in the specified output directory (e.g., ./test/demo1/). The genotypes are stored in the chosen format (e.g., PLINK BED or PGEN) within this folder.

The final selection results are saved in a compressed file named variants.hg38.tsv.gz, with the following columns:

Header	Description
CHROM	Chromosome (hg38)
POS	Position (hg38)
REF	Reference allele
ALT	Alternative allele
GTYSUM::RefHomGtyNum_ALL	Number of reference homozygous genotypes (all samples)
GTYSUM::HetGtyNum_ALL	Number of heterozygous genotypes (all samples)
GTYSUM::AltHomGtyNum_ALL	Number of alternative homozygous genotypes (all samples)
GTYSUM::MissingGtyNum_ALL	Number of missing genotypes (all samples)
GTYSUM::RefHomGtyNum_CASE	Number of reference homozygous genotypes (case samples)
GTYSUM::HetGtyNum_CASE	Number of heterozygous genotypes (case samples)
GTYSUM::AltHomGtyNum_CASE	Number of alternative homozygous genotypes (case samples)
GTYSUM::MissingGtyNum_CASE	Number of missing genotypes (case samples)
GTYSUM::RefHomGtyNum_CONTROL	Number of reference homozygous genotypes (control samples)
GTYSUM::HetGtyNum_CONTROL	Number of heterozygous genotypes (control samples)
GTYSUM::AltHomGtyNum_CONTROL	Number of alternative homozygous genotypes (control samples)
GTYSUM::MissingGtyNum_CONTROL	Number of missing genotypes (control samples)

Additional Output Files¶

Intermediate Files: Files with the .gtb suffix, located in the ConvertVCF2GTBTask subfolder, can be used directly for subsequent analysis.
Log Files: Processing logs are saved in the log subfolder within the output directory.

General Notes¶

Adjust file paths (e.g., –input-gty-file, –ped-file, –output) and parameters (e.g., –allele-num, –hwe) based on your dataset and analysis needs.
The –output-gty-format option allows flexibility in choosing the output format (e.g., PLINK_BED, PLINK_PGEN, or VCF).
For PGEN output, ensure the libjep library is installed and the –jep path is correctly specified.

VCF-based Quality Control¶

Genotype-level Quality Control¶

Removing genotypes if at least one of the options is set to true.

Option	Description	Default
`--gty-gq`	Exclude genotypes with the minimal genotype quality (Phred Quality Score) per genotype < `minGq`. Set to ‘0’ to disable this filter. Format: `--gty-gq <minGq>` Example: `--gty-gq 20` Valid setting: [int] >=0	20
`--gty-dp`	Exclude genotypes with the minimal read depth per genotype < `minDp`. Set to ‘0’ to disable this filter. Format: `--gty-dp <minDp>` Example: `--gty-dp 8` Valid setting: [int] >=0	8
`--gty-pl`	Exclude genotypes with the second smallest normalized Phred-scaled likelihoods for genotypes < `minPl`. Otherwise, there would be confusing genotypes. Set to ‘0’ to disable this filter. Format: `--gty-pl <minPl>` Example: `--gty-pl 20` Valid setting: [int]>=0	20
`--gty-ad-hom-ref`	Exclude genotypes with the fraction of the reads carrying alternative allele > `maxAdHomRef` at a reference-allele homozygous genotype. Set to ‘1’ to disable this filter. Format: `--gty-ad-hom-ref <maxAdHomRef>` Example: `--gty-ad-hom-ref 0.05` Valid setting: [float] 0.0 ~ 1.0	0.05
`--gty-ad-hom-alt`	Exclude genotypes with the fraction of the reads carrying alternative allele < `minAdHomAlt` at an alternative-allele homozygous genotype. Set to ‘0’ to disable this filter. Format: `--gty-ad-hom-alt <minAdHomAlt>` Example: `--gty-ad-hom-alt 0.75` Valid setting: [float] 0.0 ~ 1.0	0.75
`--gty-ad-het`	Exclude genotypes with the fraction of the reads carrying alternative allele < `minAdHet` at a heterozygous genotype. Set to ‘0’ to disable this filter. Format: `--gty-ad-het <minAdHet>` Example: `--gty-ad-het 0.25` Valid setting: [float] 0.0 ~ 1.0	0.25
`--gty-qc`	Exclude genotypes where the genotype quality metric corresponding to the `keyword` has not passed Java expression quality control. `--gty-qc`is is a combination of parameters that meet custom genotype QC needs. The `default` tag is used to control whether to retain or discard the genotype when a quality control parsing error occurs. Format: `--gty-qc <keyword> <rule> default=[RETAIN/DISCARD]` Example: `--gty-qc DP "DP.toInt()>=10" default=DISCARD`	[OFF]

Variant-level Quality Control¶

Option	Description	Default
`--allele-num`	Exclude variants with the alternative allele number per variant outside the range [`minAlleleNum`, `maxAlleleNum`]. Format: `--allele-num <minAlleleNum>~<maxAlleleNum>` Example: `--allele-num 2~4` Valid setting: [int] 0 ~ 255	[OFF]
`--seq-ac`	Exclude variants with the alternative allele count (AC) per variant outside the range [`minAc`, `maxAc`]. Format: `--seq-ac <minAc>~<maxAc>` Example: `--seq-ac 1~10` Valid setting: [int] >=0	[OFF]
`--seq-an`	Exclude variants with the non-missing allele number (AN) per variant outside the range [`minAn`, `maxAn`]. Format: `--seq-an <minAn>~<maxAn>` Example: `--seq-an 160~200` Valid setting: [int] >=0	[OFF]
`--seq-af`	Exclude variants with the alternative allele frequency (AF) per variant outside the range [`minAf`, `maxAf`]. Format: `--seq-af <minAf>~<maxAf>` Example: `--seq-af 0.05~1.0` Valid setting: [float] 0.0 ~ 1.0	[OFF]
`--seq-qual`	Exclude variants with the minimal overall sequencing quality score (Phred Quality Score) per variant < `minQual`. Format: `--seq-qual <minQual>` Example: `--seq-qual 30` Valid setting: [float] >=0.0	30
`--seq-mq`	Exclude variants with the minimal overall mapping quality score (Mapping Quality Score) per variant < `minMq`. Format: `--seq-mq <minMq>` Example: `--seq-mq 20` Valid setting: [float] >=0.0	20
`--seq-fs`	Exclude variants with the overall strand bias Phred-scaled p-value (using Fisher’s exact test) per variant > `maxFs`. The strand bias estimation is best suited for low-coverage situations. Set to ‘100’ to disable this filter as the maximal phred-scaled p-value is 100. Format: `--seq-fs <maxFs>` Example: `--seq-fs 60` Valid setting: [float] >=0.0	100
`--seq-info`	Exclude variants where the value of the specified `keyword` in the INFO field does not pass the Java expression quality control. `--seq-info`is a combination of parameters. The `default` tag is used to control whether to retain or discard the genotype when a quality control parsing error occurs. Format: `--seq-info <keyword> <rule> default=[RETAIN/DISCARD]` Example: `--seq-info keyword=MQ rule=MQ.char2Float()>=20 default=DISCARD`	[OFF]
`--seq-filter`	Exclude variants where the value of the specified keyword in the FILTER field of VCF does not pass quality control. It uses Java expressions flexibly. In the expressions, e.g., ‘value.XXX(string)’ operates the value of the FILTER field as a Java String. Format: `--seq-filter Java String expression` Example: `--seq-filter value.valueEquals(\"PASS\")` will exclude variants at which the FILTER field is not equal to PASS –seq-filter value.indexOf("q10") != -1` will exclude variants at which the FILTER field contains q10	[OFF]

Turn Off Quality Control¶

All quality control options mentioned above, including genotype-level and variant-level quality control options, can be turned off by --disable-qc.

Option	Description	Default
`--disable-qc`	Disable all quality control options mentioned above. NOTE It cannot be used in conjunction with other quality control options, as it will render them ineffective. Format: `--disable-qc`	[OFF]

Mutation Type¶

Option	Description	Default
`--only-snv`	Only single-nucleotide polymorphism variants (SNP) are retained and analyzed. Format: `--only-snv`	[OFF]
`--only-indel`	Only small insertion or deletion (InDel, <=50 bp) variants are retained and analyzed. Format: `--only-indel`	[OFF]

In the following functions, if a PED file is provided, the genotype data for individuals present in both the VCF and PED files will be utilized for the selection process. Without a PED file, the selection will proceed based solely on the genotype data available for all individuals listed in the VCF file.

Allele Frequency¶

Option	Description	Default
`--local-af`	Exclude variants in all subjects with alternative allele frequency (AF) outside the range [`minAF`, `maxAF`]. Format: `--local-af <minAF>~<maxAF>` Example:`--local-af 0.05~1.0` Valid setting: [float] 0.0 ~ 1.0	[OFF]
`--local-af-case`	Exclude variants in cases with alternative allele frequency (AF) outside the range [`minAF`, `maxAF`]. Format: `--local-af-case <minAF>~<maxAF>` Example:`--local-af-case 0.05~1.0` Valid setting: [float] 0.0 ~ 1.0	[OFF]
`--local-af-control`	Exclude variants in controls with alternative allele frequency (AF) outside the range [`minAF`, `maxAF`]. Format: `--local-af-control <minAF>~<maxAF>` Example:`--local-af-control 0.05~1.0` Valid setting: [float] 0.0 ~ 1.0	[OFF]
`--min-case-control-af-ratio`	Exclude variants at which the alternative allele frequency (AF) in cases is less than that of in controls multiplied by a specified ratio. Format: `--min-case-control-af-ratio <ratio>` Example:`--min-case-control-af-ratio 2.0` Valid setting: [float] >= 0.0	[OFF]
`--local-maf`	Exclude variants in all subjects with minor allele frequency (MAF) outside the range [`minMAF`, `maxMAF`]. By definition, MAF represents the frequency of the less common allele. An interesting thing about the human reference genome is that the “reference” allele is not always the common or “major” allele in the human population. When AF<=0.5, MAF equals AF; when AF > 0.5, MAF is calulated as 1-AF. Format: `--local-maf <minMAF>~<maxMAF>` Example:`--local-maf 0.05~0.5` Valid setting: [float] 0.0 ~ 0.5	[OFF]
`--local-maf-case`	Exclude variants in cases with minor allele frequency (MAF) outside the range [`minMAF`, `maxMAF`]. Format: `--local-maf-case <minMAF>~<maxMAF>` Example:`--local-maf-case 0.05~0.5` Valid setting: [float] 0.0 ~ 0.5	[OFF]
`--local-maf-control`	Exclude variants in controls with minor allele frequency (MAF) outside the range [`minMAF`, `maxMAF`]. Format: `--local-maf-control <minMAF>~<maxMAF>` Example:`--local-maf-control 0.05~0.5` Valid setting: [float] 0.0 ~ 0.5	[OFF]
`--min-case-control-maf-ratio`	Exclude variants at which the minor allele frequency (MAF) in cases is less than that of in controls multiplied by a specified ratio. Format: `--min-case-control-maf-ratio <ratio>` Example:`--min-case-control-maf-ratio 2.0` Valid setting: [float] >= 0.0	[OFF]

Missing Genotype Rate¶

Option	Description	Default
`--min-obs-rate`	Exclude variants in all subjects with the observed rate of non-missing genotypes <`minObsRate`. Format: `--min-obs-rate <minObsRate>` Example: `--min-obs-rate 0.8` Valid setting: [float] 0.0 ~ 1.0	[OFF]
`--min-obs-rate-case`	Exclude variants in cases with the observed rate of non-missing genotypes <`minObsRate`. Format: `--min-obs-rate-case <minObsRate>` Example: `--min-obs-rate-case 0.8` Valid setting: [float] 0.0 ~ 1.0	[OFF]
`--min-obs-rate-control`	Exclude variants in controls with the observed rate of non-missing genotypes <`minObsRate`. Format: `--min-obs-rate-control <minObsRate>` Example: `--min-obs-rate-control 0.8` Valid setting: [float] 0.0 ~ 1.0	[OFF]

Hardy-Weinberg Equilibrium¶

Option	Description	Default
`--hwe`	Exclude variants in all subjects with the Hardy-Weinberg test p value <= `pThreshold`. Format: `--hwe <pThreshold>` Example: `--hwe 1E-5` Valid setting: [double] 0.0 ~ 1.0	[OFF]
`--hwe-case`	Exclude variants in cases with the Hardy-Weinberg test p value <=`pThreshold`. Format: `--hwe-case <pThreshold>` Example: `--hwe-case 1E-5` Valid setting: [double] 0.0 ~ 1.0	[OFF]
`--hwe-control`	Exclude variants in controls with the Hardy-Weinberg test p value <=`pThreshold`. Format: `--hwe-control <pThreshold>` Example: `--hwe-control 1E-5` Valid setting: [double] 0.0 ~ 1.0	[OFF]

API Usage

KGGA adopts workflow-based programming, in which an analysis is divided into multiple and concatenated tasks. A programmer can call the Java API functions of the data select package in KGGA to process genotype data for quality control, basic statistics, and format conversion. The following are two examples.

Select variants with good quality¶

// Import necessary libraries and classes for handling input/output operations, VCF file manipulation, and genomic data processing.

import edu.sysu.pmglab.executor.Executor;
import edu.sysu.pmglab.gtb.GTBManager;
import edu.sysu.pmglab.gtb.genome.coordinate.RefGenomeVersion;
import edu.sysu.pmglab.kgga.command.executor.Utility;
import edu.sysu.pmglab.kgga.command.pipeline.GeneralIOOptions;
import edu.sysu.pmglab.kgga.command.pipeline.PreprocessingPipeline;
import edu.sysu.pmglab.kgga.command.pipeline.VCFQualityControlOptions;
import edu.sysu.pmglab.kgga.command.validator.VariantFileMeta;
import edu.sysu.pmglab.kgga.io.InputPhenotypeFileSet;
import edu.sysu.pmglab.kgga.io.InputType;

import java.io.File;
import java.io.IOException;

  public class Example {
    public static void main(String[] args) {
    // Step 1: Create an instance of GeneralIOOptions, which holds the general input and output options for file handling.
        GeneralIOOptions ioOptions = new GeneralIOOptions();

        // Step 2: Create an instance of VCFQualityControlOptions, which specifies quality control settings for VCF files.
        VCFQualityControlOptions vcfQualityControlOptions = new VCFQualityControlOptions();

        try {
            // Step 3: Add a VCF file to the input options. This VCF file contains the genetic variant data to be analyzed.
            // The VariantFileMeta constructor takes three parameters:
            // - URL or local file path of the VCF file,
            // - Input type (VCF format in this case),
            // - Reference genome version of the input VCF file (e.g., hg19).
            ioOptions.inputGTYFiles.add(new VariantFileMeta("https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.hg19.vcf.gz", InputType.VCF, RefGenomeVersion.hg19));

            // Step 4: Specify the path of the subject information file. The PED file contains phenotype and sample information for each individual in the VCF file. This is necessary for linking genetic variants with phenotypic data.
            ioOptions.phenoFileSet = new InputPhenotypeFileSet("https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped");

            // Step 5: Define the output directory where the results will be stored. In this example, the workspace directory is set to "./test1".
            File workspace = new File("./test1");
            ioOptions.output = workspace;

            // Step 6: Initialize an Executor instance, which serves as the workflow controller.
            // The Executor manages and coordinates different stages of the data processing pipeline.
            Executor workflow = new Executor();

            // Step 7: Generate a file to track the execution status and the paths of files storing intermediate analysis data.
            Utility.addTrack(workflow, workspace);

            // Step 8: Run the preprocessing pipeline to generate a GTB (Genotype Block) file with annotations based on the provided inputs.
            // This step add the quality control, and sub-sample extraction tasks using the options specified in ioOptions and vcfQualityControlOptions.
            GTBManager annotationBasedGTB = PreprocessingPipeline.INSTANCE.generateAnnotationBase(ioOptions, vcfQualityControlOptions, workflow, workspace);

            // Step 9: Execute the entire workflow. This command starts the data processing and writes the selected variants and subjects
            // (after filtering and quality control) to the output directory.
            workflow.execute();

            // Step 10: Print the file path of the generated GTB file to the console, allowing the user to verify the output location.
            System.out.println("The selected variants are stored in " + annotationBasedGTB.getFile());

        } catch (IOException e) {
            // Catch and handle any I/O exceptions that may occur during file reading/writing operations.
            throw new RuntimeException(e);
        }
    }
}

Export genotypes of selected variants in the TSV format and PGEN format(plink2)¶

Two tasks can be added to the above workflow to export selected variants in the TSV format and genotypes in PGEN format(plink2).

    // Step 1: Specify the output genotype format as PLINK_PGEN, which is a format used by PLINK, 
    // a widely used tool for large-scale genotype data analysis. This option sets the desired format for exporting genotypes. Note: the codes should be put in front of the API function PreprocessingPipeline.INSTANCE.generateAnnotationBase
    ioOptions.outputGtyFormat = InputType.PLINK_PGEN;

   // Step 2: Clear existing tasks in the workflow.
   // This ensures that no previously defined tasks are retained, which could interfere with the current workflow.
   workflow.clearTasks();

    // Step 3: Set the parameter for "AnnotationBaseVariantSet" in the workflow using the file path 
    // of the previously generated GTB file (annotationBasedGTB).
    // This GTB file contains the annotated variants and serves as the input data for subsequent tasks.
    workflow.setParam("AnnotationBaseVariantSet", new File(annotationBasedGTB.getFile().getPath()));

    // Step 4: Add a task to export the variants in the GTB file to a tab-separated value (TSV) file.
    // The OutputVariants2TSVTask is responsible for extracting the variants from the GTB format 
    // and saving them in a human-readable TSV file. The parameters are:
    // - ioOptions: Contains file paths and format information,
    // - workspace: Specifies the directory where the output will be stored,
    // - false: Indicates that the export task will not be run in a parallelized manner.
    workflow.addTasks(new OutputVariants2TSVTask(ioOptions, workspace, false));

    // Step 5: Check if an output genotype format is specified. If the `outputGtyFormat` is not null,
    // proceed to export genotypes to the specified format.

    // Step 6: Add a task to export genotypes to the specified format (e.g., PLINK_PGEN).
    // The OutputGenotypes2OtherTask is responsible for converting genotypes stored in the GTB file 
    // into the specified output format, which can be any supported format like PLINK, BGEN, etc.
    workflow.addTasks(new OutputGenotypes2OtherTask(ioOptions, workspace));


    // Step 7: Execute the workflow to run all added tasks sequentially.
    // This command initiates the export of variants and genotypes, producing the desired output files 
    // (e.g., TSV and PLINK files) in the specified directory.
    workflow.execute();

Annotation

Variant Annotation & Filtration¶

About¶

The annotate module in KGGA is designed to assist with genetic variant analysis, a critical process in understanding genetic variations and their impacts on health and disease. This module provides a flexible and comprehensive approach to annotating and filtering variants, making it easier for researchers to focus on relevant genetic data from large datasets.

What the Module Does¶

The annotate module enables the rapid annotation of millions of genetic variants using one or multiple databases. It allows users to select specific fields or entire databases for annotation, leveraging tools like gnomAD for allele frequency, dbNSFP and CADD for prediction scores, and others for expression and gene feature data. This is particularly useful for large-scale genomic studies where efficiency and accuracy are paramount.

Workflow Analysis¶

The workflow is described as follows, following the manage module in KGGA:

Allele Frequency Annotation & Filtration
Annotation: This step involves adding population-level allele frequency information, such as minor allele frequency, often sourced from databases like gnomAD. For example, it may include data on how common a variant is in specific populations.
Filtration: Variants are then selected based on specific frequency criteria, such as those with a minor allele frequency below a threshold in particular population ancestries. This is crucial for identifying rare variants potentially linked to diseases.
Allele Expression Annotation & Filtration
Annotation: This adds expression level information, such as PEXT (Proportion Expression Across Transcripts) scores from gnomAD, which summarize exonic region expression across tissues based on GTEx data (gnomAD PEXT). This helps assess how variants might affect gene expression.
Filtration: Variants are filtered based on expression levels, such as selecting those with high expression in certain tissues, which can be relevant for studying tissue-specific effects.
Gene Feature Annotation & Filtration
Annotation: This determines which gene or transcript a variant affects and classifies its functional role, such as missense (changing an amino acid) or nonsense (introducing a premature stop codon). This is standard in tools like ANNOVAR or VEP, used for genomic location and impact assessment.
Filtration: Variants are selected based on specific gene features, such as those in coding regions, particular genes, or functional categories, aiding in prioritizing variants for further analysis.
Prediction Score Annotation
Annotation: This step adds prediction scores from databases like dbNSFP, CADD, and FAVOR, which provide functional or pathogenic impact assessments. For instance, CADD scores predict variant deleteriousness, while dbNSFP offers multiple prediction algorithms. The module is noted for allowing rapid, one-stop annotation of hundreds of fields from one or multiple databases, enhancing efficiency for large-scale studies.
Notably, this step does not mention filtration, suggesting it may be used in subsequent analyses rather than filtered within this module.
Region Mark Annotation
Annotation: This involves adding information from databases with genomic interval features, such as epigenomic markers (e.g., histone modifications, DNase hypersensitivity sites), to highlight variants that regulate gene expression. This is important for understanding non-coding variant effects, often overlooked in traditional analyses.
Like prediction scores, this step lacks a mentioned filtration process, possibly indicating its role is primarily annotative for later use.

Output and Format¶

After processing, the module saves the filtered variants in a custom GTB (Genotype Block) binary format, which is efficient for further analysis in other KGGA modules. It also outputs results in text format for human readability, ensuring accessibility for researchers.

annot

Basic Usage¶

java -jar kgga.jar annotate --input <input1> --input <input2> --output <output> [options]

The annotate module in KGGA is designed to streamline the process of variant quality control and gene feature annotation by default. However, if you do not require these processes, the module offers the flexibility to disable them, allowing you to tailor the annotation workflow to your specific needs.

Example

Example¶

This section provides examples of using the annotate module in KGGA, a software tool designed for geneticists. The module extracts variants and genotypes from VCF and PED files after quality control (QC), annotates them with various genomic features, and applies filters to identify high-quality, rare variants for downstream analysis.

Annotate Variants with Gene Features and Filter Based on Local Allele Frequencies¶

In this example, variants and genotypes are extracted from VCF and PED files post-QC. The variants are annotated with gene features using the RefGene and Gencode databases. Variants with common allele frequencies in the local sample (MAF > 0.05) or located outside exonic regions are excluded, yielding a subset of high-quality, rare exonic variants for further analysis.

java -Dccf.remote.timeout=60 -jar kgga.jar \
annotate \
   --input-gty-file ./example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file http://api.pmglab.top/kgga/download/example/assoc.ped \
   --output ./test/demo2 \
   --threads 6 \
   --seq-ac 1 \
   --local-maf 0~0.05 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --gene-feature-included 0~8

Key Command Options¶

–input-gty-file: Specifies the input VCF file and reference genome (e.g., hg19).
–ped-file: Provides sample metadata (e.g., case/control status).
–output: Defines the output directory.
–threads: Sets the number of processing threads.
–seq-ac: Filters variants by allele count (e.g., ≥1).
–local-maf: Excludes variants with minor allele frequency (MAF) > 0.05 in the local sample.
–gene-model-database: Specifies gene annotation databases (e.g., RefGene, Gencode).
–gene-feature-included: Limits annotations to exonic features (values 0～8).

Outputs¶

The annotated variants are saved in the output directory ./test/demo2/. The final results are stored in variants.hg38.tsv.gz, with columns including:

Header	Description
CHROM	Chromosome (hg38)
POS	Position (hg38)
REF	Reference allele
ALT	Alternative allele
GeneFeature@MarkFeatureGene	Gene feature mark
GeneFeature@MarkGeneFeature	Gene feature details
GeneFeature@GeneFeatureDetails	Detailed gene feature information
GeneFeature@HitGenes	Genes affected by the variant
GeneFeature@HitGeneFeatures	Features of the affected genes
SOURCE@hg19_CHROM	Original chromosome (hg19)
SOURCE@hg19_POS	Original position (hg19)
GTYSUM@RefHomGtyNum_ALL	Number of reference homozygotes (all samples)
GTYSUM@HetGtyNum_ALL	Number of heterozygotes (all samples)
GTYSUM@AltHomGtyNum_ALL	Number of alternative homozygotes (all samples)
GTYSUM@MissingGtyNum_ALL	Number of missing genotypes (all samples)
GTYSUM@RefHomGtyNum_CASE	Number of reference homozygotes (case samples)
GTYSUM@HetGtyNum_CASE	Number of heterozygotes (case samples)
GTYSUM@AltHomGtyNum_CASE	Number of alternative homozygotes (case samples)
GTYSUM@MissingGtyNum_CASE	Number of missing genotypes (case samples)
GTYSUM@RefHomGtyNum_CONTROL	Number of reference homozygotes (control samples)
GTYSUM@HetGtyNum_CONTROL	Number of heterozygotes (control samples)
GTYSUM@AltHomGtyNum_CONTROL	Number of alternative homozygotes (control samples)
GTYSUM@MissingGtyNum_CONTROL	Number of missing genotypes (control samples)

Intermediate Files: Files with the .gtb suffix in the ConvertVCF2GTBTask subfolder can be used for subsequent analysis.
Log Files: Processing logs are saved in the log subfolder.

Annotate Variants with Gene Features, Reference Allele Frequencies, and Deleterious Scores¶

Here, variants and genotypes are extracted from VCF and PED files after QC and annotated with gene features (RefGene, Gencode), allele frequencies from gnomAD (East Asian and African populations), and deleterious scores from dbNSFP. Variants with common allele frequencies in the local sample (AF > 0.05) or reference database (AF > 0.01), or those beyond exonic regions, are excluded, identifying rare variants with potential functional impact.

java -Dccf.remote.timeout=60 -jar kgga.jar \
annotate \
   --input-gty-file ./example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file http://api.pmglab.top/kgga/download/example/assoc.ped \
   --output ./test/demo2 \
   --threads 6 \
   --seq-ac 1 \
   --local-af 0~0.05 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --gene-feature-included 0~8 \
   --freq-database gnomad \
                   field=gnomAD_joint@EAS,gnomAD_joint@AFR \
   --db-af 0~0.01 \
   --variant-annotation-database dbnsfp

Key Command Options¶

–local-af: Filters variants by local allele frequency (e.g., 0–0.05).
–freq-database: Specifies the reference frequency database (e.g., gnomAD) and fields (e.g., EAS, AFR populations).
–db-af: Excludes variants with allele frequency > 0.01 in the reference database.
–variant-annotation-database: Adds deleterious scores from dbNSFP.

Notes¶

The gnomAD and dbNSFP databases are not included in the tutorial folder due to their large size. Download the corresponding GTB files from [XXXX] and place them in resources/gnomad/ and resources/dbnsfp/ before running the command.

Outputs¶

Results are saved in ./test/demo2/ with a format similar to the first example, including additional columns for allele frequencies and deleterious scores.

Annotate Variants with Gene Features, Allele Frequencies, Non-Coding Scores, and Epigenetic Markers¶

In this example, individuals are selected from VCF and PED files, and stricter-than-default QC is applied to their genotypes. Variants are annotated with gene features (RefGene, Gencode), allele frequencies (gnomAD), non-coding functional scores (CADD), and epigenetic markers (EpiMap Repository, H3K4me3 for sample BSS00001). The output is formatted as PLINK BED files.

java -Dccf.remote.timeout=60 -jar kgga.jar \
annotate \
   --input-gty-file ./example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file ./example/assoc.ped \
   --output ./test/demo2 \
   --threads 6 \
   --seq-ac 1 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --freq-database gnomad \
                   field=gnomAD_joint@EAS,gnomAD_joint@AFR \
   --variant-annotation-database cadd \
   --region-annotation-database name=EpiMap \
                                subID=BSS00001 \
                                marker=H3K4me3 \
   --output-gty-format PLINK_BED

Key Command Options¶

–variant-annotation-database: Adds non-coding scores from CADD.
–region-annotation-database: Specifies epigenetic annotations (e.g., EpiGenome, H3K4me3 for BSS00001).
–output-gty-format: Sets the output format to PLINK BED (.bed, .bim, .fam files).

Outputs¶

Results are saved in ./test/demo2/ in PLINK BED format, with additional files (.bim, .fam) and annotations for non-coding scores and epigenetic markers.

General Notes¶

Ensure all external databases (e.g., gnomAD, dbNSFP, CADD, EpiGenome) are downloaded and correctly placed in the resources/ directory before execution.
Adjust file paths and parameters based on your specific dataset and analysis needs.

Variant Annotation and Filtration with External Databases¶

Variant annotation and filtration with external databases enhance the analysis of genetic variations by leveraging external resources to provide additional information about variants. This process aids in filtering and prioritizing variants based on various annotations, such as allele frequencies, expression levels, gene features, prediction scores, and epigenetic markers.

Allele Frequency¶

Allele frequency annotation allows users to incorporate population-level allele frequency information into the analysis of genetic variants. For available databases, refer to the allele frequency databases. Users can also make customized annotation files according to the guidance.

Annotation Option	Description	Default
–freq-database	Specifies the reference databases for allele frequency annotation. Format: –freq-database path= field=[field1,field2,…] Example: –freq-database gnomad field=gnomAD::EAS - : Database name (e.g., gnomad). - : Path to the database (local or FTP/HTTP). If the resources folder is specified, can be omitted. - : Specific fields to include (e.g., gnomAD::EAS). If unspecified, all fields are selected.	[OFF]

Once configured, users can filter variants based on allele frequencies in the reference population.

Filtration Option	Description	Default
–db-af	Excludes variants with alternative allele frequency (AF) outside the range [min, max]. Format: –db-af ~ Example: –db-af 0.05~1.0 Valid Range: 0.0 to 1.0	[OFF]
–db-maf	Excludes variants with minor allele frequency (MAF) outside the range [min, max]. Format: –db-maf ~ Example: –db-maf 0.05~0.5 Valid Range: 0.0 to 0.5	[OFF]

Allelic Expression¶

Expression annotation allows users to incorporate allelic expression of variants into the analysis of genetic variants. Normally, only mutations that are expressed are actionable. For available databases, see the curated GTEx expression dataset.

Annotation Option	Description	Default
–exp-database	Specifies the reference databases for expression annotation. Format: –exp-database path= field=[field1,field2,…] Example: –exp-database field=	[OFF]

Once configured, users can filter variants based on their expression values.

Filtration Option	Description	Default
–exp-range	Excludes variants with standardized expression values outside the range [min, max]. Format: –exp-range ~ Example: –exp-range 0.05~1.0 Valid Range: 0.0 to 1.0	[OFF]

Gene Features¶

Gene feature annotation identifies which gene or transcript a variant affects and its functional role. KGGA supports three gene definition systems: RefSeq genes (refgene), GENCODE genes (gencode), and UCSC KnownGene (knowngene). Users can also provide customized gene model databases following the example format.

Annotation Option	Description	Default
–gene-model-database	Specifies the gene model database for annotation. Format: –gene-model-database path= Example: –gene-model-database refgene - : Database name (e.g., refgene). - : Path to the database (optional if resources folder is set).	refgene

Users can fine-tune gene feature annotations with the following parameters:

Annotation Option	Description	Default
–splicing-distance	Sets the base-pair distance for defining splicing junction variants. Format: –splicing-distance Example: –splicing-distance 3 Valid Setting: ≥1	3
–upstream-distance	Sets the upstream region length (bp) from the transcription start site. Format: –upstream-distance Example: –upstream-distance 1000 Valid Setting: ≥1	1000
–downstream-distance	Sets the downstream region length (bp) from the transcription end site. Format: –downstream-distance Example: –downstream-distance 1000 Valid Setting: ≥1	1000

To disable gene feature annotation:

Annotation Option	Description	Default
–disable-gene-feature	Disables gene feature annotation. Format: –disable-gene-feature	[OFF]

KGGA assigns numeric codes to gene features for easy reference:

Feature	Code	Explanation
Frameshift	0	Short indel causing a frameshift.
Nonframeshift	1	Short indel causing amino acid loss without frameshift.
Start-loss	2	Loss of start codon (ATG).
Stop-loss	3	Loss of stop codon (TAG, TAA, TGA).
Stop-gain	4	Gain of a stop codon, potentially truncating the protein.
Splicing	5	Within 3 bp of a splicing junction (adjustable).
Missense	6	Codon change leading to a different amino acid.
Synonymous	7	Codon change without amino acid alteration.
Exonic	8	Mapped to the exonic region without precise annotation.
UTR5	9	Within 5’ untranslated region.
UTR3	10	Within 3’ untranslated region.
Intronic	11	Within an intron.
Upstream	12	Within 1 kb upstream of transcription start site (adjustable).
Downstream	13	Within 1 kb downstream of transcription end site (adjustable).
ncRNA	14	Within a non-coding RNA transcript.
Intergenic	15	In intergenic region.
Monomorphic	16	Not a sequence variation (potential reference genome error).
Unknown	17	No annotation available.

Users can filter variants based on specific gene features or genes:

Filtration Option	Description	Default
–gene-feature-included	Retains variants with specified gene feature codes. Format: –gene-feature-included ,,… Example: –gene-feature-included 0~6,9,10 Valid Codes: 0 to 17	0~17
–gene-excluded	Excludes variants associated with specified genes. Format: –gene-excluded GeneSymbol1,GeneSymbol2,…	[OFF]
–gene-retained	Retains variants associated with specified genes. Format: –gene-retained GeneSymbol1,GeneSymbol2,…	[OFF]

Prediction Scores¶

KGGA provides access to a curated selection of databases for variant functional and pathogenic prediction scores.

Annotation Option	Description	Default
–variant-annotation-database	Specifies the databases for variant annotation. Format: –variant-annotation-database path= field=[field1,field2,…] Example: –variant-annotation-database dbnsfp field=dbNSFP::VEST,dbNSFP::FATHMM-XF,dbNSFP::Eigen,dbNSFP::CADD There have been four databases provided on KGGA: CADD FAVOR dbNSFP ClinVar .	[OFF]

Epigenetic Markers¶

KGGA can directly download epigenetic marker resources from the EpiGenome public domain, specifically from the EpiMap Repository.

Annotation Option	Description	Default
–region-annotation-database	Specifies the interval/regional databases for annotation. Format: –region-annotation-database subID=[] marker=[] path= field=[field1,field2,…] Example: –region-annotation-database EpiMap subID=BSS00001 marker=H3K4me3 - name: Database name (e.g., EpiMap). - subID: Subject ID from EpiMap. - marker: Epigenetic marker (e.g., H3K4me3). - path: Path to the database file (optional for EpiMap). - field: Specific fields to include. Note: For name=EpiMap, KGGSum automatically downloads data from the EpiMap Repository if not locally available.	NA

Again, following workflow-based programming, programmers can easily call the API function to annotate sequence variants. The input variants should be stored in the GTB format. The following are codes to annotate variant gene features and allele frequencies.

// This code snippet demonstrates how to configure and execute an annotation workflow using the KGGA // platform. The snippet includes setting up the annotation resources, defining input data, configuring // annotation databases, and exporting annotated variants to a TSV file format. The code also shows how // to add tasks to a workflow and execute them sequentially.

import edu.sysu.pmglab.annotation.database.DatabaseDescription;
import edu.sysu.pmglab.container.interval.FloatInterval;
import edu.sysu.pmglab.container.list.List;
import edu.sysu.pmglab.executor.Executor;
import edu.sysu.pmglab.io.file.Channel;
import edu.sysu.pmglab.kgga.command.executor.Utility;
import edu.sysu.pmglab.kgga.command.pipeline.AnnotationOptions;
import edu.sysu.pmglab.kgga.command.pipeline.AnnotationPipeline;
import edu.sysu.pmglab.kgga.command.task.OutputVariants2TSVTask;

import java.io.File;
import java.io.IOException;

public class AnnotationExample {
    public static void main(String[] args) {
     // Step 1: Create an object for annotation options.
        // The `AnnotationOptions` object is used to configure various settings related to the annotation
        // process, such as specifying databases and frequency cutoffs.
        AnnotationOptions annotationOptions = new AnnotationOptions();
        try {

            // Step 2: Set the source of annotation resources.
            // The `Channel` class is used to add local and online channels (resource paths) where
            // annotation databases and other resources are stored. These channels provide access to
            // annotation data, such as population allele frequencies or gene annotations.
            Channel.addChannel("./resources"); // Local resource path
            Channel.addChannel("https://idc.biosino.org/pmglab/resource/kgg/kgga/resources"); // Online resource URL

            // Step 3: Specify the workspace directory for the workflow.
            // The workspace is a directory that stores intermediate files and results generated during the
            // annotation and analysis processes.
            File workspace = new File("./test1");

            // Step 4: Define the number of threads for parallel processing.
            // The number of threads determines how many tasks will be processed simultaneously, allowing
            // for faster execution when running on multi-core systems.
            int threadNum = 4;

            // Step 5: Create an executor object to manage and execute the workflow.
            // The `Executor` class provides a framework to handle the various tasks required for annotation
            // and other computational steps.
            Executor workflow = new Executor();

            // Step 6: Track the workflow and workspace for output management.
            // This utility function links the workflow with the specified workspace, ensuring that
            // all intermediate files and outputs are managed properly.
            Utility.addTrack(workflow, workspace);

            // Step 7: Execute any initial tasks defined in the workflow.
            // Although there are no specific tasks at this point, this command ensures that the workflow
            // is in a clean state before adding new tasks.
            workflow.execute();
            workflow.clearTasks(); // Clear any residual tasks to prevent conflicts.

            // Step 8: Set the path to the input GTB file for annotation.
            // This file contains the genetic variants to be annotated. The file path is passed to the
            // workflow as a parameter named "AnnotationBaseVariantSet".
            String inputAnnotationBasedGTBPath = "./test1/GenerateAnnotationBaseTask/variants.annot.hg38.gtb";
            File annotationBasedGTB = new File(inputAnnotationBasedGTBPath);
            workflow.setParam("AnnotationBaseVariantSet", annotationBasedGTB);

            // Step 9: Configure the annotation databases.
            // Various annotation databases can be specified to provide additional information about
            // genetic variants, such as allele frequencies or gene annotations.
            // - `freqDatabase`: Specifies databases for allele frequency annotations. Here, the `gnomad`
            //   database is used with specific subpopulations: "EAS" (East Asian) and "AFR" (African).
            annotationOptions.freqDatabase.add(new DatabaseDescription("gnomad", new String[]{"gnomAD_joint@EAS", "gnomAD_joint@AFR"}));

            // Set the allele frequency (AF) range filter.
            // This filter restricts the annotated variants to those with allele frequencies between 0 and
            // 0.01, ensuring that only rare variants are considered.
            annotationOptions.dbAf = new FloatInterval(0, 0.01f);

            // Specify the gene annotation database.
            // The `gencode` database is used here, providing comprehensive gene annotation information.
            // Gene annotations help link genetic variants to genes and identify their functional context.
            annotationOptions.geneDatabase = List.singleton(new DatabaseDescription("gencode"));

            // Step 10: Add the annotation tasks to the workflow.
            // The `AnnotationPipeline` class builds the necessary tasks for variant annotation using the
            // specified options, workspace, and number of threads. These tasks are then added to the workflow.
            workflow.addTask(new AnnotationPipeline(annotationOptions, workspace, threadNum).build());

            // Step 11: Add a task to export annotated variants to a TSV file.
            // The `OutputVariants2TSVTask` is responsible for exporting the annotated variants stored in
            // the GTB format to a tab-separated value (TSV) file. This file can be easily viewed and
            // processed by other bioinformatics tools.
            workflow.addTask(new OutputVariants2TSVTask(workspace, threadNum));

            // Step 12: Execute the workflow to perform annotation and export tasks.
            // This command initiates the workflow, executing all tasks sequentially and producing the
            // desired output files (e.g., annotated TSV files) in the specified workspace.
            workflow.execute();

        } catch (IOException e) {
            // Handle any exceptions that occur during file I/O operations.
            throw new RuntimeException(e);
        }
 }
}

Pruning

Variant Pruning¶

About¶

The prune module in KGGA is a powerful tool designed to refine genetic datasets by selecting variants that exhibit low linkage disequilibrium (LD) while retaining those with significant phenotypic associations and functional importance. This module is particularly valuable for reducing redundancy in large-scale genetic data, ensuring that the retained variants are both statistically robust and biologically meaningful. Its primary function is LD-based pruning, which integrates functional annotations (e.g., gene features, prediction scores) and phenotypic association p-values to prioritize variants. The pruned variants and their associated genotypes serve as high-quality input for downstream analyses, such as genetic risk prediction for complex diseases like diabetes or cardiovascular conditions.

This module is ideal for researchers aiming to streamline datasets without sacrificing critical genetic signals, making it a cornerstone for studies requiring efficient yet impactful variant selection.

Workflow of the Prune Module¶

Association Testing: Identifies variants with statistically significant associations to a phenotype of interest. The module calculates the genetic association (e.g., via regression or chi-square tests) between each variant and the target phenotype, generating a p-value for each variant. Variants with p-values exceeding a user-defined threshold (e.g., –p-cut 1E-5) are pruned. This ensures that only variants with strong evidence of phenotypic relevance are retained.
LD pruning: Reduces redundancy by removing variants in high LD with others, maximizing the number of independent variants retained. Within a user-specified genomic window (e.g., 10 MB, set via –window-kb), the module identifies pairwise LD between variants using a correlation metric (e.g., r², set via –r2-cut). Variants with the highest number of LD connections (i.e., correlated with the most other variants) are prioritized for removal. This approach minimizes redundancy while preserving dataset diversity. If multiple variants have the same number of LD connections, the variant with the higher minor allele frequency (MAF) is removed. If MAFs are equal, variants are removed randomly to maintain fairness.
LD Clumping: Enhances LD pruning by incorporating variant significance and functional annotations for prioritized retention. Similar to LD pruning, variants are ranked based on user-defined criteria (e.g., p-values, functional scores) before pruning occurs. The –clump option allows users to specify prioritization fields, such as association p-values (Assoc@CCT_P) or gene annotations (GeneFeature@MarkGeneFeature).

Output and Format¶

Upon completion, the prune module generates output files in two complementary formats with coordinates, genomic features, and genotypes of retained variants: GTB (Genotype Block) binary format and text format. The GTB format is a compact, custom binary format optimized for rapid reading and integration with other KGGA modules. The text format is human-readable and has tab-separated values (TSV) for easy inspection and external use.

pruning

Basic Usage¶

To execute the prune module, use the following command structure in a terminal or shell environment:

java -jar kgga.jar prune --input <input1> --input <input2> --output <output> [options]

Example¶

This section provides examples of using the pruning module in KGGA. The module enables users to select individuals from VCF and PED files, perform quality control, and prune variants based on association p-values, LD, or a combination of both, while optionally incorporating functional annotations such as gene features.

Prune Variants According to P-values¶

In this example, individuals present in both the VCF and PED files are selected, and the allelic association of each variant is examined. Variants with an association p-value (computed using the Cauchy Combination Test) greater than 1E-2 are excluded. The retained variants and their genotypes are stored in a VCF file for subsequent analysis.

java -Dccf.remote.timeout=60 -Xmx10g -jar kgga.jar \
   prune \
   --input-gty-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo3 \
   --local-maf 0.05~0.5 \
   --output-gty-format VCF \
   --allele-num 2~4 \
   --threads 4 \
   --assoc allelic,model-all \
   --p-cut 1E-2

Key Command Options¶

–input-gty-file: Specifies the input VCF file and reference genome of the VCF file(e.g., hg19).
–ped-file: Provides sample metadata (e.g., case/control status).
–output: Defines the output directory (e.g., ./test/demo3).
–local-maf: Filters variants by minor allele frequency (e.g., 0.05 to 0.5).
–output-gty-format: Sets the output format (e.g., VCF).
–allele-num: Filters variants by the number of alleles (e.g., 2 to 4).
–threads: Sets the number of processing threads (e.g., 4).
–assoc: Specifies the association methods (e.g., allelic and model-all).
–p-cut: Excludes variants with p-values greater than 1E-2.

Outputs¶

The pruned variants are saved in the specified output directory ./test/demo3/. The final results are stored in variants.hg38.tsv.gz, with columns including:

Header	Description
CHROM	Chromosome (hg38)
POS	Position (hg38)
REF	Reference allele
ALT	Alternative allele
…	Additional annotations

Intermediate Files: Files with the .gtb suffix in the ConvertVCF2GTBTask subfolder can be used for further analysis.
Log Files: Processing logs are saved in the log subfolder.

Prune Variants According to LD Only¶

In this example, individuals present in the VCF and PED files are selected, and LD-pruning is performed with a maximum LD r² of 0.1. The retained variants and their genotypes are stored in a VCF file for subsequent analysis.

java -Dccf.remote.timeout=60 -Xmx10g -jar kgga.jar \
   prune \
   --input-gty-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo3 \
   --local-maf 0.05~0.5 \
   --output-gty-format VCF \
   --allele-num 2~4 \
   --threads 4 \
   --r2-cut 0.1

Key Command Options¶

–r2-cut: Sets the LD threshold for pruning (e.g., 0.1).

Outputs¶

The pruned variants are saved in ./test/demo3/ with the same file structure as described in the p-value pruning example:

Final results in variants.hg38.tsv.gz.
Intermediate .gtb files in the ConvertVCF2GTBTask subfolder.
Logs in the log subfolder.

LD-Clumping of Variants According to P-values and Gene Features¶

In this example, individuals present in the VCF and PED files are selected, and the allelic association of each variant is examined. LD-clumping is performed on variants with p-values less than 1E-5, using an LD threshold of r² = 0.1. Additionally, variants in LD with less favorable gene feature codes (i.e., those less likely to affect gene function) are removed. The retained variants and their genotypes are stored in a VCF file.

java -Dccf.remote.timeout=60 -Xmx10g -jar kgga.jar \
   prune \
   --input-gty-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.hg19.vcf.gz \
                    refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
              pheno=disease1 \
              covar=age \
   --output ./test/demo3 \
   --local-maf 0.05~0.5 \
   --output-gty-format VCF \
   --allele-num 2~4 \
   --threads 4 \
   --p-cut 1E-5  \
   --r2-cut 0.1  \
   --clump Assoc@CCT_P,GeneFeature@MarkGeneFeature

Key Command Options¶

–p-cut: Sets the p-value threshold for selecting variants (e.g., 1E-5).
–r2-cut: Sets the LD threshold for clumping (e.g., 0.1).
–clump: Specifies the fields for ranking variants (e.g., Assoc@CCT_P for p-values and GeneFeature@MarkGeneFeature for gene features).

Outputs¶

The results are saved in ./test/demo3/ with the same structure as in previous examples:

Final results in variants.hg38.tsv.gz.
Intermediate .gtb files in the ConvertVCF2GTBTask subfolder.
Logs in the log subfolder.

General Notes¶

File Paths and Parameters: Adjust the file paths (e.g., –input-gty-file, –ped-file, –output) and parameters (e.g., –p-cut, –r2-cut, –local-maf) based on your specific dataset and analysis requirements.
Flexibility with Clumping: The –clump option allows users to prioritize variants based on p-values and gene features, ensuring that the most functionally relevant variants are retained.

Methods & Options

The Variant Prune module is designed to filter and select variants based on linkage disequilibrium (LD), association p-values, and other functional annotations. This module allows users to efficiently reduce the number of redundant variants and streamline subsequent analysis. The retained variants and their corresponding genotype information can be exported into various formats, making them suitable for multiple downstream applications.

Association¶

KGGA provides a robust framework for single-variant association tests, incorporating genetic-environment interactions and supporting multiple genetic models. To improve statistical power and robustness, particularly in small-sample studies, KGGA combines multiple association p-values at a variant using the Cauchy Combination Test.

Association Testing Methods¶

KGGA offers a variety of association tests for both binary and continuous traits, allowing flexibility based on study design and research questions. The key command-line options and their functionalities are outlined below:

Option	Description	Default
`--p-cut`	Sets the maximum p-value threshold for selecting variants for further analysis.	[OFF]
`--assoc`	Specifies the association analysis method. Format: –assoc method= [sex=y/n] [interaction=1,…] [standardBeta=y/n] [minCell=5] Example: –assoc method=allelic,model-trend,logistic-add sex=n interaction=1,3 standardBeta=n minCell=0 Note: The phenotypes and covariates are specified by `--ped-file` with `<pheno>` and `<covar>`. Otherwise, the first phenotype ( the 6th column) is used by default.	[OFF]

Association Methods (method) of `--assoc`¶

KGGA supports multiple association testing methods, each tailored to binary or continuous traits. These methods generate specific p-values and statistics, which are combined by default using the Cauchy Combination Test for a unified assessment.

Methods for Binary Traits¶

allelic: Performs a chi-square test on the 2x2 disease-by-genotype table.
Output: Assoc@Allelic_Assoc_P
model-trend: Applies the Cochran-Armitage trend test to assess trends in genotype frequencies.
Output: Assoc@Trend_Model_P
model-all: Uses Fisher’s exact test on the 2x2 table for an overall association.
Output: Assoc@Allelic_Model_P
model-dom: Tests the dominant model (e.g., DD + Dd vs. dd) using Fisher’s exact test.
Output: Assoc@Dominant_Model_P
model-rec: Tests the recessive model (e.g., DD vs. Dd + dd) using Fisher’s exact test.
Output: Assoc@Recessive_Model_P
logistic-add: Logistic regression testing the additive effect of the minor allele.
Outputs: Assoc@Logistic_Add_P, Assoc@Logistic_Add_Beta, Assoc@Logistic_Add_OR, Assoc@Logistic_Add_OR_Upper, Assoc@Logistic_Add_OR_Lower
logistic-dom: Logistic regression for the dominant model (DD + Dd vs. dd).
Outputs: Assoc@Logistic_Dom_P, Assoc@Logistic_Dom_Beta, Assoc@Logistic_Dom_OR, Assoc@Logistic_Dom_OR_Upper, Assoc@Logistic_Dom_OR_Lower
logistic-rec: Logistic regression for the recessive model (DD vs. Dd + dd).
Outputs: Assoc@Logistic_Rec_P, Assoc@Logistic_Rec_Beta, Assoc@Logistic_Rec_OR, Assoc@Logistic_Rec_OR_Upper, Assoc@Logistic_Rec_OR_Lower

Methods for Continuous Traits¶

anova: Performs an ANOVA test to compare phenotypic means across genotype groups.
Outputs: Assoc@ANOVA_P, Assoc@ANOVA_FStatistic, Assoc@ANOVA_DFB, Assoc@ANOVA_DFW
linear-add: Linear regression testing the additive effect of the minor allele.
Outputs: Assoc@Linear_Add_P, Assoc@Linear_Add_Beta, Assoc@Linear_Add_T
linear-dom: Linear regression for the dominant model (DD + Dd vs. dd).
Outputs: Assoc@Linear_Dom_P, Assoc@Linear_Dom_Beta, Assoc@Linear_Dom_T
linear-rec: Linear regression for the recessive model (DD vs. Dd + dd).
Outputs: Assoc@Linear_Rec_P, Assoc@Linear_Rec_Beta, Assoc@Linear_Rec_T

Note: KGGA employs the Cauchy Combination Test by default to combine p-values from selected methods into a single, comprehensive p-value.

Outputs: Assoc@CCT_P

This method enhances statistical power, especially in small-sample scenarios. For more details, refer to Liu & Xie (2020): Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures (Journal of the American Statistical Association, 115(529), 393-402).

Additional Association Options of `--assoc`¶

Option	Description	Default
sex	When set to y, includes sex as a covariate and prioritizes it in SNP × covariate interaction analyses.	[sex=n]
interaction	Specifies covariates for SNP × covariate interaction terms in regression models. Covariates must be included in the PED file. Example: interaction=1,3 includes interactions with the first and third covariates.	[OFF]
standardBeta	When set to y for linear regression methods, standardizes the phenotype (mean 0, unit variance), yielding standardized beta coefficients.	[standardBeta=n]
minCell	For contingency table tests (e.g., chi-square, Fisher’s exact), skips tests if any cell has fewer observations than minCell.	[minCell=0]

Examples of Association Models¶

Basic Logistic Regression (Additive Model): Command: –assoc method=logistic-add sex=y Model: Y = b0 + b1.ADD + b2.SEX + b3.A + b4.B + e
Logistic Regression with Interactions: Command: –assoc method=logistic-add sex=y interaction=1,3 Model: Y = b0 + b1.ADD + b2.SEX + b3.A + b4.B + b5.(ADD*SEX) + b6.(ADD*B) + e

Notes¶

Covariate Specification: Covariates for interaction terms must be explicitly included in the PED file.
P-value Combination: The Cauchy Combination Test integrates multiple p-values, improving robustness across tests.
Filtering with –p-cut: Use –p-cut to filter variants by p-value, with optional LD clumping via –r2-cut.

LD Pruning¶

Linkage Disequilibrium (LD) pruning is a straightforward method designed to eliminate genetic variants that exhibit high LD within a defined physical distance. This process reduces redundancy in genetic datasets by retaining variants that are as independent as possible, which is crucial for downstream genetic analyses.

LD Pruning Method¶

The LD pruning method in this context removes variants within a specified physical distance (controlled by –window-kb) if their LD exceeds a designated threshold (set by –r2-cut). The removal of variants follows a prioritized approach based on two key criteria:

Variants with More LD Connections A variant that is in LD with a greater number of other variants within the specified window is prioritized for removal. For instance, consider three variants: A, B, and C. If the LD r² values are 0.8 between A and B, 0.8 between B and C, and 0.64 between A and C, variant B—being in LD with both A and C—will be removed, while A and C are retained. This step minimizes redundancy by preserving variants with fewer LD dependencies.
Variants with Lower Minor Allele Frequency (MAF) If the first criterion does not distinguish between variants (i.e., they have an equal number of LD connections), the variant with the lower MAF is removed. This ensures that variants with higher MAF, which are typically more informative, are preferentially retained.

By applying these criteria, the method maximizes the retention of variants that are both less redundant (fewer LD connections) and more informative (higher MAF).

Command-Line Options for LD Pruning¶

The following table outlines the command-line options available for LD pruning, including their descriptions and default values:

Option	Description	Default
–r2-cut	Specifies the LD threshold (r²) for pruning. Variants with an LD r² value exceeding this threshold within the defined window are candidates for removal. Format: –r2-cut Example: –r2-cut 0.8	[OFF]
–window-kb	Defines the size of the sliding window (in kilobases) for LD pruning. Variants separated by a distance greater than this window are assumed to be in linkage equilibrium, and their LD coefficients are not computed to reduce computational effort. Format: –window-kb Example: –window-kb 10000	10000

Notes¶

Enabling LD Pruning: The –r2-cut option must be explicitly set to activate LD pruning; otherwise, this step is bypassed.
Window Size Impact: The –window-kb parameter determines the maximum distance over which LD is evaluated. Larger windows may yield a more thorough LD assessment but will increase computational time.

LD Clumping¶

LD Clumping in KGGA is an advanced linkage disequilibrium (LD) pruning method designed to reduce redundancy in genetic datasets by retaining variants with favorable characteristics. Unlike conventional LD clumping methods, such as those implemented in PLINK, which prioritize variants solely based on smaller p-values, KGGA’s LD clumping offers greater flexibility. It allows users to consider both association p-values and functional impact (e.g., gene features, CADD scores) when deciding which variants to retain. This makes it a powerful tool for genetic studies where both statistical significance and biological relevance are critical.

How LD Clumping Works¶

Within a specified LD pruning window (controlled by –window-kb), variants are sorted according to user-defined fields (specified via –clump). The clumping process starts with the variant having the most favorable value (e.g., the smallest p-value or highest functional score) and proceeds sequentially to variants with less favorable values. Variants in LD with a more favorable variant are pruned if they exceed the LD threshold.

Prioritization Criteria¶

The clumping process follows a hierarchical ranking system:

Custom Fields (–clump): Variants are ranked based on the fields specified in –clump. For example, these could include association p-values (e.g., Assoc@CCT_P) or functional annotations (e.g., GeneFeature@MarkGeneFeature). Variants with more favorable values—such as smaller p-values or higher functional scores—are retained.
LD Connections and MAF: If two variants are ranked equally based on the custom fields, the variant with more LD connections (i.e., higher correlation with other variants) or a lower minor allele frequency (MAF) is prioritized for removal.
Random Selection: If all ranking conditions are equal (i.e., identical custom field values, LD connections, and MAF), one of the variants is randomly selected for pruning.

This approach ensures that the retained variants are both statistically significant and biologically meaningful, offering a more nuanced alternative to traditional LD clumping.

Options¶

The following table outlines the key command-line option for LD clumping in KGGA, including its description and default behaviors:

Option	Description	Default
–clump	Specifies the genomic feature names or fields used to rank variants during clumping. Format: –clump Example: –clump Assoc@CCT_P,GeneFeature@MarkGeneFeature Explanation: In this example, variants are ranked first by their association p-value (Assoc@CCT_P), with smaller p-values preferred, and then by gene feature codes (GeneFeature@MarkGeneFeature), with higher codes indicating greater functional impact. Variants with larger p-values or less favorable gene features are pruned within the LD window.	-

Additional LD Pruning Options¶

LD Clumping in KGGA also relies on standard LD pruning parameters to define the pruning window and LD threshold. These options are typically inherited from the broader LD pruning framework:

Option	Description	Default
–r2-cut	Specifies the LD threshold (r²) for clumping. Variants with LD r² exceeding this value are considered for removal. Format: –r2-cut Example: –r2-cut 0.8	[OFF]
–window-kb	Defines the sliding window size (in kilobases) for LD clumping. Variants beyond this distance are assumed to be in linkage equilibrium. Format: –window-kb Example: –window-kb 10000	10000

Key Advantages¶

Flexibility: KGGA’s LD clumping goes beyond p-value-based pruning by incorporating functional annotations, making it ideal for studies where biological impact matters.
Customizable Ranking: Users can define multiple fields in –clump to create a tailored hierarchy for variant retention.
Balanced Pruning: The use of LD connections, MAF, and random selection as tiebreakers ensures a robust and fair pruning process when custom rankings are insufficient.

Again, following workflow-based programming, programmers can easily call the API function to annotate sequence variants. The input variants should be stored in the GTB format. The following are codes to annotate variant gene features and allele frequencies.

// This code snippet demonstrates how to configure and execute an annotation workflow using the KGGA // platform. The snippet includes setting up the annotation resources, defining input data, configuring // annotation databases, and exporting annotated variants to a TSV file format. The code also shows how // to add tasks to a workflow and execute them sequentially.

import edu.sysu.pmglab.container.interval.FloatInterval;
import edu.sysu.pmglab.executor.Executor;
import edu.sysu.pmglab.gtb.GTBManager;
import edu.sysu.pmglab.gtb.genome.coordinate.RefGenomeVersion;
import edu.sysu.pmglab.kgga.command.executor.Utility;
import edu.sysu.pmglab.kgga.command.pipeline.GeneralIOOptions;
import edu.sysu.pmglab.kgga.command.pipeline.LDPruneOptions;
import edu.sysu.pmglab.kgga.command.pipeline.PreprocessingPipeline;
import edu.sysu.pmglab.kgga.command.pipeline.VCFQualityControlOptions;
import edu.sysu.pmglab.kgga.command.task.LDPruningTask;
import edu.sysu.pmglab.kgga.command.task.OutputVariants2TSVTask;
import edu.sysu.pmglab.kgga.command.validator.VariantFileMeta;
import edu.sysu.pmglab.kgga.io.InputPhenotypeFileSet;
import edu.sysu.pmglab.kgga.io.InputType;

import java.io.File;
import java.io.IOException;

public class AnnotationExample {
 public static void main(String[] args) {
    // Initialize general input/output options.
GeneralIOOptions ioOptions = new GeneralIOOptions();
// Initialize options for VCF file quality control.
VCFQualityControlOptions vcfQualityControlOptions = new VCFQualityControlOptions();
try {
    // Add an input VCF file (from a URL) to the IO options.
    // Specifies its type (VCF) and reference genome version (hg19).
    ioOptions.inputGTYFiles.add(new VariantFileMeta("https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.hg19.vcf.gz", InputType.VCF, RefGenomeVersion.hg19));

    // Step : Specify the path of the subject information file.
    // The PED file contains phenotype and sample information for each individual in the VCF file.
    // This is necessary for linking genetic variants with phenotypic data.
    ioOptions.phenoFileSet = new InputPhenotypeFileSet("https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped");

    // Set a filter for local Minor Allele Frequency (MAF) to be between 0.05 (5%) and 0.5 (50%).
    ioOptions.localMaf = new FloatInterval(0.05f, 0.5f);

    // The workspace is a directory that stores intermediate files and results generated during the
    // annotation and analysis processes.
    File workspace = new File("./test1"); // Define the workspace directory.

    // Set the number of threads to use for parallel processing.
    int threadNum = 4;

    // Initialize a workflow executor to manage and run tasks.
    Executor workflow = new Executor();

    // Add tracking or logging for the workflow, associated with the workspace.
    Utility.addTrack(workflow, workspace);

    // Generate an annotation-based Genotype Table (GTB).
    // This is a key preprocessing step that takes IO options, QC options,
    // the workflow executor, and the workspace directory as input.
    // It likely reads VCFs, applies QC, and creates a structured GTB file.
    GTBManager annotationBasedGTB = PreprocessingPipeline.INSTANCE.generateAnnotationBase(ioOptions, vcfQualityControlOptions, workflow, workspace);

    // Execute the tasks currently added to the workflow (i.e., the GTB generation).
    workflow.execute();

    // Clear any residual tasks from the workflow to prevent conflicts before adding new ones.
    workflow.clearTasks();

    // Get the file path of the generated annotation-based GTB and create a File object for it.
    File localGTBFile = new File(String.valueOf(annotationBasedGTB.getFile()));

    // Set the generated GTB file as a parameter named "AnnotationBaseVariantSet" for the workflow.
    // This makes it available to subsequent tasks.
    workflow.setParam("AnnotationBaseVariantSet", localGTBFile);

    // Initialize options for Linkage Disequilibrium (LD) pruning.
    LDPruneOptions ldPruneOptions = new LDPruneOptions();

    // Set the R-squared threshold for LD pruning. Variants with R^2 > 0.01 with another variant
    // in a window might be removed.
    ldPruneOptions.pruneR2 = 0.01f;

    // Add an LD pruning task to the workflow using the specified options, workspace,
    // a boolean flag (true, specific purpose depends on task implementation, e.g., overwrite), and thread number.
    workflow.addTask(new LDPruningTask(ldPruneOptions, workspace, true, threadNum));
    // Add a task to output the processed (e.g., pruned) variants to a TSV (Tab-Separated Values) file.
    workflow.addTask(new OutputVariants2TSVTask(workspace, threadNum));

    // Execute the task to output variants to TSV.
    workflow.execute();

} catch (IOException e) {
    // Handle any exceptions that occur during file I/O operations.
    // Wrap and rethrow as a RuntimeException for simplicity here,
    // though more specific handling might be needed in production.
    throw new RuntimeException(e);
}
 }
}

Prioritization

Region/Gene Prioritization with the prioritize Module¶

About¶

The prioritize module in KGGA is a powerful tool designed to rank genomic regions and genes based on the presence of multiple mutations. It extends beyond simply analyzing allelic distribution differences tied to phenotypes by incorporating genomic feature annotations, such as functional prediction scores and allelic frequencies from reference populations. This integrated approach allows researchers to prioritize regions or genes that are most likely to influence the phenotype under investigation, making it an invaluable asset for genetic studies targeting complex traits or diseases.

Functions in the Prioritize Module¶

The prioritize module has had three specialized functions by far, each addressing a unique aspect of prioritization. These can be applied individually or in combination, depending on the research objectives and available data.

RUNNER: Identifies and prioritizes genes or genomic regions enriched with rare sequence variants compared to expected background mutation rates. It utilizes a negative binomial regression model to evaluate the enrichment of rare variants within a gene or region. The model integrates allelic frequencies sourced from reference populations (e.g., gnomAD) to pinpoint rare variants and functional prediction scores (such as CADD or PolyPhen scores) to assess the potential impact of variants.
PubMed: Prioritize genes according to the literature in PubMed. Papers that co-mention the genes to be prioritized and specified phenotypes will be retrieved directly.

Basic Usage¶

java -jar kgga.jar prioritize --input <input1> --input <input2> --output <output> [options]

Methods & Options

The prioritize module is designed to prioritize variants, genes, or genomic regions based on allelic frequencies in reference populations and annotated functional prediction scores. Multiple methods under this module are available. We are also calling for more new methods based on this powerful infrastructure.

Prioritize genes enriched with rare sequence variants¶

RUNNER¶

The RUNNER function leverages a negative binomial regression model to prioritize genes or genomic regions enriched with rare sequence variants compared to background mutations. This methodology builds on foundational research detailed in our previous publications (WITER and RUNNER). It constructs a model to adjust mutation counts in predefined genomic regions or genes using multiple genomic features, such as reference allele frequencies, region lengths, and conservation scores. It evaluates the statistical enrichment of these adjusted counts, weighted by functional scores, under negative binomial distributions.

Citations¶

Lin Jiang, Hui Jiang, …, Pak Chung Sham, and Miaoxin Li. Deviation from baseline mutation burden provides powerful and robust rare-variants association test for complex diseases. Nucleic Acids Res. 2022 Apr 8;50(6):e34. PubMed Link
Lin Jiang, Jingjing Zheng, …, Yonghong Zhu, and Miaoxin Li. WITER: a powerful method for estimation of cancer-driver genes using a weighted iterative regression modelling background mutation counts. Nucleic Acids Res. 2019 Sep 19;47(16):e96. PubMed Link

Functionality¶

RUNNER identifies genes or regions with an unexpectedly high number of rare variants by:

Adjusting mutation counts with a negative binomial regression model based on genomic features.
Assessing enrichment using functional score weights provides a statistical measure of significance.

Options and Examples¶

Below is an example command demonstrating how RUNNER examines genes enriched with non-synonymous rare variants. Mutation counts are weighted by functional scores from dbNSFP and adjusted for evolutionary conservation, mutation frequency in reference populations (e.g., gnomAD), and gene length. The output provides enrichment p-values for each gene.

RUNNER utilizes R as its computational engine. Please ensure that KGGA is properly configured to communicate with R. We recommend using Docker to streamline this integration. For Docker configuration details, please refer to RServerDocker.

java -Xmx10g -jar kgga.jar \
 prioritize \
   -i /path/to/genotype/file \
   --ped-file /path/to/pedigree/phenotype/file \
   --output ./demo4 \
   --threads 8 \
   --variant-annotation-database dbNSFP \
   --variant-annotation-database conservation \
   --gene-feature-in 0~6 \
   --gene-model-database refgene \
   --gene-model-database gencode \
   --freq-database gnomad \
   --db-af 0~0.01 \
   --runner conservation@GC,conservation@CpG,conservation@priPhCons,conservation@mamPhCons,conservation@verPhCons,conservation@priPhyloP,conservation@mamPhyloP,conservation@verPhyloP,conservation@GerpN,conservation@GerpS,conservation@fitCons_all \
     weight=dbNSFP@VEST,dbNSFP@FATHMM-XF,dbNSFP@Eigen,dbNSFP@CADD,dbNSFP@GenoCanyon
   --r-server localhost:6300

Note: The disease phenotypes to be analyzed are specified by --ped-file with <pheno>. Otherwise, the first phenotype ( the 6th column) is used by default.

Key Command-Line Options¶

-i /path/to/genotype/file: Specifies the input genotype file.
--ped-file /path/to/pedigree/phenotype/file: Provides pedigree and phenotype data.
--output ./demo4: Sets the output directory or file prefix.
--threads 8: Defines the number of processing threads.
--variant-annotation-database dbNSFP: Uses dbNSFP for variant annotations.
--variant-annotation-database conservation: Incorporates conservation data.
--gene-feature-in 0~6: Selects gene features (e.g., exons) to analyze.
--gene-model-database refgene: Uses the refGene database for gene models.
--gene-model-database gencode: Adds the GENCODE database for gene models.
--freq-database gnomad: Employs gnomAD for allele frequency data.
--db-af 0~0.01: Filters variants with allele frequencies between 0 and 0.01.
--r-server localhost:6300: Specifies the address and port of the R server that the KGGA will connect to for running the R-based analyses.

The `--runner` Option¶

The --runner option is a composite parameter that configures the regression analysis. All specified fields must be defined in the annotation database (--variant-annotation-database).

Sub-Options¶

predictor: Database fields used as predictors in the regression (e.g., conservation scores like GC, CpG, etc.). Default: region length if unspecified.
weight: Fields used as weights for mutation counts (e.g., functional scores like VEST, CADD). If @AFGRE is included, recalibrated weights are applied; otherwise, raw weights are used.
adjustMethod: Method to adjust mutation counts (full or cut). Default: cut.
combineWeight: Combines specified scores before weighting (yes or no). Default: yes.
countOnce: Counts only the most impactful variant per subject in a region/gene (yes or no). Default: yes.
freqRatio: Excludes variants where the case-to-control allele frequency ratio is outside 1/[value] to [value]. Larger values correspond to more extreme variants. Default: 3.

Format¶

--runner <field1>,<field2> [weight=<field3>,<field4>] [adjustMethod=full/cut] [combineWeight=y/n] [countOnce=y/n] [freqRatio=3]

Example¶

--runner conservation@priPhCons,conservation@mamPhCons weight=dbNSFP@CADD combineWeight=y

Output¶

RUNNER generates three main outputs:

Regression Model Summary: Displays the fitted zero-truncated negative binomial regression model, including estimates, standard errors, z-values, and p-values for predictors.
QQ Plot: A quantile-quantile plot of p-values, saved as a PDF file.
Prioritized Genes/Regions: A text file listing p-values and statistics for each gene or region.

Example Regression Model Summary¶

2025-04-19 10:29:02 Best standardized score bin: 0.45; Optimal truncation point: 1; MLFC: 0.039896260517655234
2025-04-19 10:29:04 The zero-truncated negative-binomial regression model fitted for region-based mutation counts regression:
  Variable                  Estimate         Std. Error      z value         Pr(>|z|)
  (Intercept)               0.924802489      0.102559969     9.017187660     1.92979663121953e-19
  conservation_GC           0.239511946      0.166125249     1.441755221     0.149371439179087
  conservation_CpG         -0.126699149      0.357418197    -0.354484328     0.722975946812435
  ...
  Theta = 2.64204113945025
  Log-likelihood: -17561.7404288397
  AIC: 35155.4808576794
  Number of iterations in BFGS optimization: 26

Output File Description¶

The *.regression.prioritized.genes.txt file contains columns such as:

ID: Gene or region identifier (e.g., RET).
Region: Type of region (e.g., Exons).
Chromosome: Chromosome number (e.g., 10).
StartPosition: Region start position (e.g., 43100542).
EndPosition: Region end position (e.g., 43129617).
CaseUnweightedMutationCounts: Mutation counts in cases (e.g., 60).
conservation@GC, conservation@CpG, …: Predictor values.
ln_RegionLength: Log of region length.
ln_RefCount: Log of reference count.
ln_RegionLength_RefCount: Interaction term.
ControlUnweightedMutationCounts: Mutation counts in controls (e.g., 27).
z: Z-score (e.g., 4.33).
p: P-value (e.g., 7.57e-06).
FDRq: FDR-adjusted q-value (e.g., 0.0350).

Example Output Row¶

ID  Region  Chromosome  StartPosition   EndPosition CaseUnweightedMutationCounts    conservation@GC conservation@CpG    conservation@priPhCons  conservation@mamPhCons  conservation@verPhCons  conservation@priPhyloP  conservation@mamPhyloP  conservation@verPhyloP  conservation@GerpN  conservation@GerpS  conservation@fitCons_all    ln_RegionLength ln_RefCount ln_RegionLength_RefCount    ControlUnweightedMutationCounts z   p   FDRq
RET  Exons  10  43100542  43129617  60  0.6173  0.0970  0.5678  0.6683  0.7831  0.4373  1.1918  2.3258  5.1527  3.4412  0.5569  1.2887  4.3521  5.6084  27  4.33  7.57e-06  0.0350

iRUNNER¶

iRUNNER is an extended method from RUNNER designed to assess the synergistic rare mutation burden in gene pairs or triples, referred to as interaction units, within case samples. These interaction units represent genes with functional cooperation, such as protein interactions, gene co-expression, or shared biological pathways, and must be supplied to KGGA. iRUNNER counts mutations occurring in every gene within an interaction unit at a patient. Once mutation counts are obtained, the analysis proceeds identically to RUNNER’s process for individual genes.

Citations¶

Hui Jiang, …, and Miaoxin Li. Exploring Genetic Interactions with Rare Variants Reveals Gene Networks Susceptible to Complex Diseases. Link

Options and Examples¶

Below is an example command demonstrating how iRUNNER examines genes enriched with non-synonymous rare variants:

java -Xmx10g -jar kgga.jar \
 prioritize \
   # Identical to the above options for RUNNER 
 --interaction file=/home/lmx/MyJava/netbeans/kggseq1/resources/Coding_predict_fixed_0.5.txt.gz \
               threshold=0.8

The –interaction Option¶

The –interaction option is a composite parameter that specifies the file containing interaction units.

Sub-Options¶

file: Specifies the path to the file containing gene interactions. Each line must follow the format: gene1,gene2,…,interactionScore.
threshold: Excludes gene combinations with interaction scores below this value.
hasHead: Indicates whether the file includes a header row (yes or no). Default: yes.
sep: Defines the separator used in each line. Options: TAB, COMMA, SEMICOLON, BLANK. Default: TAB.

Format¶

--interaction file=<path> threshold=<float> [hasHead=y/n] [sep=TAB/COMMA/SEMICOLON/BLANK]

Output¶

The outputs, including the Regression Model Summary, QQ Plot, and Prioritized Genes/Regions, are similar to those generated by RUNNER.

PubMed¶

The PubMed function retrieves papers that co-mention the genes to be prioritized and specified phenotypes. It queries the PubMed database to identify relevant publications.

Functionality¶

Purpose: Retrieve paper IDs from PubMed that mention both the prioritized genes and the specified phenotypes.
Query Mechanism: Uses keywords for phenotypes and a specified query field to search for co-mentions in PubMed.

Options¶

Option	Description	Default
–pubmed-mining	Specifies phenotype keywords and the query field type for retrieving paper IDs from PubMed that co-mention the prioritized genes and phenotypes. Phenotype keywords are separated by + (e.g., disease+name1). By default, this function is inactive. If activated, the default query field is ‘Title/Abstract’. Another available field is ‘Text+Word’. Format: –pubmed-mining , [field=Title/Abstract] Example: –pubmed-mining Hirschsprung This command queries PubMed for papers that mention both the prioritized genes and “Hirschsprung” in the title or abstract.	[OFF]

Connection

Connection to third-party tools or platforms¶

About¶

The Connect module in KGGA enables integration with external tools and platforms for additional genetic analyses. It leverages KGGA’s robust capabilities for cleaning, annotating, and preprocessing large-scale genotype data. Processed data is stored in efficient formats —such as PLINK’s BED or PGEN binary files on disk —or maintained as optimized computing objects in RAM. These outputs serve as inputs for third-party tools or platforms, facilitating follow-up genetic analysis of complex traits or diseases.

Functions in the Connect Module¶

The Connect module currently supports four specialized functions, each seamlessly integrated with a specific tool or platform:

PLINK2: Generates genotypes and phenotypes in PLINK’s PGEN format after preprocessing, enabling direct analysis with PLINK2.
PLINK: Produces genotypes and phenotypes in PLINK’s BED format after preprocessing, supporting analysis with PLINK (version ≤ 1.9).
GCTA: Creates genotypes and phenotypes in PLINK’s BED format after preprocessing, allowing direct analysis with GCTA.
Python: Provides Java APIs to access computing objects in RAM, enabling users to perform custom analyses in Python. This option requires proficiency in both Java and Python programming.

Basic Usage¶

The following command demonstrates how to preprocess data and connect to a third-party tool:

java -jar kgga.jar <clean/annotate/prune> \
     --input <input1> \
     --input <input2> \
     --output <output> \
     [options] \
     --plink2/--plink/--gcta [tools' options]

Key Parameters¶

: Specifies the preprocessing task (e.g., cleaning, annotating, or pruning genotype data).
–input –input : Defines input files for preprocessing.
–output : Sets the output file or directory for processed data.
[options]: Additional parameters for preprocessing (specific to the task).
–plink2/–plink/–gcta: Selects the third-party tool for analysis, generating compatible output formats and launching the tool directly.

Methods & Options

Connection to Third-Party Tools or Platforms¶

KGGA’s Connect module facilitates seamless integration with four third-party tools and platforms—PLINK2, PLINK, GCTA, and Python—for advanced genetic analyses. By leveraging KGGA’s powerful data preprocessing capabilities, such as cleaning and annotating large-scale genotype data, the module generates efficient output formats (e.g., PLINK’s BED or PGEN) or maintains data in RAM for direct use by these tools.

PLINK2¶

The PLINK2 integration allows KGGA to produce cleaned genotypes in PLINK’s PGEN format and directly launch PLINK2 for analysis, such as regression using its generalized linear model (–glm). For additional PLINK2 analysis options, refer to the PLINK2 documentation.

Example Usage¶

java -Djava.library.path="$(pip3 show jep | grep Location | awk '{print $2"/jep"}')" \
   -Xmx10g -jar kgga.jar \
   clean \
   --input-gty-file ./example/assoc.hg19.vcf.gz refG=hg19 \
   --ped-file ./example/assoc.ped \
   --output ./test/demo1 \
   --threads 4 \
   --allele-num 2~4 \
   --seq-ac 1 \
   --min-obs-rate 0.9 \
   --hwe 1E-5 \
   --plink2 "--glm genotypic interaction --covar ./example/cov.plink2.txt --parameters 1-4,7" \
            path=./tools/plink2

The –plink2 Option¶

The –plink2 option is a composite parameter that specifies the path to the PLINK2 executable and its analysis options.

path: Defines the directory containing the PLINK2 executable.
Analysis Options: Specifies PLINK2 parameters, such as –glm for regression analysis, as shown in the example.

Note: The phenotypes and covariates are specified by --ped-file with <pheno> and <covar>. Otherwise, the first phenotype ( the 6th column) is used by default.

PLINK¶

The PLINK integration generates cleaned genotypes in PLINK’s BED format and launches PLINK (version ≤ 1.9) for analysis, such as regression using its generalized linear model (–glm). For additional PLINK analysis options, refer to the PLINK documentation.

Example Usage¶

java -Xmx10g -jar kgga.jar \
   clean \
   --input-gty-file ./example/assoc.hg19.vcf.gz refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo1 \
   --threads 4 \
   --allele-num 2~4 \
   --seq-ac 1 \
   --min-obs-rate 0.9 \
   --hwe 1E-5 \
   --plink "--glm genotypic interaction --covar ./example/cov.txt --parameters 1-4,7"

The –plink Option¶

The –plink option is a composite parameter that configures the path to the PLINK executable and its analysis options.

path: (Optional) Specifies the directory containing the PLINK executable. If omitted, KGGA assumes PLINK is in the system path.
Analysis Options: Defines PLINK parameters, such as –glm for regression analysis, as shown in the example.

Note: The phenotypes and covariates are specified by --ped-file with <pheno> and <covar>. Otherwise, the first phenotype ( the 6th column) is used by default.

GCTA¶

The GCTA integration produces cleaned genotypes in PLINK’s BED format and launches GCTA to perform analyses, such as creating a genetic relationship matrix (GRM) between pairs of individuals from retained variants. For additional GCTA analysis options, refer to the GCTA documentation.

Example Usage¶

java -Xmx10g -jar kgga.jar \
   clean \
   --input-gty-file ./example/assoc.hg19.vcf.gz refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo1 \
   --threads 4 \
   --allele-num 2~4 \
   --seq-ac 1 \
   --min-obs-rate 0.9 \
   --hwe 1E-5 \
   --gcta "--make-grm --thread-num 4" \
         path=./tools/gcta64

The –gcta Option¶

The –gcta option is a composite parameter that specifies the path to the GCTA executable and its analysis options.

path: Defines the directory containing the GCTA executable.
Analysis Options: Specifies GCTA parameters, such as –make-grm for generating a GRM, as shown in the example.

Note: The phenotypes and covariates are specified by --ped-file with <pheno> and <covar>. Otherwise, the first phenotype ( the 6th column) is used by default.

Python¶

The Python integration enables users to access cleaned genotype data in GTB format through KGGA’s Java APIs, allowing direct manipulation in RAM for custom Python analyses. This requires proficiency in both Java and Python programming.

Example Usage¶

The following Java code demonstrates how to read cleaned genotypes in GTB format and reformat them for use in a Python function:

// Initialize a global Python interpreter instance.
GlobalPythonInterpreter python = new GlobalPythonInterpreter();

// Initialize a GTBReader to read from a specific GTB (Genotype Block) file.
// This file likely contains genetic variant data.
GTBReader reader = new GTBReader("./test1/GenerateAnnotationBaseTask/variants.annot.gty.hg38.gtb");

// Get the total number of variants from the reader.
int varNum = (int) reader.numOfVariants();

// Initialize an array to store variant coordinates (e.g., "chr1:12345").
String[] coordinates = new String[varNum];

// Initialize a 2D array to store genotype data for each variant.
// The second dimension will be sized per variant later.
int[][] genotypes = new int[varNum][];

// Initialize a counter for the current variant ID/index.
int varID = 0;

// Loop through each variant in the GTB file.
while (reader.hasNext()) {
    // Read the next variant record.
    Variant variant = reader.read();

    // Get the genotype codes for the current variant.
    // tmpGty is likely a 2D array, possibly [alleles_per_sample][samples] or [ploidy_level][samples].
    // For diploid organisms, it's often [2][number_of_samples], where tmpGty[0] is the first allele
    // and tmpGty[1] is the second allele for each sample.
    int[][] tmpGty = variant.getGenotypes().getGenotypeCodes();

    // Initialize the genotype array for the current variant, sized by the number of samples.
    // Assumes tmpGty[0].length gives the number of samples.
    genotypes[varID] = new int[tmpGty[0].length];

    // For each sample, sum the two allele codes.
    // This typically converts diploid genotypes (e.g., 0/0, 0/1, 1/1) into a count of
    // the alternate allele (e.g., 0, 1, 2).
    for (int i = 0; i < tmpGty[0].length; i++) {
        genotypes[varID][i] = tmpGty[0][i] + tmpGty[1][i];
    }

    // Get and store the coordinate of the current variant as a string.
    coordinates[varID] = variant.getCoordinate().toString();

    // Increment the variant ID/index.
    varID++;
}

// Close the GTBReader to release resources.
reader.close();

// --- Python Interaction ---

// Execute Python code: import the NumPy library as 'np'.
python.exec("import numpy as np");

// Pass the Java 'genotypes' array to Python as a variable named 'gty'.
python.setValue("gty", genotypes);

// Pass the Java 'coordinates' array to Python as a variable named 'var'.
python.setValue("var", coordinates);

// Execute Python code: call a Python function named 'prs'.
// This function likely uses the 'gty' (genotypes) and a 'model' variable.
// 'model' is assumed to be defined or loaded within the Python environment/script.
// 'prs' could stand for Polygenic Risk Score calculation.
python.exec("prs(gty, model)");

// Close the Python interpreter and release its resources.
python.close();

This code reads GTB data, extracts genotypes and coordinates, and passes them to a Python function (prs) for analysis.

Simulation

Phenotype and gene expression simulation¶

The simulate module in KGGA is designed to simulate phenotypes (including binary traits and quantitative traits) based on real genotype data. This module is particularly suitable for methodological studies in which multiple genetic loci or genes influence a phenotype under polygenic models. An advantage of using real genotypes for simulation is that it maintains the natural genetic structure for more realistic simulations, where linkage disequilibrium (LD) patterns, allele frequencies, and gene boundaries are preserved. The condition for such a simulation strategy is that a large sample (e.g., n > 500,000) is available, as this is required for accurately controlling a given heritability.

About¶

KGGA’s simulate module leverages high-performance computing to generate synthetic phenotypes from large-scale genotype datasets. It supports polygenic models where genetic effects are distributed across numerous variants, enabling researchers to test statistical methods, evaluate power in GWAS, or model complex trait architectures. By integrating with KGGA’s other modules, users can preprocess genotypes to focus on specific variant sets (e.g., rare variants or gene regions) before simulation.

What the Module Does¶

The simulate module enables rapid simulations with large genotype datasets. It allows users to:

Simulate a binary or quantitative phenotype given heritability, sample sizes, and sample numbers.
Simulate two related phenotypes given causal effects, heritability, sample sizes, and sample numbers.
Simulate space-time specific expression of genes and phenotypes regulated by the genes given heritability, gene number, specific expression-related parameters, sample sizes, and sample numbers.

These features support advanced scenarios, such as modeling pleiotropy or spatiotemporal gene regulation in transcriptomic studies.

Workflow of the Simulate Module¶

This module follows a structured workflow, beginning with data preparation and culminating in phenotype simulation. The key stages are as follows:

Input and Preprocessing: The module accepts genotypes from whole-genome variants in VCF or GTB formats. Before simulation, users can leverage other KGGA modules to perform quality control (clean), annotate variants with gene and genomic features (annotate), and reduce redundancy based on linkage disequilibrium (prune).
Phenotype Simulation: Given the heritability of phenotypes or genes, KGGA will simulate phenotypes for all input subjects with genotypes and gene expression (if required). Simulations account for polygenic risk, environmental noise, and user-defined parameters to ensure realistic variance partitioning.
Phenotype Sampling: The module will randomly assign subjects with simulated phenotypes into multiple samples according to the preset sample size and sample number. Subjects are randomly drawn without replacement, allowing for replication studies or bootstrapping analyses.

Output and Format¶

Upon completion, the simulate module generates multiple output files designed for subsequent analysis.

Case/Control Samples: A number of pedigree files, each for a sample, containing case/control subjects. The number is equal to the sample number. Files are in PED format for compatibility with tools like PLINK.
Quantitative Phenotype Samples: A number of pedigree files, each for a sample, containing subjects with quantitative traits. The number is equal to the sample number. Phenotypes are standardized (mean=0, variance=1) unless otherwise specified.
Gene Expression Profiles: A number of tab-separated (TSV) files containing the simulated gene expression. Typically, each row represents a gene (labeled by gene symbol), and each column represents an expression condition, such as a tissue space grid (for spatial transcriptomic data) or a time grid (temporal transcriptomic data). Files include metadata headers for heritability and simulation parameters.

All outputs are stored in the specified output directory, with optional compression (e.g., .gz) for large files.

Basic Usage¶

To execute the predict module, use the following command structure in a terminal or shell environment:

java -jar kgga.jar simulate --input <input1> --input <input2> --output <output> [options]

Simulate Methods & Options¶

About¶

The simulate module is a powerful and flexible tool designed for generating synthetic phenotypes from real genotype data. This functionality is crucial for a wide range of applications in genetic research, including but not limited to: evaluating the statistical power of new analytical methods, validating the performance of association testing frameworks, understanding complex genetic architectures, and exploring causal relationships between traits (e.g., for Mendelian Randomization studies).

The module provides two sophisticated simulation frameworks:

QTL-based Simulation: Generates phenotypes based on quantitative trait loci (QTLs) using established principles of quantitative genetics and the liability threshold model. It can simulate single traits, as well as causally related traits.
Gene Expression-Mediated Simulation: Simulates phenotypes through a hierarchical model where genetic variants (eQTLs) first influence gene expression levels, which in turn determine the final complex trait. This framework can model both spatial and temporal (developmental) gene expression patterns.

By leveraging real genotype data from VCF or KGGA’s native GTB format, the simulate module produces biologically plausible datasets that capture the linkage disequilibrium and allele frequency patterns of the source population.

Phenotype Simulation Based on Quantitative Trait Loci (QTLs)¶

This simulation framework operates on the principle that phenotypic variation arises from the combined effects of genetic variation and environmental factors. The simulation uses real genotypes to model the genetic component. Given a user-defined heritability, the environmental variance is calculated and applied to generate a continuous quantitative phenotype. For disease traits, this quantitative value is treated as a liability score, and a threshold corresponding to the specified disease prevalence is used to assign case/control status.

Example 1: Simulate a Single Phenotype¶

This example demonstrates how to simulate a single disease trait based on a set of 150 QTLs with varying effect models. The total heritability is set to 25%, and the disease prevalence is 1%. The underlying mathematical models can be viewed here.

java -Xmx10g -jar kgga.jar simulate \
   --input-gty-file ./variants.maf05.hg38.gtb \
   --output ./test/simulate_single_pheno \
   --threads 20 \
   --min-obs-rate 0.4 \
   --allele-num 2~4 \
   --local-maf 0.05~0.5 \
   --r2-cut 0.1 \
   --phenotype-variants qtlPattern=50@1,50@2,50@3 \
                        heritability=0.25 \
                        prevalence=0.01 \
                        confounding=0.05 \
   --sampling-group groupNum=10 \
   --sampling sampleSize=6000 \
              caseProportion=0.5

–phenotype-variants: Configures the genetic architecture of the simulated phenotype, including the number and effect models of QTLs, heritability, prevalence (for binary traits), and confounding variance.
–sampling: Specifies the number of replicate datasets to generate and the desired number of cases and controls to sample for each replicate.
–r2-cut: Sets the maximum Linkage Disequilibrium (LD) r² allowed between selected QTLs, ensuring they are relatively independent.

This example simulates two phenotypes where the first trait (exposure) has a causal effect on the second (outcome). This is particularly useful for generating data to test methods like Mendelian Randomization. The underlying mathematical models can be viewed here.

java -Xmx10g -jar kgga.jar simulate \
   --input-gty-file ./variants.maf05.hg38.gtb \
   --output ./test/simulate_causal_pheno \
   --threads 20 \
   --local-maf 0.05~0.5 \
   --r2-cut 0.1 \
   --phenotype-variants qtlPattern=150@1 \
                        heritability=0.25 \
                        prevalence=0.01 \
                        confounding=0.05 \
   --phenotype-variants qtlPattern=150@1 \
                        heritability=0.25 \
                        prevalence=0.01 \
                        confounding=0.05 \
   --phenotype-causal causalEffect=0.2 \
                      causalHeritability=0.05 \
                      pleiotropyIVProportion=0.3 \
                      directPleiotropyHeritability=0.02 \
                      randPleiotropyHeritability=0.02 \
   --sampling-group groupNum=10 \
                   overlapping=0 \
   --sampling sampleSize=10000 \
              caseProportion=0.5 \
   --sampling sampleSize=10000 \
              caseProportion=0.5

–phenotype-causal: Specifies the causal effect of the first simulated phenotype (exposure) on the second (outcome). It also defines the proportion of the exposure’s genetic variance that contributes to this causal pathway.
–sampling: Set the sample size and the proportion of case individuals within a sample.
–sampling-group: Defines the number of sample groups and the proportion of individuals that are shared between the samples within a group. The number of samples within a group is determined by the number of ‘–sampling’ options.

Full Options for QTL-based Simulation¶

Option	Description	Default
–phenotype-variants	Configures the genetic architecture for a phenotype. This option can be specified multiple times for simulating multiple traits. Format: qtlPattern=[spec] heritability=[float] prevalence=[float] confounding=[float] - qtlPattern: A comma-separated list specifying the number and effect model of QTLs (e.g., n1@model1,n2@model2). Three effect models are supported: 1 (additive), 2 (pairwise interaction), and 3 (three-way interaction). - heritability: The proportion of phenotypic variance explained by all specified QTLs (range: 0-1). - prevalence: The disease prevalence for binary traits (range: 0-1). If omitted, a quantitative trait is simulated. - confounding: The proportion of variance explained by a common confounding factor (range: 0-1). Example: qtlPattern=50@1,50@2,50@3 heritability=0.25 prevalence=0.01 confounding=0.05	[OFF]
–sampling	Specifies the sampling size from the full simulated population. Format: sampleSize=[int] caseProportion=[float] -sampleSize: The number of total subjects included in each sample. -caseProportion: The cases among the all subjects in each sample. Note: The sampled phenotypes are written to separate PED files, suitable for direct use in other KGGA modules.	[OFF]
–sampling-group	Specifies the sampling number and overlapping rates among samples. One group contains at least one sample. The number of samples in each group is determined by the number of ‘–sampling’. Format: groupNum=[int] overlapping=[float] - groupNum: The number of replicate sample groups to draw. - overlapping: The proportion of case subjects in each sample. A value of 0.5 means 50% overlap.
–phenotype-causal	Specifies a causal relationship between two consecutively defined phenotypes. Format: causalEffect=[float] causalHeritability=[float] - causalEffect: The strength of the causal effect from the exposure to the outcome (e.g., -1 to 1). - causalHeritability: The proportion of the exposure’s genetic variance that mediates the causal effect. - pleiotropyIVProportion: The proportion of correlated pleiotropic variants. - directPleiotropyHeritability: Variance proportions for directional pleiotropic variants. This is used to set the corrected horizontal pleiotropic effects. - randPleiotropyHeritability: Variance proportions for random pleiotropic variants. This is used to set the uncorrected horizontal pleiotropic effects. Note: The two values must be either zero or non-zero together.	[OFF]
–r2-cut	Specifies the maximum LD r² for QTL selection. QTLs in high LD with an already selected QTL will be excluded. Format: [float] Example: –r2-cut 0.1	1.0

Phenotype Simulation Based on Gene Expression Mediation¶

This advanced framework simulates phenotypes via a biologically inspired hierarchical model: Genotype -> Gene Expression -> Phenotype. It first models the expression of specified genes as a quantitative trait influenced by local eQTLs. It then uses these simulated gene expression levels as predictors for the final complex phenotype. This approach is ideal for investigating transcriptomic mediation and can simulate data for both spatial and temporal expression profiles.

Example 3: Simulate a Phenotype Mediated by Spatially-Variable Gene Expression¶

This command simulates spatial transcriptomic data across 10 chips. It models 500 genes with expression patterns concentrated in specific “hot spots” and subsequently simulates a disease phenotype where 250 of these genes act as drivers. Here are the underlying mathematical models on spatial transcriptomic profiles and eQTL gene expression regulating a phenotype.

java -Xmx10g -jar kgga.jar simulate \
   --input-gty-file ./variants.maf05.hg38.gtb \
   --output ./test/simulate_spatial \
   --threads 20 \
   --r2-cut 0.1 \
   --xqtl-gene-file ./eqtlgenes.txt \
   --spatial-gene chipSize=100 \
                  hotLocations=50:50,10:50 \
                  maxHotExpressionGeneNum=500 \
                  dropoutRate=0.4 \
                  chipNum=10 \
   --xqtl-gene maxQTLNum=4 \
               effectScale=1 \
               heritability=0.3 \
   --phenotype-gene maxDriverNum=250 \
                    heritability=0.15 \
                    prevalence=0.01 \
   --sampling-group groupNum=10 \
   --sampling sampleSize=6000 \
              caseProportion=0.5

–xqtl-gene-file: Provides a list of genes eligible for simulation.
–spatial-gene: Defines the parameters for simulating spatial expression profiles, including chip dimensions and gene expression patterns.
–xqtl-gene: Configures the genetic regulation of gene expression by eQTLs.
–phenotype-gene: Configures the final phenotype as a function of the simulated gene expression levels.

Example 4: Simulate a Phenotype Mediated by Temporally-Variable Gene Expression¶

This example simulates gene expression data across a developmental timeline (ages 0-100). It models genes with expression levels that change over time and uses this temporal data to simulate a final phenotype. The mathematical models can be viewed here.

java -Xmx10g -jar kgga.jar \
 simulate \
   --input-gty-file ./variants.maf05.hg38.gtb \
   --output ./test/simulateS \
   --threads 20 \
   --min-obs-rate 0.4 \
   --allele-num 2~4 \
   --local-maf 0.05~0.5 \
   --r2-cut 0.1 \
   --xqtl-gene-file /home/lmx/MyJava/idea/kgga2/eqtlgenes.txt \
   --timing-gene startAge=0 \
                 endAge=100 \
                 ageInterval=1 \
                 sampleSizePerAge=10 \
                 maxTimingGenes=500 \
                 sampleNum=10 \
   --timing-gene-model peakTime=20 \
                       scaleWidth=5 \
                       amplitude=10
   --xqtl-gene maxQTLNum=4 \
               effectScale=1 \
               heritability=0.3 \
   --phenotype-gene maxDriverNum=250 \
                    heritability=0.15 \
                    prevalence=0.01 \   
   --sampling-group groupNum=10 \
   --sampling sampleSize=6000 \
              caseProportion=0.5

Key Command-Line Options¶

--timing-gene: Sets the parameters for the temporal simulation, including the age range and sampling scheme.
--timing-gene-model: Defines the mathematical model governing the temporal expression pattern of genes (e.g., peak expression time).

Full Options for Gene Expression-Mediated Simulation¶

Option	Description	Default
`--xqtl-gene-file`	Specifies the path to a file containing a list of candidate genes for expression simulation.	[OFF]
`--xqtl-gene`	Configures the eQTL effects on gene expression. Format: `maxQTLNum=[int] effectScale=[float] heritability=[float]` - `maxQTLNum`: The maximum number of eQTLs per gene. - `effectScale`: A scaling factor for eQTL effect sizes. - `heritability`: The proportion of expression variance explained by eQTLs.	[OFF]
`--phenotype-gene`	Configures the effect of gene expression on the final phenotype. Format: `maxDriverNum=[int] heritability=[float] prevalence=[float]` - `maxDriverNum`: The maximum number of driver genes influencing the phenotype. - `heritability`: The proportion of phenotypic variance explained by the driver genes. - `prevalence`: The disease prevalence if simulating a binary trait.	[OFF]
`--spatial-gene`	Sets parameters for spatial transcriptomics simulation. Format: `chipSize=[int] hotLocations=[coord] ...` - `chipSize`: The dimension of the square chip (e.g., 100 for 100x100). - `hotLocations`: Comma-separated coordinates for centers of high-expression. - `maxHotExpressionGeneNum`: The number of genes to assign to hot spots. - `dropoutRate`: The rate of technical dropouts. - `chipNum`: The number of spatial chips to simulate.	[OFF]
`--timing-gene`	Sets parameters for temporal (developmental) expression simulation. Format: `startAge=[int] endAge=[int] ...` - `startAge`, `endAge`: The age range for simulation. - `ageInterval`: The interval between age points. - `sampleSizePerAge`: Number of individuals to simulate per age point. - `maxTimingGenes`: The number of genes with temporal expression patterns. - `sampleNum`: The number of temporal expression profiles.	[OFF]
`--timing-gene-model`	Configures the model for temporal expression. Format: `peakTime=[int] scaleWidth=[int] amplitude=[float]` - `peakTime`: The age of peak gene expression. - `scaleWidth`: The width of the expression peak. - `amplitude`: The maximum expression level.	[OFF]

Outputs¶

Simulated data is organized into task-specific subdirectories within the specified output folder.

QTL-based Simulation Output¶

Generated within a PhenotypeSimulateTask folder: * Population Phenotypes: A master file named phenotype.ped containing the simulated phenotypes for all individuals in the input genotype file. * Sampled Datasets: A series of numbered PED files (phenotype.ped.0, phenotype.ped.1, etc.), corresponding to each replicate specified by --sampling. Each file contains the sampled cases and controls for one replicate analysis.

Gene Expression-Mediated Simulation Output¶

Generated within an ExpressionPhenotypeSimulateTask folder, in addition to the phenotype files described above: * Spatial Expression Profiles: A series of text files (gene.spatial.expression.txt.0, gene.spatial.expression.txt.1, etc.), one for each simulated chip. These files contain the gene expression matrices for each spatial location. * Temporal Expression Profiles: A series of text files (gene.timing.expression.txt.0, gene.timing.expression.txt.1, etc.), one for each replicate. These files contain matrices of gene expression across the simulated age points.

Log File¶

A log file is generated summarizing the simulation parameters and key results, such as the calculated genetic and environmental variance components.

2025-09-04 12:11:27 The genetic variance and environmental variance are 95733.04 and 287199.12, respectively, to meet the heritability of 0.25.```

Prediction

Phenotype Prediction¶

About¶

The predict module in KGGA is a powerful tool designed to construct and apply predictive models for phenotypes using genotypes and covariates. This module is particularly valuable for genomic research and clinical applications, such as prioritizing individuals with high disease risk based on their genetic profiles. To enhance prediction accuracy, the module seamlessly integrates covariates like age and sex into the models, bridging the gap between genomic data and real-world outcomes.

What the Module Does¶

The predict module enables the rapid development and application of phenotype prediction models. It allows users to:

Leverage both genotype data (from VCF or GTB formats) and supplementary covariate information.
Integrate seamlessly with upstream KGGA modules (clean, annotate, and prune) to ensure high-quality input data.
Flexibly partition datasets into distinct training, testing, and prediction samples for robust model development and validation.
Utilize advanced machine learning algorithms for prediction, starting with ELAG (Ensemble Learning with Adaptive Sampling Guided by Policy Gradient).
Generate portable prediction models and detailed prediction scores for further analysis or clinical interpretation.

Workflow of the Predict Module¶

This module follows a structured workflow, beginning with data preparation and culminating in phenotype prediction. The key stages are as follows:

Input and Preprocessing: The module accepts genotypes from whole-genome variants in VCF or GTB formats. Before prediction, users can leverage other KGGA modules to perform quality control (clean), annotate variants with gene and genomic features (annotate), and reduce redundancy based on linkage disequilibrium (prune).
Sample Partitioning: A unique feature of this module is the explicit specification of sample sets. Users can define a training sample for model construction, a testing sample for model evaluation (optional), and a separate prediction sample for which phenotype predictions are desired.
Model Training: Using the specified training data, the module builds a predictive model. Currently, a method named ELAG (Ensemble Learning with Adaptive Sampling Guided by Policy Gradient) is available for disease risk prediction. Support for more methods will be added in the future.
Prediction: The trained model is then applied to the prediction sample to generate risk scores or phenotype values for each individual.

Output and Format¶

Upon completion, the predict module generates multiple output files designed for different purposes, ensuring comprehensive results for evaluation and downstream use.

Trained Prediction Model: A serialized binary file containing the complete, trained model. This portable file can be loaded for external use or to predict phenotypes on new datasets without the need for retraining.
Model Variant List: A text file (TSV) listing the specific variants and covariates that were used to build the prediction model. This file often includes feature importance scores or model weights, providing insights into the biological drivers of the prediction.
Prediction Values: A tab-separated (TSV) file containing the prediction results for the prediction sample. It typically includes individual IDs along with their calculated prediction values or risk scores, ready for further statistical analysis or individual prioritization.

Basic Usage¶

To execute the predict module, use the following command structure in a terminal or shell environment:

java -jar kgga.jar predict --input <input1> --input <input2> --output <output> [options]

Methods & Options

The predict module is designed to predict phenotypes based on genotypes and covariates. Multiple methods under this module are available. We are also calling for new methods based on this powerful infrastructure.

Polygenic Risk Prediction Using Genotype Data¶

ELAG¶

ELAG is a machine learning framework designed for polygenic risk prediction from high-dimensional genomic data. The workflow begins with genotype data that has undergone standard quality control to preserve a large variant pool. ELAG’s core is a policy-gradient-inspired algorithm that dynamically samples variant subsets, which are then used to train an ensemble of base classifiers. The collective performance of this ensemble provides a reward signal that iteratively refines the sampling policy. The final output is a trained ensemble model for predicting binary disease status (case/control).

Citations¶

Lihang Ye, Nan Lin, Yin Luo, …, Miaoxin Li. Policy gradient-guided ensemble learning for enhanced polygenic risk prediction in ultra-high-dimensional genomics. PrePrint. medrxiv

Functionality¶

ELAG enhances disease risk prediction through:

1) Adaptive variant sampling guided by a novel policy gradient algorithm that leverages feedback from machine learning classifier performance;

2) An ensemble learning framework that effectively addresses the challenges of ultra–high-dimensional genomic data;

3) Dual functionality, serving not only as a prediction tool but also as an upstream module to boost the performance of existing polygenic risk score (PRS) methods by selecting genetic variants.

4) Besides genetic variants, it can also integrate environmental factors to enhance prediction accuracy.

Options and Examples¶

Below is an example command demonstrating how ELAG can be used to train and test for genomic data. If functional annotations like CADD scores and external GWAS summary data can be used, we can also input them as external information. The output includes individual scores and the prediction performance of the test set.

ELAG utilizes Python as its computational engine. Please ensure that KGGA is properly configured to communicate with Python. We recommend using Docker to streamline this integration. For Docker configuration details, please refer to RServerDocker.

java -Xmx10g -jar kgga.jar \
    predict \
       --input-gty-file ./wgs/comm/variants.maf05.hg38.gtb \
       --ped-file ./AD/merged_all.fam \
       --output ./test/predict/AD \
       --threads 20 \
       --variant-annotation-database cadd \
                                    field=ProteinFunction@CADD_RawScore \
       --min-obs-rate 0.8 \
       --allele-num 2~4 \
       --local-maf 0.01~0.5 \
       --sum-file ./AD/gwas.summary.txt.gz \
                  cp12Cols=chromosome,base_pair_location,effect_allele,other_allele \
                  pbsCols=p_value \
                  refG=hg19 \
       --elag functionScore=ProteinFunction@CADD_RawScore \
              maxEpoch=20 \
              crossFold=5 \
              baggingNumber=10 \
              variantSampleSize=5000 \
              ignoreGty=n \
              permutNum=100 \
       --assign-sample trainingSample=./AD/merged_data_with_age_sex.fam \           testingSample=./AD/merged_test_data_with_age_sex.fam \
                       pheno=disease \
                       covar=age,sex

Note: The disease phenotypes to be analyzed are specified by --ped-file with <pheno>. Otherwise, the first phenotype ( the 6th column) is used by default.

Key Command-Line Options¶

--input-gty-file /path/to/genotype.gtb
Path to the input genotype file (e.g., GTB format).
--ped-file /path/to/pedigree_or_phenotype.fam
Provides sample IDs, pedigree information, and phenotypes.
Note: The phenotype used can be specified together with --assign-sample pheno=<name>; otherwise the first phenotype column (6th column in .fam) is used by default.
--output /path/to/output_prefix_or_dir
Output directory or file prefix for all result files.
--threads <int>
Number of CPU threads to use.
--variant-annotation-database <db> field=<field1>,<field2>,...
Selects the variant annotation source and the fields to import as features.
• Example:
--variant-annotation-database cadd field=ProteinFunction@CADD_PHRED,ProteinFunction@CADD_RawScore
--min-obs-rate <float>
Minimum per-variant call rate (proportion of non-missing genotypes) required to keep a variant.
• Example: --min-obs-rate 0.8 keeps variants with ≥80% observed genotypes.
--allele-num <low>~<high>
Keeps variants whose number of alleles falls within the given range.
• Example: 2~4 keeps bi-allelic up to quad-allelic sites.
--local-maf <low>~<high>
Filters variants by minor allele frequency computed in the current dataset.
• Example: 0.01~0.5 keeps variants with MAF between 1% and 50%.
--sum-file /path/to/sumstats.txt[.gz] cp12Cols=<chr,pos,effect_allele,other_allele> pbsCols=<pval_col> refG=<build>
Imports external GWAS summary statistics and maps the required columns; refG sets the reference genome build of the summary file.
• Example:
–sum-file … cp12Cols=chromosome,base_pair_location,effect_allele,other_allele pbsCols=p_value refG=hg19.
--assign-sample trainingSample=/path/to/train.fam testingSample=/path/to/test.fam pheno=<phenotype_name> covar=<cov1,cov2,...>
Explicitly assigns training/testing sample sets, selects the phenotype column to analyze, and specifies covariates to adjust (e.g., age,sex).

The `--elag` Option¶

The --elag option is a composite parameter that configures the prediction and configures the ELAG module (policy-gradient–guided adaptive variant sampling + bagged ensemble).

Sub-Options¶

• functionScore=<annotation_field>: annotation score used to guide sampling (e.g., ProteinFunction@CADD_RawScore)
• maxEpoch=<int>: maximum training epochs for the policy
• crossFold=<k>: k-fold cross-validation within training
• baggingNumber=<int>: number of bootstrap models in the ensemble
• variantSampleSize=<int>: number of variants sampled per iteration/epoch
• ignoreGty=y|n: whether to ignore individual-level genotypes (n means use genotypes)
• permutNum=<int>: number of phenotype-label permutations for empirical significance assessment

Format¶

--elag functionScore=<field> [maxEpoch=<int>] [crossFold=<k>] [baggingNumber=<int>] [variantSampleSize=<int>] [ignoreGty=y/n] [permutNum=<int>]

Example¶

--elag functionScore=ProteinFunction@CADD_RawScore maxEpoch=20 crossFold=5 baggingNumber=10 variantSampleSize=5000 ignoreGty=n permutNum=100

Output File Description¶

ELAG generates four main output files:

Model Save List: Saves the best model trained.
Predict Result: The output values of each individual in the test set from every predictor.
Test Metrics: The evaluation metrics of the entire test set.
Validate Metrics: The evaluation metrics of the validation set in each fold.

Example Predict Result¶

The *predict_result.tsv file contains the following columns:

IID: Individual ID.
1, 2, …, N: The output of the N-th classifier.
mean: The mean value of all classifier outputs.

IID 1   2   3   ... 50  mean 
1015905 0.3278  0.3446  0.4327  ... 0.6307  0.4914

Example Test Metrics¶

The *test_metrics.tsv file contains columns such as:

epoch: Training epoch number.
fold: Fold number; mean indicates the average across folds.
auc_roc: Area Under the ROC Curve (AUC-ROC). Higher values indicate better separation between positive and negative classes.
auc_pr: Area Under the Precision-Recall Curve (AUC-PR). More informative than AUC-ROC when classes are imbalanced.
ks: Kolmogorov–Smirnov statistic. Measures the maximum difference between cumulative distributions of positive and negative classes.
best_accuracy: Accuracy at the optimal threshold.
best_threshold: Threshold value that maximizes accuracy.
best_recall: Recall at the optimal threshold.
best_precision: Precision at the optimal threshold.
best_f1: F1 score at the optimal threshold
best_mcc: Matthews Correlation Coefficient. A balanced metric even for imbalanced datasets.

epoch   fold    sample  auc_roc auc_pr  ks  best_accuracy   best_threshold  best_recall best_precision  best_f1 best_mcc
1   mean    mean    0.7266  0.7127  0.3841  0.6890  0.5 0.6524  0.7039  0.6772  0.3791

Example Validate Metrics¶

The *validate_metrics.tsv file contains columns such as:

epoch: Training epoch number.
fold: Fold number; mean indicates the average across folds.
auc_roc: Area Under the ROC Curve (AUC-ROC). Higher values indicate better separation between positive and negative classes.
auc_pr: Area Under the Precision-Recall Curve (AUC-PR). More informative than AUC-ROC when classes are imbalanced.
ks: Kolmogorov–Smirnov statistic. Measures the maximum difference between cumulative distributions of positive and negative classes.
best_accuracy: Accuracy at the optimal threshold.
best_threshold: Threshold value that maximizes accuracy.
best_recall: Recall at the optimal threshold.
best_precision: Precision at the optimal threshold.
best_f1: F1 score at the optimal threshold
best_mcc: Matthews Correlation Coefficient. A balanced metric even for imbalanced datasets.

epoch   fold    sample  auc_roc auc_pr  ks  best_accuracy   best_threshold  best_recall best_precision  best_f1 best_mcc
1   1   mean    0.6088  0.6039  0.2014  0.5989  0.55    0.4846  0.6283  0.5472  0.2033

Changelog

2025¶

Prune module performance improvement

v1.2.1 Release

2025-10-1 Optimization

Speed up linear and logistic regression analysis in the prune module.

New simulation module

2025-10-1 Feature

Add a new module for simulation.

RUNNER bug fix

2025-8-12 Bug Fix

Fix a bug for RUNNER to handle a sample without controls.

Task tracking optimization

2025-7-3 Optimization

Update the task tracking model to reuse intermediate data across projects.

GBC library update

2025-6-3 Optimization

Update the GBC lib of KGGA.

Initial release

2025-5-1 Release

Release the first formal version of KGGA.

Brief Document

KGGA（Knowledge-based Genetic and Genomic Analyses Accelerator）¶

Introduction¶

Functionality¶

Key Advances of KGGA¶

Citations¶

System Requirements and Setup for KGGA¶

KGGA and Its Running Resources¶

System Requirements¶

Setting Up a Java Runtime Environment (JRE)¶

Installation Steps¶

Setting Up an Environment for Quick Tutorials¶

Notes¶

General Notes¶

Summary of Databases¶

Gene feature annotation¶

Variant annotation databases¶

Region annotation databases¶

Quick Start

Quick-Start Examples¶

Perform Quality Control on VCF Data and Output in PLINK BED Format¶

Example Command¶

Key Parameters¶

Output¶

Annotate Variants from VCF Data and Output in TSV Format¶

Example Command¶

Key Parameters¶

Output¶

GBC

Starting GBC 中文¶

Core Functionality: The convert Command¶

Syntax¶

Format Conversion Examples¶

Converting VCF to GTB¶

Converting VCF to PLINK-PGEN¶

Converting Multiple VCF Files to a Single GTB File¶

Subset Extraction: Filtering Genotypes¶

Example: PLINK to VCF with Subset Extraction¶

Additional Command-Line Features¶

Extended Functionalities: PLINK and BGEN Support¶

Finding the jep Path¶

Remarks¶

UKB

Managing Genotype Data on the Cloud: A Tutorial with DNANexus and UKB WES 中文¶

1. Required Software¶

2. Step-by-Step Procedure¶

2.1. Download Software Locally¶

2.2. Upload to the RAP Platform¶

2.3. Obtain Data Paths¶

2.4. Launch a Worker and Configure the Environment¶

2.5. Download and Process Data in Batches¶

2.6. Convert GTB to Other Formats (Genotype-level Analysis)¶

2.7. Annotate Variants using KGGA¶

FAQ

Frequently Asked Questions (FAQ)¶

1. Why do I need KGGA?¶

2. How do I cite KGGA?¶

3. Where do I report bugs and provide feedback on KGGA?¶

4. How can I use generative AI to assist with KGGA?¶

Detailed Document

Basic Options

Input¶

Variant and Genotype File¶

Pedigree and Phenotype File (optional)¶

Mutation Annotation Format File¶

Output¶

Path¶

Formats¶

General Setting¶

Data Cleaning & QC

Variant & Sample Cleaning¶

About¶

Workflow of the Manage Module¶

Basic Usage¶

Examples¶

Selecting a Subset of Samples and Outputting Genotypes in PLINK BED Format¶

Key Command Options¶

Outputting Genotypes in PGEN Format¶

Additional Command Option¶

Notes¶

Starting GBC 中文 ¶

Managing Genotype Data on the Cloud: A Tutorial with DNANexus and UKB WES 中文 ¶

Association Methods (method) of `--assoc`¶

Additional Association Options of `--assoc`¶