Starting GBC 中文
GBC is now part of the KGGA toolkit. It specializes in genotype data processing—including memory encoding, storage encoding, and computational encoding to optimize performance for specific tasks—and the design of efficient coordinate search algorithms. To launch GBC’s command-line interface, use the following command:
java -jar kgga.jar gbc
You can access detailed documentation for all commands by adding the --help flag.
Core Functionality: The convert Command
GBC’s primary command is convert, which enables parallel conversion between common genomic analysis file formats. It also supports features like filtering, LiftOver, Biallelic conversion, quality control (QC), sorting, and concatenation.
Syntax
java -jar kgga.jar gbc convert <source>2<target> [options]
Examples: vcf2gtb, plink2gtb, gtb2vcf
Note: Conversion to PLINK-PGEN requires additional extensions (see "Extended Functionalities" below).
Format Conversion Examples
Converting VCF to GTB
This example converts a VCF file into a GTB file with default quality control and specific options.
Command:
java -jar kgga.jar gbc convert vcf2gtb ~/ukb24310_c1_b6089_v1.vcf.gz --field --prune --seq-an 1~ --seq-af 0.000001~0.999999 -o /Users/suranyi/ukb24310_c1_b6089_v1.3.gtb
Details:
- Converts the input VCF file to GTB format.
- Applies default QC filters:
- GQ >= 20
- DP >= 8
- MQ >= 40
- PL >= 20
- LPL >= 20
- FT == PASS
- AD_HOM_REF <= 0.05
- AD_HOM_ALT >= 0.75
- AD_HET >= 0.25
- --field: Removes INFO and FILTER fields from the VCF (no parameters specified).
- --prune: Trims alternate (ALT) mutations with an allele count (AC) of 0.
- --seq-an 1~: Filters by allele number range (minimum 1, no upper limit).
- --seq-af 0.000001~0.999999: Filters by allele frequency range.
To disable QC, add --disable-qc. For more details on QC parameters, run:
java -jar kgga.jar gbc convert -h
Converting VCF to PLINK-PGEN
This example converts a VCF file to PLINK-PGEN format.
Command:
java -Djava.library.path=$(pip3 show jep | grep Location | awk '{print $2"/jep"}') -jar kgga.jar gbc convert vcf2plink ./ukb24310_c1_b6089_v1.vcf.gz -o ./ukb24310_c1_b6089_v1 --output-type pgen
Details:
- Requires the jep library path for PLINK-PGEN support (see "Extended Functionalities" below).
- Outputs a PLINK-PGEN file with the specified name.
Converting Multiple VCF Files to a Single GTB File
This example processes multiple VCF files with a LiftOver operation.
Command:
java -jar kgga.jar gbc convert vcf2gtb 1kg.phase3.v5.shapeit2.amr.hg19.chr*.vcf.gz -o ~/tmp/AMR.hg38.gtb --liftover hg19ToHg38
Details:
- Combines multiple chromosome-specific VCF files into one GTB file.
- Applies a LiftOver from hg19 to hg38 coordinates.
Subset Extraction: Filtering Genotypes
You can extract subsets of genotype data (e.g., specific individuals or positions) using the convert command with additional options:
- --individual
, ,...: Select specific individuals by ID. - --pos [expression]: Filter by genomic position.
- --index-range
~ : Filter by index range. - --allele-num
~ : Filter by allele number. - --seq-ac
~ : Filter by allele count. - --seq-af
~ : Filter by allele frequency.
Example: PLINK to VCF with Subset Extraction
Command:
java -Djava.library.path=$(pip3 show jep | grep Location | awk '{print $2"/jep"}') -jar kgga.jar gbc convert plink2vcf ./ukb24310_c1_b6089_v1 --input-type pgen --individual 1718672,2380098,5176706,4729017,1930596 --seq-an 1~ -o ./ukb24310_c1_b6089_v1.s5.vcf.gz
Details:
- Converts a PLINK-PGEN file to VCF.
- Filters to include only the specified individuals.
- Applies an allele number filter (--seq-an 1~).
Additional Command-Line Features
GBC offers several other useful commands:
Queue Merging:
java -jar kgga.jar gbc merge <file1> <file2> -o <output>
Merges two genotype files into one.
Vertical Concatenation (e.g., for chromosome files):
java -jar kgga.jar gbc concat <input> <input> ... --output <output>
Combines multiple files into a single output.
Linkage Disequilibrium (LD) Calculation:
java -jar kgga.jar gbc ld
Graphical User Interface (GUI):
java -jar kgga.jar gbc gui
Launches a visual interface for GBC.
Database Creation:
java -jar kgga.jar gbc make-database
Generates a database file from genotype data.
Note: Genomic annotation and analysis tools are integrated into KGGA (visit http://pmglab.top/kgga).
Extended Functionalities: PLINK and BGEN Support
To enable support for PLINK and BGEN formats, install the required Python libraries:
pip install jep zstandard pgenlib bgen_reader
When running GBC, specify the jep library path:
java -Djava.library.path="$(pip3 show jep | grep Location | awk '{print $2"/jep"}')" -jar kgga.jar gbc
Example Equivalent:
java -Djava.library.path=/opt/homebrew/lib/python3.13/site-packages/jep -jar kgga.jar gbc
Finding the jep Path
For Windows users or non-standard installations, determine the jep directory by running:
pip3 show jep
Locate the Location field in the output, append /jep, and use the resulting path with -Djava.library.path.
Remarks
- CCF Architecture (Version 4.x):
- Features a more flexible row-column block design and fine-grained parallelization for better performance.
- Optimized for low memory usage (e.g., encoding and filtering UK Biobank whole-genome genotypes at 1GB per thread).
- Incompatible with version 3.x file formats.
- Development Focus:
- KGGA and CCF prioritize systematic engineering improvements via Java APIs.
- Command-line tools are supplementary and may have usability limitations, to be addressed in the next minor release (ccf-4.6).
- File Merging:
- Currently supports basic merging based on coordinate/REF consistency and standard bases (ATCG alleles).
- Future updates will enhance performance, handle multi-allelic sites, and add advanced merging modes (e.g., intersection, union, complement, left alignment).