Build GTB for VCF
We designed a novel GTB format for compressing and storing large-scale genotype data of haploid and diploid species with various allele numbers, chromosome tags, as well as phased or unphased genotypes. Here, input that conforms to the VCF file specification is required to build the GTB archive. Since genotype quality is not commonly used in downstream analysis, GBC does not save genotype quality information by default. Instead, GBC performs quality control on variant sites and genotypes to ensure that the stored sites and genotypes are of reliable quality. On the command line, use the following command to build GTB archives for VCF files:
java -jar gbc.jar vcf2gtb <input> [output] [options]
Once the input file is constructed, GBC will check if the output-GTB file is ordered by coordinates, and an unordered file will call GTBSorter for sorting. For VCF file storage support for other species, see Chromosome Tag Declaration.
[!NOTE|label:Example I|style:callout]
Use GBC to build the archive for the example file
https://pmglab.top/gbc/download/assoc.hg19.vcf.gz
and set the following parameters:
- Store phased genotypes;
- Compressor compression level set to 16;
- Liftover the version of the reference genome from hg19 to hg38.
The command instructions to complete this task are as follows:
# Download the data file wget https://pmglab.top/gbc/download/assoc.hg19.vcf.gz -O assoc.hg19.vcf.gz # Run directly in the terminal java -jar gbc.jar vcf2gtb ./assoc.hg19.vcf.gz ./assoc.hg38.gtb \ -p -l 16 --liftover hg19ToHg38 # Run it using docker docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \ vcf2gtb ./assoc.hg19.vcf.gz ./assoc.hg38.gtb \ -p -l 16 --liftover hg19ToHg38
[!NOTE|label:Example II|style:callout]
Use GBC to build the archive for the example file
https://pmglab.top/gbc/download/rare.disease.hg19.vcf.gz
and save all the fields:# Run directly in the terminal java -jar gbc.jar vcf2gtb https://pmglab.top/gbc/download/rare.disease.hg19.vcf.gz -f ALL # Run it using docker docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \ vcf2gtb https://pmglab.top/gbc/download/rare.disease.hg19.vcf.gz -f ALL
[!TIP|label:Build GTB archives for all VCF files in the specified directory|style:callout]
When running in the Macos, Linux terminal environment, you can use the following statement to build GTB archives for all files with the extension
.vcf.gz
under the folder$DATA_PATH
:DATA_PATH="/Data" OUTPUT_PATH="/Data" for file in ${DATA_PATH}/*.vcf.gz do java -jar gbc.jar vcf2gtb ${file} ${OUTPUT_PATH}/$(basename "$file" .vcf.gz).gtb done
Program Options
Usage: vcf2gtb <input> [output] [options]
Java-API: edu.sysu.pmglab.gbc.VCF2GTB
About: Compress and build *.gtb (genotype block format) for *.vcf (variant call format).
Options:
--chromosome Specify the chromosome tags file. e.g., identify 'X, chrX,
CHRX, ChrX' as '(int) 22' chromosome.
format: --chromosome <file>
--threads,-t Set the number of threads.
default: 4
format: --threads <int>
--level,-l Compression level to use when ZSTD works.
default: 16
format: --level <int> (0 ~ 22)
GTB Archive Options:
--add-meta Add the specified metas to the GTB file.
format: --add-meta <key>=<value> <key>=<value> ...
--phased,-p Set the status of genotype to phased.
--liftover Lift over variants from one reference genome version to another
(chain files are downloaded from
http://hgdownload.cse.ucsc.edu/goldenPath/<version>/liftOver).
format: --liftover <string>
([hg19ToHg38/hg38ToHg19/hg18ToHg19/hg18ToHg38] (ignoreCase))
--field,-f Add the specified fields from the VCF file to the GTB file.
default: GENOTYPE
format: --field <string>,<string>,...
([META/ID/QUAL/FILTER/INFO/GENOTYPE/ALL/NONE] (ignoreCase))
Quality Control Options:
--no-qc Disable all quality control methods.
--allele-num Exclude variants with the alternative allele number per
variant out of the range [minAlleleNum, maxAlleleNum].
default: 0-15
format: --allele-num <minAlleleNum>-<maxAlleleNum> (0 ~
255)
--gty-gq Exclude genotypes with the minimal genotype quality (Phred
Quality Score) per genotype < minGq.
default: 20
format: --gty-gq <minGq> (>= 0)
--gty-dp Exclude genotypes with the minimal read depth per genotype
< minDp.
default: 8
format: --gty-dp <minDp> (>= 0)
--gty-pl Exclude genotypes with the second smallest normalized
Phred-scaled likelihoods for genotypes < minPl. Otherwise,
there would be confusing genotypes.
default: 20
format: --gty-pl <minPl> (>= 0)
--gty-ad-hom-ref Exclude genotypes with the fraction of the reads carrying
alternative allele > maxAdHomRef at a reference-allele
homozygous genotype.
default: 0.05
format: --gty-ad-hom-ref <maxAdHomRef> (0.0 ~ 1.0)
--gty-ad-hom-alt Exclude genotypes with the fraction of the reads carrying
alternative allele < minAdHomAlt at a alternative-allele
homozygous genotype.
default: 0.75
format: --gty-ad-hom-alt <minAdHomAlt> (0.0 ~ 1.0)
--gty-ad-het Exclude genotypes with the fraction of the reads carrying
alternative allele < minAdHet at a heterozygous genotype.
default: 0.25
format: --gty-ad-het <minAdHet> (0.0 ~ 1.0)
--seq-qual Exclude variants with the minimal overall sequencing
quality score (Phred Quality Score) per variant < minQual.
default: 30.0
format: --seq-qual <minQual> (>= 0.0)
--seq-fs Exclude variants with the overall strand bias Phred-scaled
p-value (using Fisher's exact test) per variant > maxFs.
format: --seq-fs <maxFs> (>= 0.0)
--seq-mq Exclude variants with the minimal overall mapping quality
score (Mapping Quality Score) per variant < minMq.
default: 20.0
format: --seq-mq <minMq> (>= 0.0)
--seq-info Exclude variants with the information (i.e., INFO in VCF)
field contain or do not contain (starts with ^) the
specified strings.
format: --seq-info <string> <string> ...
--seq-ac Exclude variants with the alternate allele count (AC) per
variant out of the range [minAc, maxAc].
format: --seq-ac <minAc>-<maxAc> (>= 0)
--seq-af Exclude variants with the alternate allele frequency (AF)
per variant out of the range [minAf, maxAf].
format: --seq-af <minAf>-<maxAf> (0.0 ~ 1.0)
--seq-an Exclude variants with the non-missing allele number (AN)
per variant out of the range [minAn, maxAn].
default: 1-
format: --seq-an <minAn>-<maxAn> (>= 0)
API Toolkit
The API tool for converting VCF files to GTB files is edu.sysu.pmglab.gbc.VCF2GTB, and the usage example is as follows:
VCF2GTB.of("https://pmglab.top/gbc/download/assoc.hg19.vcf.gz")
.setOutputFile(new File("./assoc.hg38.gtb"))
.storeOriginMeta(true)
.liftOverWith(RefGenomeVersion.hg19, RefGenomeVersion.hg38)
.setThreads(4)
.convert();