Build GTB for VCF

We designed a novel GTB format for compressing and storing large-scale genotype data of haploid and diploid species with various allele numbers, chromosome tags, as well as phased or unphased genotypes. Here, input that conforms to the VCF file specification is required to build the GTB archive. Since genotype quality is not commonly used in downstream analysis, GBC does not save genotype quality information by default. Instead, GBC performs quality control on variant sites and genotypes to ensure that the stored sites and genotypes are of reliable quality. On the command line, use the following command to build GTB archives for VCF files:

java -jar gbc.jar vcf2gtb <input> [output] [options]

Once the input file is constructed, GBC will check if the output-GTB file is ordered by coordinates, and an unordered file will call GTBSorter for sorting. For VCF file storage support for other species, see Chromosome Tag Declaration.

[!NOTE|label:Example I|style:callout]

Use GBC to build the archive for the example file https://pmglab.top/gbc/download/assoc.hg19.vcf.gz and set the following parameters:

  • Store phased genotypes;
  • Compressor compression level set to 16;
  • Liftover the version of the reference genome from hg19 to hg38.

The command instructions to complete this task are as follows:

# Download the data file
wget https://pmglab.top/gbc/download/assoc.hg19.vcf.gz -O assoc.hg19.vcf.gz

# Run directly in the terminal
java -jar gbc.jar vcf2gtb ./assoc.hg19.vcf.gz ./assoc.hg38.gtb \
                          -p -l 16 --liftover hg19ToHg38

# Run it using docker
docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \
vcf2gtb ./assoc.hg19.vcf.gz ./assoc.hg38.gtb \
        -p -l 16 --liftover hg19ToHg38

[!NOTE|label:Example II|style:callout]

Use GBC to build the archive for the example file https://pmglab.top/gbc/download/rare.disease.hg19.vcf.gz and save all the fields:

# Run directly in the terminal
java -jar gbc.jar vcf2gtb https://pmglab.top/gbc/download/rare.disease.hg19.vcf.gz -f ALL

# Run it using docker
docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \
vcf2gtb https://pmglab.top/gbc/download/rare.disease.hg19.vcf.gz -f ALL

[!TIP|label:Build GTB archives for all VCF files in the specified directory|style:callout]

When running in the Macos, Linux terminal environment, you can use the following statement to build GTB archives for all files with the extension .vcf.gz under the folder $DATA_PATH:

DATA_PATH="/Data"
OUTPUT_PATH="/Data"

for file in ${DATA_PATH}/*.vcf.gz
do
    java -jar gbc.jar vcf2gtb ${file} ${OUTPUT_PATH}/$(basename "$file" .vcf.gz).gtb
done

Program Options

Usage: vcf2gtb <input> [output] [options]
Java-API: edu.sysu.pmglab.gbc.VCF2GTB
About: Compress and build *.gtb (genotype block format) for *.vcf (variant call format).
Options:
  --chromosome  Specify the chromosome tags file. e.g., identify 'X, chrX, 
                CHRX, ChrX' as '(int) 22' chromosome.
                format: --chromosome <file>
  --threads,-t  Set the number of threads.
                default: 4
                format: --threads <int>
  --level,-l    Compression level to use when ZSTD works.
                default: 16
                format: --level <int> (0 ~ 22)
GTB Archive Options:
  --add-meta   Add the specified metas to the GTB file.
               format: --add-meta <key>=<value> <key>=<value> ...
  --phased,-p  Set the status of genotype to phased.
  --liftover   Lift over variants from one reference genome version to another 
               (chain files are downloaded from 
               http://hgdownload.cse.ucsc.edu/goldenPath/<version>/liftOver). 
               format: --liftover <string> 
               ([hg19ToHg38/hg38ToHg19/hg18ToHg19/hg18ToHg38] (ignoreCase))
  --field,-f   Add the specified fields from the VCF file to the GTB file.
               default: GENOTYPE
               format: --field <string>,<string>,... 
               ([META/ID/QUAL/FILTER/INFO/GENOTYPE/ALL/NONE] (ignoreCase))
Quality Control Options:
  --no-qc           Disable all quality control methods.
  --allele-num      Exclude variants with the alternative allele number per 
                    variant out of the range [minAlleleNum, maxAlleleNum].
                    default: 0-15
                    format: --allele-num <minAlleleNum>-<maxAlleleNum> (0 ~ 
                    255) 
  --gty-gq          Exclude genotypes with the minimal genotype quality (Phred 
                    Quality Score) per genotype < minGq.
                    default: 20
                    format: --gty-gq <minGq> (>= 0)
  --gty-dp          Exclude genotypes with the minimal read depth per genotype 
                    < minDp.
                    default: 8
                    format: --gty-dp <minDp> (>= 0)
  --gty-pl          Exclude genotypes with the second smallest normalized 
                    Phred-scaled likelihoods for genotypes < minPl. Otherwise, 
                    there would be confusing genotypes.
                    default: 20
                    format: --gty-pl <minPl> (>= 0)
  --gty-ad-hom-ref  Exclude genotypes with the fraction of the reads carrying 
                    alternative allele > maxAdHomRef at a reference-allele 
                    homozygous genotype.
                    default: 0.05
                    format: --gty-ad-hom-ref <maxAdHomRef> (0.0 ~ 1.0)
  --gty-ad-hom-alt  Exclude genotypes with the fraction of the reads carrying 
                    alternative allele < minAdHomAlt at a alternative-allele 
                    homozygous genotype.
                    default: 0.75
                    format: --gty-ad-hom-alt <minAdHomAlt> (0.0 ~ 1.0)
  --gty-ad-het      Exclude genotypes with the fraction of the reads carrying 
                    alternative allele < minAdHet at a heterozygous genotype.
                    default: 0.25
                    format: --gty-ad-het <minAdHet> (0.0 ~ 1.0)
  --seq-qual        Exclude variants with the minimal overall sequencing 
                    quality score (Phred Quality Score) per variant < minQual.
                    default: 30.0
                    format: --seq-qual <minQual> (>= 0.0)
  --seq-fs          Exclude variants with the overall strand bias Phred-scaled 
                    p-value (using Fisher's exact test) per variant > maxFs.
                    format: --seq-fs <maxFs> (>= 0.0)
  --seq-mq          Exclude variants with the minimal overall mapping quality 
                    score (Mapping Quality Score) per variant < minMq.
                    default: 20.0
                    format: --seq-mq <minMq> (>= 0.0)
  --seq-info        Exclude variants with the information (i.e., INFO in VCF) 
                    field contain or do not contain (starts with ^) the 
                    specified strings.
                    format: --seq-info <string> <string> ...
  --seq-ac          Exclude variants with the alternate allele count (AC) per 
                    variant out of the range [minAc, maxAc].
                    format: --seq-ac <minAc>-<maxAc> (>= 0)
  --seq-af          Exclude variants with the alternate allele frequency (AF) 
                    per variant out of the range [minAf, maxAf].
                    format: --seq-af <minAf>-<maxAf> (0.0 ~ 1.0)
  --seq-an          Exclude variants with the non-missing allele number (AN) 
                    per variant out of the range [minAn, maxAn].
                    default: 1-
                    format: --seq-an <minAn>-<maxAn> (>= 0)

API Toolkit

The API tool for converting VCF files to GTB files is edu.sysu.pmglab.gbc.VCF2GTB, and the usage example is as follows:

VCF2GTB.of("https://pmglab.top/gbc/download/assoc.hg19.vcf.gz")
        .setOutputFile(new File("./assoc.hg38.gtb"))
        .storeOriginMeta(true)
        .liftOverWith(RefGenomeVersion.hg19, RefGenomeVersion.hg38)
        .setThreads(4)
        .convert();
Copyright ©Liubin Zhang all right reservedLast modified time: 2023-04-12 08:26:49

results matching ""

    No results matching ""