Export as BED Format

PLINK is incapable of concurrently converting VCF files to BED format. On the command line, employ the following instruction to export GTB file to BED format:

java -jar gbc.jar gtb2bed <input> [output] [options]

When output is not specified, the output files will automatically generate <input>.bed, <input>.bim and <input>.fam files according to the input file <input>.gtb. If the input file is a remote file, the output files will be stored in the current local working path.

[!NOTE|label:Example|style:callout]

Here is the command to use GBC to output https://pmglab.top/gbc/download/assoc.hg19.gtb in BED format:

# Run directly in the terminal
java -jar gbc.jar gtb2bed https://pmglab.top/gbc/download/assoc.hg19.gtb

# Run it using docker
docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \
gtb2bed https://pmglab.top/gbc/download/assoc.hg19.gtb

Please note that the rules for converting from GTB to BED files differ from those PLINK employs when converting VCF to BED:

  • GTB faithfully designate REF as A1 from the .bim file instead of assigning it based upon allele frequency.
  • For multiallelic variants, PLINK retains only the two genotypes with the highest allele frequencies, designating them as A2 and A1, respectively. In contrast, GBC splits multiallelic sites into multiple biallelic sites, designating REF as A1 and ALT as A2 for each. If the current GTB file was constructed from a BED file (using bed2gtb), it will utilize the A1 and A2 from the BED file as REF and ALT, respectively, thereby ensuring the output matches PLINK's BED format.

Program Options

Usage: gtb2bed <input> [output] [options]
Java-API: edu.sysu.pmglab.gbc.toolkit.bed.BEDWriter
About: Decompress and export variants from *.gtb (genotype block format) to 
       *.bed (PLINK binary biallelic genotype table).
Options:
  --chromosome  Specify the chromosome tags file. e.g., identify 'X, chrX, 
                CHRX, ChrX' as '(int) 22' chromosome.
                format: --chromosome <string>
  --threads,-t  Set the number of threads.
                default: 4
                format: --threads <int> (1 ~ 10)
Subset Selection Options:
  --subject,-s         Retrieve the genotypes of the specified subject (by 
                       subject names). Subject names can be stored in a file 
                       with comma-separated format, and pass in via '-s @file'.
                       format: --subject <string>,<string>,...
  --subject-range,-sr  Retrieve the genotypes of the specified subject (by 
                       intervals of subject index).
                       format: --subject-range <minIndex>-<maxIndex> (>= 0)
  --subject-index,-si  Retrieve the genotypes of the specified subject (by 
                       subject indexes).
                       format: --subject-index <index1>,<index2>,... (>= 0)
  --pos,-p             Retrieve the variants by the specified coordinates of 
                       variant. 
                       format: --pos <chr>:<pos>,<pos>,... ... (>= 1)
  --pos-range,-pr      Retrieve the variants by the specified coordinate 
                       intervals of variant.
                       format: --pos-range <chr>:<minPos>-<maxPos>,... (>= 1)
  --index-range,-ir    Retrieve the variants by the line-index of variant.
                       format: --index-range <minIndex>-<maxIndex> (>= 0)
Quality Control Options:
  --allele-num       Exclude variants with the alternative allele number per 
                     variant out of the range [minAlleleNum, maxAlleleNum].
                     format: --allele-num <minAlleleNum>-<maxAlleleNum> (0 ~ 
                     255) 
  --seq-ac           Exclude variants with the alternate allele count (AC) per 
                     variant out of the range [minAc, maxAc].
                     format: --seq-ac <minAc>-<maxAc> (>= 0)
  --seq-af           Exclude variants with the alternate allele frequency (AF) 
                     per variant out of the range [minAf, maxAf].
                     format: --seq-af <minAf>-<maxAf> (0.0 ~ 1.0)
  --seq-an           Exclude variants with the non-missing allele number (AN) 
                     per variant out of the range [minAn, maxAn].
                     format: --seq-an <minAn>-<maxAn> (>= 0)
  --field-condition  Extract variants by the values of the specified 
                     supplementary fields. For comparable fields, the 
                     'condition' format is 'minValue-maxValue'; for other 
                     formats, the 'condition' is multiple optional values 
                     separated by ','.
                     format: --field-condition <field>=<condition> 
                     <field>=<condition> ...

API Toolkit

The read/write support for the BED format is located in the package edu.sysu.pmglab.gbc.toolkit.bed, implemented by BEDGenotypes for mapping between BED genotypes and GTB genotypes. The tools for BED file creation are BEDWriter and BEDPartWriter, both of which can create BED files. Since GTB support splitting the entire file into equal parts by variant count for parallel processing, we have also implemented this functionality for BED (i.e. BEDPartWriter).The .fam file can be generated using BEDWriter.generateFam(Individual[] individuals, String fileName) or BEDWriter.generateFam(String[] individualNames, String fileName).

Copyright ©Liubin Zhang all right reservedLast modified time: 2023-04-18 21:57:57

results matching ""

    No results matching ""