Concatenate Multiple GTB

Concatenate means adding new variants by row, usually without overlap of coordinates between these variants. The most common case is to rejoin variants from the same cohort of subjects scattered in different files (stored by chromosome, by number of variants or by sub-file size) back into a separate file (e.g., concatenated chr1~chr22, chrX, chrY from 1000GP3 into a separate file). Use the following command to concatenate multiple GTB files:

java -jar gbc.jar concat <input> <input> ... -o <output> [options]

When multiple files are concatenated, the GBC checks the first variant of each subfile and orders them to ensure that the concatenated files remain as "coordinate-ordered" as possible.

concatGTB

[!NOTE|label:Example|style:callout]

Concatenate multiple subfiles generated by GTBSplitter using GBC:

# Run directly in the terminal
java -jar gbc.jar concat ./rare.disease.hg19/chr5.gtb ./rare.disease.hg19/chr7.gtb ./rare.disease.hg19/chr8.gtb ./rare.disease.hg19/chr10.gtb ./rare.disease.hg19/chr17.gtb ./rare.disease.hg19/chr21.gtb -o ./rare.disease.hg19.concat.gtb

# Run it using docker
docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \
concat ./rare.disease.hg19/chr5.gtb ./rare.disease.hg19/chr7.gtb ./rare.disease.hg19/chr8.gtb ./rare.disease.hg19/chr10.gtb ./rare.disease.hg19/chr17.gtb ./rare.disease.hg19/chr21.gtb -o ./rare.disease.hg19.concat.gtb

[!TIP|label:Concatenated all GTB files in a folder|style:callout]

When running in Macos, Linux terminal environment, you can use the following statement to list all files with the extension .gtb under the "DATA_PATH" folder and join them with a space:

$(find $DATA_PATH -type f -name '*.gtb' -exec echo {} \;)

For example, the above example program is the same as the following program:

# Run directly in the terminal
java -jar gbc.jar concat $(find ./rare.disease.hg19 -type f -name '*.gtb' -exec echo {} \;) -o ./rare.disease.hg19.concat.gtb

Program Options

Usage: concat <input> <input> ... -o <output> [options]
Java-API: edu.sysu.pmglab.gbc.toolkit.GTBConcat
About: Concatenate multiple *.gtb files. Concatenate means adding on rows 
       to *.gtb (e.g. re-combining split chromosomes). The program sorts 
       the input files by the coordinates of the first variant of each
       *.gtb in order to make the output files as ordered as possible.
Options:
  *--output,-o  Set the output file.
                format: --output <file>
  --chromosome  Specify the chromosome tags file. e.g., identify 'X, chrX, 
                CHRX, ChrX' as '(int) 22' chromosome.
                format: --chromosome <file>
  --threads,-t  Set the number of threads.
                default: 4
                format: --threads <int> (>= 1)
  --field,-f    Select the specified fields from the *.gtb file to the output 
                file (all fields by default).
                format: --field <string>,<string>,...
  --subject,-s  Retrieve the genotypes of the specified subject (by subject 
                names). Subject names can be stored in a file with 
                comma-separated format, and pass in via '-s @file'.
                format: --subject <string>,<string>,...
  --no-gt       Do not load and store genotypes.
  --sort        Sort the variants by coordinate fields (CHROM, POS).
  --add-meta    Add the specified metas to the output file.
                format: --add-meta <key>=<value> <key>=<value> ...
  --rm-meta     Remove all meta information.

API Toolkit

The API tool for concatenating GTB files is edu.sysu.pmglab.gbc.GTBConcat, and the usage example is as follows:

GTBConcat.of(new File("./rare.disease.hg19"))
        .addManager("./rare.disease.hg19/chr5.gtb")
        .addManager("./rare.disease.hg19/chr7.gtb")
        .addManager("./rare.disease.hg19/chr8.gtb")
        .addManager("./rare.disease.hg19/chr10.gtb")
        .addManager("./rare.disease.hg19/chr17.gtb")
        .addManager("./rare.disease.hg19/chr19.gtb")
        .addManager("./rare.disease.hg19/chr21.gtb")
        .submit();
Copyright ©Liubin Zhang all right reservedLast modified time: 2023-04-10 11:18:19

results matching ""

    No results matching ""