Concatenate Multiple GTB
Concatenate
means adding new variants by row, usually without overlap of coordinates between these variants. The most common case is to rejoin variants from the same cohort of subjects scattered in different files (stored by chromosome, by number of variants or by sub-file size) back into a separate file (e.g., concatenated chr1~chr22, chrX, chrY from 1000GP3 into a separate file). Use the following command to concatenate multiple GTB files:
java -jar gbc.jar concat <input> <input> ... -o <output> [options]
When multiple files are concatenated, the GBC checks the first variant of each subfile and orders them to ensure that the concatenated files remain as "coordinate-ordered" as possible.
[!NOTE|label:Example|style:callout]
Concatenate multiple subfiles generated by GTBSplitter using GBC:
# Run directly in the terminal java -jar gbc.jar concat ./rare.disease.hg19/chr5.gtb ./rare.disease.hg19/chr7.gtb ./rare.disease.hg19/chr8.gtb ./rare.disease.hg19/chr10.gtb ./rare.disease.hg19/chr17.gtb ./rare.disease.hg19/chr21.gtb -o ./rare.disease.hg19.concat.gtb # Run it using docker docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \ concat ./rare.disease.hg19/chr5.gtb ./rare.disease.hg19/chr7.gtb ./rare.disease.hg19/chr8.gtb ./rare.disease.hg19/chr10.gtb ./rare.disease.hg19/chr17.gtb ./rare.disease.hg19/chr21.gtb -o ./rare.disease.hg19.concat.gtb
[!TIP|label:Concatenated all GTB files in a folder|style:callout]
When running in Macos, Linux terminal environment, you can use the following statement to list all files with the extension
.gtb
under the "DATA_PATH" folder and join them with a space:$(find $DATA_PATH -type f -name '*.gtb' -exec echo {} \;)
For example, the above example program is the same as the following program:
# Run directly in the terminal java -jar gbc.jar concat $(find ./rare.disease.hg19 -type f -name '*.gtb' -exec echo {} \;) -o ./rare.disease.hg19.concat.gtb
Program Options
Usage: concat <input> <input> ... -o <output> [options]
Java-API: edu.sysu.pmglab.gbc.toolkit.GTBConcat
About: Concatenate multiple *.gtb files. Concatenate means adding on rows
to *.gtb (e.g. re-combining split chromosomes). The program sorts
the input files by the coordinates of the first variant of each
*.gtb in order to make the output files as ordered as possible.
Options:
*--output,-o Set the output file.
format: --output <file>
--chromosome Specify the chromosome tags file. e.g., identify 'X, chrX,
CHRX, ChrX' as '(int) 22' chromosome.
format: --chromosome <file>
--threads,-t Set the number of threads.
default: 4
format: --threads <int> (>= 1)
--field,-f Select the specified fields from the *.gtb file to the output
file (all fields by default).
format: --field <string>,<string>,...
--subject,-s Retrieve the genotypes of the specified subject (by subject
names). Subject names can be stored in a file with
comma-separated format, and pass in via '-s @file'.
format: --subject <string>,<string>,...
--no-gt Do not load and store genotypes.
--sort Sort the variants by coordinate fields (CHROM, POS).
--add-meta Add the specified metas to the output file.
format: --add-meta <key>=<value> <key>=<value> ...
--rm-meta Remove all meta information.
API Toolkit
The API tool for concatenating GTB files is edu.sysu.pmglab.gbc.GTBConcat, and the usage example is as follows:
GTBConcat.of(new File("./rare.disease.hg19"))
.addManager("./rare.disease.hg19/chr5.gtb")
.addManager("./rare.disease.hg19/chr7.gtb")
.addManager("./rare.disease.hg19/chr8.gtb")
.addManager("./rare.disease.hg19/chr10.gtb")
.addManager("./rare.disease.hg19/chr17.gtb")
.addManager("./rare.disease.hg19/chr19.gtb")
.addManager("./rare.disease.hg19/chr21.gtb")
.submit();