Merge Multiple GTB
Merge
means creating a superset of variant calls across multiple individuals (e.g., merging variants of different populations in 1000GP3), also involves merging fields with the same coordinates with each other. Since the merging of multiple files is quite complex on non-genotypes (e.g., updating of certain statistical fields, merging of annotation information), the command line tool GBC provides only genotype merging, using the following command for genotype merging of GTB files:
java -jar gbc.jar merge <input> <input> ... -o <output> [options]
The merged file can be split by the GTBExporter. Merging of multiple files (> 2) in GBC uses the two-by-two merging strategy (optimized using a minimum heap weighted by the sample size) to accommodate merging between sample sets of arbitrary size.
[!NOTE|label:Example|style:callout]
Merging all Y-chromosome genotypes of 1000GP3 using GBC:
# Download the data file wget https://pmglab.top/gbc/download/1000GP3.hg19.chrY/afr.gtb \ https://pmglab.top/gbc/download/1000GP3.hg19.chrY/amr.gtb \ https://pmglab.top/gbc/download/1000GP3.hg19.chrY/eas.gtb \ https://pmglab.top/gbc/download/1000GP3.hg19.chrY/eur.gtb \ https://pmglab.top/gbc/download/1000GP3.hg19.chrY/sas.gtb # Run directly in the terminal java -jar gbc.jar merge ./afr.gtb ./amr.gtb ./eas.gtb ./eur.gtb ./sas.gtb \ -o ./1000GP3.chrY.gtb # Run it using docker docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \ merge ./afr.gtb ./amr.gtb ./eas.gtb ./eur.gtb ./sas.gtb \ -o ./1000GP3.chrY.gtb
Program Options
Usage: merge <input> <input> ... -o <output> [options]
Java-API: edu.sysu.pmglab.gbc.toolkit.GTBMerger
About: Merge genotypes of individuals in multiple *.gtb into a single *.gtb.
Merge means creating a superset of variant calls across multiple
individuals.
Options:
*--output,-o Set the output file.
format: --output <file>
--chromosome Specify the chromosome tags file. e.g., identify 'X, chrX,
CHRX, ChrX' as '(int) 22' chromosome.
format: --chromosome <string>
--threads,-t Set the number of threads.
default: 4
format: --threads <int> (>= 1)
--method,-m Method for handing coordinates in different files (union,
intersection or alignment), the missing genotype is replaced by
'.'.
default: alignment
format: --method <string> ([union/intersection/alignment]
(ignoreCase))
--no-gt Do not load and store genotypes.
--add-meta Add the specified metas to the output file.
format: --add-meta <key>=<value> <key>=<value> ...
--rm-meta Remove all meta information.
API Toolkit
The API tool for merging GTB files is edu.sysu.pmglab.gbc.GTBMerger, and an example of its use is as follows:
GTBMerger.of("./afr.gtb", "./amr.gtb")
.setMergeOperator(GTBMerger.MergeOperator.UNION)
.submit();
In the GTBMerger, merging of non-genotypic fields is achieved by adding additional field names and field types with addField
and setting new field values with addValueConverter
.