Build GTB for TSV
Many genomic coordinate-type databases are stored in TSV or CSV files (e.g., dbNSFP, GWAS Catalog, regBase, Etc.), which can also be constructed in GTB format. On average, these databases can save more than 30% of storage space and significantly improve access performance and retrieval performance by choosing the appropriate numeric encoding format to build GTB. From the command line, use the following command to build GTB archives for TSV files:
java -jar gbc.jar tsv2gtb <input> [output] [options]
[!NOTE|label:Example|style:callout]
Use GBC to build an archive for the example file
https://pmglab.top/gbc/download/gwas-coordinate.tsv.gz
(which was downloaded as an example file from the GWAS Catalog), and the command line instructions to complete this task are as follows:# Download the data file wget https://pmglab.top/gbc/download/gwas-coordinate.tsv.gz -O gwas-coordinate.tsv.gz # Run directly in the terminal java -jar gbc.jar tsv2gtb ./gwas-coordinate.tsv.gz ./gwas-coordinate.hg19.gtb \ --chromosome-field chromosome --position-field base_pair_location --allele-field other_allele,effect_allele \ --field variant_id=string p_value=float odds_ratio=float \ --rename-field variant_id=ID # Run it using docker docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \ tsv2gtb ./gwas-coordinate.tsv.gz ./gwas-coordinate.hg19.gtb \ --chromosome-field chromosome --position-field base_pair_location --allele-field other_allele,effect_allele \ --field variant_id=string p_value=float odds_ratio=float \ --rename-field variant_id=ID
Program Options
Usage: tsv2gtb <input> [output] [options]
Java-API: edu.sysu.pmglab.ccf.TSV2CCF
About: Compress and build *.gtb (genotype block format) for *.tsv (tab-separated
values format).
Options:
--chromosome Specify the chromosome tags file. e.g., identify 'X,
chrX, CHRX, ChrX' as '(int) 22' chromosome.
format: --chromosome <file>
--threads,-t Set the number of threads.
default: 4
format: --threads <int>
--add-meta Add the specified metas to the output file.
format: --add-meta <key>=<value> <key>=<value> ...
--chromosome-field Set the field to identify as the chromosome.
default: CHROM
format: --chromosome-field <string>
--position-field Set the field to identify as the position.
default: POS
format: --position-field <string>
--allele-field Set the fields to identify as the alleles.
default: REF,ALT
format: --allele-field <string>,<string>,...
--position-type Set the position type of the coordinate field.
default: 1_based
format: --position-type <string> ([0_based/1_based] or
[0/1])
TSV Format Options:
--separator Set the bytecode for the delimiter to be used between the
different fields. (\t by default)
default: 9
format: --separator <byte> (-128 ~ 127)
--auto-format Automatically identify meta fields in TSV files. Fields
starting with '##' are recognized as description information
and fields starting with '#' are recognized as header lines.
default: true
format: --auto-format [true/false]
--startsWith Identify the beginning of the specified string as the header
line.
format: --startsWith <string>
--skip-line Number of lines to skip at the start of the file.
format: --skip-line <int>
--noHeaderLine This dataset has no header row. If this parameter is passed
in, the field names are specified as V1,V2,....
--store-meta Store meta information.
--field,-f Select fields in the dataset and specifies their data types.
If the parameter '--noHeaderLine' is passed in, use Vi for
the ith field's name.
format: --field <field>=<type> <field>=<type> ...
--rename-field Reset field names for *.gtb directly.
format: --rename-field <old>=<new> ...
API Toolkit
The API tool for converting TSV files to GTB files is edu.sysu.pmglab.ccf.TSV2CCF. CCF files (the CCF format is the underlying protocol of the GTB format) can be recognized as a GTB file provided that
- has mandatory fields
CHROM
andPOS
, and both are of typeint
; - If it contains variable allele fields (e.g., REF and ALT in a VCF file), the field name is
ALLELE
and the field type isByteCodeArray
;
We downloaded VarNoteDB_FA_regBase_prediction.gz (total size 232.13 GB) and converted it to GTB format using the following Java program:
CCFTable table = TSV2CCF.of("./VarNoteDB_FA_regBase_prediction.gz", new File("./VarNoteDB_FA_regBase_prediction.hg19.gtb"))
.setAutoMeta(true)
.setThreads(6)
.addField("CHROM", FieldType.Int)
.addField("POS", FieldType.Int)
.addField("ALLELE", FieldType.ByteCodeArray)
.addField("regBase_REG", FieldType.HalfFloatArray)
.addField("regBase_CAN", FieldType.HalfFloatArray)
.addField("regBase_PAT", FieldType.HalfFloatArray)
.addTSVFilter(values -> Chromosome.get(values.get("Chrom")) != null)
.addValueConverter((values, record) -> {
record.set("CHROM", Chromosome.get(values.get("Chrom")).getChromosomeIndex());
record.set("POS", values.get("Pos_end").toInt());
record.set("ALLELE", new ByteCode[]{values.get("Ref"), values.get("Alts")});
record.set("regBase_REG", new Float[]{values.get("REG").toFloat(), values.get("REG_PHRED").toFloat()});
record.set("regBase_CAN", new Float[]{values.get("CAN").toFloat(), values.get("CAN_PHRED").toFloat()});
record.set("regBase_PAT", new Float[]{values.get("PAT").toFloat(), values.get("PAT_PHRED").toFloat()});
})
.addMeta("regBase_REG", "<Type=HalfFloatArray,Source=\"http://2792wttzz8.xuduan.vip/VarNoteDB/hg19/VarNoteDB_FA_regBase_prediction/VarNoteDB_FA_regBase_prediction.gz\",Description=\"regBase_REG is a composite prediction model to score functional SNVs from existing tools for base-wise annotation of human genome. The first element of the array is the predicted score value provided by the database, and the second element is the phred-link score value.\">")
.addMeta("regBase_CAN", "<Type=HalfFloatArray,Source=\"http://2792wttzz8.xuduan.vip/VarNoteDB/hg19/VarNoteDB_FA_regBase_prediction/VarNoteDB_FA_regBase_prediction.gz\",Description=\"regBase_CAN is a composite prediction model to score cancer driver non-coding regulatory SNVs from existing tools for base-wise annotation of human genome. The first element of the array is the predicted score value provided by the database, and the second element is the phred-link score value.\">")
.addMeta("regBase_PAT", "<Type=HalfFloatArray,Source=\"http://2792wttzz8.xuduan.vip/VarNoteDB/hg19/VarNoteDB_FA_regBase_prediction/VarNoteDB_FA_regBase_prediction.gz\",Description=\"regBase_PAT is a composite prediction model to score pathogenic SNVs from existing tools for base-wise annotation of human genome. The larger the score the more likely the SNP has damaging effect. The first element of the array is the predicted score value provided by the database, and the second element is the phred-link score value.\">")
.convert();
// liftover and sort
GTBExporter.of("./VarNoteDB_FA_regBase_prediction.hg19.gtb")
.liftOver(RefGenomeVersion.hg19, RefGenomeVersion.hg38)
.sort(true)
.setOutputFile(new File("./VarNoteDB_FA_regBase_prediction.hg38.gtb"))
.setThreads(6).submit();
The size of the generated files are:
- VarNoteDB_FA_regBase_prediction.hg19.gtb: 83.50 GB
- VarNoteDB_FA_regBase_prediction.hg38.gtb: 83.45 GB