Build GTB for TSV

Many genomic coordinate-type databases are stored in TSV or CSV files (e.g., dbNSFP, GWAS Catalog, regBase, Etc.), which can also be constructed in GTB format. On average, these databases can save more than 30% of storage space and significantly improve access performance and retrieval performance by choosing the appropriate numeric encoding format to build GTB. From the command line, use the following command to build GTB archives for TSV files:

java -jar gbc.jar tsv2gtb <input> [output] [options]

[!NOTE|label:Example|style:callout]

Use GBC to build an archive for the example file https://pmglab.top/gbc/download/gwas-coordinate.tsv.gz (which was downloaded as an example file from the GWAS Catalog), and the command line instructions to complete this task are as follows:

# Download the data file
wget https://pmglab.top/gbc/download/gwas-coordinate.tsv.gz -O gwas-coordinate.tsv.gz

# Run directly in the terminal
java -jar gbc.jar tsv2gtb ./gwas-coordinate.tsv.gz ./gwas-coordinate.hg19.gtb \
                          --chromosome-field chromosome --position-field base_pair_location --allele-field other_allele,effect_allele \
                          --field variant_id=string p_value=float odds_ratio=float \
                          --rename-field variant_id=ID

# Run it using docker
docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \
tsv2gtb ./gwas-coordinate.tsv.gz ./gwas-coordinate.hg19.gtb \
        --chromosome-field chromosome --position-field base_pair_location --allele-field other_allele,effect_allele \
        --field variant_id=string p_value=float odds_ratio=float \
        --rename-field variant_id=ID

Program Options

Usage: tsv2gtb <input> [output] [options]
Java-API: edu.sysu.pmglab.ccf.TSV2CCF
About: Compress and build *.gtb (genotype block format) for *.tsv (tab-separated
       values format).
Options:
  --chromosome        Specify the chromosome tags file. e.g., identify 'X, 
                      chrX, CHRX, ChrX' as '(int) 22' chromosome.
                      format: --chromosome <file>
  --threads,-t        Set the number of threads.
                      default: 4
                      format: --threads <int>
  --add-meta          Add the specified metas to the output file.
                      format: --add-meta <key>=<value> <key>=<value> ...
  --chromosome-field  Set the field to identify as the chromosome.
                      default: CHROM
                      format: --chromosome-field <string>
  --position-field    Set the field to identify as the position.
                      default: POS
                      format: --position-field <string>
  --allele-field      Set the fields to identify as the alleles.
                      default: REF,ALT
                      format: --allele-field <string>,<string>,...
  --position-type     Set the position type of the coordinate field.
                      default: 1_based
                      format: --position-type <string> ([0_based/1_based] or 
                      [0/1]) 
TSV Format Options:
  --separator     Set the bytecode for the delimiter to be used between the 
                  different fields. (\t by default)
                  default: 9
                  format: --separator <byte> (-128 ~ 127)
  --auto-format   Automatically identify meta fields in TSV files. Fields 
                  starting with '##' are recognized as description information 
                  and fields starting with '#' are recognized as header lines.
                  default: true
                  format: --auto-format [true/false]
  --startsWith    Identify the beginning of the specified string as the header 
                  line. 
                  format: --startsWith <string>
  --skip-line     Number of lines to skip at the start of the file.
                  format: --skip-line <int>
  --noHeaderLine  This dataset has no header row. If this parameter is passed 
                  in, the field names are specified as V1,V2,....
  --store-meta    Store meta information.
  --field,-f      Select fields in the dataset and specifies their data types. 
                  If the parameter '--noHeaderLine' is passed in, use Vi for 
                  the ith field's name.
                  format: --field <field>=<type> <field>=<type> ...
  --rename-field  Reset field names for *.gtb directly.
                  format: --rename-field <old>=<new> ...

API Toolkit

The API tool for converting TSV files to GTB files is edu.sysu.pmglab.ccf.TSV2CCF. CCF files (the CCF format is the underlying protocol of the GTB format) can be recognized as a GTB file provided that

  • has mandatory fields CHROM and POS, and both are of type int;
  • If it contains variable allele fields (e.g., REF and ALT in a VCF file), the field name is ALLELE and the field type is ByteCodeArray;

We downloaded VarNoteDB_FA_regBase_prediction.gz (total size 232.13 GB) and converted it to GTB format using the following Java program:

CCFTable table = TSV2CCF.of("./VarNoteDB_FA_regBase_prediction.gz", new File("./VarNoteDB_FA_regBase_prediction.hg19.gtb"))
        .setAutoMeta(true)
        .setThreads(6)
        .addField("CHROM", FieldType.Int)
        .addField("POS", FieldType.Int)
        .addField("ALLELE", FieldType.ByteCodeArray)
        .addField("regBase_REG", FieldType.HalfFloatArray)
        .addField("regBase_CAN", FieldType.HalfFloatArray)
        .addField("regBase_PAT", FieldType.HalfFloatArray)
        .addTSVFilter(values -> Chromosome.get(values.get("Chrom")) != null)
        .addValueConverter((values, record) -> {
            record.set("CHROM", Chromosome.get(values.get("Chrom")).getChromosomeIndex());
            record.set("POS", values.get("Pos_end").toInt());
            record.set("ALLELE", new ByteCode[]{values.get("Ref"), values.get("Alts")});
            record.set("regBase_REG", new Float[]{values.get("REG").toFloat(), values.get("REG_PHRED").toFloat()});
            record.set("regBase_CAN", new Float[]{values.get("CAN").toFloat(), values.get("CAN_PHRED").toFloat()});
            record.set("regBase_PAT", new Float[]{values.get("PAT").toFloat(), values.get("PAT_PHRED").toFloat()});

        })
        .addMeta("regBase_REG", "<Type=HalfFloatArray,Source=\"http://2792wttzz8.xuduan.vip/VarNoteDB/hg19/VarNoteDB_FA_regBase_prediction/VarNoteDB_FA_regBase_prediction.gz\",Description=\"regBase_REG is a composite prediction model to score functional SNVs from existing tools for base-wise annotation of human genome. The first element of the array is the predicted score value provided by the database, and the second element is the phred-link score value.\">")
        .addMeta("regBase_CAN", "<Type=HalfFloatArray,Source=\"http://2792wttzz8.xuduan.vip/VarNoteDB/hg19/VarNoteDB_FA_regBase_prediction/VarNoteDB_FA_regBase_prediction.gz\",Description=\"regBase_CAN is a composite prediction model to score cancer driver non-coding regulatory SNVs from existing tools for base-wise annotation of human genome. The first element of the array is the predicted score value provided by the database, and the second element is the phred-link score value.\">")
        .addMeta("regBase_PAT", "<Type=HalfFloatArray,Source=\"http://2792wttzz8.xuduan.vip/VarNoteDB/hg19/VarNoteDB_FA_regBase_prediction/VarNoteDB_FA_regBase_prediction.gz\",Description=\"regBase_PAT is a composite prediction model to score pathogenic SNVs from existing tools for base-wise annotation of human genome. The larger the score the more likely the SNP has damaging effect. The first element of the array is the predicted score value provided by the database, and the second element is the phred-link score value.\">")
        .convert();

// liftover and sort
GTBExporter.of("./VarNoteDB_FA_regBase_prediction.hg19.gtb")
        .liftOver(RefGenomeVersion.hg19, RefGenomeVersion.hg38)
        .sort(true)
        .setOutputFile(new File("./VarNoteDB_FA_regBase_prediction.hg38.gtb"))
        .setThreads(6).submit();

The size of the generated files are:

  • VarNoteDB_FA_regBase_prediction.hg19.gtb: 83.50 GB
  • VarNoteDB_FA_regBase_prediction.hg38.gtb: 83.45 GB
Copyright ©Liubin Zhang all right reservedLast modified time: 2023-04-12 08:27:06

results matching ""

    No results matching ""