GTB Indexer
The GTB indexer is used to mark the coordinate range and pointer range of each chromosome. It is used to quickly determine whether the corresponding GTB/CCF file is in order and to locate the pointer interval where the requested data is located. Use the following command to build an index for a GTB/CCF file:
java -jar gbc.jar index <input> [output] [options]
The index file is not mandatory in the various GTB toolkits, as reading coordinates in GTB is very fast (reading genome-wide coordinates can usually be done in less than 2 seconds). However, for the widely used coordinate type databases (non-VCF format), coordinate interval type databases (e.g., transcript, gene-based database), the index file help to quickly determine if the file is coordinate-ordered and to form a uniform access interface with the GTB. Specifically, for regular GTB files, retrieval of the specified coordinates requires only a search within a defined range because the coordinates are ordered at the chromosome level, whereas for CCF format files, indexing can help GBC programs quickly find search cutoff points.
The chromosome tags for GTB indexes follow the Chromosome Tag Declaration, and the corresponding chromosome declaration file needs to be specified when constructing indexes for GTB files of non-human genomes.
[!NOTE|label:Example I|style:callout]
Build index for 1000GP3-EAS-chr4:
# Download the data file wget https://pmglab.top/gbc/download/1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb # Run directly in the terminal java -jar gbc.jar index 1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb # Run it using docker docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \ index 1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb
The terminal outputs the following message:
2023-04-02 08:52:40 INFO [main] GBC - Command Line Interface Succeeded to build index for 1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb 2023-04-02 08:52:40 INFO [main] GBC - Command Line Interface Total Processing time: 0.837 s ##Source=1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb ##SourceFileSize=31.340 MB ##SourceLastModifiedTime=2023/04/01 03:17:15 #CHROM POS POINTER chr4 [10006, 191043882] [0, 5732585)
Program Options
Usage: index <input> [options]
Java-API: edu.sysu.pmglab.gbc.GTBIndexer
About: Construct chromosome-level index tree for coordinate-ordered *.gtb
or *.ccf to improve access speed.
Options:
--output,-o Set the output file.
format: --output <file>
--chromosome Specify the chromosome tags file. e.g., identify 'X,
chrX, CHRX, ChrX' as '(int) 22' chromosome.
format: --chromosome <file>
--threads,-t Set the number of threads.
default: 4
format: --threads <int>
--coordinate-field,-c Set the field to identify as the coordinate (i.e.,
chromosome and position).
default: CHROM,POS
format: --coordinate-field <CHROM>,<POS>
--position-type,-p Set the position type of the coordinate field.
default: 1_based
format: --position-type <string> ([0_based/1_based] or
[0/1])
API Toolkit
The API tool for managing the index is the edu.sysu.pmglab.gbc.GTBIndexer, which provides methods for determining whether a file contains the specified chromosome, the coordinate range of the specified chromosome, the pointer range of the specified chromosome, and other methods. The API tool for building and loading index files is edu.sysu.pmglab.gbc.GTBIndexer.Builder. The following example constructs an index of a GTB file:
GTBIndexer.Builder.of("/Users/suranyi/project/GBC/GBC-stable-1.0/docs/web-docs/1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb")
.build();
For database files in CCF format, use .setCoordinateFields
to set the fields identified as genomic coordinates and .mapRecordToVariant
to map records to variants.