GTB Indexer

The GTB indexer is used to mark the coordinate range and pointer range of each chromosome. It is used to quickly determine whether the corresponding GTB/CCF file is in order and to locate the pointer interval where the requested data is located. Use the following command to build an index for a GTB/CCF file:

java -jar gbc.jar index <input> [output] [options]

The index file is not mandatory in the various GTB toolkits, as reading coordinates in GTB is very fast (reading genome-wide coordinates can usually be done in less than 2 seconds). However, for the widely used coordinate type databases (non-VCF format), coordinate interval type databases (e.g., transcript, gene-based database), the index file help to quickly determine if the file is coordinate-ordered and to form a uniform access interface with the GTB. Specifically, for regular GTB files, retrieval of the specified coordinates requires only a search within a defined range because the coordinates are ordered at the chromosome level, whereas for CCF format files, indexing can help GBC programs quickly find search cutoff points.

The chromosome tags for GTB indexes follow the Chromosome Tag Declaration, and the corresponding chromosome declaration file needs to be specified when constructing indexes for GTB files of non-human genomes.

[!NOTE|label:Example I|style:callout]

Build index for 1000GP3-EAS-chr4:

# Download the data file
wget https://pmglab.top/gbc/download/1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb

# Run directly in the terminal
java -jar gbc.jar index 1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb 

# Run it using docker
docker run -v `pwd`:`pwd` -w `pwd` --rm -it -m 4g gbc \
index 1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb

The terminal outputs the following message:

2023-04-02 08:52:40 INFO  [main] GBC - Command Line Interface Succeeded to build index for 1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb
2023-04-02 08:52:40 INFO  [main] GBC - Command Line Interface Total Processing time: 0.837 s
##Source=1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb
##SourceFileSize=31.340 MB
##SourceLastModifiedTime=2023/04/01 03:17:15
#CHROM    POS    POINTER
chr4    [10006, 191043882]    [0, 5732585)

Program Options

Usage: index <input> [options]
Java-API: edu.sysu.pmglab.gbc.GTBIndexer
About: Construct chromosome-level index tree for coordinate-ordered *.gtb
       or *.ccf to improve access speed.
Options:
  --output,-o            Set the output file.
                         format: --output <file>
  --chromosome           Specify the chromosome tags file. e.g., identify 'X, 
                         chrX, CHRX, ChrX' as '(int) 22' chromosome.
                         format: --chromosome <file>
  --threads,-t           Set the number of threads.
                         default: 4
                         format: --threads <int>
  --coordinate-field,-c  Set the field to identify as the coordinate (i.e., 
                         chromosome and position).
                         default: CHROM,POS
                         format: --coordinate-field <CHROM>,<POS>
  --position-type,-p     Set the position type of the coordinate field.
                         default: 1_based
                         format: --position-type <string> ([0_based/1_based] or 
                         [0/1])

API Toolkit

The API tool for managing the index is the edu.sysu.pmglab.gbc.GTBIndexer, which provides methods for determining whether a file contains the specified chromosome, the coordinate range of the specified chromosome, the pointer range of the specified chromosome, and other methods. The API tool for building and loading index files is edu.sysu.pmglab.gbc.GTBIndexer.Builder. The following example constructs an index of a GTB file:

GTBIndexer.Builder.of("/Users/suranyi/project/GBC/GBC-stable-1.0/docs/web-docs/1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb")
        .build();

For database files in CCF format, use .setCoordinateFields to set the fields identified as genomic coordinates and .mapRecordToVariant to map records to variants.

Copyright ©Liubin Zhang all right reservedLast modified time: 2023-04-10 15:00:51

results matching ""

    No results matching ""