IO Methods
Introduction
Regular file IO streams (or methods) describe the basic information of a file with the class File
, and using a single byte as the smallest unit of data to write and read. While GBC describes the basic information of a GTB file in the GTBManager class, using a single variant as the smallest unit of data to write and read. The following is a list of the most commonly used APIs:
- edu.sysu.pmglab.gbc.variant.Variant: Standard variant object.
- edu.sysu.pmglab.gbc.variant.Variants: Collection object for store multiple variants object with the same coordinates.
- edu.sysu.pmglab.gbc.GTBIndexer: GTB indexer for quickly locating the (pointer and position) range of variants where a given chromosome is located.
- edu.sysu.pmglab.gbc.GTBManager: Describe the basic information of GTB.
- edu.sysu.pmglab.gbc.GTBWriter: Write Variant objects to GTB.
- edu.sysu.pmglab.gbc.GTBPartWriter: Write multiple variants to GTB in parallel.
- edu.sysu.pmglab.gbc.GTBReader: Read variants from GTB.
- edu.sysu.pmglab.gbc.GTBFilter: Apply quality control conditions (AC, AN, coordinate range, etc.) at the level of the variant to the GTBReader.
- edu.sysu.pmglab.gbc.GTBFormat: Describe the basic format of GTB.
- edu.sysu.pmglab.gbc.GTBMeta: Used to manage the meta information of GTB.
The following is a simple example of performing IO based on GTB, which reads the variants on https://pmglab.top/gbc/download/assoc.hg19.gtb
, filters out the variants with coordinates in the range chr1:0-200000
, and performs liftover from hg19 to hg38:
// load GTB
GTBManager manager = GTBManager.load("https://pmglab.top/gbc/download/assoc.hg19.gtb");
// initialize and create GTBWriter
GTBWriter writer = GTBWriter.Builder.of(new File("./assoc.hg19.gtb"))
.addField("hg38_CHROM", FieldType.Int)
.addField("hg38_POS", FieldType.Int)
.addSubjects(manager.getSubjects())
.build()
.writeMeta(manager.getMeta());
// create GTBReader
GTBReader reader = new GTBReader(manager);
// create filter
GTBFilter filter = new GTBFilter().filterByCoordinateRange(Chromosome.get("chr1"), new Interval<>(0, 200000));
// initialize liftover from hg19 to hg38
LiftOver liftOver = LiftOver.load(RefGenomeVersion.hg19, RefGenomeVersion.hg38);
Variant variant;
// Reads the variants that satisfy the filter conditions.
while ((variant = reader.read(filter)) != null) {
// liftover and then write it's mapped coordinate
Entry<Chromosome, Integer> hg38Coordinates = liftOver.convertCoordinate(variant.getChromosome(), variant.getPosition());
variant.setProperty("hg38_CHROM", hg38Coordinates.getKey().getChromosomeIndex());
variant.setProperty("hg38_POS", hg38Coordinates.getValue());
writer.write(variant);
}
reader.close();
writer.close();
GTBFormat
The GTBFormat includes 3 parameters: compression level, maximum number of variants per block, and genotype status (phased or unphased).
- The higher the compression level, as well as the higher the compression ratio and slower the compression speed.
- The higher the number of variants per block, there will be a higher compression ratio, but it will affect the access performance and memory overhead when reading and writing.
To ensure the feasibility of parallel operation of large-scale genotype data, each genotype block in a GTB is no larger than 128 MB in size.
In addition, GTBFormat provides the following two common methods for normalizing supplemental field names and subject names:
- checkSubjectName(String subjectName): Check the subject name. Invisible characters (such as spaces, tabs, line breaks, etc.), backslashes, equal signs, colons, semicolons, commas, single and double quotes are replaced with underscores
_
. - fieldNameChecker(String fieldName): Check if the field name is valid. Field names are not allowed to be CHROM, POS, ALLELE, GENOTYPE.
GTBMeta
GTB uses GTBMeta to manage meta information. It mainly contains add(String metaKey, String metaValue)
and add(ByteCode metaKey, ByteCode metaValue)
methods. As a mandatory rule, GTBMeta refuses to write meta information starting with $gtbformat.
.
The following is an example for writing and reading meta-information using GTBMeta:
GTBMeta meta = new GTBMeta();
meta.add("fileformat", "VCFv4.4");
meta.add("contig", "<ID=1,assembly=b37,length=249250621>");
meta.add("contig", "<ID=2,assembly=b37,length=243199373>");
meta.add("FORMAT", "<ID=GT,Number=1,Type=String,Description=\"Genotype\">");
// Read the meta information, the result is an array of all meta information values with the key contig
BaseArray<ByteCode> contigs = meta.get("contig").values();
GTBWriter
The GTBWriter and GTBPartWriter are initialized using the builder pattern, and GTBWriter.Builder is their initialization method. During the initialization step, the following methods are used to configure the information:
- setFormat(GTBFormat format): set the format of GTB.
- addSubject(String subjectName): add subject.
- addSubjects(String[] subjectNames): add multiple subjects.
- addSubjects(
Iterable<String>
subjectNames): add multiple subjects. - addField(String fieldName, FieldType fieldType): add other fields.
- addFields(
Map<String, FieldType>
fields): add other fields.
For the types supported by FieldType, see: Enumerated FieldType. When fields are added to the GTBWriter, you need to add the property key-value of the corresponding field to the Variant
object, otherwise a null pointer exception will be thrown.
Upon completion of the configuration, the GTBWriter is instantiated using the .build()
method, or the GTBPartWriter is instantiated using .build(int nThreads)
.
GTBPartWriter:
----- ++++++ ~~~~~~ ---> -----++++++~~~~~~
Part1 Part2 Part3 final
The most common case of a partial writer is to assign a subpart to each chromosome and stitch it by chromosome tags after completing compression.
Both GTBWriter and GTBPartWriter write meta information via writeMeta(String metaKey, String metaValue)
or writeMeta(ByteCode metaKey, ByteCode metaValue)
. The GTBWriter writes variants by calling write(Variant variant)
and write(Variants variants)
, while the GTBPartWriter needs to specify the division index first, obtain the corresponding GTBWriter, and then perform the write of the variant. Be sure to call the close()
method to close the file IO stream when you finish writing.
GTBManager
GTBManager is used to maintain information about GTB files pointed to by the same file path. It is designed with a cache structure so that multiple requests for the same file's manager only scan the file's index information once, and all subsequent requests return the same manager object. To load a GTB file's manager, use GTBManager.load(Object fileObject)
.
GTBManager contains the following commonly used methods:
- getFieldNum(): get the number of fields the current GTB contains.
- getVariantNum(): get the number of variants.
- getSubjectNum(): get the number of subjects.
- containField(String fieldName):whether the current GTB contains this field.
- containSubject(String subject): whether the current GTB contains this subject.
- getFields(): get all the fields and their types.
- getSubjects(): get all the subjects.
- getFieldType(String fieldName): get the field type of the specified field.
- indexOfSubject(String subject): get the subject index by subject's name.
- subjectOfIndex(int index): get the subject name by subject's index.
- getMeta(): get the metas.
- isOrdered(): check if the GTB is ordered by coordinates.
GTBFilter
GTBFilter is used for fast filtering when reading variants in GTB. It contains the following common methods:
- filterByAC(
Interval<Integer>
range) - filterByAN(
Interval<Integer>
range) - filterByAF(
Interval<Float>
range) - filterByAlleleNum(
Interval<Float>
range) - filterByChromosome(Chromosome chromosome)
- filterByChromosomes(Chromosome... chromosomes)
- filterByChromosomes(
Iterable<Chromosome>
chromosomes) - filterByCoordinate(Chromosome chromosome, int position)
- filterByCoordinates(Chromosome chromosome,
Iterable<Integer>
poses) - filterByCoordinates(
Map<Chromosome, Iterable<Integer>>
poses) - filterByCoordinateRange(Chromosome chromosome,
Interval<Integer>
ranges) - filterByCoordinateRanges(
Map<Chromosome, Interval<Integer>>
poses) - filterByVariantIndex(
Interval<Long>>
indexRangeOfVariant)
In addition, the following methods can be used for complex conditions:
- filterByVariant(
Function<Variant, Boolean>
function)
GTBFilter supports adding filters by means of chain calls, as shown in the following method for adding multiple complex filters:
LiftOver liftOver = LiftOver.load(RefGenomeVersion.hg19, RefGenomeVersion.hg38);
GTBFilter filter = new GTBFilter()
.filterByAC(new Interval<>(0, 100))
.filterByAF(new Interval<>(0.1f, 0.9f))
.filterByChromosome(Chromosome.get("chr1"))
.filterByVariant(variant -> liftOver.convertCoordinate(variant) != null);
GTBReader
GTBReader is used to create a reader instance that reads variants by line, and contains the following constructor methods:
- GTBReader(Object manager): default constructor method, loads all genotypes and complementary fields.
- GTBReader(Object manager, boolean loadGenotype): constructor method, optionally loads the genotype or not
- GTBReader(Object manager, boolean loadGenotype, boolean loadField): constructor method, optionally load or not load genotype and complementary fields
- GTBReader(Object manager, boolean loadGenotype,
Iterable<String>
fields): constructor method, optionally load or not load genotype, and seleted the complementary fields to be loaded. Whenfields
isnull
, it means that all complementary fields are loaded.
Note that a single GTBReader is not thread-safe, create multiple instances of GTBReader if you need to read a single GTB in parallel. The instantiated GTBReader mainly contains the following methods:
- read(): read and return a variant object. Null object returns when it reaches the end of the file.
- reads(): read multiple contiguous variants with the same coordinates and return a collection variant object. Null object returns when it reaches the end of the file.
- read(GTBFilter filter): read a variant that satisfies the conditions. Null object returns when it reaches the end of the file.
- tell(): get the pointer of the GTB file (i.e., variant index).
- seek(long variantIndex): move the file pointer to the specified variant index.
- seek(GTBFilter filter): move the file pointer to the place that meets the conditions.
false
returns when no such variants exists. - limit(
Interval<Long>
ranges): set the current read range of the reader. - remaining(): get the number of variants left to be read by the current reader.
- part(int nParts): divides the GTBReader into
nPart
equal parts from the current pointer to the end of the file.
For the variants being read in, the genotype array is obtained by Variant.getProperty(IGenotypes.class)
. For details on how to manipulate genotype array, see IGenotypes.