KGGA adopts workflow-based programming, in which an analysis is divided into multiple and concatenated tasks. A programmer can call the Java API functions of the data select package in KGGA to process genotype data for quality control, basic statistics, and format conversion. The following are two examples.

Select variants with good quality.

// Import necessary libraries and classes for handling input/output operations, VCF file manipulation, and genomic data processing.

import edu.sysu.pmglab.executor.Executor;
import edu.sysu.pmglab.gtb.GTBManager;
import edu.sysu.pmglab.gtb.genome.coordinate.RefGenomeVersion;
import edu.sysu.pmglab.kgga.command.executor.Utility;
import edu.sysu.pmglab.kgga.command.pipeline.GeneralIOOptions;
import edu.sysu.pmglab.kgga.command.pipeline.PreprocessingPipeline;
import edu.sysu.pmglab.kgga.command.pipeline.VCFQualityControlOptions;
import edu.sysu.pmglab.kgga.command.validator.VariantFileMeta;
import edu.sysu.pmglab.kgga.io.InputPhenotypeFileSet;
import edu.sysu.pmglab.kgga.io.InputType;

import java.io.File;
import java.io.IOException;

  public class Example {
    public static void main(String[] args) {
    // Step 1: Create an instance of GeneralIOOptions, which holds the general input and output options for file handling.
        GeneralIOOptions ioOptions = new GeneralIOOptions();

        // Step 2: Create an instance of VCFQualityControlOptions, which specifies quality control settings for VCF files.
        VCFQualityControlOptions vcfQualityControlOptions = new VCFQualityControlOptions();

        try {
            // Step 3: Add a VCF file to the input options. This VCF file contains the genetic variant data to be analyzed.
            // The VariantFileMeta constructor takes three parameters:
            // - URL or local file path of the VCF file,
            // - Input type (VCF format in this case),
            // - Reference genome version of the input VCF file (e.g., hg19).
            ioOptions.inputGTYFiles.add(new VariantFileMeta("https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.hg19.vcf.gz", InputType.VCF, RefGenomeVersion.hg19));

            // Step 4: Specify the path of the subject information file. The PED file contains phenotype and sample information for each individual in the VCF file. This is necessary for linking genetic variants with phenotypic data.
            ioOptions.phenoFileSet = new InputPhenotypeFileSet("https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped");

            // Step 5: Define the output directory where the results will be stored. In this example, the workspace directory is set to "./test1".
            File workspace = new File("./test1");
            ioOptions.output = workspace;

            // Step 6: Initialize an Executor instance, which serves as the workflow controller.
            // The Executor manages and coordinates different stages of the data processing pipeline.
            Executor workflow = new Executor();

            // Step 7: Generate a file to track the execution status and the paths of files storing intermediate analysis data.
            Utility.addTrack(workflow, workspace);

            // Step 8: Run the preprocessing pipeline to generate a GTB (Genotype Block) file with annotations based on the provided inputs.
            // This step add the quality control, and sub-sample extraction tasks using the options specified in ioOptions and vcfQualityControlOptions.
            GTBManager annotationBasedGTB = PreprocessingPipeline.INSTANCE.generateAnnotationBase(ioOptions, vcfQualityControlOptions, workflow, workspace);

            // Step 9: Execute the entire workflow. This command starts the data processing and writes the selected variants and subjects
            // (after filtering and quality control) to the output directory.
            workflow.execute();

            // Step 10: Print the file path of the generated GTB file to the console, allowing the user to verify the output location.
            System.out.println("The selected variants are stored in " + annotationBasedGTB.getFile());

        } catch (IOException e) {
            // Catch and handle any I/O exceptions that may occur during file reading/writing operations.
            throw new RuntimeException(e);
        }
    }
}

Export genotypes of selected variants in the TSV format and PGEN format(plink2)

Two tasks can be added to the above workflow to export selected variants in the TSV format and genotypes in PGEN format(plink2).

    // Step 1: Specify the output genotype format as PLINK_PGEN, which is a format used by PLINK, 
    // a widely used tool for large-scale genotype data analysis. This option sets the desired format for exporting genotypes. Note: the codes should be put in front of the API function PreprocessingPipeline.INSTANCE.generateAnnotationBase
    ioOptions.outputGtyFormat = InputType.PLINK_PGEN;

   // Step 2: Clear existing tasks in the workflow.
   // This ensures that no previously defined tasks are retained, which could interfere with the current workflow.
   workflow.clearTasks();

    // Step 3: Set the parameter for "AnnotationBaseVariantSet" in the workflow using the file path 
    // of the previously generated GTB file (annotationBasedGTB).
    // This GTB file contains the annotated variants and serves as the input data for subsequent tasks.
    workflow.setParam("AnnotationBaseVariantSet", new File(annotationBasedGTB.getFile().getPath()));

    // Step 4: Add a task to export the variants in the GTB file to a tab-separated value (TSV) file.
    // The OutputVariants2TSVTask is responsible for extracting the variants from the GTB format 
    // and saving them in a human-readable TSV file. The parameters are:
    // - ioOptions: Contains file paths and format information,
    // - workspace: Specifies the directory where the output will be stored,
    // - false: Indicates that the export task will not be run in a parallelized manner.
    workflow.addTasks(new OutputVariants2TSVTask(ioOptions, workspace, false));

    // Step 5: Check if an output genotype format is specified. If the `outputGtyFormat` is not null,
    // proceed to export genotypes to the specified format.

    // Step 6: Add a task to export genotypes to the specified format (e.g., PLINK_PGEN).
    // The OutputGenotypes2OtherTask is responsible for converting genotypes stored in the GTB file 
    // into the specified output format, which can be any supported format like PLINK, BGEN, etc.
    workflow.addTasks(new OutputGenotypes2OtherTask(ioOptions, workspace));


    // Step 7: Execute the workflow to run all added tasks sequentially.
    // This command initiates the export of variants and genotypes, producing the desired output files 
    // (e.g., TSV and PLINK files) in the specified directory.
    workflow.execute();

API Usage

results matching ""

No results matching ""