Examples

This section demonstrates how to use the manage module in KGGA to select a subset of samples from VCF and PED files, apply stringent quality control filters to their genotypes, and export the results in formats suitable for downstream genetic analysis, such as PLINK BED or PGEN.


In this example, individuals present in both the VCF and PED files are selected, and their genotypes are subjected to stricter quality control than the default settings. Variants are filtered based on allele number, allele count, observation rate, and Hardy-Weinberg equilibrium (HWE). The cleaned genotypes are then output in PLINK BED format for subsequent analysis.

java -Xmx10g -jar kgga.jar \
   clean \
   --input-gty-file ./example/assoc.hg19.vcf.gz \
                   refG=hg19 \
   --ped-file https://idc.biosino.org/pmglab/resource/kgg/kgga/example/assoc.ped \
   --output ./test/demo1 \
   --threads 4 \
   --allele-num 2~4 \
   --seq-ac 1 \
   --min-obs-rate 0.9 \
   --hwe 1E-5 \
   --output-gty-format PLINK_BED

Key Command Options

  • --input-gty-file: Specifies the input VCF file and reference genome (e.g., hg19).
  • --ped-file: Provides sample metadata, such as case/control status, from a PED file.
  • --output: Defines the output directory (e.g., ./test/demo1).
  • --threads: Sets the number of processing threads (e.g., 4).
  • --allele-num: Filters variants by the number of alleles (e.g., 2 to 4).
  • --seq-ac: Filters variants by minimum allele count (e.g., ≥1).
  • --min-obs-rate: Sets the minimum genotype observation rate (e.g., 0.9 or 90%).
  • --hwe: Applies a Hardy-Weinberg equilibrium filter with a p-value threshold (e.g., 1E-5).
  • --output-gty-format: Specifies the output format (e.g., PLINK_BED).

Outputting Genotypes in PGEN Format

KGGA also supports exporting cleaned genotypes in other formats, such as PGEN. This requires the Python library libjep, which must be installed before running the command. For installation instructions, refer to xxx.

java -Djava.library.path="$(pip3 show jep | grep Location | awk '{print $2"/jep"}')" \
   -Xmx10g -jar kgga.jar \
   clean \
   --input-gty-file ./example/assoc.hg19.vcf.gz \
                   refG=hg19 \
   --ped-file ./example/assoc.ped \
   --output ./test/demo1 \
   --threads 4 \
   --allele-num 2~4 \
   --seq-ac 1 \
   --min-obs-rate 0.9 \
   --hwe 1E-5 \
   --output-gty-format PLINK_PGEN

Additional Command Option

  • -Djava.library.path: Specifies the path to the libjep.jnilib file, required for PGEN output.

Notes

  • KGGA also supports exporting genotypes in VCF format by setting --output-gty-format VCF, though this does not require libjep.
  • Ensure the libjep library is correctly installed and the path is accurate when using PGEN output.

Outputs

The cleaned variants, along with basic count summaries, are saved in the specified output directory (e.g., ./test/demo1/). The genotypes are stored in the chosen format (e.g., PLINK BED or PGEN) within this folder.

The final selection results are saved in a compressed file named variants.hg38.tsv.gz, with the following columns:

Header Description
CHROM Chromosome (hg38)
POS Position (hg38)
REF Reference allele
ALT Alternative allele
GTYSUM::RefHomGtyNum_ALL Number of reference homozygous genotypes (all samples)
GTYSUM::HetGtyNum_ALL Number of heterozygous genotypes (all samples)
GTYSUM::AltHomGtyNum_ALL Number of alternative homozygous genotypes (all samples)
GTYSUM::MissingGtyNum_ALL Number of missing genotypes (all samples)
GTYSUM::RefHomGtyNum_CASE Number of reference homozygous genotypes (case samples)
GTYSUM::HetGtyNum_CASE Number of heterozygous genotypes (case samples)
GTYSUM::AltHomGtyNum_CASE Number of alternative homozygous genotypes (case samples)
GTYSUM::MissingGtyNum_CASE Number of missing genotypes (case samples)
GTYSUM::RefHomGtyNum_CONTROL Number of reference homozygous genotypes (control samples)
GTYSUM::HetGtyNum_CONTROL Number of heterozygous genotypes (control samples)
GTYSUM::AltHomGtyNum_CONTROL Number of alternative homozygous genotypes (control samples)
GTYSUM::MissingGtyNum_CONTROL Number of missing genotypes (control samples)

Additional Output Files

  • Intermediate Files: Files with the .gtb suffix, located in the ConvertVCF2GTBTask subfolder, can be used directly for subsequent analysis.
  • Log Files: Processing logs are saved in the log subfolder within the output directory.

General Notes

  • Adjust file paths (e.g., --input-gty-file, --ped-file, --output) and parameters (e.g., --allele-num, --hwe) based on your dataset and analysis needs.
  • The --output-gty-format option allows flexibility in choosing the output format (e.g., PLINK_BED, PLINK_PGEN, or VCF).
  • For PGEN output, ensure the libjep library is installed and the --jep path is correctly specified.
Copyright ©MiaoXin Li all right reservedLast modified time: 2025-04-23 06:36:51

results matching ""

    No results matching ""