Managing Genotype Data on the Cloud: A Tutorial with DNANexus and UKB WES 中文

1. Required Software

KGGA Software Package: An integrated tool for managing GBC, CCF, genomes, and other large-scale public data.
Java Runtime Environment
- Download Oracle JDK via wget https://download.oracle.com/java/24/latest/jdk-24_linux-x64_bin.tar.gz.

Since the Java environment may be used multiple times on different compute workers, we recommend storing these resources in your project directory.

2. Step-by-Step Procedure

2.1. Download Software Locally

On your local machine, download KGGA, the gene annotation file, and Oracle JDK:

# Use -c to enable resumable downloads
wget -c https://pmglab.top/kgga/download/lib/v1/kgga.jar
wget -c https://idc.biosino.org/pmglab/resource/kgg/kgga/resources/gene/refGene_hg38_kggseq_v2.txt.gz
wget -c https://download.oracle.com/java/24/latest/jdk-24_linux-x64_bin.tar.gz

2.2. Upload to the RAP Platform

Navigate to your project on the DNANexus platform, upload the files from the previous step, and note their file IDs:

KGGA: file-J21ZP4QJ5qQ25bjXZJX435Bq
JDK: file-J214ZzjJ5qQ8GfF272Fy4ZqP
refGene_hg38: file-J21Y8J0J5qQ6KZPxjvpgjF0P

2.3. Obtain Data Paths

The UK Biobank genotype data is stored in the following directories on the DNANexus Research Analysis Platform (RAP):

# WES VCF
<PROJECT-ID>:/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/

# WGS VCF
<PROJECT-ID>:/Bulk/DRAGEN WGS/ML-corrected DRAGEN population level WGS variants, pVCF format [500k release]/

# WGS PGEN
<PROJECT-ID>:/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release]/

For more details on the genotype data storage, please refer to: https://biobank.ndph.ox.ac.uk/ukb/label.cgi?id=100314

The WES data is split by chromosome and then by genomic blocks. The table below shows the total size and file count for each chromosome.

Chromosome	VCF.GZ [GB]	No. of Files
chr1	2,273.42	97
chr2	1,636.94	71
chr3	1,304.10	56
chr4	890.46	39
chr5	994.17	43
chr6	1,128.24	48
chr7	1,091.85	47
chr8	824.10	35
chr9	983.73	42
chr10	914.43	40
chr11	1,382.71	57
chr12	1,187.92	52
chr13	402.98	18
chr14	716.82	30
chr15	780.47	34
chr16	1,141.16	47
chr17	1,360.13	56
chr18	357.89	16
chr19	1,662.61	65
chr20	595.50	25
chr21	246.88	11
chr22	539.19	23
chrX	563.02	24
chrY	5.97	1
ALL	22,984.69	977

All files are named in the format ukb23157_c_b_v1.vcf.gz.

2.4. Launch a Worker and Configure the Environment

Launch an interactive worker. A suggested configuration is mem1_ssd1_v2_x4 (7.8 GB RAM, 93 GB disk, 4 cores).

In the worker's terminal, set up the environment:

cd /opt/notebooks/
dx download file-J21ZP4QJ5qQ25bjXZJX435Bq
dx download file-J214ZzjJ5qQ8GfF272Fy4ZqP
tar zxvf ./jdk-24_linux-x64_bin.tar.gz
export PATH=/opt/notebooks/jdk-24.0.2/bin:$PATH

2.5. Download and Process Data in Batches

Individual UKB VCF files are very large, making it impractical to process all sub-files for a chromosome at once. Therefore, we will process the data in batches. First, we divide the files for an entire chromosome into k parts. Then, each worker processes one part (which contains multiple VCF sub-files). The processing includes genotype/individual-level QC and format conversion to the compact GTB format. The resulting GTB files are uploaded back to the DNANexus platform. Finally, once all parts of a chromosome are processed, they are concatenated into a single, complete GTB file.

The following script processes one batch of VCF files. It can be run in parallel on multiple workers by specifying a different BATCH_INDEX for each.

# ---------- Job Basic Information ----------
  CHROM="21"
  # Total number of batches to split the chromosome files into
  BATCH_NUM=5
  # Current batch index (0-based, from 0 to BATCH_NUM-1)
  BATCH_INDEX=0
  # Input data project and directory
  INPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
  INPUT_PROJECT_DIR="/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/"
  # Output data project and directory. [Ensure this directory exists to avoid upload errors]
  OUTPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
  OUTPUT_PROJECT_DIR="/user/zlb/ukb/wes/chr${CHROM}"

# ---------- File List ----------
  echo "[STEP 1] Finding VCF files for chromosome ${CHROM}..."
  # Search for all vcf.gz files in the specified input path
  mapfile -t FILES < <(
    dx find data -r --name "ukb23157_c${CHROM}_b*.vcf.gz" \
      --path "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/" \
      --brief \
    | xargs -n1 -I {} dx describe {} --name
  )

  TOTAL_FILES=${#FILES[@]}
  echo "Found ${TOTAL_FILES} files"

# ---------- Job Mapping ----------
  if [[ $BATCH_NUM -le 0 ]]; then
    echo "Error: BATCH_NUM must be greater than 0"
    exit 1
  fi

  # Calculate the size of each batch (ceiling)
  BATCH_SIZE=$(( (TOTAL_FILES + BATCH_NUM - 1) / BATCH_NUM ))
  START_INDEX=$(( BATCH_INDEX * BATCH_SIZE ))
  END_INDEX=$(( START_INDEX + BATCH_SIZE - 1 ))

  # Validate the batch index
  if [[ $BATCH_INDEX -ge $BATCH_NUM ]]; then
    echo "Error: BATCH_INDEX cannot exceed BATCH_NUM-1"
    exit 1
  fi

  # Handle the last batch which may be smaller
  if [[ $END_INDEX -ge $TOTAL_FILES ]]; then
    END_INDEX=$(( TOTAL_FILES - 1 ))
  fi

  # Check for empty batches (e.g., if BATCH_INDEX is too high)
  if [[ $START_INDEX -gt $END_INDEX ]]; then
    echo "Warning: Current batch is empty (all files have likely been processed)."
    exit 0
  fi

  echo "Total batches: ${BATCH_NUM} | Approx. ${BATCH_SIZE} files per batch"
  echo "Current batch: ${BATCH_INDEX} (file index range ${START_INDEX}-${END_INDEX})"

# ---------- Processing Logic ----------
  for (( i=START_INDEX; i<=END_INDEX; i++ ))
  do
      INPUT="${FILES[$i]}"
      echo "Progress: $((i - START_INDEX + 1))/$((END_INDEX - START_INDEX + 1)) - File ${INPUT}"

      # Output filename, e.g., rename ukb23157_c1_b1.vcf.gz to ukb23157_c1_b1.gtb
      OUTPUT="$(basename ${INPUT} .vcf.gz).gtb"

      # Check if the output file already exists on the platform and skip if it does
      if dx ls "project-${OUTPUT_PROJECT_ID}:${OUTPUT_PROJECT_DIR}/${OUTPUT}" > /dev/null 2>&1; then
          echo "File already exists, skipping: ${OUTPUT}"
          continue
      fi

      # Download the file if it doesn't exist locally
      if ls "${INPUT}" > /dev/null 2>&1; then
          echo "Local file exists, skipping download: ${INPUT}"
      else
          echo "Downloading file: ${INPUT}"
          dx download "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/${INPUT}"
      fi

      # Process the file: convert VCF to GTB with filtering
      echo "Converting VCF -> GTB..."
      java -jar -Xms6g -Xmx6g kgga.jar gbc convert \
           vcf2gtb "${INPUT}" -o "${OUTPUT}" \
           --field --prune --seq-an 1~ --seq-af 0.000001~0.999999 \
           -t 4 || { echo "Java conversion failed for ${INPUT}"; exit 1; }

      # Upload the resulting GTB file
      echo "Uploading file: ${OUTPUT}"
      dx upload "${OUTPUT}" --path "project-${OUTPUT_PROJECT_ID}:${OUTPUT_PROJECT_DIR}/${OUTPUT}" || { echo "Upload failed for ${OUTPUT}"; exit 1; }

      # Clean up by deleting local files to save space
      rm "${INPUT}" "${OUTPUT}" || { echo "Failed to clean up files"; exit 1; }
      echo "Finished processing: ${INPUT}"
  done

echo "Batch ${BATCH_INDEX} processing complete"

After all batches for a chromosome have been processed, execute the following commands on a single worker to merge the intermediate GTB files:

# ---------- Job Basic Information ----------
  CHROM="21"
  # Input directory containing the intermediate GTB files
  INPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
  INPUT_PROJECT_DIR="/user/zlb/ukb/wes/chr${CHROM}"
  # Output project and directory for the final merged file
  OUTPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
  OUTPUT_PROJECT_DIR="/user/zlb/ukb/wes/"
  OUTPUT="chr${CHROM}.hg38.gtb"

# ---------- Processing Logic ----------
  # Get a list of all intermediate GTB files
  mapfile -t FILES < <(
    dx find data --name "ukb23157_c${CHROM}_b*.gtb" \
      --path "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/" \
      --brief \
    | xargs -n1 -I {} dx describe {} --name
  )

  # Download all intermediate files
  for (( i=0; i<${#FILES[@]}; i++ ))
  do
      dx download "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/${FILES[$i]}"
  done

  # Sort files numerically by block number to ensure correct order
  SORTED_FILES=($(ls ./*.gtb | sort -t '_' -k3.2n))

  # Concatenate the GTB files
  java -jar kgga.jar gbc concat "${SORTED_FILES[@]}" -o "${OUTPUT}"

  # Upload the final merged file
  dx upload "${OUTPUT}" --path "project-${OUTPUT_PROJECT_ID}:${OUTPUT_PROJECT_DIR}/${OUTPUT}" || { echo "Upload failed for ${OUTPUT}"; exit 1; }

  # Delete the intermediate sub-files
  rm -rf ukb23157_c*.gtb

This process results in a single GTB file for each chromosome.

2.6. Convert GTB to Other Formats (Genotype-level Analysis)

The GTB format typically saves 50%–99% of storage space compared to other formats. Its analysis and computation performance is comparable to high-performance bit-encoded formats like PLINK's PGEN and BED. A brief guide can be found at https://pmglab.top/kgga/v1.0/GBCQuickStartEn.html.

To access specific subsets of samples and genomic regions and export them to PGEN format (first, install dependencies using pip3 install jep zstandard pgenlib):

java -Djava.library.path="$(pip3 show jep | grep Location | awk '{print $2"/jep"}')" -jar \
    kgga.jar gbc convert gtb2plink chr21.hg38.gtb --seq-af 0.05~0.95 -ot pgen

2.7. Annotate Variants using KGGA

KGGA can perform high-speed annotation of genomic attributes directly on GTB files. This example demonstrates how to annotate variants based on gene features and filter them to retain only protein-altering non-synonymous mutations. For more detailed functionalities, please refer to the KGGA online documentation.

To annotate and filter variants, saving the results to a TSV text file:

# Prepare the resource directory for KGGA annotation
mkdir -p resources/gene
cd resources/gene

# Download the resource file for KGGA annotation (using the file ID from step 2.2)
dx download file-J21Y8J0J5qQ6KZPxjvpgjF0P
cd ../../

# Start the annotation process
# This command annotates variants in chr21.hg38.gtb with an Allele Count >= 1,
# using the refGene model, and includes functional consequences from nonsynonymous to LoF (0~6).
java -Xmx4g -jar kgga.jar annotate \
   --input-gty-file ./chr21.hg38.gtb \
   --output ./annot \
   --seq-ac 1 \
   --gene-model-database refgene \
   --gene-feature-included 0~6

You can quickly check the results with command-line tools:

zless /opt/notebooks/annot/variants.hg38.tsv.gz

UKB