Managing Genotype Data on the Cloud: A Tutorial with DNANexus and UKB WES 中文
1. Required Software
- KGGA Software Package: An integrated tool for managing GBC, CCF, genomes, and other large-scale public data.
- Java Runtime Environment
- Download Oracle JDK via wget https://download.oracle.com/java/24/latest/jdk-24_linux-x64_bin.tar.gz.
Since the Java environment may be used multiple times on different compute workers, we recommend storing these resources in your project directory.
2. Step-by-Step Procedure
2.1. Download Software Locally
On your local machine, download KGGA, the gene annotation file, and Oracle JDK:
# Use -c to enable resumable downloads
wget -c https://pmglab.top/kgga/download/lib/v1/kgga.jar
wget -c https://idc.biosino.org/pmglab/resource/kgg/kgga/resources/gene/refGene_hg38_kggseq_v2.txt.gz
wget -c https://download.oracle.com/java/24/latest/jdk-24_linux-x64_bin.tar.gz
2.2. Upload to the RAP Platform
Navigate to your project on the DNANexus platform, upload the files from the previous step, and note their file IDs:
KGGA: file-J21ZP4QJ5qQ25bjXZJX435Bq
JDK: file-J214ZzjJ5qQ8GfF272Fy4ZqP
refGene_hg38: file-J21Y8J0J5qQ6KZPxjvpgjF0P
2.3. Obtain Data Paths
The UK Biobank genotype data is stored in the following directories on the DNANexus Research Analysis Platform (RAP):
# WES VCF
<PROJECT-ID>:/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/
# WGS VCF
<PROJECT-ID>:/Bulk/DRAGEN WGS/ML-corrected DRAGEN population level WGS variants, pVCF format [500k release]/
# WGS PGEN
<PROJECT-ID>:/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, PLINK format [500k release]/
For more details on the genotype data storage, please refer to: https://biobank.ndph.ox.ac.uk/ukb/label.cgi?id=100314
The WES data is split by chromosome and then by genomic blocks. The table below shows the total size and file count for each chromosome.
Chromosome | VCF.GZ [GB] | No. of Files |
---|---|---|
chr1 | 2,273.42 | 97 |
chr2 | 1,636.94 | 71 |
chr3 | 1,304.10 | 56 |
chr4 | 890.46 | 39 |
chr5 | 994.17 | 43 |
chr6 | 1,128.24 | 48 |
chr7 | 1,091.85 | 47 |
chr8 | 824.10 | 35 |
chr9 | 983.73 | 42 |
chr10 | 914.43 | 40 |
chr11 | 1,382.71 | 57 |
chr12 | 1,187.92 | 52 |
chr13 | 402.98 | 18 |
chr14 | 716.82 | 30 |
chr15 | 780.47 | 34 |
chr16 | 1,141.16 | 47 |
chr17 | 1,360.13 | 56 |
chr18 | 357.89 | 16 |
chr19 | 1,662.61 | 65 |
chr20 | 595.50 | 25 |
chr21 | 246.88 | 11 |
chr22 | 539.19 | 23 |
chrX | 563.02 | 24 |
chrY | 5.97 | 1 |
ALL | 22,984.69 | 977 |
All files are named in the format ukb23157_c
2.4. Launch a Worker and Configure the Environment
Launch an interactive worker. A suggested configuration is mem1_ssd1_v2_x4 (7.8 GB RAM, 93 GB disk, 4 cores).
In the worker's terminal, set up the environment:
cd /opt/notebooks/
dx download file-J21ZP4QJ5qQ25bjXZJX435Bq
dx download file-J214ZzjJ5qQ8GfF272Fy4ZqP
tar zxvf ./jdk-24_linux-x64_bin.tar.gz
export PATH=/opt/notebooks/jdk-24.0.2/bin:$PATH
2.5. Download and Process Data in Batches
Individual UKB VCF files are very large, making it impractical to process all sub-files for a chromosome at once. Therefore, we will process the data in batches. First, we divide the files for an entire chromosome into k parts. Then, each worker processes one part (which contains multiple VCF sub-files). The processing includes genotype/individual-level QC and format conversion to the compact GTB format. The resulting GTB files are uploaded back to the DNANexus platform. Finally, once all parts of a chromosome are processed, they are concatenated into a single, complete GTB file.
The following script processes one batch of VCF files. It can be run in parallel on multiple workers by specifying a different BATCH_INDEX for each.
# ---------- Job Basic Information ----------
CHROM="21"
# Total number of batches to split the chromosome files into
BATCH_NUM=5
# Current batch index (0-based, from 0 to BATCH_NUM-1)
BATCH_INDEX=0
# Input data project and directory
INPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
INPUT_PROJECT_DIR="/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/"
# Output data project and directory. [Ensure this directory exists to avoid upload errors]
OUTPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
OUTPUT_PROJECT_DIR="/user/zlb/ukb/wes/chr${CHROM}"
# ---------- File List ----------
echo "[STEP 1] Finding VCF files for chromosome ${CHROM}..."
# Search for all vcf.gz files in the specified input path
mapfile -t FILES < <(
dx find data -r --name "ukb23157_c${CHROM}_b*.vcf.gz" \
--path "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/" \
--brief \
| xargs -n1 -I {} dx describe {} --name
)
TOTAL_FILES=${#FILES[@]}
echo "Found ${TOTAL_FILES} files"
# ---------- Job Mapping ----------
if [[ $BATCH_NUM -le 0 ]]; then
echo "Error: BATCH_NUM must be greater than 0"
exit 1
fi
# Calculate the size of each batch (ceiling)
BATCH_SIZE=$(( (TOTAL_FILES + BATCH_NUM - 1) / BATCH_NUM ))
START_INDEX=$(( BATCH_INDEX * BATCH_SIZE ))
END_INDEX=$(( START_INDEX + BATCH_SIZE - 1 ))
# Validate the batch index
if [[ $BATCH_INDEX -ge $BATCH_NUM ]]; then
echo "Error: BATCH_INDEX cannot exceed BATCH_NUM-1"
exit 1
fi
# Handle the last batch which may be smaller
if [[ $END_INDEX -ge $TOTAL_FILES ]]; then
END_INDEX=$(( TOTAL_FILES - 1 ))
fi
# Check for empty batches (e.g., if BATCH_INDEX is too high)
if [[ $START_INDEX -gt $END_INDEX ]]; then
echo "Warning: Current batch is empty (all files have likely been processed)."
exit 0
fi
echo "Total batches: ${BATCH_NUM} | Approx. ${BATCH_SIZE} files per batch"
echo "Current batch: ${BATCH_INDEX} (file index range ${START_INDEX}-${END_INDEX})"
# ---------- Processing Logic ----------
for (( i=START_INDEX; i<=END_INDEX; i++ ))
do
INPUT="${FILES[$i]}"
echo "Progress: $((i - START_INDEX + 1))/$((END_INDEX - START_INDEX + 1)) - File ${INPUT}"
# Output filename, e.g., rename ukb23157_c1_b1.vcf.gz to ukb23157_c1_b1.gtb
OUTPUT="$(basename ${INPUT} .vcf.gz).gtb"
# Check if the output file already exists on the platform and skip if it does
if dx ls "project-${OUTPUT_PROJECT_ID}:${OUTPUT_PROJECT_DIR}/${OUTPUT}" > /dev/null 2>&1; then
echo "File already exists, skipping: ${OUTPUT}"
continue
fi
# Download the file if it doesn't exist locally
if ls "${INPUT}" > /dev/null 2>&1; then
echo "Local file exists, skipping download: ${INPUT}"
else
echo "Downloading file: ${INPUT}"
dx download "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/${INPUT}"
fi
# Process the file: convert VCF to GTB with filtering
echo "Converting VCF -> GTB..."
java -jar -Xms6g -Xmx6g kgga.jar gbc convert \
vcf2gtb "${INPUT}" -o "${OUTPUT}" \
--field --prune --seq-an 1~ --seq-af 0.000001~0.999999 \
-t 4 || { echo "Java conversion failed for ${INPUT}"; exit 1; }
# Upload the resulting GTB file
echo "Uploading file: ${OUTPUT}"
dx upload "${OUTPUT}" --path "project-${OUTPUT_PROJECT_ID}:${OUTPUT_PROJECT_DIR}/${OUTPUT}" || { echo "Upload failed for ${OUTPUT}"; exit 1; }
# Clean up by deleting local files to save space
rm "${INPUT}" "${OUTPUT}" || { echo "Failed to clean up files"; exit 1; }
echo "Finished processing: ${INPUT}"
done
echo "Batch ${BATCH_INDEX} processing complete"
After all batches for a chromosome have been processed, execute the following commands on a single worker to merge the intermediate GTB files:
# ---------- Job Basic Information ----------
CHROM="21"
# Input directory containing the intermediate GTB files
INPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
INPUT_PROJECT_DIR="/user/zlb/ukb/wes/chr${CHROM}"
# Output project and directory for the final merged file
OUTPUT_PROJECT_ID="J04yP4QJ5qQ4pgP2xQpb0Qp3"
OUTPUT_PROJECT_DIR="/user/zlb/ukb/wes/"
OUTPUT="chr${CHROM}.hg38.gtb"
# ---------- Processing Logic ----------
# Get a list of all intermediate GTB files
mapfile -t FILES < <(
dx find data --name "ukb23157_c${CHROM}_b*.gtb" \
--path "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/" \
--brief \
| xargs -n1 -I {} dx describe {} --name
)
# Download all intermediate files
for (( i=0; i<${#FILES[@]}; i++ ))
do
dx download "project-${INPUT_PROJECT_ID}:${INPUT_PROJECT_DIR}/${FILES[$i]}"
done
# Sort files numerically by block number to ensure correct order
SORTED_FILES=($(ls ./*.gtb | sort -t '_' -k3.2n))
# Concatenate the GTB files
java -jar kgga.jar gbc concat "${SORTED_FILES[@]}" -o "${OUTPUT}"
# Upload the final merged file
dx upload "${OUTPUT}" --path "project-${OUTPUT_PROJECT_ID}:${OUTPUT_PROJECT_DIR}/${OUTPUT}" || { echo "Upload failed for ${OUTPUT}"; exit 1; }
# Delete the intermediate sub-files
rm -rf ukb23157_c*.gtb
This process results in a single GTB file for each chromosome.
2.6. Convert GTB to Other Formats (Genotype-level Analysis)
The GTB format typically saves 50%–99% of storage space compared to other formats. Its analysis and computation performance is comparable to high-performance bit-encoded formats like PLINK's PGEN and BED. A brief guide can be found at https://pmglab.top/kgga/v1.0/GBCQuickStartEn.html.
To access specific subsets of samples and genomic regions and export them to PGEN format (first, install dependencies using pip3 install jep zstandard pgenlib):
java -Djava.library.path="$(pip3 show jep | grep Location | awk '{print $2"/jep"}')" -jar \
kgga.jar gbc convert gtb2plink chr21.hg38.gtb --seq-af 0.05~0.95 -ot pgen
2.7. Annotate Variants using KGGA
KGGA can perform high-speed annotation of genomic attributes directly on GTB files. This example demonstrates how to annotate variants based on gene features and filter them to retain only protein-altering non-synonymous mutations. For more detailed functionalities, please refer to the KGGA online documentation.
To annotate and filter variants, saving the results to a TSV text file:
# Prepare the resource directory for KGGA annotation
mkdir -p resources/gene
cd resources/gene
# Download the resource file for KGGA annotation (using the file ID from step 2.2)
dx download file-J21Y8J0J5qQ6KZPxjvpgjF0P
cd ../../
# Start the annotation process
# This command annotates variants in chr21.hg38.gtb with an Allele Count >= 1,
# using the refGene model, and includes functional consequences from nonsynonymous to LoF (0~6).
java -Xmx4g -jar kgga.jar annotate \
--input-gty-file ./chr21.hg38.gtb \
--output ./annot \
--seq-ac 1 \
--gene-model-database refgene \
--gene-feature-included 0~6
You can quickly check the results with command-line tools:
zless /opt/notebooks/annot/variants.hg38.tsv.gz