PMGLab    KGGSeq: A biological Knowledge-based mining platform for Genomic and Genetic studies using Sequence data
 
 
KGGSeq Application
Type File Size Version
Go to mirror site in China (higher speed)
Package KGGSeq with minimal resource bundles for a quick testing 1.3GB 1.2
Package KGGSeq + Full resource bundles (for hg19) 27GB 1.2
Package KGGSeq + Full resource bundles (for hg38) 27GB 1.2
Latest core library KGGSeq executable file Only 31MB 1.2
All All available resources in OneDrive - -
User Manual Only Online Version provided since 0.3 - 1.1+
Source codes KGGseq Github - 1.1+

Datasets
Type File Version
Example data examples.zip 1.1+
ExoVar training sets ExoVar June, 2012

Note:
If you have any question about KGGSeq, please email: limx54@gmail.com;
You are also welcomed to join our google group. This site is used for communication and discussion of Kggseq usage and functions.


Disclaimer:
KGGSeq is free of charge. All materials on the website are provided without any warranty. Please use them at your own risk.

Updated:
06/06/2024
Fix a bug in output of ANNOVA format.
04/29/2023
Fix a bug in resource data location of dbNCFP scores.
04/11/2023
Allow KGGSeq to read a highly addressable byte-encoding genotype blocking format, GTB.
01/30/2023
Fixed a minor bug for annotating variants in duplicated genomic regions.
09/27/2022
Fixed a bug in the new algorithm (released on 08/26/2022) of variants' gene feature annotation for Indels.
08/26/2022
Improve the speed of variants' gene feature annotation.
07/26/2022
Add clinvar_clnsig and clinvar_trait of dbNSFP4.2a into the output file.
01/01/2022
1. Update resources of KGGSeq, including RefGene, GEncode, UCSC gene models and dbNSFP4.2a.
2. Add variant expression in multiple tissues and cell-types from GTEx.
3. Add more analysis options for RUNNER.
11/16/2021
1. Allow RUNNER to consider customized weights of variants.
2. Add mutation rates of gnomAD as covariates of RUNNER.
3. Enable uses to select C/T types variants for all analysis.
09/25/2021
1. Adjust the disease statues (--phen) by covariables (--cov) with the logistic regression (--adj-disease) for RUNNER.
2. Allow to exclude genes with only one variant (--ignore-gene-fewer-var 1).
02/18/2021
Fixe kggseq's resource download-link for the new website.
07/17/2020
Fixe a resource bug for RUNNER analysis based on reference genome hg38.
06/16/2020
Simplify options for mutation burden tests by setting default parameters.
03/18/2020
Upgrade KGGSeq from v1.1 to v1.2 in which a new analysis named Pheno-RUNNER is integrated.
03/01/2020
Update the gnomAD genome R3.0 for hg38.
12/07/2019
Fix a bug for option '--genes-out'.
11/28/2019
Fix a minor bug in checking R packages.
11/20/2019
Fix a minor bug in double-hit filtering.
10/30/2019
Fix a minor bug in searching allele-frequency of indels.
10/03/2019
Fix a minor bug in output of RVTESTS analysis.
08/19/2019
Improve the input and output of RVTESTS for gene-based association.
07/06/2019
1. Add a new analysis function, WITER, into KGGSeq for estimating cancer-driver genes. See details in the user manual.
2. Fix a bug of missing unlabeled variants in gene feature filtering.
03/29/2019
1. Add a new analysis function, RUNER, into KGGSeq for a powerful genetic mapping of rare susceptibility mutations in patient-only and case-control samples. See details in the user manual.
2. Fix a bug of missing unlabeled variants in gene feature filtering.
03/05/2019
Fix a minor bug for merging downloaded dbncfp files.
02/18/2019
Fix a minor bug for using dbnsfp and dbncfp together in --db-score.
01/15/2019
Update the v1.0 to v1.1. In which there are 2 major updates besides fixing some minor bugs.
1. The program coding structure is revised substantially.
2. Update refGene model and gnomAD dataset to be v2.1
11/17/2018
Fix a minor bug in regulatory prediction at non-coding variants.
6/20/2018
Fix a minor bug in numbering the affected exon in gene feature annotation.
4/10/2018
Fix a minor bug in filtering variants by allele-frequencies for InDels.
3/16/2018
Fix a minor bug in predicting pathogenic mutations for Mendelian diseases.
1/25/2018
Fix a minor bug in predicting regulatory potential by some cell types.
1/12/2018
Fix a minor bug in allele frequency annotation for InDels of multiple alleles.
1/8/2018
1. Annotate genes with tissue- or cell-type specific expression.
2. Improve prediction model for cancer driver mutations.
3. Update the variant annotation with dbSNFP v3.5
4. Update OMIM information.
5. Update Cosmic database to be V83.
9/11/2017
Fix a minor bug of counting variants in genes
7/20/2017
Fix a minor bug of missing indel variants in output VCF files.
7/10/2017
Fix a minor annotation bug for variants with mixed substitution and insertion.
6/30/2017
Fix a minor annotation bug for frameshift and non-frameshift variants
5/18/2017
Allows KGGSeq to jointly annoate non-synonymous variants and splicing variants within X bp for each subject when phased genotypes in VCF format are available. See more.
5/4/2017
Fix a minor bug in allele frequency annotation by dbNCFP.
4/14/2017
1. Fix a minor bug in allele frequency annotation by exac.
2. Add allele-frequencies from Genome Aggregation Database (gnomAD),which contains 123,136 exome sequences and 15,496 whole-genome sequences from unrelated individuals
3/28/2017
Fix a minor bug in the function of gene feature annotation at deletion variants.
3/28/2017
Updated the ExAC database from v0.3 to v1.0 and fixed a bug of unmatched alleles at deletion variants.
3/26/2017
Fix a bug in LD pruning!
3/18/2017
Add a function tissue or cell-type specific epigenomic weighting for prioritization of regulatory variants and disease-associated gene.
3/8/2017
Allow KGGSeq to accept a super simple VCF format which only contains the first 5 columns(#CHROM POS ID REF ALT).
2/25/2017
Update the download link of mouse phenotype dataset.
2/10/2017
Fix a bug in pathogenic prediction under the feature specific model.
1/28/2017
Fix a bug of missing annotation by dbNSPF dataset.
1/26/2017
Add non-synonymous allele frequencies from DiscovEHR Collaboration, which contains over 50,000 paticipants.
12/29/2016
Fix a bug in parallel computing with the R package 'snow'.
12/16/2016
1. Fix a bug in a VCF with a mixture of phased and unphased genotypes.
2. Update the protein interaction database STRING to be V10.
3. Update the gene set database GSEA to be V5.2.
12/10/2016
Fix a bug in knockout mouse annoation.
11/25/2016
Update the variant and frequency information of ExAC database (to r0.3.1) in which 60,706 unrelated individuals sequenced.
11/14/2016
Fix a bug for frequency-based filter for deletions.
11/08/2016
Fix broken conection to NCBI PubMed.
11/02/2016
Change the prediction model for pathogenic variants of mendelian diseases.
10/28/2016
Fix a bug in computing regulatory composite scores.
10/18/2016
Count double-hit genes in unaffected offspring of a trio.
10/16/2016
Fixed a bug in producing index file for RVTESTS.
9/24/2016
Fixed a minor bug for RSID annotation with hg38.
9/18/2016
Fixed some minor bugs for hg38.
8/24/2016
Fixed the broken link for DDD dataset.
7/28/2016
Add a tag for a VCF file containing both uphased and phased genotypes.
7/21/2016
Fix a minor bug in IBS estimation
7/14/2016
1)Reduce the size of output genotype data in KED binary format.
2)Fix a minor bug in building index file.
7/7/2016
Fix a minor bug in the unique gene filter (--unique-gene-filter).
6/28/2016
Fix a minor bug in output format.
6/21/2016
1. Update the function for functional or regulatory prediction at whole genome variants .
2. Update the function for pathogenic prediction at genes.
3. Fixed a bug for annotation by known genes at for zebrafish and mouse .
5/9/2016
Fixed a bug in COSMIC cancer annotation.
5/5/2016
Fixed an output bug of reads for de novo mutations.
04/27/2016
Fixed a minor bug in annotation of dbNSFP dataset.
04/24/2016
Speed up the filtering by allele frequencies by a multiple-thread matching function.
04/18/2016
Add a function of LD pruning.
04/12/2016
Fixed a minor bug in resource downloading.
04/02/2016
Fixed the modified URL and format in the Development Disorder Genotype – Phenotype Database.
03/30/2016
Fix a minor bug in QQPlot of RVTESTS.
03/10/2016
Refine detailed gene feature annotation labels for InDels.
03/06/2016
Fixed some minor bugs in parsing annoation data with the faster algorithm.
02/26/2016
Improve the speed of parsing compressed annotation.
02/02/2016
Improve the function of VCF parsing to tolerate some ill-formats.
01/05/2016
Improve the association analysis with RVTest, in which the analysis models be can flexibly specified via KGGSeq.
12/28/2015
Fixed a minor bug for Hardy-Weinberg test.
12/24/2015
Fixed bugs in producing input files for plink and RVTest.
12/20/2015
Optimize memory usage for large scale input.
12/12/2015
Release KGGSeq V1.0, which has a milestone upgrade for large-scale whole genome sequencing studies of complex diseases. See more in the online user-manual..
10/28/2015
Fixed a bug in gene-feature annotation for non-synonymous variants.
4/23/2015
Fixed a bug in IBD region filtering.
3/25/2015
Format the output of gene feature annotations as Human Genome Variation Society(HGVS) recommendations.
3/20/2015
Refine the gene feature annotation. Add the stop-loss gene feature annotation for variants.
2/28/2015
1. Update dbNSFP to be version 2.9.
2. Fix a bug for the option --genes-in.
1/18/2015
Restructure the resource data folders of kggseq.
12/18/2014
Update the function of gene-based analysis for cancer driver mutations.
12/2/2014
1. Update dbNSFP to be V2.8.
2. Fixed a minor bug for --double-hit-gene-trio-filter.
11/24/2014
Add variants in the exac dataset for filtration
11/21/2014
1. Improved the basic gene future annotation with RefGene database (more accurate).
2. Fixed a bug to export ANNOVAR format for Deletion variants.
11/18/2014
1. Add gene feature annotation from Ensembl database.
2. Optimize the VCF parsing algorithm for using less memory.
9/11/2014
Fixed potential Exception bugs in parsing VCF and sorting variants!.
25/10/2014
1. Add the latest version of 1000 Genomes Project data for data merging ;
2. Optimize the function of merging 1000 Genomes data.
22/10/2014
1. Refine filtering by reference variants. See more.
2. Provide a dataset dbsnp138nf for hard filtering without consideration of allele frequencies.
10/07/2014
1. Add an option '--no-qc' to turn off all quality control function at variants and genotypes in VCF format.Thank Ricky Lali for the suggestion!
2. Add "--missing-gty X" to denote missing genotypes in the output VCF file. It is "--missing-gty ./." by default. For vcf-tools, one should set "--missing-gty .". Thank Lorenzo Tattini for the suggestion!
10/03/2014
Fixed two minor bugs, 1)getting stuck and 2)missing genotypes for vcf output. Thank Mike Chong for reporting the bugs!
09/29/2014
Fixed a bug to parse numeric values from text.
09/24/2014
1. Fixed typos in the log file.
2. Further optimize the text search engine to speed up the procedure.
09/18/2014
Integrate the functional prediction scores in dbNSFP 2.7 to predict pathogenic variants for Mendelian diseases and cancers! A new prediction score, VEST3_score, is added..
09/04/2014
Optimize the prioritization and filtering procedure to speed up the analysis.
09/03/2014
Integrate reference variants released by 1000 Genome Project in 2013 May (containing over 81 million sequence variants from 2504 subjects).
09/01/2014
Fixed a minor bug in reference allele frequency-based filtering.
08/27/2014
Improve the output of LOG information and allow the screen LOG information redirect in a file.
08/08/2014
Allow VCF input without any phenotype information for basic annotation and prioritization.
07/27/2014
Correct errors about allele frequencies in dbSNP 137 and 138 and add dbSNP141 for allele-frequency-based filtering.
07/17/2014
Fixed a minor bug for allele frequency filtering.
05/15/2014
Incorperate dbsnp138 for filtering and clarify variants with missing allele frequncies and not existing in databases. See more here .
05/03/2014
Allow the ANNOVAR format to have a head row and multiple columns for comments. See more here .
05/01/2014
Add a function for gene-based filtering. See more here .
04/28/2014
Fixed a minor bug in summary statistics of eliminated variants.
04/16/2014
1. Add two variant-filtering functions by super-duplicate regions and the number variants in a gene .
2. Save resulting variants in Excel2007+ format which have larger capacity than Excel1997.
04/10/2014
Fixed a bug in finding matched Indels. Thank JIBIN JOHN from UNIVERSITY OF DELHI for reporting the problem and preparing the testing data!
03/20/2014
Fixed small bugs for tri-allelic variants and exome length calculation.
03/14/2014
1. Update the pathogenic prediction models for Mendelian and cancer diseases with dbNSFP 2.4 .
2. Add more flexible functions for de novol mutation filtration.
02/15/2014
Add the UCSC known genes for variants annotation.
02/09/2014
Update dbNSFP to be version 2.3 in which there are less missing prediction scores.
01/24/2014
The gene set defined by GENCODE is used for gene feature annotation. An advantage of GENCODE over RefGene may be that the former contains genes definition mitochondria. here.
12/09/2013
Inproved the double-hit gene filtration function, See more in Double-hit gene filter.
11/21/2013
Incorporate genotypes from HapMap and 1K Genome Projects to more accurately check cross-subject contamination for small sample, See more in Sample QC.
11/9/2013
1. Update the deleteriousness prediction scores in dbNSFP to be version 2.1.
2. Fix several cryptic bugs. Thank Chong, Michael for reporting the bugs!
08/06/2013
Fixed a bug for annoation of splicing variants. Thank Martin Haagmans for reporting the bug!
07/14/2013
1. Add a function to explore double-hit genes. See more here .
2. Refine an algorithm to speed up the exploration of longest IBS and homozygous genotype regions. See more here .
06/23/2013
1. Add two short tutorials for cancer-driver somatic mutation and de novo mutation analysis (see more by clicking the links on the upper-left panel).
2. Released a module to detect somatic mutations, genes and pathways driving to cancer based on exome sequencing data. See detail here
3. Fixed a number of potential bugs.
05/29/2013
Add a variant-level QC tag, --vcf-filter-in, to use external QC function such as the Variant Quality Score Recalibration (VQSR) . See detail here
05/25/2013
Modified the definition of --gty-af-alt and add a QC tag --gty-af-het, which is critical for the de novo mutation scanning. See detail here
05/16/2013
Extend the De novo mutation scanning function for multiple independent families or pedigrees at a time.
04/23/2013
Add an OMIM disease gene annotation function.
04/15/2013
Add the ESP5600 dataset as reference dataset for filtration of known sequence variants (hg19_ESP6500AA and hg19_ESP6500EA).
04/08/2013
Add the dbSNP137 dataset as reference dataset for filtration of known sequence variants (hg18_dbsnp137 and hg19_dbsnp137).
04/05/2013
1. Add a function to detect somatic mutation(s) using matched tumor and non-tumor samples.
2. Update the pathogenic prediction by the dbNSFP v2.0 and add at most 13 available functional impact scores to more accurately calculate pathogenic probability by Logistic Regression Model.
3. Refine the cancer driver prediction function by a more reasonable training dataset.
02/06/2013
Add a function to predict "driver " somatic mutation for cancer by the 12 available functional impact scores from dbNSFP v2.0b4.
12/08/2012
Update the pathogenic prediction by the dbNSFP v2.0b4 and add at most 12 available functional impact scores to more accurately calculate pathogenic probability by Logistic Regression Model.
09/06/2012
Add a genotyper filter for the detection of de novo muations in trios
08/08/2012
Update from KGGSeq V0.2 to KGGSeq V0.3
1. Added a function to export genotypes in format into KGGSeq binary genotype format which can use much less space store phased genotypes of sequence with multiple alternative alleles (up to 3).
2. Improved the algorithm to map sequence variants to RefGene, which are quicker and more accurate.
3. Simplified the commands.
4. Re-wrote the user manual, which is much easier to read.
06/23/2012
Merge all library files into a single kggseq.jar. This facilitates the redistribution of kggseq.
06/13/2012
1. Add a function for outputting genotypes in Plink format and flexibly select the sequence variants according to allele frequencies. This would be useful for QC checking by using Plink http://pngu.mgh.harvard.edu/~purcell/plink/ibdibs.shtml.
2. Use a new training dataset, ExoVar, to predict disease-causal nsSNVs.
3. Integrate a public variants dataset from NHLBI GO Exome Sequencing Project (ESP, http://evs.gs.washington.edu/EVS/) for variants filtration.
05/07/2012
Update the deleteriousness scores from dbNSFP 2.0, http://sites.google.com/site/jpopgen/dbNSFP.
03/15/2012
Add the protein features from http://www.uniprot.org/ and reference variants 1000g2012feb, dbSNP132 and dbSNP135 in hg19.
02/15/2012
Integrate the available Uniprot protein domains to annotate amino acid changes.
12/02/2011
Provide a Commands Generator with graphic interface for kggseq.
09/12/2011
Add a function to convert the VCF format into a format which can be recognized by kggseq as local filters (--local-filter).
08/25/2011
Release the first formal version of KGGSeq after extensive testing.
Archive:
Type File Version
MS Windows / Mac OS X / Linux KGGSeq.zip 1.0
User Manual User Manual 1.0
     
MS Windows / Mac OS X / Linux KGGSeq.zip 0.8
User Manual User Manual 0.8
     
MS Windows / Mac OS X / Linux KGGSeq.zip 0.2
User Manual User Manual.pdf 0.2
MS Windows / Linux KGGSeq.Commands.Generator 0.2


Miao-xin Li, Precision Medical Genomics Laboratory, All rights reserved.