Introduction and References
Install and run
Short Tutorials
Quick examples
Options
FAQ
|
KGGSeq: A biological Knowledge-based mining platform for Genomic and Genetic studies using Sequence data
(Version 0.8)
Miaoxin Li, Hongsheng Gui, Suying Bao and JS Kwan |
KGGSeq is designed to make use of biological knowledge for sequencing-based gene mapping of human diseases/traits. Compared with other tools like Plink/Seq (“A library for the analysis of genetic variation data”, http://atgu.mgh.harvard.edu/plinkseq/), KGGSeq focuses more on downstream analysis. Currently it mainly has two pipeline carefully designed to filter and prioritize gene variants in exome sequencing of rare Mendelian disorders and common complex disorders. More pipelines will be developed on the KGGSeq platform for the knowledge-based downstream analysis of monogenic and complex diseases/traits using sequencing data.
|
References
1. Li MX, Gui HS, Kwan JS, Bao SY, Sham PC. A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases.Nucleic Acids Res. 2012 Apr;40(7):e53. PubMed NAR
2. Li et al. Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genet. 2013 Jan;9(1):e1003143. PubMed PLoS Genet
|
Install and run
Java Runtime Environment (JRE) version 6.0 is required for KGGSeq. It can be downloaded from the Java web site.
The version number for KGGSEQ is 1.6 or up. Installing the JRE is very easy in Windows OS and Mac OS X.
In Linux, you have more work to do. Details of the installation can be found http://www.java.com/en/download/help/linux_install.xml.
In Ubuntu, if you have an error message like: "Exception in thread "AWT-EventQueue-0" java.awt.HeadlessException ... " , please install the Sun Java Running Environment (JRE) first.
To install the Sun JRE on Ubuntu(10.04), please use the following commands:
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts
Detailed explanation of above commands can be found at http://www.ubuntugeek.com/how-install-sun-java-runtime-environment-jre-in-ubuntu-10-04-lucid-lynx.html.
For Mac OS, the JRE 1.6 has been available at http://developer.apple.com/java/download/ since April 2008. Mac OS users may need update the Java application to run KGGSeq. A potential problem is that this update does not replace the existing installation of J2SE 5.0 or change the default version of Java. Similar to the Linux OS, the Java_Home environmental variable has to be configured to initiate KGGSeq.
|
Simply decompress the archive and run the following command
java -Xms256m -Xmx1300m -jar ./kggseq.jar <arguments>
The arguments -Xms256m and -Xmx1300m set the initial and maximum Java heap sizes for KGGSeq as 256 megabytes and 1.3 gigabytes respectively. Specifying a larger maximum heap size can speed up the analysis. A higher setting like -Xmx2g or even-Xmx5g is required when there is a large number of variants, say 5 million. The number, however, should be less than the size of physical memory of a machine.
Note <arguments >can be saved in a flat text file.
Note To use KGGSeq under Proxy Network, you need specifically use the following java command to configure the proxy settings (Thank Daniele Yumi for this suggestion):
java -Dhttp.proxyHost=xxx.xxx.xxx -Dhttp.proxyPort=xxxx -Dhttp.proxyUser=xxx -Dhttp.proxyPassword=xxx -jar "./kggseq.jar"
Simply download a file named kggseq.commands.generator (http://statgenpro.psychiatry.hku.hk/limx/kggseq/download/kggseq.commands.generator.jar) the archive and run the following command
java -jar "./kggseq.commands.generator.jar"
Note kggseq.commands.generator.jar is an independent utility with friendly GRAPHIC interface to help users preparing input arguments of KGGSeq.
Note To use KGGSeq.Commands.Generator under Proxy Network, you need specifically use the following java command to configure the proxy settings (Thank Daniele Yumi for this suggestion):
java -Dhttp.proxyHost=xxx.xxx.xxx -Dhttp.proxyPort=xxxx -Dhttp.proxyUser=xxx -Dhttp.proxyPassword=xxx -jar "./kggseq.commands.generator.jar" |
Quick example
Example 1: Prioritize variants based on the hg18 assembly for a rare Mendelian disease
Files needed:
1)a VCF file (rare.disease.hg18.vcf); and
2)a linkage pedigree file (rare.disease.ped.txt).
Note All files were included in the examples folder of KGGSeq.
Run the command below:
java -Xms256m -Xmx1300m -jar kggseq.jar examples/param.rare.disease.hg18.txt
We now walk through the parameter file “param.rare.disease.hg18.txt” before going into the results. Lines starting with hash sign # are comments. Detailed interpretation for each argument in the parameter file is included in 'Options' part.
#one argument per line
#I.Environmental setting
--buildver hg18 \ \ #line 1
--resource ./resources \ \ #line 2
#--no-lib-check \ #line 3
#--no-resource-check \ \ #line 4
#II. Specify the input files
--vcf-file examples/rare.disease.hg18.vcf \ #line 5
--ped-file examples/rare.disease.ped.txt \ #line 6, or specify --indiv-pheno X:1,Y:1,Z:2
#III. Output setting
--out ./test1 \ #line 7
--excel \ #line 8
--o-seattleseq \ #line 9
--o-vcf \ #line 10
#--o-flanking-seq 50 \ #line 11, need large RAM memory
#IV. QC
--gty-qual 10 \ #line 12
--gty-dp 4 \ #line 13
--gty-af-ref 0.05 \ #line 14
--gty-af-het 0.25 \ #line 15
--vcf-filter-in PASS,VQSRTrancheSNP90.00to93.00,VQSRTrancheSNP93.00to95.00,VQSRTrancheSNP95.00to97.00,VQSRTrancheSNP97.00to99.00 \ #line 16
--seq-qual 50 \ #line 17
--seq-mq 20 \ #line 18
--seq-sb 0 \ #line 19
--min-obsa 1 \ #line 20
--min-obsu 0 \ #line 21
#V. Filtering and Prioritization
--genotype-filter 3 \ #line 22
--ibs-case-filter 1000 \ #line 23, or specify 'ibdregions.txt' file
--db-gene refgene \ #line 24
--gene-feature-in 0,1,2,3,4,5,6 \ #line 25
--db-filter hg18_1kgasn2010,hg18_1kgeur2010,hg18_dbsnp138,hg18_dbsnp141,hg18_ESP5400 \ #line 26
--rare-allele-freq 0.006 \ #line 27
--db-score dbnsfp \ #line 28
--filter-nondisease-variant \ #line 29
--regions-out chrX,chrY \ #line 30
#VI. Annotation
--genome-annot \#line 31
--candi-list ATXN1,ATXN2,ATXN8OS,ATXN8,ATXN10,TTBK2 \ #line 32
--pathway-annot cura \ #line 33
--ppi-annot string \ #line 34
--ppi-depth 1 \ #line 35
--pubmed-mining Spinocerebellar+ataxia \ #line 36
Part (I): Specify general environmental setting
Argument ‘line 1~4’ is used to set general KGGSeq running enviroment, including human genome build, resource path and program update
Part (II): Specify the input files
Arguments ‘line 5~6’ are used to specify the input files which support various data format, and they are compulsory for running KGGSeq.
Part (III): Output setting
Argument ‘line 7~11’ is used to set the output file name and type, which can be produced simutaneously by KGGSeq.
Part (IV): Specify Quality control conditions
Arguments ‘line 12~21’ are used to apply QC conditions (Genotype or Variant level) on input dataset, and then exclude low quality genotypes or variants in the following analysis.
Part (V): Filtering and Prioritization
Arguments ‘line 22~30’ are used to apply a combination of filters (genetic inheritance, gene feature, allele frequency, functional importance, etc) to enable better prioritization of the key variants/genes.
Part (VI): Annotation
Arguments ‘line 31~36’ are used to annotate remained list of variants/genes by searching various genomic annotation database and literature evidence.
Notes Most of the above arguments are optional (except line 5 and 6), so user can mask some lines by “#” or delete the lines. Under this circumstance, user can have a systematic view of the impact for each level or even steps. And it will be easier to produce this parameter file by Kggseq command generator we provide.
After running KGGSeq, you will get three file named ‘test0808_1.fit.xls’, ‘test0808_1.fit.vcf’, and ‘test0808_1.fit.seattleseq’ in the same directory of KGGSeq.jar. For this example case, 2 missense variants residing in CTBP2 gene are retained and all of them are predicted to be disease causal.
We also provide another 5 quick examples (corresponding to different input datasets) in the same folder. Users are recommended to practice each of them for a better understanding of each parameter and their combinations.
Example 2: Prioritize variants based on the hg19 assembly for a rare Mendelian disease
Files needed:
1) a VCF file (rare.disease.hg19.vcf); and
2) a pedigree file (rare.disease.ped.txt).
Note: All files were included in the example folder of KGGSeq.
Run the command below:
java -Xms256m -Xmx1300m -jar kggseq.jar examples/param.rare.disease.hg19.txt
The major difference of example 2 from example 1 is to set human genome build version at hg19 and use different genotype datasets for filteration (See details in param.rare.disease.hg19.txt). After running KGGSeq for example 2, one file named ‘test0808_1.fit.xls’ will be saved in the current directory.
Example3: Prioritize variants based on the hg18 assembly for a complex disease
Files needed:
1) a Plink/Seq summary result file (complex.disease.v.assoc.hg18.out)
Run the command below:
java -Xmx1300m -jar kggseq.jar examples/param.complex.disease.v.assoc.hg18.txt
After running KGGSeq for example 3 by the settings in param.complex.disease.v.assoc.hg18.txt, one file named ‘kggseqtest3.cd.fit.xls’ and one figure named ‘kggseqtest3.cd.qq.png’ will be saved in the current directory. In the excel file, 17 variants (only 1 missense variant) residing in 6 genes (SAMD11, NOC2L, KLHL17, PLEKHN1, AGRN and C1orf159) are remained and well annotated.
Example 4: Prioritize genes based on the hg18 assembly for a complex disease
Files needed:
1) a Plink/Seq summary result file (complex.gene.assoc.hg18.out)
Run the command below:
java -Xmx1300m -jar kggseq.jar examples/param.complex.disease.gene.assoc.hg18.txt
After running KGGSeq for example 4 by the settings in param.complex.disease.gene.assoc.hg18.txt, one file named ‘test3.fit.xls’ and one figure named ‘test3.qq.png’ will be saved in the current directory. In the excel file, 17 variants (only 1 missense variant) residing in 6 genes (SAMD11, NOC2L, KLHL17, PLEKHN1, AGRN and C1orf159) are remained and well annotated.
|
Options
Set the version of reference human genome buildjava -jar kggseq.jar --buildver hg## All inputs will be analyzed and ouputted according to the setted genome version.
NOTE
Currently, KGGSeq only supports hg18 and hg19. It is --buildver hg19 by default.
Specify the local path to store the resource datasets java -jar kggseq.jar --resource path/to/resource/files
Once initiated, KGGSeq will automatically download the resource data from its web server to the local folder. By default, it is --resource "./"
Switch off the auto-checking function for itself.
java -jar kggseq.jar --no-lib-check
By default, KGGSeq will automatically check and download the available latest version of it, once initiated.
Switch off the auto-checking function for all relevant resource datasets.
java -jar kggseq.jar --no-resource-check
By default, KGGSeq will automatically check and download the available latest version of resource datasets, once initiated.
|
VCF format with genotypes
The format KGGSeq supporting best is the VCF format, which stored sequence variants and called genotype information. The VCF file can be flat text or be compressed as .gz.
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2
NOTE
If the data were stored in different files chromosome by chromosome, you can use _CHROM_ to denote the chromosome names [1...Y] or directly specify [1,2,X]in the file name so that kggseq can read data in multiple files in a run.
NOTE
In the VCF file, each subject should have a unique SubjectID, which is used to link the phenotypes and/or pedigree structure.
VCF format WITHOUT genotypes
KGGSeq can also accept VCF data without genotype information. But there are less usable functions for this type data.
java -jar kggseq.jar --no-gty-vcf-file path/to/file1
In addition, KGGSeq also support other popular variants and/or genotypes formats which are popular in next generation studies.
ANNOVAR format
java -jar kggseq.jar --annovar-file path/to/file
NOTE
Since v0.4, KGGSeq recognize an extended ANNOVAR format in which a head row and and multiple columns for comments are allowed.
Example:
chr startpos endpos ref alt comment1 comment2 comment3
1 69428 69428 T G T 92 129
1 69476 69476 T C T 1 0
Samtools Pileup format java -jar kggseq.jar --pileup-file path/to/file
Plink/Seq variants association output --v-assoc-file
java -jar kggseq.jar --v-assoc-file path/to/file
The plink/seq outputs of variant association test, which are stored in a text (or compressed "*.gz") file.
Tips Procedures for running plink/seq to get the summary statistics for individual variants.
pseq proj v-assoc --phenotype my.phenotype [ and other filter options] > path/to/file
OR
pseq data.vcf.gz v-assoc --phenotype phenofile.txt my.phenotype [ and other filter options] > path/to/file
Plink/Seq variants Linear and logistic regression tests output --glm-file
java -jar kggseq.jar --glm-file path/to/file
The plink/seq outputs of variant association test, which are stored in a text (or compressed "*.gz") file.
Tips Procedures for running plink/seq to get the summary statistics for individual variants.
pseq proj glm --phenotype phe1 --covar mds1 mds2 [ and other filter options] > path/to/file
Plink/Seq gene-based association output: --assoc-file
java -jar kggseq.jar --assoc-file path/to/file
The plink/seq outputs of gene-based association test, which are stored in a text (or compressed) file.
Tips Procedures for running plink/seq to get the summary statistics for groups of variants or genes:
pseq proj assoc --phenotype my.phenotype --options calpha vt --mask loc.group=refseq [ and other filter options] > path/to/file
Tips The --associ-file and --v-associ-file (or --glm-file)can be specified at a time, in which KGGSeq will integrate the variants into genes.
VAAST gene-based association output
java -jar kggseq.jar --vaast-simple-file path/to/file
The output of VAAST of gene-based association analysis result in a text file.
The following are the example format:
RANK Gene p-value Score Variants
1 TEKT4 2.84070221452526e-12 47.62418618 chr2:94906054;47.62;T->A;M->K;0,8
2 HLA-C_DUP_02 2.84070221452526e-12 42.05922817 chr6:31345752;42.06;T->C;T->A;0,12
3 USP6 2.84070221452526e-12 41.80136081 chr17:4977987;41.80;G->A;V->I;0,8
KGGSeq binary genotype file set --ked-file
java -jar kggseq.jar --ked-file path/to/fileset
The KGGSeq binary genotype produced by KGGSeq, which is more flexible to store the different types of genotypes than the Plink BED format.
When there are genotypes imputted say in VCF format, KGGSeq should know to whom they belong to and the phenotypes of the sample. So you need provide your sample information in this case.
Concise subject ID and disease status: --indiv-pheno
java -jar kggseq.jar --vcf-file path/to/file --indiv-pheno SubjectID1:0,SubjectID2:2,...
Specify the individual IDs and affection status, which is coded as: 0 missing or unknow,1 unaffected,2 affected.
Note The individual IDs must be identical to the ones defined in the VCF file(s).
Conventional linkage pedigree file: --ped-file
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 (or --composite-subject-id)
Specify the subject IDs, relationship between subjects, and disease status of them; it is a conventional Linkage Format, detailed here.
- Pedigree Name
A unique alphanumeric identifier for this individual's family. Unrelated individuals should not share a pedigree name.
- Individual ID
An alphanumeric identifier for this individual. Should be unique within his family (see above).
- Father's ID
Identifier corresponding to father's individual ID or "0" if unknown father. Note
that if a father ID is specified, the father must also appear in the file.
- Mother's ID
Identifier corresponding to mother's individual ID or "0" if unknown mother Note that if a mother ID is specified, the mother must also appear in the file.
- Sex
Individual's gender (1=MALE, 2=FEMALE).
- Affection status
Affection status to be used for association tests (0=UNKNOWN, 1=UNAFFECTED,2=AFFECTED).
Note By default, the subject IDs in the VCF file must be identical to Individual IDs in the pedigree file, which are assumed unique in the whole pedigree file. However, one can also ask KGGSeq to use a composite subject ID in the CVF file as "PedigreeName$IndividualID" by the option --composite-subject-id.
|
KGGSeq can flexibly output different formats of the prioritization and annotation results for either final validation or further analysis by third-party tools.
Output file path and file name prefix: --outjava -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --out path/to/prefixname
Specify path and prefix name of outputs. It is "./kggseq" by default.
Text format: This is by defaultjava -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2
By default, KGGSeq output results in TEXT format in a file named kggseq.flt.txt, in which the fields (or columns) are delimited by the tabs.
Excel format: --excel
java -jar kggseq.jar --vcf-file path//to/file1 --ped-file path/to/file2 --excel
The results will be stored in a Excel file with the name kggseq.flt.xls.
Produce Polyphen input: --o-polyphen
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-polyphen
Generate an extra copy of output for the prioritized variants in Polyphen input format, which can be further analyzed by Polyphen.
Produce SeattleSeq input: --o-seattleseq
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-seattleseq
Generate an extra copy of output for the prioritized variants in SeattleSeq input format, which can be further annotated by SeattleSeq.
Produce ANNOVAR input: --o-annovar
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-annovar
Generate an extra copy of output for the prioritized variants in ANNOVAR input format, which can be further annotated by ANNOVAR.
Produce VCF input: --o-vcf
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-vcf
Generate an extra copy of output for the prioritized variants in VCF format, which can be further analyzed by other tools.
Note Please use --missing-gty X to denote missing genotypes in the output VCF file. It is --missing-gty ./. by default.
Produce PLINK input: --o-plink-ped
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-plink-ped
Generate an extra copy of output for the prioritized variants in PLINK PED and MAP file, which can be further annotated by PLINK.
Extract the flanking sequence of each sequence variant: --o-flanking-seq
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-flanking-seq 50
The source sequence information is the UCSC Chromosome dataset hg18 or hg19, which are human genome reference sequences on the forward strand. The unit of the specified flanking sequence is base-pair(bp). The retrieved sequences will be in the *.flt.txt or *.flt.xls with the column label "FlankingSeqXXbp".
Note Users might need set large memory (by -Xmx4g) to carry out this function as KGGSeq will load the whole sequences of a chromosome.
PLink binary genotypes files set: --o-plink-bed
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --composite-subject-id --o-plink-bed
Generate an extra copy of output for the prioritized variants in PLink binary genotype format (SNP-major model), which can store the genotype information more efficiently.
Note This format does not support variants with over 2 alleles and phased genotypes, which will be ignored by Kggseq.To record this information, you are suggested to use the Kggseq binary genotype format (below).
Kggseq binary genotypes files set: --o-ked
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --composite-subject-id --o-ked
Generate an extra copy of output for the prioritized variants in KGGSeq binary genotype format, which use much less space to store the genotype information.
Here is an exmaple VCF file:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1$A 1$B 1$C
chr1 4793 rs6682385 A G 596.57 PASS AC=2 GT 1/1 0/0 ./.
chr1 53560 . T C 573.51 PASS AC=1; GT ./. 0/0 0/1
chr1 12887054 rs3738105 T C,G 5572.99 PASS AC=2,1 GT 1/2 0/0 0/1 |
And the corresponding pedigree file:
1 A 0 0 1 1
1 B 0 0 2 1
1 C A B 1 2 |
After running the command above, KGGSeq will generat three file:
kggseq.ked (the KGGSeq binary genotype file, genotype information)
kggseq.fam (first six columns of the conventional Linkage Pedigree file)
kggseq.bim (extended MAP file: two extra cols for reference and alternative allele names, like: A G,C)
The kggseq.fam and kggseq.bim are similar to PLINK BED format.
But the kggseq.ked has a different way to present genotypes, which is more flexible.
Here is the binary view of kggseq.ked file:
|-magic number--| |phase mode| |----------------------------------genotype data----------------------------------|
10011110 10000010 00000000 10000000 11000000 10100000 11000000 10000000 00100000 11000000 |
The followings define the KGGSeq binary genotype file format
- First two bytes 10011110 10000010 for the validation checking of a KGGSeq binary genotype file.
- Third byte is 00000000 (Unphased genotypes) or 00000001 (Phased genotypes)
- Genotype data, either for unphased or phased ones in SNP-major order.The 4th and 5th bytes (10000000 and 11000000) present genotypes of chr1:4793. The A, B and C have bit genotypes 11,01 and 00 respectively. The 6th and 7th bytes (10100000 and 11000000) present genotypes of chr1:53560. The 8th, 9th and 10th bytes (10000000, 00100000 and 11000000) present genotypes of chr1:12887054.
- New "row" in the VCF always starts a new byte
- KGGSeq currently considers a variant with at most 4 alleles (either phased or not).
For each sequence variant, the number of bytes required to present a genotype is determined by the number of alleles and the genotype phase status.
I. Unphased -genotypes
I.I Bi-allelic sequence variant (2 bits)
|
missing |
Reference homozygous |
Heterozygous |
Alternative homozygous |
VCF genotype |
./. |
0/0 |
0/1 |
1/1 |
Bits |
00 |
01 |
10 |
11 |
Decimal |
0 |
1 |
2 |
3 |
I.II Tri-allelic sequence variant (3 bits)
|
missing |
Reference homozygous |
Heterozygous |
Heterozygous |
Alternative homozygous |
Heterozygous |
Alternative homozygous |
VCF genotype |
./. |
0/0 |
0/1 |
0/2 |
1/1 |
1/2 |
2/2 |
Bits |
000 |
001 |
010 |
011 |
100 |
101 |
110 |
Decimal |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
I.III Quad-allelic sequence variant (4 bits)
|
missing |
Reference homozygous |
Heterozygous |
Heterozygous |
Heterozygous |
Alternative homozygous |
Heterozygous |
VCF genotype |
./. |
0/0 |
0/1 |
0/2 |
0/3 |
1/1 |
1/2 |
Bits |
000 |
0001 |
0010 |
0011 |
0100 |
0101 |
0110 |
Decimal |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
|
Heterozygous |
Alternative homozygous |
Heterozygous |
Alternative homozygous |
VCF genotype |
1/3 |
2/2 |
2/3 |
3/3 |
Bits |
0111 |
1000 |
1001 |
1010 |
Decimal |
7 |
8 |
9 |
10 |
II. Phased -genotypes
II.I Bi-allelic sequence variants (3 bits)
|
missing |
Reference homozygous |
Heterozygous |
Heterozygous |
Alternative homozygous |
VCF genotype |
.|. |
0|0 |
0|1 |
1|0 |
1|1 |
Bits |
000 |
001 |
010 |
011 |
100 |
Decimal |
0 |
1 |
2 |
3 |
4 |
II.II Tri-allelic sequence variant (4 bits)
|
missing |
Reference homozygous |
Heterozygous |
Heterozygous |
Heterozygous |
Heterozygous |
Alternative homozygous |
VCF genotype |
.|. |
0|0 |
0|1 |
1|0 |
0|2 |
2|0 |
1|1 |
Bits |
000 |
0001 |
0010 |
0011 |
0100 |
0101 |
0110 |
Decimal |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
|
Heterozygous |
Heterozygous |
Alternative homozygous |
VCF genotype |
1|2 |
2|1 |
2|2 |
Bits |
0111 |
1000 |
1001 |
Decimal |
7 |
8 |
9 |
II.III Quad-allelic sequence variants (5 bits)
|
missing |
Reference homozygous |
Heterozygous |
Heterozygous |
Heterozygous |
Heterozygous |
Heterozygous |
VCF genotype |
.|. |
0|0 |
0|1 |
1|0 |
0|2 |
2|0 |
0|3 |
Bits |
00000 |
00001 |
00010 |
00011 |
00100 |
00101 |
00110 |
Decimal |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
|
Heterozygous |
Alternative homozygous |
Heterozygous |
Heterozygous |
Heterozygous |
Heterozygous |
VCF genotype |
3|0 |
1|1 |
1|2 |
2|1 |
1|3 |
3|1 |
Bits |
00111 |
01000 |
01001 |
01010 |
01011 |
01100 |
Decimal |
7 |
8 |
9 |
10 |
11 |
12 |
|
Alternative homozygous |
Heterozygous |
Heterozygous |
Alternative homozygous |
VCF genotype |
2|2 |
2|3 |
3|2 |
3|3 |
Bits |
01101 |
01110 |
01111 |
10000 |
Decimal |
13 |
14 |
15 |
16 |
|
After genetic variants were inputed from a variety of data formats, KGGSeq provides a few optional quality control filters on individual genotype, genomic variant and sample phenotype.
By adopting one or a combination of these QCs before performing downstream prioritization/annotation, it could help the user to reduce false positives.
Typical information contained in individual genotype field usually covers read depth(--gty-dp), genotype quality (--gty-qual) and allelic balance ratio (--gty-af-ref, --gty-af-het and --gty-af-alt):
java -jar kggseq.jar --vcf-file path/to/file1 --indiv-pheno Indiv:1,Indiv:2 --gty-qual 10 --gty-dp 4 --gty-sec-pl 20 --gty-af-ref 0.05 --gty-af-het 0.25 --gty-af-alt 0.75 --o-vcf path/to/file2
They are in relationship of logical AND, and not exclusive of each other,user can specify one or any combination of these parameters at one time.
--gty-qual 10 |
Exclude genotypes with the minimal genotyping quality (Phred Quality Score) per genotype < 10. The default setting is --gty-qual 10. |
--gty-dp 4 |
Exclude genotypes with the minimal read depth per genotype < 4. The default setting is --gty-dp 4. |
--gty-sec-pl 20 |
Exclude genotypes with the second smallest normalized, Phred-scaled likelihoods for genotypes < 20. (otherwise, there would be confusing genotypes).The default setting is --gty-sec-pl 20. |
--gty-af-ref 0.05 |
Exclude genotypes with the fraction of the reads carrying alternative allele >= 5% at a reference-allele homozygous genotype, usually indicating by AF in a VCF file. The default setting is --gty-af-ref 0.05 . This tag can be disabled by --gty-af-ref 1 |
--gty-af-het 0.25 |
Exclude genotypes with the fraction of the reads carrying alternative allele <= 25% at a heterozygous genotype.
The default setting is --gty-af-het 0.25. This tag can be disabled by --gty-af-het 0 |
--gty-af-alt 0.75 |
Exclude variants with the fraction of the reads carrying alternative allele <= 50% at a alternative-allele homozygous genotype.
The default setting is --gty-af-alt 0.75. This tag can be disabled by --gty-af-alt 0 |
--gty-somat-p 0.05 |
Exclude genotypes at which the distribution of reads carrying reference and alternative alleles are not significantly (p>0.05) different between tumor tissues and non-tumor tissues. By default, the p value threshold is 0.05. |
Note The AF is prioritized in the quality control process if the both AF and Ad are provided in the VCF file as the AF is usually estimated using high quality reads
After running individual genotype QC, there will be two output files produced: one default Kggseq output stored in txt (or Excel by specifying "--excel"), in which the basic annotations for each clean variant are provided;
the other one is clean vcf file, in which the genotype field which does not meet the conditions specified will be replaced by missing value ./., hence not considered in following analysis.
##Key information provided in default basic annotation file:
"AffectedRefHomGtyNum", Number of affected individuals with reference homozygote at this variant;
"AffectedHetGtyNum", Number of affected individuals with heterozygote at this variant;
"AffectedAltHomGtyNum", Number of affected individuals with non-ref homozygote;
"UnaffectedRefHomGtyNum", Number of unaffected individuals with reference homozygote at this variant;
"UnaffectedHetGtyNum", Number of unaffected individuals with heterozygote at this variant;
"UnaffectedAltHomGtyNum", Number of unaffected individuals with non-ref homozygote;
Note The QC parameters in this section are designed for VCF input,which may not be suitable for other input formats.
Note All above quality control functions at genotypes in VCF format can be turned off by --no-qc.
Each genomic variant corresponds to one row in the input dataset. Typical information for one variant includes overall sequencing quality (--seq-qual), RMS Mapping Quality (--seq-mq),
and strand bias (--seq-sb or Phred-scaled p-value using Fisher's exact test --seq-fs). To make use of this information:
java -jar kggseq.jar --vcf-file path/to/file1 --indiv-pheno Indiv:1,Indiv:2 --vcf-filter-in PASS,VQSRTrancheSNP95.00to97.00,VQSRTrancheSNP97.00to99.00 --seq-qual 50 --seq-mq 20 --seq-sb -10 --seq-fs 60 --o-vcf path/to/file2
Interpretation of key parameters and their values:
Arguments/values |
Annotation |
--vcf-filter-in |
exlude the variants with "FILTER" not matching the specified QC labels in the VCF input file. e.g., --vcf-filter-in PASS,VQSRTrancheSNP90.00to93.00,VQSRTrancheSNP93.00to95.00,VQSRTrancheSNP95.00to97.00,VQSRTrancheSNP97.00to99.00 . By default, this tag is ignored. |
--seq-qual 50 |
set the minimum overall sequencing quality score (Phred Quality Score) for the variant at 50; |
--seq-mq 20 |
set the minimum overall mapping quality score (Phred Quality Score) for the variant at 20; |
--seq-sb -10 |
set the maximal overall strand bias score for the variant at -10 |
--seq-fs 60 |
set the maximal overall strand bias Phred-scaled p-value (using Fisher's exact test) for the variant at 60; |
Note All above quality control functions at variants in VCF format can be turned off by --no-qc.
In addition, KGGSeq can make use of phenotype information (specified by "--indiv-pheno" or "--ped-file") to filter genomic variants. This QC is quite useful when multiple samples are included in one input file, like trio or case/control study design.
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --min-heta 1(or --min-homa 1) --min-hetu 1(or --min-homu1) --min-obsa 2(or --min-obsu 2) --min-obs 2 --o-vcf path/to/file3
Moreover, this function can work with options to flexibly set the sample size and missingness of genotypes. The following table list these options.
Arguments/values |
Annotation |
--min-heta 1 |
Set minimal observed number of heterozygotes genotypes in cases (the affected) as 1. |
--min-homa 1 |
Set minimal observed number of alternate homozygsote(s) genotypes in cases (the affected) as 1; --min-heta and --min-homa are logical OR relation. |
--min-hetu 1 |
Set minimal observed number of heterozygotes genotypes in controls (the unaffected) as 1. |
--min-homu 1 |
Set minimal observed number of alternate homozygsote(s) genotypes in controls(the unaffected) as 1; --min-hetu and --min-homu are logical OR relation. |
--min-obsa 2 |
Set minimal observed number of non-missing genotypes in cases (the affected) as 2. |
--min-obsu 2 |
Set minimal observed number of non-missing genotypes in controls(the unaffected) as 2; --min-obsa and --min-obsu are logical AND relation. |
--min-obs 2 |
Set minimal observed number of non-missing genotypes in all samples as 2. |
--hwe-control 0.01 |
Exclude variants in controls with the Hardy-Weinberg test p value <= 0.01. By default, this tag is disabled. |
--hwe-case 0.01 |
Exclude variants in cases with the Hardy-Weinberg test p value <= 0.01. By default, this tag is disabled. |
--hwe-all 0.01 |
Exclude variants in all subjects with the Hardy-Weinberg test p value <= 0.01. By default, this tag is disabled. |
After running individual genotype QC, there will be two output files produced: one default Kggseq output stored in txt (or Excel by specifying "--excel"), in which the basic annotations for each clean variant are provided (see above section);
the other one is clean vcf file, in which the variant row which does not meet the conditions specified will be removed completely, hence not considered in following analysis.
Note The genomic variant QC is also available for non-vcf type of input, by using type of variants, or genomic position only.
KGGSeq provides two set of functions to facilitate quality assessment of sequencing data of each subject.
Firstly, you can use KGGSeq to convert genotypes from VCF format to Plink binary genotype format (see Outputs part) so that routine sample QC (pairwise relatedness, mendel error rate and sex check) in genome-wide association study (GWAS) analysis can be performed quickly. The samples with unexpected relationship or error rate are suggested to be removed.
java -jar kggseq.jar --vcf-file path-to-file1 --ped-file path/to/file2 <Variants and Genotype QC Settings> --db-merge <Reference genotype set> --o-plink-bed
plink --bfile kggseq.merged --genome
The following is the reference genotypes integrated by KGGSeq: (KGGSeq will automatically download it for you according to the set name specified by --db-merge )
Reference genotype set |
Description |
hapmap2.r22.ceu.hg19 |
Haplotypes of Hapmap 2 release 22. Convert the coordinates to be hg19 from hg18 by UCSC lift over function. Complied from
here.
|
hapmap2.r22.chbjpt.hg19 |
|
hapmap2.r22.yri.hg19 |
|
hapmap3.r2.ceu.hg19 |
Haplotpyes of Hapmap 3 release 2. Convert the coordinates to be hg19 from hg18 by UCSC lift over function. Compiled from
here.
|
hapmap3.r2.chbjpt.hg19 |
|
hapmap3.r2.mex.hg19 |
|
hapmap3.r2.tsi.hg19 |
|
hapmap3.r2.yri.hg19 |
|
1kg.phase1.v3.shapeit2.asn.hg19 |
Haplotpyes of 1000 Genomes Project version 3. Donwload from
here.
|
1kg.phase1.shapeit2.v3.eur.hg19 |
|
1kg.phase1.shapeit2.v3.afr.hg19 |
|
1kg.phase1.v3.shapeit2.amr.hg19 |
|
1kg.phase3.v5.shapeit2.eur.hg19 |
Haplotpyes of 1000 Genomes Project Phase3 of version 5. Donwload from
here.
|
1kg.phase3.v5.shapeit2.afr.hg19 |
|
1kg.phase3.v5.shapeit2.amr.hg19 |
|
1kg.phase3.v5.shapeit2.sas.hg19 |
|
1kg.phase3.v5.shapeit2.eas.hg19 |
|
We acknowledge the complied VCF data of 1000 Genomes projects by the author of MACH, Dr. LI Yun. To see detailed the description about the data,
please visit http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-03-14.html |
An unexpectedly high pair-wise relatedness (denoted by the PI_HAT of plink) suggests cross-individual contamination
Note To estimate the relatedness (the PI_HAT), you normally do NOT need the genotypes datasets from 1000 Genome Projects (which will require rather large memory). The merging with ancestry-matched HapMap genotype data is often sufficient for an accurate estimation of relatedness. However, the estimation of PI_HAT without reference genotypes in a small local sample will result in an over-estimated value because of the inaccurate allele frequencies derived from the tiny sample.
plink --bfile kggseq --mendel
plink --bfile kggseq --check-sex
Secondly, we provide plotting function (see Plotting part) to visualization the distribution of minor alleles per individual. The outlier from the distribution of total rare variants (MAF < 0.01, or novel in the common database) across all individuals can be easily detected and then ignored in the following analysis.
java -jar kggseq.jar --vcf-file path-to-file1 --out path/to/prefixname --mafplot
java -jar kggseq.jar --vcf-file path-to-file1 --ped-file path/to/file2 --db-filter hg18_dbsnp138,hg18_dbsnp141 --allele-freq 0,0.01 --o-vcf path/to/file3
Note The meanings for -db-filter and --allele-freq in this section are provied in "Allele Frequency Trimming" section.
|
KGGSeq offers simple but potentially powerful strategies to variants filtering and prioritization, which help users isolate most promising disease causal candiate variants from sequence variants to be considered.Users could prioritize variants based on some or all of the following evidences and/or functions: Genetic inheritance sharing, gene feature, allele frequency, biological function prediction, sequence variant type, and phenotype-relevant filtering.
Inheritance pattern filter: --genotype-filter
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --genotype-filter 1,2
Filter out variants for which their genotypes are not consistent with the assumption of disease inheritance pattern.
Functions and applicable models for each code of '--genotype-filter'options are listed below:
Code |
Function |
Applicable model |
1 |
Exclude variants at which affected subjects have heterozygous genotypes. |
Recessive |
2 |
Exclude variants at which both affected and unaffected subjects have the same homozygous genotypes. |
Recessive and full penetrance causal mutation(s) in the sample; compound-heterozygosity |
3 |
Exclude variants at which affected subjects have reference homozygous genotypes. |
Dominant without consanguineous mating |
4 |
Exclude variants at which both affected and unaffected subjects have the same heterozygous genotypes. |
Dominant with full penetrance causal mutation(s) in the sample |
5 |
Exclude variants at which affected subjects have alternative homozygous genotypes. |
Dominant without heterogeneity |
6 |
Exclude variants at which affected family members have NO shared alleles. |
Full penetrance causal mutation(s) in the sample |
7 |
ONLY include variants at which an offspring has one or two non-inherited alleles.
Note
Used as --genotype-filter 4,7 to filter out the de novo mutations where both affected and unaffected subjects have the same heterozygous genotypes.
Note Used as --ignore-homo to ignore the de novo mutations where affected and unaffected subjects have the same ALTERNATIVE homozygous genotypes.
|
Detect de novo mutations which have nearly full penetrance.
In the main output file, there is a column named DenovoMutationEvent to record the genotypes of a child and his or her parents.
Example: N140_0:0/1:46,59&N140_1:0/0:57,0&N140_2:0/0:68,0
The child N140_0 has genotype 0/1 with 46 and 59 reads carrying reference alleles and alternative alleles respectively. The father N140_1 and mother N140_2 are homozygous 0/0.
|
8 |
ONLY include variants at which 1) the paired non-tumor (unaffected) and tumor (affected) samples have different genotypes.
Note The tag --indiv-pair has to be used together to specify the matched non-tumor and tumor samples, e.g. --indiv-pair tumorIndivID1:normalIndivID1,tumorIndivID2:normalIndivID2
|
Detect somatic mutation(s) among matched tumor and non-tumor samples. |
Note Multiple filtration codes have logical OR relationship.
Note The inheritance mode based filtration is proposed under strong assumptions for rare Mendelian disease with clear inheritance mode. If the inheritance mode is elusive, such filtration is not suggested as it may lead to the miss of genuine causal mutation(s). This is particularly true when one has no sufficient information to distinguish the compound-heterozygosity diseases from the recessive diseases.
Extract mutations affecting both copies of a gene in a diploid: --double-hit-gene-trio-filter or --double-hit-gene-phased-filter
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --db-gene refgene --gene-feature-in 0,1,2,3,4,5,6 --db-filter 1kg201204,dbsnp137,ESP6500AA,ESP6500EA --rare-allele-freq 0.033 --db-score dbnsfp --mendel-causing-predict all --double-hit-gene-trio-filter (OR --double-hit-gene-phased-filter)
This function is designed for a disorder suspected to be under compound-heterozygous or recessive inheritance mode, in which both copies of a gene on the two homologous chromosomes of a patient are damaged by two different mutations or the same mutation. For recessive mode, it simply checks variants with homozygous genotypes in patients. For the compound-heterozygous mode, it can use two different input data, phased genotypes of a patient or unphased genotypes in a trio. Here the trio refers to the two parents and an offspring. When these alleles causing a disease at one locus, it follows the recessive model; and when they are at two loci, it follows the compound-heterozygosity model. In both cases, a gene is hit twice. This is the reason why it has the name 'double-hit gene'.
Besides the file storing prioritized sequence variants, this analysis with --double-hit-gene-trio-filter option will lead to a list of double-hit genes in at least one subject. The counts of double-hit genes of each subject are saved in *.doublehit.gene.trios.flt.count.xls (or *.doublehit.gene.trios.flt.count.txt) and the genotypes of involved sequence variants are saved in *.doublehit.gene.trios.flt.gty.xls (or *.doublehit.gene.trios.flt.gty.txt). The corresponding files produced by the analysis with --double-hit-gene-phased-filter option are *.doublehit.gene.phased.flt.count.xls (or *.doublehit.gene.phased.flt.count.txt) and *.doublehit.gene.phased.flt.gty.xls (or *.doublehit.gene.phased.flt.gty.txt).
Note The --double-hit-gene-trio-filter or --double-hit-gene-phased-filter only works for inputs specified by --vcf-file and requires parents' genotypes to infer the phases of the alleles. And sequence variants with missed parents' genotypes are ignored.
Note To ignore homozygous mutation (i.e., implicating the autosomal recessive model), please add the tag --ignore-homo.
Extract a gene on which EVERY affected subject has at least one case-unique alternative allele. But these alleles may be from different variants: --unique-gene-filter
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --db-filter 1kg201204,dbsnp137,ESP6500AA,ESP6500EA --rare-allele-freq 0.02 --db-score dbnsfp --mendel-causing-predict all --unique-gene-filter
Sequence variants beyond this type of genes will be filtered out.
Note --unique-gene-filter only works for inputs specified by --vcf-file.
Filter sequence variants by Identical by state (IBS): --ibs-case-filter X
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ibs-case-filter X
Search the longest region (in minimal length X kb) of each interesting variant, in which there is at least one allele identical among all cases. This function can be used to highlight potential causal mutations or exclude non-disease-relevant sequence variants as the causal mutation(s) is very likely to have long shared regions among all affected family members of Mendelian dominant and recessive diseases. However, IBS checking is just a quick but over-estimation of identical by descent (IBD), which is a more accurate measure of shared regions among related individuals. KGGSeq currently has no function to estimate IBD regions but allow users to input the IBD information estimated by third-party tools into KGGSeq for annotation.
Note --ibs-case-filter only works for inputs specified by --vcf-file.
Identical by descent (IBD) annotation: --ibd-annot
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ibd-annot path/to/file3
User can also provide IBD (i.e., the two alleles arose from the same allele in an earlier generation) or significant linkage regions, which are estimated by a third party statistical genetics tools (such as BEAGLE, Merlin and SimWalk2). Variants within these regions will be highlighted. For an ibd or linkage annotation file, title line is needed, and the file format is like:
CHR START END
1 6134174 10082841
1 202855724 207862411
Note Title line is needed.
Filter sequence variants by homozygosity: --homozygosity-case-filter X
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --homozygosity-case-filter X
Search the longest region (in minimal length X kb) of consecutive homozygous genotype for each interesting variant with minimal length X kb in cases but not in controls. For cancer diseases, this function can be used to examine the loss of heterozygosity straightforwardly. For rare recessive diseases, it can be used to examine the runs of homozygosity straightforwardly. But KGGSeq currently does not have methods to test the statistical significance.
Note --homozygosity-case-filter only works for inputs specified by --vcf-file.
Filtration according to gene features: --db-gene & --gene-feature-in
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --db-gene refgene,gencode,knowngene --gene-feature-in 0,1,2,3,4,5,6
The '--dbgene' sets database(s) to annotate and filter variants.
Gene database label |
Supported reference genome |
Description |
refgene |
hg18 and hg19 |
The RefGene datasase compiled by UCSC from hg18 refGene and hg19 refGene
Note: RefSeq has NO mitochondria gene definition. |
gencode |
hg19 |
The GENCODE gene sets.See home page and the paper of GENCODE for details.
Note: GECODE contains similar number of coding genes but more transcripts than RefGene. It HAS the mitochondria gene definition. |
knowngene |
hg18 and hg19 |
The UCSC knonwGene datasase compiled by UCSC from hg18 knownGene and hg19 knownGene
|
ensembl |
hg19 |
The Ensembl gene datasase compiled by UCSC from hg19 ensGene
|
With this filter, variants beyond '--gene-feature-in' region will be excluded. It is '--gene-feature-in 0,1,2,3,4,5...,15' by default.
Illustration of number codes for the gene features:
Feature |
Code |
Explanation |
Frameshift |
0 |
Short insertion or deletion result in a completely different translation from the original. |
Nonframeshift |
1 |
Short insertion or deletion result in loss of amino acids in the translated proteins. |
Startloss |
2 |
Indels or nucleotide substitution result in the loss of start codon(ATG) (mutated into a non-start codon). |
Stoploss |
3 |
Indels or nucleotide substitution result in the loss of stop codons (TAG, TAA, TGA) |
Stopgain |
4 |
Indels or nucleotide substitution result in the new stop codons (TAG, TAA, TGA), which may truncate the protein. |
Splicing |
5 |
variant is within 2-bp of a splicing junction (use --splicing x to change this, the unit of x is base-pair) |
Missense |
6 |
Variants result in a codon coding for a different amino acid (missense) |
Synonymous |
7 |
Nucleotide substitution does not change amino acid. |
Exonic |
8 |
Due to loss of sequences, only map a variant into exonic region |
UTR5 |
9 |
variant within a 5' untranslated region |
UTR3 |
10 |
variant within a 3' untranslated region |
Intronic |
11 |
Variants within an intron |
Upstream |
12 |
variant overlaps 1-kb region upstream of transcription start site? (use --neargene x to change this, the unit of x is base-pair) |
Downstream |
13 |
variant overlaps 1-kb region downtream of transcription end site (use --neargene x to change this, the unit of x is base-pair) |
ncRNA |
14 |
variant overlaps a transcript without coding annotation in the gene definition (see Notes below for more explanation) |
Intergenic |
15 |
variant is in intergenic region |
Monomorphic |
16 |
It is not a sequence variation actually, which may be resulted from bugs in reference genome in variant calling. |
Unknown |
17 |
Variants has no annotation. |
The output of this function is a list of variables providing all involved gene features of a considered variant.
MostImportantFeatureGene: The gene corresponds to the variant’s smallest gene feature code.
MostImportantGeneFeature: The variant’s smallest gene feature code.
RefGeneFeatures: All matched gene features in the reference gene database.
GENCODEFeatures: All matched gene features in the GENCODE gene database.
The gene feature annotations are described in accordance with the Human Genome Variation Society (HGVS) recommendations. The followings are some examples:
KLHL17:NM_198317:c.1918A>C:p.T640P:(12Exons):exon12:missense
Means: At nucleotide 76 of gene KLHL17's transcript NM_198317, an A is changed to a C, which changes the 640th amino-acid T to be P in this transcript's protein. This transcript has 12 exons in total and this variant is in the 12th exon.
DNAJC8:NM_014280:c.237+2T>G:(9Exons):exon3GTdonor
Means:This is a T to G substitution of the 2nd nucleotide of the 2 nd intron (from its 5' part) positioned between coding DNA nucleotides 237 and 238, which will affect the GT splicing donor.
HNRNPCL2:NM_001136561:c.-169C>G:(2Exons):5UTR
Means: This is a C to G substitution of nucleotide, 169 nucleotides upstream of the ATG translation initiation codon (coding DNA -169).
PRAMEF4:NM_001009611:c.*62C>T:(4Exons):3UTR
Means: This is a C to T substitution of nucleotide coding DNA 62, located in the 3' UTR.
C1orf159:NM_017891:c.311-23T>C:(10Exons):intronic6
Means: This is a T to C substitution of the 23 nucleotide of the 6 th intron (from its 3' part) positioned between coding DNA nucleotides 310 and 311.
HNRNPCL1:NM_001013631:c.*138G>A:(2Exons):downstream
Means: This is a C to A substitution of nucleotide located 293 nucleotides downstream of the gene, i.e. the polyA-addition site.
TNFRSF18:NM_148901:c.-61C>T:(4Exons):upstream
Means: This is a C to T substitution of nucleotide, 61 nucleotides upstream of the ATG translation initiation codon (coding DNA -61).
1. Allele frequency trimming according to database: --db-filter
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --db-filter 1kg201204,dbsnp141,ESP6500AA,ESP6500EA --rare-allele-freq 0.005 --db-filter-hard dbsnp138nf
This mode enables variants filtering according to allele frequencies recorded in public data sources.
The '--db-filter' is use to set databases for filtration. In the above case, variants with allele alternative allele frequency EQUAL to or over 0.005 will be excluded (not including the boundary). No spaces are allowed between and within database names.
Moreover, '--db-filter-hard' is use to filter ALL existing variants in a database. But it should be always careful to use this hard filtering.
List of public databases for allele frequency trimming
Genome Build |
DB Label |
Explanation |
hg18 |
1kg201305 |
All sample of 1000 Genomes Project release 2013 May (over 81 million sequence variants from 2504 subjects) |
hg18 |
1kgasn2010 |
1000 Genomes Project Pilot Data SNPs + indels (2010 March call set) for ASN (asian) |
hg18 |
1kgeur2010 |
1000 Genomes Project Pilot Data SNPs + indels (2010 March call set) for EUR (european) |
hg18 |
1kgafr2010 |
1000 Genomes Project Pilot Data SNPs + indels (2010 March call set) for AFR (african) |
hg18 |
1kgasn2009 |
1000 Genomes Project Pilot Data SNPs (2009 April call set) for ASN (asian) |
hg18 |
1kgeur2009 |
1000 Genomes Project Pilot Data SNPs (2009 April call set) for EUR (european) |
hg18 |
1kgafr2009 |
1000 Genomes Project Pilot Data SNPs (2009 April call set) for AFR (african) |
hg18 |
dbsnp137 |
dbSNP version 137,compiled from UCSC with coordinates converted by liftover. |
hg18 |
dbsnp138 |
dbSNP version 138, compiled from UCSC with coordinates converted by liftover. |
hg18 |
dbsnp138nf |
dbSNP version 138 without the flagged SNPs by UCSC (so nonflagged: nf). In UCSC the flagged SNPs include SNPs clinically associated by dbSNP, mapped to a single location in the reference genome assembly, and not known to have a minor allele frequency of at least 1%. The coordinates were converted from hg19 to hg18 by liftover. |
hg18 |
dbsnp141 |
dbSNP version 141, compiled from NCBI dbSNP V141. Note Indels are ingored. |
hg18 |
ESP5400 |
a public variants dataset from NHLBI GO Exome Sequencing Project (ESP) |
hg19 |
1kg201305 |
All sample of 1000 Genomes Project release 2013 May (over 81 million sequence variants from 2504 subjects) |
hg19 |
1kg201204 |
All sample of 1000 Genomes Project release 2012 April |
hg19 |
1kgafr201204 |
African descendent samples of 1000 Genomes Project release in 2012 April |
hg19 |
1kgeur201204 |
European descendent samples of 1000 Genomes Project release in 2012 April |
hg19 |
1kgamr201204 |
American descendent samples of 1000 Genomes Project release in 2012 April |
hg19 |
1kgasn201204 |
Asian descendent samples of 1000 Genomes Project release in 2012 April |
hg19 |
dbsnp135 |
dbSNP version 135, compiled from UCSC. |
hg19 |
dbsnp137 |
dbSNP version 137, compiled from UCSC. |
hg19 |
dbsnp138 |
dbSNP version 138, compiled from UCSC. |
hg19 |
dbsnp138nf |
dbSNP version 138 without the flagged SNPs by UCSC (so nonflagged: nf). In UCSC the flagged SNPs include SNPs clinically associated by dbSNP, mapped to a single location in the reference genome assembly, and not known to have a minor allele frequency of at least 1%. |
hg19 |
dbsnp141 |
dbSNP version 141, compiled from NCBI dbSNP V141. Note Indels are ingored. |
hg19 |
ESP5400 |
a public variants dataset from NHLBI GO Exome Sequencing Project (ESP) |
hg19 |
ESP6500AA |
a public variants dataset of African American from NHLBI GO Exome Sequencing Project (ESP) |
hg19 |
ESP6500EA |
a public variants dataset of European American from NHLBI GO Exome Sequencing Project (ESP) |
hg19 |
exac |
Exome Aggregation Consortium (ExAC): Variants from 61,486 unrelated individuals sequenced as part of various disease-specific and population genetic studies (ESP) |
The output of this function is a list of variables providing all involved gene features of a considered variant.
MaxDBAltAF: The maximum alternative allele frequency across datasets specified.
altFreq@hgXX_XXXXX: The alternative allele frequency at a dataset hgXX_XXXXX.
NoteVariants without allele frequencies in reference databases are marked by '.' in the results. Moreover, Variants not EXISTING in reference database are marked by 'N'.
2. Users can also provide local filters to filter variants. --local-filter
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --local-filter path/to/file3 --rare-allele-freq 0.005 --local-filter-hard path/to/file4
Moreover, '--local-filter-hard' is use to filter ALL existing variants in a database.
The '--local-filter' option sets local datasets for filterating. KGGSeq accepts local dataset files in the following format: each file has 5 columns [Chromosome, Physical Position, Reference Allele, Alternative Allele(s), and Frequency of alternative allele(s)], separated by tab character. KGGSeq provides a function to convert data in VCF format to that in the local dataset format:
java -jar ./kggseq.jar --make-filter --vcf-file path/to/file --buildver-in hg18 --out test.hg18.var --buildver-out hg19
An example file is like:
Chrom Position Reference Alternative Freq
1 469 C G 0.150
1 492 C T 0.175
1 519 G C 0.067
1 874290 - 0ACAGAG 0.809
1 875913 CAG 3 0.8431
Note The title line is needed and Freq is optional.
3. An in-house VCF file can be used for allele frequency trimming: --local-filter-vcf
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --local-filter-vcf path/to/file3 --rare-allele-freq 0.005 --local-filter-vcf-hard path/to/file4
If the data were stored in different files chromosome by chromosome, you can use ' _CHROM_' to denote the chromosome names [1...Y] or directly specify [1,2,X] in the file name so that kggseq can read data in multiple files in a run.
Similarly, '--local-vcf-filter-hard' is use to filter ALL existing variants in a database.
4. An in-house VCF file without genotypes can also be used for allele frequency trimming: --local-filter-no-gty-vcf
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --local-filter-no-gty-vcf path/to/file3 --rare-allele-freq 0.005 --local-filter-no-gty-vcf-hard path/to/file4
Similarly, '--local-filter-no-gty-vcf-hard' is use to filter ALL existing variants in a database.
Note1. To remove variants in database regardless of their frequencies, please use '--rare-allele-freq 0'.
Note2. To keep variants in a frequency range,[a,b], please use '--allele-freq a,b' option. The boundaries a and b are included.
Note3. The options, '--allele-freq' and '--rare-allele-freq', are mutually exclusive and the latter has higher priority.
Prioritize according to functional impact scores: --db-score
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --db-score dbnsfp
Annotate the non-synonymous sequence variants (i.e., those alter proteins) by publicly avaiable functional impact scores. Currently KGGSeq only support the dataset of dbNSFP, an integrated database of multiple algorithms for the comprehensive collection of human of deleteriousness/functional predictions at non-synonymous sequence variants of human genome. These scores will finally be combined by Logistic regression method to estimate the posterior probability of being a pathogenic variants.
And the --filter-nondisease-variant can be used to jointly filterd out variant predicted to be non-disease causal variant.
Ingore insertion and deletion sequence variants (indels): --ignore-indel
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ignore-indel
Accordingly, users can also ask kggseq to ingore Single nucleotide variants (SNVs) : --ignore-snv
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ignore-snv
By default, KGGSeq will process both SNV and Indel.
Ignore some genomic regions: --regions-out
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --regions-out chrX,chrY:1-10000
Alternatively, one can also exclusively process sequence variants at some genome regions:
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --regions-in chr2,chr4:21212-233454
Ignore variants within some genes: --genes-out
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --genes-out TCF4,CNNM2,ANK3
Alternatively, one can KEEP variants only within some genes:
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --genes-in TCF4,CNNM2,ANK3
Filter out variants in super duplicate regions: --superdup-filter
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --superdup-filter
KGGSeq can eliminate sequence variants within putative super-duplicate genomic regions defined in a dataset (genomicSuperDups) from UCSC datasets , which have higher genotyping error rate. (See more ).
Note Anoher option --superdup-annot allows you to just mark (but not to remove) the variants in the super duplicate regions. A column named SuperDupKValue will be added in the resulting dataset
Filter out genes with too many putative pathogenic variants: --gene-var-filter
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 [options to filter out common and neutral variants] --gene-var-filter 4
After a number of hard-filtering by allele frequencies and pathogenic prediction, you may see some genes harboring candidate variants. Empirically, these genes are usually not interesting and their mutations are often produced by some unknown factors such pseudogene or platform bias. As a rule of thumb, it is safe to set a cutoff 4 or more.
Filter for case or control unique variants: --filter-model case-unique or --filter-model control-unique
java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --filter-model case-unique
Case unique: extract variants with alternative alleles only present among affected individuals.
Control unique: extract variants with alternative alleles only present among unaffected individuals.
Note The --filter-model only works for inputs specified by --vcf-file and --v-assoc-file.
Note This function can work with options to flexibly set the sample size and missingness of genotypes, detalied in the Variant QC part.
Filtration by statistical association: --filter-model association
java -jar ./kggseq.jar --buildver hg19 --filter-model association --v-assoc-file path/to/file3 --p-value-cutoff 0.01 --multiple-testing benfdr --qqplot Exclude variants which are not interesting in terms of statistical p-values for association calculated by PLINK/Seq . The '--p-value-cutoff' sets p-value cutoffs for filtration; the '--multiple-testing' sets the approaches for multiple testing comparisons; and the '--qqplot' enables Q-Q plot of the association results.
Multiple testing method |
Description |
no |
No multiple testing methods. Set a fixed p value cutoff by --p-value-cutoff. |
benfdr |
Benjamini-Hochberg method, which aims to control the FDR (specified by --p-value-cutoff), or "False Discovery Rate," rather than the FWE or "FamilyWise Error rate". |
bonhol |
Bonferroni-Holm method, which is less conservative and more powerful than the Standard Bonferroni method given a family-wise error rate specified by --p-value-cutoff. |
bonf |
Standard Bonferroni correction given a family-wise error rate specified by --p-value-cutoff. |
Note This functions also supports inputs specified by --vcf-file, --assoc-file, which is the output of PLINK/Seq gene-based association analysis result, and --vaast-simple-file, which is the output of VAAST gene-based association analysis result.
|
Provide genomic functional annotation of variants: --genome-annot
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --genome-annot
Currently supported functional annotation categories are Pseudogenes, transcription factor binding sites (TFBSs), enhancers, and UniProt Feature. The option will append three extra fields to the output file:
Pseudogenes: Pseudogenes listed in http://tables.pseudogene.org/set.py?id=Human61
DispensableDeleteriousVariants: Number of dispensable deleterious variants for the gene
in the 1000 Genomes Project
TFBSconsSite[name:rawScore:zScore]: Conserved TFBSs in the UCSC genome browser
vistaEnhancer[name:+ve/-ve]: Known enhancers in the VISTA enhancer browser
UniProtFeature: Annotate a variant of coding gene using the UniProt protein annotations.
UniProtFeature |
Definition |
region of interest |
Extent of a region of interest in the sequence |
active site |
Amino acid(s) involved in the activity of an enzyme. |
calcium-binding region |
Extent of a calcium-binding region. |
glycosylation site |
Glycosylation site. |
chain |
Extent of a polypeptide chain in the mature protein. |
coiled-coil region |
Extent of a coiled-coil region |
compositionally biased region |
Extent of a compositionally biased region |
sequence conflict |
Different papers report differing sequences. |
cross-link |
Posttranslationally formed amino acid bonds. |
disulfide bond |
Disulfide bond. |
DNA-binding region |
Extent of a DNA-binding region. |
domain |
Extent of a domain, which is defined as a specific |
helix |
DSSP secondary structure |
intramembrane region |
|
initiator methionine |
Initiator methionine. |
modified residue |
|
lipid moiety-binding region |
|
metal ion-binding site |
Binding site for a metal ion. |
short sequence motif |
Short (up to 20 amino acids) sequence motif of biological interest |
mutagenesis site |
Site which has been experimentally altered. |
non-consecutive residues |
Non-consecutive residues. |
non-standard amino acid |
Non-standard aminoacid used for pyrrolysine |
non-terminal residue |
|
nucleotide phosphate-binding region |
Extent of a nucleotide phosphate binding region. |
peptide |
Extent of a released active peptide. |
propeptide |
Extent of a propeptide. |
repeat |
Extent of an internal sequence repetition. |
signal peptide |
Extent of a signal sequence (prepeptide). |
site |
Any interesting single amino-acid site on the sequence. |
strand |
DSSP secondary structure |
transit peptide |
|
topological domain |
Topological domain |
transmembrane region |
Extent of a transmembrane region. |
turn |
DSSP secondary structure |
unsure residue |
Uncertainties in the sequence |
sequence variant |
Authors report that sequence variants exist. |
splice variant |
Description of sequence variants produced by alternative |
zinc finger region |
Extent of a zinc finger region. |
binding site |
|
Note Users should be aware of possible mapping error due to these pseudogenes.
Examine whether variants share the same PPIs with some genes of interest: --candi-list gene1,gene2,...,geneN --ppi-annot ppiDatabase
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --candi-list gene1,gene2,...,geneN --ppi-annot string --ppi-depth 1
Note Currently, the only supported PPI database is STRING (i.e., string).
Note Instead of specifying the gene list using the --candi-list option, users can specify a file containing the list using the --candi-file option.
Note The maximum distance of a PPI between a specifified candidate genes and a gene containing the potential causal variants can be adjusted using the --ppi-depth option. The defaul is --ppi-depth 1.
The option will append the following fields to the output file:
IsWithinCandidateGene: Whether the variant is within a candidate gene
PPI: PPI shared by at least one gene of interest and the gene
containing the variant
Examine whether variants share the same pathways with some genes of interest: --candi-list gene1,gene2,...,geneN --pathway-annot pathwayDatabase
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --candi-list gene1,gene2,...,geneN --pathway-annot PathwayDatabase
Currently, users can choose between 5 pathway databases from GSEA. The following are brief descriptions of the 5 datasets COPIED from the page of GSEA.
PathwayDatabase |
Description |
Version |
cano |
Gene sets from the pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts. (1452 gene sets) |
MSigDB 3.1 |
cura |
Curated gene sets: Gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts. The gene set page for each gene set lists its source. (4850 gene sets) |
MSigDB 3.1 |
onco |
Oncogenic signatures: Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer. The majority of signatures were generated directly from microarray data from NCBI GEO or from internal unpublished profiling experiments which involved perturbation of known cancer genes. In addition, a small number of oncogenic signatures were curated from scientific publications. (189 gene sets) |
MSigDB 3.1 |
cmop |
Computational gene sets: Computational gene sets defined by mining large collections of cancer-oriented microarray data. (858 gene sets) |
MSigDB 3.1 |
onto |
GO gene sets Gene sets are named by GO term and contain genes annotated by that term. GSEA users: Gene set enrichment analysis identifies gene sets consisting of co-regulated genes; GO gene sets are based on ontologies and do not necessarily comprise co-regulated genes. (1454 gene sets) |
MSigDB 3.1 |
Note The --candi-list and --candi-file options can be shared with --ppi-annot.
The option will append the following fields to the output file:
IsWithinCandidateGene: Whether the variant is within a candidate gene
SharedPathway: Pathway shared by at least one gene of interest and the gene
containing the variant
PubMed text mining:--pubmed-mining searchTerm or --pubmed-mining-gene searchTerm
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --pubmed-mining search+Term1,search+Term2
KGGSeq can mine the titles and abstracts of published papers in PubMed via the NCBI E-utilities to find a co-mention with the searched term(s) of interest (i.e., disease) and the genes/cytogeneic regions in which the
variants are located. The option --pubmed-mining will append the following two fields to the output file while the --pubmed-mining-gene will only add the PubMedIDGene information.
PubMedIDIdeogram : PubMed ID of articles in which the term and the cytogeneic
position of the variant are co-mentioned
PubMedIDGene : PubMed ID of articles in which the term and the gene
containing the variant are co-mentioned
IMPORTANT Please use + instead of blank characters within a search term.
OMIM term annotation:--omim-annot
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --omim-annot
KGGSeq can extract all available disorder names linked to a gene in which the
variants are located from the OMIM dataset morbidmap. The command will append two extra fields to the output file:
DiseaseName(s)MIMid : Disorder, <disorder MIM no.> (<phene mapping key>)
Phenotype mapping method <phene mapping key>:
1 - the disorder is placed on the map based on its association with
a gene, but the underlying defect is not known.
2 - the disorder has been placed on the map by linkage; no mutation has
been found.
3 - the molecular basis for the disorder is known; a mutation has been
found in the gene.
4 - a contiguous gene deletion or duplication syndrome, multiple genes
are deleted or duplicated causing the phenotype.
GeneMIMid : Gene/locus MIM no.
COSMIC canncer information: --cosmic-annot
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --indiv-pair NONTUMOR.1:TUMOR.1,NONTUMOR.2:TUMOR.2 --genotype-filter 8 --cosmic-annot
KGGSeq can retrieve the confirmed somatic mutation information registered for variants and genes in the COSMIC database.
The command will append a field to the output file. e.g.,
COSMICCancerInfo : known cancer and occurrence time of the somatic mutation in COSMIC dataset. e.g., {serous_carcinoma=3, astrocytoma_Grade_IV=1}
This somatic mutation occurred 3 times in samples of 'serous_carcinoma' and once in samples of 'astrocytoma_Grade_IV'.
One attractive feature of KGGSeq is that it combines the functional impact scores from multiple algorithms (i.e., SIFT and PolyPhen2) to predict whether variants are disease-causing or not.
If you use this feature in any published work, please cite the following manuscript:
Li et al. Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genet. 2013 Jan;9(1):e1003143
Predict Mendelian disease-causing variants: --db-score dbnsfp --mendel-causing-predict
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --db-score dbnsfp --mendel-causing-predict all --filter-nondisease-variant
Use at most 12 available functional impact scores (listed below) collected in dbNSFP database to RE-predict whether a nonsynonymous single nucleotide variant (SNV) will potentially be Mendelian disease causal or not by Logistic regression model (See methods in our paper, Li et al. PLoS Genet. 2013 Jan;9(1):e1003143). By default, KGGSeq tries all subsets of the scores (at least two) for a combinatorial prediction and finally present a prediction which can pass the threshold to classify the SNV as a potentially disease causal or no. In conjunction with the prediction model, the --filter-nondisease-variant tag will be used to filter out the nondisease-causative variants predicted by Logistic regression model (See methods in Li et al. PLoS Genet. 2013 Jan;9(1):e1003143). .Note that this trying will increase the false positive rate although it decreases the false negative rate. So you may need more information to justify the involved gene(s).
Figure: Receiver operating characteristic (ROC) and Precision recall(PR) curves and area under the curves (AUC) of individual scores and combined score by Logistic regression model
Note: this is an updated plot of Figure 1 in our paper . Here we combined more individual functional impact scores, which is the only difference.
On the other hand, one can FIX the prediction using a specified subset or full set of the 5 impact scores by option like --mendel-causing-predict 4,6,7
The coding for the functional impact scores used in'--mendel-causing-predict'options is listed below:
Coding |
Method |
Description (Mostly copied from dbNSFP) |
- |
SLR |
Sitewise Likelihood-ratio (SLR) test statistic for testing natural selection on codons. A negative value indicates negative selection, and a positive value indicates
positive selection. Larger magnitude of the value suggests stronger evidence. |
1 |
SIFT |
SIFT uses the 'Sorting Tolerant From Intolerant' (SIFT) algorithm to predict whether a single amino acid substitution affects protein function or not, based on the assumption that important amino acids in a protein sequence should be conserved throughout evolution and substitutions at highly conserved sites are expected to affect protein function.A small scoreindicates a high chance for a substitutionto damage the protein function. |
2 |
Polyphen2_HDIV |
Polyphen2 score based on HumDiv, i.e. hdiv_prob.
The score ranges from 0 to 1, and the corresponding prediction is "probably damaging"
if it is in [0.957,1]; "possibly damaging" if it is in [0.453,0.956]; "benign"
if it is in [0,0.452]. Score cutoff for binary classification is 0.5, i.e.
the prediction is "neutral" if the score is smaller than 0.5 and "deleterious" if the
score is larger than 0.5. Multiple entries separated by ";" |
3 |
Polyphen2_HVAR |
Polyphen2 predicts the possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations by an iterative greedy algorithm. In the present study, we use the original scores generated by the HumVar (instead ofHumDiv) trained model as it is preferred for the diagnosis of Mendelian diseases. The scores range from 0 to 1. A substitution with larger score has a higher possibility to damage the protein function. |
4 |
LRT |
LRT employed a likelihood ratio test to assess variant deleteriousnessbased on a comparative genomics data set of 32 vertebrate species. The identified deleterious mutations could disrupt highly conserved amino acids within protein-coding sequences, which are likely to be unconditionally deleterious.The scores range from 0 to 1. A larger score indicates a larger deleterious effect. |
5 |
MutationTaster |
MutationTaster assesses the impact of the disease-causing potential of a sequence variant by a naive Bayes classifier using multiple resources such as evolutionary conservation, splice-site changes, loss of protein features and changes that might affect mRNA level. The scores range from 0 to 1. The larger score suggests a higher probability to cause a human disease. |
6 |
MutationAssessor |
MutationAssessor "functional impact of a variant :
predicted functional (high, medium), predicted non-functional (low, neutral)"
Please refer to Reva et al. Nucl. Acids Res. (2011) 39(17):e118 for details |
7 |
FATHMM |
FATHMM default score (weighted for human inherited-disease mutations with Disease Ontology); If a score is smaller than -1.5 the corresponding NS is predicted as "D(AMAGING)"; otherwise it is predicted as "T(OLERATED)". If there's more than one scores associated with the same NS due to isoforms, the smallest score (most damaging) was used. Please refer to Shihab et al Hum. Mut. (2013) 34(1):57-65 for details |
8 |
VEST3 |
VEST 3.0 score. Score ranges from 0 to 1. The larger the score the more likely
the mutation may cause functional change. In case there are multiple scores for the
same variant, the largest score (most damaging) is presented. Please refer to
Carter et al., (2013) BMC Genomics. 14(3) 1-16 for details.
Please note this score is free for non-commercial use. For more details please refer to
http://wiki.chasmsoftware.org/index.php/SoftwareLicense. Commercial users should contact
the Johns Hopkins Technology Transfer office. |
9 |
CADD_score |
Combined Annotation Dependent Depletion (CADD) score for funtional prediction of a SNP. Please refer to Kircher et al.
(2014) Nature Genetics 46(3):310-5 for details. The larger the score the more likely
the SNP has damaging effect. |
10 |
GERP++_NR |
Neutral rate |
11 |
GERP++_RS |
RS score, the larger the score, the more conserved the site |
12 |
PhyloP100way_vertebrate |
PhyloP estimates the evolutional conservation at each variant from multiple alignments of 100 vertebrate genomes to the human genome based on a phylogenetic hidden Markov model. |
13 |
29way_logOdds |
SiPhy score based on 29 mammals genomes. The larger the score,
the more conserved the site. |
Note Predictions at variants with missing scores at specified methods will be ignored!
The option will append the following fields to the output file:
...
Polyphen2_HDIV_pred: Polyphen2 prediction based on HumDiv, "D" ("porobably damaging"),
"P" ("possibly damaging") and "B" ("benign"). Multiple entries separated by ";"
Polyphen2_HVAR_pred: Polyphen2 prediction based on HumVar, "D" ("porobably damaging"),
"P" ("possibly damaging") and "B" ("benign"). Multiple entries separated by ";"
LRT_pred: Classification using LRT (D = deleterious, N = neutral,
or U = unknown)
MutationTaster_pred: Classification using MutationTaster (A =
disease_causing_automatic, D = disease_causing, N =
polymorphism, or P = polymorphism_automatic)
MutationAssessor_pred: MutationAssessor "functional impact of a variant :
predicted functional (high, medium), predicted non-functional (low, neutral)"
DiseaseCausalProb_ExoVarTrainedModel: Conditional probability of being Mendelian disease-causing
given the above prediction scores under a logistic regression
model trained by our dataset ExoVar.
IsRareDiseaseCausal_ExoVarTrainedModel: Classification using the logistic regression model
(Y = disease-causing or N = neutral)
BestCombinedTools:OptimalCutoff:TP:TN : The subset of original prediction tools (out of the 13 tools) used for the combined prediction by our Logistic Regression model which have the largest posterior probability among all possible combinatorial subsets: the cutoff leads to the maximal Matthews correlation coefficient (MCC): the corresponding true positive and true negative at the maximal MCC.
Note By default the option --mendel-causing-predict all is always switched on unless other predicion types are specified.
Predict the driver somatic-mutations and genes of cancers by an integrative methods: --db-score dbnsfp --cancer-driver-predict
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --indiv-pair NONTUMOR.1:TUMOR.1,NONTUMOR.2:TUMOR.2 --genotype-filter 8 --cosmic-annot --db-score dbnsfp --cancer-driver-predict all --filter-nondisease-variant
A driver mutation is defined as the mutation that leads to selective advantage to a cancer clone by either increasing the cancer cells' survival or reproduction. Use at most 12 available functional impact scores (listed above, exept for LRT_Omega) collected in dbNSFP database to RE-predict whether a nonsynonymous single nucleotide variant (SNV) will potentially be the somatic driver mutations or not by Logistic regression model trained under COSMIC database. By default, KGGSeq tries all subsets of the scores (at least two) for a combinatorial prediction and finally present a prediction which can pass the threshold to classify the SNV as a potentially disease causal or no. In conjunction with the prediction model, the --filter-nondisease-variant tag will be used to filter out the nondisease-causative variants predicted by Logistic regression model.
Figure: Receiver operating characteristic (ROC) and Precision recall(PR) curves and area under the curves (AUC) of individual scores and combined score by Logistic regression model
The option will append the following fields to the output file:
...
CancerDriverProb_COSMICTrainedModel: Conditional probability of being cancer driver mutation
given the above prediction scores under a logistic regression
model trained by the COSMIC dataset.
IsCancerDriver_COSMICTrainedModel: Classification using the logistic regression model
(Y = cancer-drivering or N = passenger)
BestCombinedTools:OptimalCutoff:TP:TN: The subset of original prediction tools (out of the 13 tools) used for the combined prediction by our Logistic Regression model which have the largest posterior probability among all possible combinatorial subsets: the cutoff leads to the maximal Matthews correlation coefficient (MCC): the corresponding true positive and true negative at the maximal MCC.
In addition, genes of these predicted cancer-driver somatic mutations are evaluated by our statistical approach (FUSE).
Kwan JS, Ng OL, Sham PC and Li MX. FUSE: FUnctional and Statistical Evidence combined to identify driver genes in cancer genomes. (Submitted)
In the output file of the cancer-driver genes (*.cancer.gene.xls), there are detailed statistical evaluation results of each gene:
GeneSymbol: Official gene symbols defined by HGNC.
AvgCodingLen: The actual or averaged (of multiple isoforms) length (in kb) of all exons in a gene.
#AllNS: The observed number of non-synonymous and splicing somatic sequence variants.
#PositiveNS: The number of predicted possible cancer-driver non-synonymous and splicing somatic sequence variants.
#S: The observed number of synonymous somatic sequence variants.
CancerDriverPValue: The probability of having the observed number or more predicted possible cancer-driver mutations in a gene by chance.
NonCancerDriverPValue: The probability of having the observed number or more synonymous mutations in a gene. The synonymous somatic mutations are assumed to not contribute to the cancer cell growth advantage. This measure can be used to exclude a gene having synonymous somatic mutations due to technique artifacts or mapping errors.
Reads: The reads carrying reference and alternative alleles between tumor and non-tumor tissues. Example:
NORMAL.107 <-> TUMOR.107,112/0,85/32,5.00E-11,84.329414|NORMAL.105 <-> TUMOR.105,52/0,46/24,2.48E-7,54.260868
Here the somatic mutations are observed in two subjects separated by |. For the first subject, the reads carrying reference and alternative alleles between tumor and non-tumor tissues are 112/0 and 85/32 respectively, which are significantly different between the two different tissues, p=5.00E-11 and odds ratio =84.33.
PPI: protein-protein interactions between this gene and inputted disease genes.
Pathway: biological pathways shared by this gene and inputted disease genes.
PubMedID: PubMed ID of articles in which the disease name and the gene are co-mentioned.
COSMICCancerInfo: known cancer and occurrence time of the somatic mutations of this gene in COSMIC dataset.
Similarly, the FUSE method is also used to statistically assess the biological pathways enriched in predicted cancer-driver somatic mutations: --db-score dbnsfp --cancer-driver-predict --pathway-db pathwayDatabase or --pathway-db path/to/mypahway/file
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --indiv-pair NONTUMOR.1:TUMOR.1,NONTUMOR.2:TUMOR.2 --genotype-filter 8 --cosmic-annot --db-score dbnsfp --cancer-driver-predict all --filter-nondisease-variant --pathway-db pathwayDatabase
Currently, there are 5 pathway databases from GSEA. The following are brief descriptions of the 5 datasets COPIED from the page of GSEA.
PathwayDatabase |
Description |
Version |
cano |
Gene sets from the pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts. (1452 gene sets) |
MSigDB 3.1 |
cura |
Curated gene sets: Gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts. The gene set page for each gene set lists its source. (4850 gene sets) |
MSigDB 3.1 |
onco |
Oncogenic signatures: Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer. The majority of signatures were generated directly from microarray data from NCBI GEO or from internal unpublished profiling experiments which involved perturbation of known cancer genes. In addition, a small number of oncogenic signatures were curated from scientific publications. (189 gene sets) |
MSigDB 3.1 |
cmop |
Computational gene sets: Computational gene sets defined by mining large collections of cancer-oriented microarray data. (858 gene sets) |
MSigDB 3.1 |
onto |
GO gene sets Gene sets are named by GO term and contain genes annotated by that term. GSEA users: Gene set enrichment analysis identifies gene sets consisting of co-regulated genes; GO gene sets are based on ontologies and do not necessarily comprise co-regulated genes. (1454 gene sets) |
MSigDB 3.1 |
Alternatively, uses can specify the own pathway gene sets in a text file. The format of the pathway file is:
[Column 1: Pathway ID]
[Column 2: Pathway URL]
[Column 3..N: Gene Symbols in the pathway; separated by spaces].
No title row is required!!!
e.g.,
----------------------------------------------
KEGG_GLYCOLYSIS_GLUCONEOGENESIS http://www.broadinstitute.org/KEGG_GLYCOLYSIS_GLUCONEOGENESIS.html LDHC LDHB LDHA
KEGG_CITRATE_CYCLE_TCA_CYCLE http://www.broadinstitute.org/KEGG_CITRATE_CYCLE_TCA_CYCLE.html GJB1 OGDHL OGDH PDHB IDH3G LDHB
...
...
In the output file of the cancer-driver pathways (*.cancer.pathway.xls), there are detailed statistical evaluation results of each pathway:
Pathway: the input pathway name or id
Size: the number of genes within the pathway
TotalExonLen: the length (in kb)of exons of all genes in the pathway
TotalNSSomatNum: the number of non-synonymous and splicing somatic sequence variants of all genes in the pathway
TotalSSomatNum: the number of synonymous somatic sequence variants Of all genes in the pathway
Genes: genes with predicted possible cancer-driver non-synonymous and splicing somatic sequence variants
CancerDriverPValue: The probability of having the observed number or more predicted possible cancer-driver mutations within a pathway by chance.
NonCancerDriverPValue: The probability of having the observed number or more synonymous mutations within a pathway by chance. The synonymous somatic mutations are assumed to not contribute to the cancer cell growth advantage. This measure can be used to exclude a pathway having synonymous somatic mutations due to technique artifacts or mapping errors.
HypergeometricPValue: Hypergeometric enrichment set test p-value of a biological pathway enriched by genes with promising cancer-driver gene p-values (described above). The selection of the genes for hypergeometric test is determined by Benjamini Hochberg FDR test with a cutoff specified by --pathway-gene-p.
Note This feature will be available soon. Stay tuned for an update!
|
Generate a histogram of minor allele frequency (MAF) --mafplot
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --out path/to/prefixname --db-filter hg18_1kgasn2010,hg18_1kgeur2010,hg18_1kgafr2010,hg18_ESP5400 --allele-freq 0,1 --mafplot
Note Currently, --mafplot only supports the inputs specified by --vcf-file and --v-assoc-file.
Generate a Quantile-Quantile (Q-Q) plot for association p-values --qqplot
java -jar kggseq.jar --v-assoc-file path/to/file --out path/to/prefixname --qqplot
Note Currently, --mafplot only supports the inputs specified by --assoc-file, --v-assoc-file.
|
1) Does Kggseq provide gene expression information in the output?
Kggseq does not implement any gene expression database for annotating location or quantity of gene expression. But users can input Kggseq’s final gene list to GeneCards for gene expression information.
2) Why Kggseq does not read my VCF file?
If you use standard VCF output from GATK pipeline, it usually contains variants on mitochondrial DNA. However, mitochondrial DNA is not annotated by gene feature database. Therefore Kggseq currently only accept VCF file excluding variants on ChrM.
3) Is there a link for annotating shared pathways in terms of biological function?
Kggseq implements MsigDB database, therefore each pathway item output by Kggseq can be directed to a weblink which contain concrete information for the corresponding pathway.
4) How shall I prepare the candidate gene list? What is the gene list for?
The candidate gene list can include any gene interesting to the user. It may be verified causal gene for the disease, or potential causal gene. The gene list is used to fish more hidden interesting genes in order to narrow down the searching region. If no gene list provided, Kggseq can still perform routine ‘filtration + sharing’ strategy for sequencing data.
5)Suppose I have in-house control genomes in VCF format, how can I transform it into Kggseq acceptable file?
There are two alternative ways to allow you use the in-house VCF data (containing genotypes of multiple individuals) as filter.
1. Convert the VCF data into a standard kggseq local filter format by the command: --make-filter --vcf-file path/to/vcffile --buildver-in hg18 --out test.hg19.var --buildver-out hg19 first. And then set the --local-filter test.hg19.var
2. Directly set --local-filter-vcf path/to/vcffile. However, this way is less efficient than the above way as it will take additional time to parse the VCF data.
6) What is the default setting for minor allele frequency (MAF) cut-offs in the filtration step? Is there a practical guidance for MAF selection?
The default setting for MAF cut-offs is 0, which means only novel variants can survive in the filtration process. MAF selection is a difficult problem, but generally two suggestions can be adopted: for rare Mendelian disease, MAF cut-off at 0 or 0.01 is reasonable; for common complex disease of relatively rare alleles with large effect size, MAF cut-off is dependent on disease prevalence, penetrance and genetic model, so a higher cut-off like 0.05 might be more reasonable.
7) Can I use ANNOVAR and Kggseq together on my dataset?
Kggseq is quite flexible for interacting with other sequence-oriented analytical programs/software (including SOAPSNP, ANNOVAR, PLINK/seq, VAAST, etc). It can accept various input formats, and output different kinds of sequence data. In case of ANNOVAR, Kggseq can read ANNOVAR-formatted sequence variants, and write the final remaining variants in ANNOVAR format.
8)Can I run Kggseq on my Lenovo or Mac laptop? Is it time consuming to run a complete Kggseq process?
Normally, Kggseq run well and fast with >=1 GB RAM memory. Hence current laptop are certainty affordable for running Kggseq. The whole process need only <10 minutes, unless PubMed searching is set in the parameter file. For PubMed searching, it really depends on the speed of the network and also the loading of the NCBI PubMed database.
9)How do I report an error or bugs to Kggseq?
You are welcomed to join our google group (http://groups.google.com/group/kggseq_user?hl=en). This site is used for communication and discussion of Kggseq usage and functions. You can also directly write an email to limx54@yahoo.com, or kggseq_user@googlegroups.com. |
|