Options
General settings

Inputs
  Variants&genotypes
  Phenotype&pedigree

Outputs
Ordinary outputs
Binary outputs

Quality control
Genotype QC
Variant QC
Sample QC

Filtering
Genetic inheritance
Gene feature
Allele frequency
Type&region&gene
Supper duplicate
Phenotype
Expression

Annotation
dbSNP RSID
Alternative splicing
Structure variation
Protein interaction
GeneSet
Mouse phenotype
Zebrafish Phenotype
DDD Study
PubMed literature
OMIM
COSMIC

Prediction at variant
Mendelian disease
Cancer somatic driver
Complex disease
Composite:PRVCS
Context:cepip
GF-Specific

Prediction at gene
Pathogenicity
Phenotype mining
Expression

Statistical tests
Gene Mutation Burden
   ITER
   WITER
   WITER-Ref
RUNNER
GGI-RUNNER
Association at variant
Association at gene
SKAT
Rvtests
Association at geneset
SKAT
Rvtests

Plotting
Allele frequency
Q-Q

Miscellaneous
Linkage Disequilibrium
LD Pruning

FAQ

KGGSeq: A biological Knowledge-based mining platform for Genomic and Genetic studies using Sequence data

(Version 1.0+)
Miaoxin Li, Mulin Jun Li, Jiang Li, Jacob Shujui Hsu, Hongsheng Gui, Zhicheng Pan, Suying Bao and JS Kwan

Introduction

KGGSeq is designed to make use of biological knowledge and statistical models to identify genetic loci responsible for human diseases/traits by high-throughput sequencing data. Compared with other tools like PLINK/Seq (“A library for the analysis of genetic variation data”, http://atgu.mgh.harvard.edu/plinkseq/), KGGSeq focuses more on integrative downstream analysis. Since version 1.0, it has been equipped with a series of integrative methods for characterizing genetic mutations/polymorphsim of Mendelian disorders, complex disorders and cancers by using whole genome sequencing data and versatile public resources .

References

1.   Li MX, Gui HS, Kwan JS, Bao SY, Sham PC. A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases.Nucleic Acids Res. 2012 Apr;40(7):e53. PubMed NAR‹For the entire filtration, annotation and prioritization framework›
2.   Li M, Li J, Li MJ, Pan Z, Hsu JS, Liu DJ, Zhan X, Wang J, Song Y, Sham PC. Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework. Nucleic Acids Res. 2017 Jan 23. pii: gkx019. doi: 10.1093/nar/gkx019 PubMed‹For the entire filtration, annotation and prioritization framework with whole genome sequencing data›
3.   Li MX, Kwan JS, Bao SY, Yang W, Ho SL, Song YQ, Sham PC. Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genet. 2013 Jan;9(1):e1003143. PubMed PLoS Genet‹For Mendelian pathogenic prediction of non-synonymous variants by logistic regression›
4.   Hsu JS, Kwan JS, Pan Z, Garcia-Barcelo MM, Sham PC, Li M. Inheritance Modes Specific Pathogenicity Prioritization (ISPP) for Human Protein Coding Genes. Bioinformatics. 2016 Oct 15;32(20):3065-3071.‹For causal prediction at gene levels ›
5.   Li MJ, Pan Z, Liu Z, Wu J, Wang P, Zhu Y, Xu F, Xia Z, Sham PC, Kocher JP, Li M, Liu JS, Wang J. Predicting regulatory variants with composite statistic. Bioinformatics. 2016 Sep 15;32(18):2729-36.‹For regulatory potential prediction at non-coding variants ›
6.   Li MJ, Li M, Liu Z, Yan B, Pan Z, Huang D, Liang Q, Ying D, Xu F, Yao H, Wang P, Kocher JA, Xia Z, Sham PC, Liu JS, Wang J. cepip: context-dependent epigenomic weighting for prioritization of regulatory variants and disease-associated genes, Genome Biology (2017) 18:52 ‹For tissue- or cell type specific regulatory potential prediction at non-coding variants ›
7.   Jiang et al. WITER: a powerful method for estimation of cancer-driver genes using a weighted iterative regression modelling background mutation counts. Nucleic Acids Res. 2019 Sep 19;47(16):e96. doi: 10.1093/nar/gkz566. PubMed ‹For mutation burden test at cancer-driver genes›
8.   Jiang et al. Deviation from baseline mutation burden provides powerful and robust rare-variants association test for complex diseases. Nucleic Acids Res. 2021 Dec 20. doi: 10.1093/nar/gkab1234. PubMed ‹For mutation burden test at genes with rare variants; RUNNER›

Install and run

System requirement

Java Runtime Environment (JRE) version 1.6 (or up) is required for KGGSeq. It can be downloaded from the Java web site. Installing the JRE is very easy in Windows OS and Mac OS X.

In Linux, you have more work to do. Details of the installation can be found http://www.java.com/en/download/help/linux_install.xml.

In Ubuntu, if you have an error message like: "Exception in thread "AWT-EventQueue-0" java.awt.HeadlessException ... " , please install the Sun Java Running Environment (JRE) first.

To install the Sun JRE on Ubuntu(10.04), please use the following commands:
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts
Detailed explanation of above commands can be found at http://www.ubuntugeek.com/how-install-sun-java-runtime-environment-jre-in-ubuntu-10-04-lucid-lynx.html.

For Mac OS, the JRE 1.6 has been available at http://developer.apple.com/java/download/ since April 2008. Mac OS users may need update the Java application to run KGGSeq. A potential problem is that this update does not replace the existing installation of J2SE 5.0 or change the default version of Java. Similar to the Linux OS, the Java_Home environmental variable has to be configured to initiate KGGSeq.

Installing KGGSeq

Simply decompress the archive and run the following command

java -Xmx10g -jar ./kggseq.jar <arguments>

The arguments -Xmx10g set the maximum Java heap sizes for KGGSeq as 10 gigabytes. Specifying a larger maximum heap size can speed up the analysis when processing large datasets. A higher setting like -Xmx20g is required when there is over 50 million. The number, however, should be less than the size of physical memory of a machine.

Note <arguments >can be saved in a flat text file.

Note For some advanced statistical analysis in which some third-party tools are called, you have to run R and C++ make command to install them. Please watch our demo video for details.

Note To use KGGSeq under Proxy Network, you need specifically use the following java command to configure the proxy settings (Thank Daniele Yumi for this suggestion):

  java -Dhttp.proxyHost=xxx.xxx.xxx -Dhttp.proxyPort=xxxx -Dhttp.proxyUser=xxx -Dhttp.proxyPassword=xxx -jar "./kggseq.jar"

Note KGGSeq calls R packages through RServe in a 'localhost' machine with '6311' port. To specify the host and port of RServe, you need add the following two options:
java ./kggseq.jar --rserve-host xxx.xxx.xxx --rserve-port xxxx <other arguments>

Quick example

Example : Prioritize variants for a rare Mendelian dominant disease, Arthrogryposis

Files needed:
1)a VCF file (rare.disease.hg19.vcf); and
2)a linkage pedigree file (rare.disease.ped.txt).

Note All files were included in the examples folder of KGGSeq.

Run the command below:

java -Xmx5g -jar kggseq.jar examples/param.rare.disease.hg19.txt

We now walk through the parameter file “param.rare.disease.hg19.txt” before going into the results. Lines starting with hash sign # are comments. Detailed interpretation for each argument in the parameter file is included in 'Options' part.

#one argument per line
#I.Environmental setting
--buildver hg19 \ \ #line 1
--nt 4 \ \ #line 2
#--lib-update \ #line 3
#--resource-update \ \ #line 4

#II. Specify the input files
--vcf-file examples/rare.disease.hg19.vcf \ #line 5
--ped-file examples/rare.disease.ped.txt \ #line 6, or specify --indiv-pheno X:1,Y:1,Z:2

#III. Output setting
--out ./test1 \ #line 7
--excel \ #line 8
--o-vcf \ #line 10
#--o-flanking-seq 50 \ #line 11, need large RAM memory

#IV. QC
--gty-qual 20 \ #line 12
--gty-dp 8 \ #line 13
--gty-af-ref 0.05 \ #line 14
--gty-af-het 0.25 \ #line 15
--vcf-filter-in PASS,VQSRTrancheSNP90.00to93.00,VQSRTrancheSNP93.00to95.00,VQSRTrancheSNP95.00to97.00,VQSRTrancheSNP97.00to99.00 \ #line 16
--seq-qual 30 \ #line 17
--seq-mq 20 \ #line 18
--seq-fs 60 \ #line 19
--min-obsa 1 \ #line 20

#V. Filtering
--genotype-filter 3,4,5,6 \ #line 21 for dominant mode
#--ibs-case-filter 1000 \ #line 22, or specify 'ibdregions.txt' file
--regions-out chrX,chrY \ #line 23
--db-filter 1kg201204,dbsnp141,ESP6500AA,ESP6500EA \ #line 24
--rare-allele-freq 0.006 \ #line 25
--db-filter-hard dbsnp138nf \ #line 26
--db-gene refgene,gencode,knowngene \ #line 27
--gene-feature-in 0,1,2,3,4,5,6 \ #line 28
--superdup-filter \ #line 29
--gene-var-filter 4 \ #line 30

#VI. Annotation
--scsnv-annot \#line 31
--dgv-cnv-annot \#line 32
--omim-annot \#line 32
--candi-list ECEL1,MYBPC1,TNNI2,TNNT3,TPM2 \#line 33
--geneset-annot cura \ #line 34
--ppi-annot string \ #line 35
--ppi-depth 1 \ #line 36
--phenotype-term Arthrogryposis,Arthrogryposis+multiplex+congenita \ #line 37
--pubmed-mining \ #line 38
--mouse-pheno \ #line 39
--zebrafish-pheno \ #line 40
--ddd-annot \ #line 41

#VII. Prediction at variants
--db-score dbnsfp \ #line 42
--mendel-causing-predict best \ #line 43
--filter-nondisease-variant \ #line 44

#VIII. Prediction at genes
--patho-gene-predict \ #line 45
--phenolyzer-prediction \# line 46 the '--phenotype-term' option has to be specified prior to this option.

Part (I): Specify general environmental setting

Argument ‘line 1~4’ is used to set general KGGSeq running enviroment, including human genome build, resource path and program update

Part (II): Specify the input files

Arguments ‘line 5~6’ are used to specify the input files which support various data format, and they are compulsory for running KGGSeq.

Part (III): Output setting

Argument ‘line 7~11’ is used to set the output file name and type, which can be produced simutaneously by KGGSeq.

Part (IV): Specify Quality control conditions

Arguments ‘line 12~20’ are used to apply QC conditions (Genotype or Variant level) on input dataset, and then exclude low quality genotypes or variants in the following analysis.

Part (V): Filtering

Arguments ‘line 21~30’ are used to apply a combination of filters (genetic inheritance, gene feature, allele frequency, functional importance, etc) to enable better prioritization of the key variants/genes.

Part (VI): Annotation

Arguments ‘line 31~41’ are used to annotate remained variants/genes by searching various genomic annotation database and literature evidence.

Part (VII) & Part (VIII): Pathogenic prediction at variants and genes

Arguments ‘line 42~46’ are used to predict pathogenicity at remained variants/genes.

Notes Most of the above arguments are optional (except line 5 and 6), so user can mask some lines by “#” or delete the lines. Under this circumstance, user can have a systematic view of the impact for each level or even steps. And it will be easier to produce this parameter file by KGGSeq command generator we provide.

Options

Performance &environmental settings

Set the number of parallel tasks for an analysis job. java -jar kggseq.jar --nt #

The # denotes the number of parallel threads to perform the analysis.

Set the version of reference human genome buildjava -jar kggseq.jar --buildver hg##

All inputs will be analyzed and outputted according to the set genome version.

NOTE Currently, KGGSeq only supports hg18, hg19, and hg38. It is --buildver hg19 by default.

Specify the local path to store the resource datasets java -jar kggseq.jar --resource path/to/resource/files

Once initiated, KGGSeq will automatically download the resource data from its web server to the local folder. By default, it is --resource "./"

Switch off the auto-checking function for itself. java -jar kggseq.jar --no-lib-check

By default, KGGSeq will automatically check the latest version of the kggseq.jar.

Switch on the auto-checking and-updating function for itself. java -jar kggseq.jar --lib-update

With this option, KGGSeq will automatically check the latest version of the kggseq.jar and update it if necessary.

Switch on auto-checking and-updating for NEEDED resource datasets. java -jar kggseq.jar --resource-update

With this option, KGGSeq will automatically check and download the available latest version of resource datasets, once initiated.

Switch off the auto-log function to write log in a text file. java -jar kggseq.jar --no-log

By default, KGGSeq will automatically write log in a text file.

Input formats

Variants and genotypes

VCF format with genotypes
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 The format that KGGSeq supporting best is the VCF format, which stores sequence variants and called genotype information. The VCF file can be flatted as text or compressed as .gz. KGGSeq will read all genotypes either phased or unphased but will ignore the phase status.

VCF format with phased genotypes
java -jar kggseq.jar --vcf-file path/to/file1 --vcf-phased --ped-file path/to/file2 Currently, KGGSeq only allows all genotypes phased in a VCF file. When there a mixture of phased and unphased genotypes in a VCF file, KGGSeq will report an error with this option.

NOTE To achieve the best of parallel computing advantages of KGGSeq, it is strongly recommended to compress the input data with blocked GNU Zip format by bgzip <path/to/your/file>. See more here. Data compressed with conventional GNU Zip format cannot be handled in parallel.

NOTE The parallel computing also supports VCF data compressed by Burrows–Wheeler algorithm ( bzip2), which may save over 20% space than the standard GNU Zip. However,KGGSeq may take 3 to 4 folds longer time to decompress the data.

NOTE To set KGGseq handling input file in parallel, the parameter --nt N should be added. Not only the .gz and .bz2 formats are supported, but the original VCF files can also be read parallel.

NOTE If the data were stored in different files, chromosome by chromosome, you can use _CHROM_ to denote the chromosome names [1...Y] or directly specify [1,2,X]in the file name so that KGGSeq can read data in multiple files in a run.In this scenario, the separated VCF files must contain the SAME subjects in the SAME order.

NOTE In the VCF file, each subject should have a unique SubjectID, which is used to link the phenotypes and/or pedigree structure.

VCF format WITHOUT genotypes
KGGSeq can also accept VCF data without genotype information. But there are less usable functions for this type data.
java -jar kggseq.jar --no-gty-vcf-file path/to/file1

ANNOVAR format java -jar kggseq.jar --annovar-file path/to/file NOTE Since v0.4, KGGSeq recognize an extended ANNOVAR format in which a head row and multiple columns for comments are allowed.

Example:
chr startpos endpos ref alt comment1 comment2 comment3
1 69428 69428 T G T 92 129
1 69476 69476 T C T 1 0

GenoType Block (GTB) format --gtb-file

java -jar kggseq.jar --gtb-file path/to/fileset The GTB binary genotypeformat produced by GBCGenoType Blocking Compressor(GBC) , which facilitates ultra-fast access for large scale genotypes of hundreds of thousands of subjects.

KGGSeq binary genotype file set --ked-file

java -jar kggseq.jar --ked-file path/to/fileset The KGGSeq binary genotype produced by KGGSeq, which is more flexible to store the different types of genotypes than the PLINK BED format.

The Cancer Genome Atlas (TCGA) project's Mutation Annotation Format (MAF)
java -jar kggseq.jar --maf-file path/to/file1.maf A Mutation Annotation Format (MAF) file (.maf) is a tab-delimited text file that lists mutations. The MAF file can be flatted as text or compressed as .gz.

NOTE KGGSeq's WITER function recognizes a simplified MAF format as well which contains extracted SIX columns only. The head names of the six columns are fixed but the order does not matter. Otherwise KGGSeq would abort with an exception.

Example:
Tumor_Sample_UUID Chromosome Start_Position Reference_Allele Tumor_Allele1 Tumor_Allele2
TCGA-A8-A06P chr19 58864307 C A C
TCGA-A8-A06P chr19 58864307 C A C
TCGA-E9-A1NH chr19 58864366 G A G
TCGA-E9-A22B chr19 58862784 C T C
BR-MEX-015 chr10 52566490 C T C
TCGA-BH-A0HP chr10 52595854 G A G
TCGA-BH-A18P chr10 52595937 G A G
TCGA-A2-A0EY chr12 9246090 C T C
TCGA-A8-A08G chr12 9251298 G A G
TCGA-B6-A0IC chr12 9220358 - T -

Phenotypes and pedigree

When there are genotypes inputted in VCF format, KGGSeq should know to whom they belong and the phenotypes of the samples. So you need provide your sample information in this case.

Concise subject ID and disease status: --indiv-pheno java -jar kggseq.jar --vcf-file path/to/file --indiv-pheno SubjectID1:0,SubjectID2:2,... Specify the individual IDs and affection status, which is coded as: 0 missing or unknow,1 unaffected,2 affected.

Note The individual IDs must be unique and identical to the ones defined in the VCF file(s).

Conventional linkage pedigree file: --ped-file java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 (or --composite-subject-id) Specify the subject IDs, relationship between subjects, and disease status of them; it is a conventional Linkage Format, detailed here.

Pedigree Name
A unique alphanumeric identifier for this individual's family. Unrelated individuals should not share a pedigree name.

Individual ID
An alphanumeric identifier for this individual. Should be unique within his family (see above).

Father's ID
Identifier corresponding to father's individual ID or "0" if unknown father. Note that if a father ID is specified, the father must also appear in the file.

Mother's ID
Identifier corresponding to mother's individual ID or "0" if unknown mother Note that if a mother ID is specified, the mother must also appear in the file.

Sex
Individual's gender (1=MALE, 2=FEMALE).

Affection status
Affection status to be used for association tests (0=UNKNOWN, 1=UNAFFECTED,2=AFFECTED).

Note By default, the subject IDs in the VCF file must be identical to Individual IDs in the pedigree file, which are assumed unique in the whole pedigree file. However, one can also ask KGGSeq to use a composite subject ID in the CVF file as "PedigreeName$IndividualID" by the option `--composite-subject-id`.

Besides, KGGSeq also accepts a pedigree format with header information (See the following example),in which the phenotype names are specified. Some of the phenotype names may be designated as covariates in statistical association analysis.

                                        fid iid fatid motid sex DISEASE QT  AGE  
                                        1  NA12344 NA12347 NA12348 1   1   94.17   NA   
                                        1  NA12347 0   0   1   NA   19.54  40.0  
                                        1  NA12348 0   0   2   2   NA  36.6

Note The head line must start with 'fid iid'. Otherwise, KGGSeq will regard the first row as a subject-line.
Note The missing phenotype values are denoted by "NA".

Outputs

KGGSeq can flexibly output different formats of the prioritization and annotation results for validation or further analysis by third-party tools.

Output file path and file name prefix: --outjava -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --out path/to/prefixname Specify path and prefix name of outputs. It is "./kggseq" by default.

Note Note all outputs (except for the Excel format) of KGGSeq are to compressed by GNU Zip format automatically by default. Please use --no-gz to disable the automatic compression function.

Ordinary outputs

Text format: This is by defaultjava -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 By default, KGGSeq output results in TEXT format in a file named kggseq.flt.txt, in which the fields (or columns) are delimited by the tabs.

Excel format: --excel java -jar kggseq.jar --vcf-file path//to/file1 --ped-file path/to/file2 --excel The results will be stored in a Excel file with the name kggseq.flt.xls.

For ANNOVAR input: --o-annovar java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-annovar Generate an extra copy of output for the prioritized variants in ANNOVAR input format, which can be further annotated by ANNOVAR.

For VCF input: --o-vcf java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-vcf Generate an extra copy of output for the prioritized variants in VCF format, which can be further analyzed by other tools.

Note Please use --missing-gty X to denote missing genotypes in the output VCF file. It is --missing-gty ./. by default.

For VCF input, but in a simplified format: --o-svcf java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-svcf Generate an extra copy of output for the prioritized variants in simplified VCF format, in which only genotypes are included in the FORMAT column.

The group-file format for EPACTS input or other tools : --o-gene-grp java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-gene-grp Output coordinates, reference and alternative alleles of left variants of each gene in a group-file format, which can be further analyzed by other tools for more sophisticated gene-based association analysis.

Note According to coordinates, a variant may be mapped onto multiple genes.

For PLINK input: --o-plink-ped java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-plink-ped Generate an extra copy of output for the prioritized variants in PLINK PED and MAP file, which can be further annotated by PLINK.

Extract the flanking sequence of each sequence variant: --o-flanking-seq java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --o-flanking-seq 50 The source sequence information is the UCSC Chromosome dataset hg18 or hg19, which are human genome reference sequences on the forward strand. The unit of the specified flanking sequence is base-pair(bp). The retrieved sequences will be in the *.flt.txt or *.flt.xls with the column label "FlankingSeqXXbp".

Note Users might need set large memory (by -Xmx20g) to carry out this function as KGGSeq will load the whole sequences of a chromosome.

Binary outputs

PLINK binary genotypes files set: --o-plink-bed java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --composite-subject-id --o-plink-bed Generate an extra copy of output for the prioritized variants in PLink binary genotype format (SNP-major model), which can store the genotype information more efficiently.
Note This format does not support variants with over 2 alleles and phased genotypes, which will be ignored by KGGSeq.To record this information, you are suggested to use the KGGSeq binary genotype format (below).

KGGSeq binary genotypes files set: --o-ked java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --composite-subject-id --o-ked Generate an extra copy of output for the prioritized variants in KGGSeq binary genotype format, which use much less space to store the genotype information.

Here is an example VCF file:

##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1$A 1$B 1$C
chr1 4793 rs6682385 A G 596.57 PASS AC=2 GT 1/1 0/0 ./.
chr1 53560 . T C 573.51 PASS AC=1; GT ./. 0/0 0/1
chr1 12887054 rs3738105 T C,G 5572.99 PASS AC=2,1 GT 1/2 0/0 0/1

And the corresponding pedigree file:

1 A 0 0 1 1
1 B 0 0 2 1
1 C A B 1 2

After running the command above, KGGSeq will generat three file:
kggseq.ked (the KGGSeq binary genotype file, genotype information)
kggseq.kam (the KGGSeq family data file, first six columns of the conventional Linkage Pedigree file)
kggseq.bim (the KGGSeq variants map data file, two extra columns for reference and alternative allele names, like: A G,C). The kggseq.kim is similar to the .bim file of PLINK BED format. But it has one additional column (the last column) which denotes the compression status in the kggseq.ked file.The '-1' means that the genotypes are not compressed while another value suggested that the genotypes are compressed for a variant. The value divided by 3 denotes the number of "1" bits in the encoded genotypes of a variant for all subjects. The actual positions of "1" bits in the encoded genotypes are stored in the kggseq.ked file.
An example:

1 rs558604819 0 10642 G A 63
1 rs575272151 0 11008 C G -1
1 rs544419019 0 11012 C G -1
1 rs561109771 0 11063 T G 45

Explanation: 1) The binary genotypes of variant rs558604819 are compressed into 63 bytes. As every 3 bytes denote a position of '1' bits, this number means that there are 21 bits with a value of '1' in the binary genotypes. One can obtain all positions by reading the consecutive 3 bytes. Given the positions of '1', the original binary genotypes can be restored. 2) The binary genotypes of variant rs575272151 are not compressed.

Compared to the plink .bed format, the kggseq.ked has a different way to present genotypes, which is more flexible. Here is the binary view of kggseq.ked file:

|-magic number--| |phase mode| |------------------genotype data-------------------|
10011110 10000010 00000000 `10100100` `10010100` `10000000` `10000000`

The followings are the interpretation of the KGGSeq binary genotype file format

First two bytes `10011110 10000010` for the validation checking of a KGGSeq binary genotype file.

Third byte is `00000000` (Unphased genotypes) or `00000001` (Phased genotypes)

Genotype data, either for unphased or phased ones in Variant-major order ((i.e. list all individuals for the first Variant, all individuals for the second Variant, etc) ). The 4th byte (`10100100`) denotes genotypes of chr1:4793 by two bit-blocks (denoted by green and red respectively). At this site, the genotypes of A, B and C are first encoded as 10, 00 and 11 respectively (See details in the encoding table). The binary codes at first and second places are put into two blocks 101 and 001. The block size is equal to the number of total subjects. It is 3 here. The remaining spaces in a byte, which can contain 8 bits, are set as zero bits.
The 5th byte (`10010100` ) denotes genotypes of chr1:53560. The 6th and 7th bytes (`10000000` and `10000000` ) present genotypes of chr1:12887054, which has three alleles.

New "row" in the VCF file always starts a new byte (8 bits)

For each sequence variant, the number of bytes required to present a genotype is determined by the number of alleles and the genotype phase status.
I. Unphased -genotypes
I.I Bi-allelic sequence variant (2 bits)

Reference homozygous Heterozygous Alternative homozygous missing

VCF genotype 0/0 0/1 1/1 ./.

Bits 00 01 10 11

Decimal 0 1 2 3

I.II Tri-allelic sequence variant (3 bits)

Reference homozygous Heterozygous Heterozygous Alternative homozygous Heterozygous Alternative homozygous missing

VCF genotype 0/0 0/1 0/2 1/1 1/2 2/2 ./.

Bits 000 001 010 011 100 101 110

Decimal 0 1 2 3 4 5 6

I.III Quad-allelic sequence variant (4 bits)

Reference homozygous Heterozygous Heterozygous Heterozygous Alternative homozygous Heterozygous Heterozygous

VCF genotype 0/0 0/1 0/2 0/3 1/1 1/2 1/3

Bits 0000 0001 0010 0011 0100 0101 0110

Decimal 0 1 2 3 4 5 6

Alternative homozygous Heterozygous Alternative homozygous missing

VCF genotype 2/2 2/3 3/3 ./.

Bits 0111 1000 1001 1010

Decimal 7 8 9 10

II. Phased -genotypes
II.I Bi-allelic sequence variants (3 bits)

Reference homozygous Heterozygous Heterozygous Alternative homozygous missing

VCF genotype 0|0 0|1 1|0 1|1 .|.

Bits 000 001 010 011 100

Decimal 0 1 2 3 4

II.II Tri-allelic sequence variant (4 bits)

Reference homozygous Heterozygous Heterozygous Heterozygous Heterozygous Alternative homozygous Heterozygous

VCF genotype 0|0 0|1 0|2 1|0 1|1 1|2 2|0

Bits 0000 0001 0010 0011 0100 0101 0110

Decimal 0 1 2 3 4 5 6

Heterozygous Alternative homozygous missing

VCF genotype 2|1 2|2 .|.

Bits 0111 1000 1001

Decimal 7 8 9

II.III Quad-allelic sequence variants (5 bits)

Reference homozygous Heterozygous Heterozygous Heterozygous Heterozygous Heterozygous Heterozygous

VCF genotype 0|0 0|1 0|2 0|3 1|0 1|1 1|2

Bits 00000 00001 00010 00011 00100 00101 00110

Decimal 0 1 2 3 4 5 6

Alternative homozygous Heterozygous Heterozygous Heterozygous Heterozygous Alternative homozygous

VCF genotype 1|3 2|0 2|1 2|2 2|3 3|0

Bits 00111 01000 01001 01010 01011 01100

Decimal 7 8 9 10 11 12

Heterozygous Heterozygous Alternative homozygous missing

VCF genotype 3|1 3|2 3|3 .|.

Bits 01101 01110 01111 10000

Decimal 13 14 15 16

Quality Control

After genetic variants were inputed from a variety of data formats, KGGSeq provides a few optional quality control filters on individual genotype, genomic variant and sample phenotype. By adopting one or a combination of these QCs before performing downstream prioritization/annotation, it could help the user to reduce false positives.

Genotype QC

This is for genotype level quality control. Typical information in each individual genotype field covers read depth(--gty-dp), genotype quality (--gty-qual) and allelic balance ratio (--gty-af-ref, --gty-af-het and --gty-af-alt):

java -jar kggseq.jar --vcf-file path/to/file1 --indiv-pheno Indiv:1,Indiv:2 --gty-qual 20 --gty-dp 8 --gty-sec-pl 20 --gty-af-ref 0.05 --gty-af-het 0.25 --gty-af-alt 0.75 --o-vcf path/to/file2

They are in relationship of logical AND, and not exclusive of each other,user can specify one or any combination of these parameters at one time.

--gty-qual 20	Exclude genotypes with the minimal genotyping quality (Phred Quality Score) per genotype < 10. The default setting is `--gty-qual 20`.
--gty-dp 8	Exclude genotypes with the minimal read depth per genotype < 8. The default setting is `--gty-dp 8`.
--gty-sec-pl 20	Exclude genotypes with the second smallest normalized, Phred-scaled likelihoods for genotypes < 20. (otherwise, there would be confusing genotypes).The default setting is `--gty-sec-pl 20`.
--gty-af-ref 0.05	Exclude genotypes with the fraction of the reads carrying alternative allele >= 5% at a reference-allele homozygous genotype, usually indicating by AF in a VCF file. The default setting is `--gty-af-ref 0.05` . `--gty-af-ref 1` disables this tag.
--gty-af-het 0.25	Exclude genotypes with the fraction of the reads carrying alternative allele <= 25% at a heterozygous genotype. The default setting is `--gty-af-het 0.25`. `--gty-af-het 0` disables this tag.
--gty-af-alt 0.75	Exclude genotypes with the fraction of the reads carrying alternative allele <= 75% at a alternative-allele homozygous genotype. The default setting is `--gty-af-alt 0.75`. `--gty-af-alt 0` disables this tag.
--gty-somat-p 0.05	Exclude genotypes at which the distribution of reads carrying reference and alternative alleles are not significantly (p>0.05) different between tumor tissues and non-tumor tissues. By default, the p value threshold is 0.05.

Note The AF is prioritized in the quality control process if the both AF and Ad are provided in the VCF file as the AF is usually estimated using high quality reads

After genotype QC, two output files will be produced. 1)The default KGGSeq output in txt format (or Excel by specifying "--excel") which includes the basic annotations for each clean variant; 2)The vcf file: the genotype field which does not meet the QC conditions will be replaced by missing value ./., hence not considered in following analysis.

##Key information provided in default basic annotation file:
"AffectedRefHomGtyNum", Number of affected individuals with reference homozygote at this variant;
"AffectedHetGtyNum", Number of affected individuals with heterozygote at this variant;
"AffectedAltHomGtyNum", Number of affected individuals with non-ref homozygote;
"UnaffectedRefHomGtyNum", Number of unaffected individuals with reference homozygote at this variant;
Gene feature filtering "UnaffectedHetGtyNum", Number of unaffected individuals with heterozygote at this variant;
"UnaffectedAltHomGtyNum", Number of unaffected individuals with non-ref homozygote;

Note All above QC functions are designed for VCF input format,which may not be suitable for other input formats.

Note All above QC functions can be turned off by --no-qc.

Note In the LOG file, the null genotype refers to either 1) a missing genotype in the input VCF file or 2) a genotype failing QC. Similarly, the non-null genotype refers to a genotype pass QC.

Variant QC

This is for variant level quality control. In general VCF format, each line corresponds to the variants on the same position. Typical information for one variant includes overall sequencing quality (--seq-qual), RMS Mapping Quality (--seq-mq), and strand bias (--seq-sb or Phred-scaled p-value using Fisher's exact test --seq-fs). To make use of this information:

java -jar kggseq.jar --vcf-file path/to/file1 --indiv-pheno Indiv:1,Indiv:2 --vcf-filter-in PASS,VQSRTrancheSNP95.00to97.00,VQSRTrancheSNP97.00to99.00 --seq-qual 30 --seq-mq 20 --seq-fs 60 --o-vcf path/to/file2

Interpretation of key parameters and their values:

Arguments/values	Annotation
--vcf-filter-in	exlude the variants with "FILTER" not matching the specified QC labels in the VCF input file. e.g., --vcf-filter-in PASS,VQSRTrancheSNP90.00to93.00,VQSRTrancheSNP93.00to95.00,VQSRTrancheSNP95.00to97.00,VQSRTrancheSNP97.00to99.00 . By default, this tag is ignored.
--seq-qual 30	set the minimum overall sequencing quality score (Phred Quality Score) for the variant at 30;(default: --seq-qual 30)
--seq-mq 20	set the minimum overall mapping quality score (Mapping Quality Score) for the variant at 20;(default: --seq-mq 20)
--seq-fs 60	set the maximal overall strand bias Phred-scaled p-value (using Fisher's exact test) for the variant at 60; However, strand bias estimated using Fisher's Exact Test is best suited for low coverage situations. So this tag is ignored by default, which is --seq-fs 0.
--max-allele 4	ignore variant with alleles over 4; (default: --max-allele 4)

Note All above quality control functions can be turned off by --no-qc.

In addition, KGGSeq can use phenotype information (specified by "--indiv-pheno" or "--ped-file") to filter variants, especially when multiple samples are included in one input file such as trio or case/control study design.

java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --min-heta 1(or --min-homa 1) --min-hetu 1(or --min-homu1) --min-obsa 2(or --min-obsu 2) --min-obs 2 --o-vcf path/to/file3

Moreover, It's flexible to match the sample size and missingness of genotypes. Here are the options for different study design. (cases = affected subjects ; controls = unaffected subjects)

Arguments/values	Annotation
--min-heta 1	Set minimal observed number of heterozygotes genotypes in cases as 1.
--min-homa 1	Set minimal observed number of alternate homozygsote(s) genotypes in cases as 1; `--min-heta` and `--min-homa` are logical OR relation.
--min-hetu 1	Set minimal observed number of heterozygotes genotypes in controls as 1.
--min-homu 1	Set minimal observed number of alternate homozygsote(s) genotypes in controls as 1; `--min-hetu` and `--min-homu` are logical OR relation.
--min-obsa 2	Set minimal observed number of non-missing genotypes in cases as 2.
--min-obsa-rate 0.8	Set proportion of minimal observed number of non-missing genotypes in cases as 80%.
--min-obsu 2	Set minimal observed number of non-missing genotypes in controls as 2; `--min-obsa` and `--min-obsu` are logical AND relation.
--min-obsu-rate 0.8	Set proportion of minimal observed number of non-missing genotypes in controls as 80%; `--min-obsa` and `--min-obsu` are logical AND relation.
--min-obs 2	Set minimal observed number of non-missing genotypes in all samples as 2.
--min-obs-rate 0.8	Set proportion of minimal observed number of non-missing genotypes in all samples as 80%.
--hwe-control 0.01	Exclude variants in controls with the Hardy-Weinberg test p value <= 0.01. By default, this tag is disabled.
--hwe-case 0.01	Exclude variants in cases with the Hardy-Weinberg test p value <= 0.01. By default, this tag is disabled.
--hwe-all 0.01	Exclude variants in all subjects with the Hardy-Weinberg test p value <= 0.01. By default, this tag is disabled.

After running individual genotype QC, there will be two output files produced: one default KGGSeq output stored in txt (or Excel by specifying "--excel"), in which the basic annotations for each clean variant are provided (see above section); the other one is clean vcf file, in which the variant row which does not meet the conditions specified will be removed completely, hence not considered in following analysis.

Note The genomic variant QC is also available for non-vcf type of input, by using type of variants, or genomic position only.

Sample QC

KGGSeq provides two set of functions to facilitate quality assessment of sequencing data of each subject.
Firstly, you can use KGGSeq to convert genotypes from VCF format to PLINK binary genotype format (see Outputs part) so that routine sample QC (pairwise relatedness, mendel error rate and sex check) in genome-wide association study (GWAS) analysis can be performed quickly. The samples with unexpected relationship or error rate are suggested to be removed.

java -jar kggseq.jar --vcf-file path-to-file1 --ped-file path/to/file2 <Variants and Genotype QC Settings> --db-merge <Reference genotype set> --o-plink-bed plink --bfile kggseq.merged --genome The following is the reference genotypes integrated by KGGSeq: (KGGSeq will automatically download it for you according to the set name specified by --db-merge )

Reference genotype set	Description
hapmap2.r22.ceu.hg19	Haplotypes of Hapmap 2 release 22. Convert the coordinates to be hg19 from hg18 by UCSC lift over function. Complied from here.
hapmap2.r22.chbjpt.hg19
hapmap2.r22.yri.hg19
hapmap3.r2.ceu.hg19	Haplotpyes of Hapmap 3 release 2. Convert the coordinates to be hg19 from hg18 by UCSC lift over function. Compiled from here.
hapmap3.r2.chbjpt.hg19
hapmap3.r2.mex.hg19
hapmap3.r2.tsi.hg19
hapmap3.r2.yri.hg19
1kg.phase1.v3.shapeit2.asn.hg19	Haplotpyes of 1000 Genomes Project version 3. Donwload from here.
1kg.phase1.shapeit2.v3.eur.hg19
1kg.phase1.shapeit2.v3.afr.hg19
1kg.phase1.v3.shapeit2.amr.hg19
1kg.phase3.v5.shapeit2.eur.hg19	Haplotpyes of 1000 Genomes Project Phase3 of version 5. Donwload from here.
1kg.phase3.v5.shapeit2.afr.hg19
1kg.phase3.v5.shapeit2.amr.hg19
1kg.phase3.v5.shapeit2.sas.hg19
1kg.phase3.v5.shapeit2.eas.hg19
We acknowledge the complied VCF data of 1000 Genomes projects by the author of MACH, Dr. LI Yun. To see detailed the description about the data, please visit http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-03-14.html

An unexpectedly high pairwise relatedness (denoted by the PI_HAT of plink) suggests cross-individual contamination

Note To estimate the relatedness (the PI_HAT), you normally do NOT need the genotypes datasets from 1000 Genome Projects (which will require rather large memory). The merging with ancestry-matched HapMap genotype data is often sufficient for an accurate estimation of relatedness. However, the estimation of PI_HAT without reference genotypes in a small local sample will result in an over-estimated value because of the inaccurate allele frequencies derived from the tiny sample. plink --bfile kggseq --mendel plink --bfile kggseq --check-sex

Secondly, we provide plotting function (see Plotting part) to visualization the distribution of minor alleles per individual. The outlier from the distribution of total rare variants (MAF < 0.01, or novel in the common database) across all individuals can be easily detected and then ignored in the following analysis.

java -jar kggseq.jar --vcf-file path-to-file1 --out path/to/prefixname --mafplot java -jar kggseq.jar --vcf-file path-to-file1 --ped-file path/to/file2 --db-filter hg18_dbsnp138,hg18_dbsnp141 --allele-freq 0,0.01 --o-vcf path/to/file3

Note The meanings for -db-filter and --allele-freq in this section are provied in "Allele Frequency Trimming" section.

Filtering

KGGSeq offers simple but potentially powerful strategies to variants filtering and prioritization, which help users isolate most promising disease causal candiate variants from sequence variants to be considered.Users could prioritize variants based on some or all of the following evidences and/or functions: Genetic inheritance sharing, gene feature, allele frequency, biological function prediction, sequence variant type, and phenotype-relevant filtering.

Genetic inheritance sharing

Inheritance pattern filter: --genotype-filter

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --genotype-filter 1,2

Filter out variants for which their genotypes are not consistent with the assumption of disease inheritance pattern.

Functions and applicable models for each code of '--genotype-filter'options are listed below:

Code	Function	Applicable model
1	Exclude variants at which affected subjects have heterozygous genotypes.	Recessive
2	Exclude variants at which both affected and unaffected subjects have the same homozygous genotypes.	Recessive and full penetrance causal mutation(s) in the sample; compound-heterozygosity
3	Exclude variants at which affected subjects have reference homozygous genotypes.	Dominant without consanguineous mating
4	Exclude variants at which both affected and unaffected subjects have the same heterozygous genotypes.	Dominant with full penetrance causal mutation(s) in the sample
5	Exclude variants at which affected subjects have alternative homozygous genotypes.	Dominant without heterogeneity
6	Exclude variants at which affected family members have NO shared alleles.	Full penetrance causal mutation(s) in the sample
7	ONLY include variants at which an offspring has one or two non-inherited alleles. Note Used as `--genotype-filter 4,7` to filter out the de novo mutations where both affected and unaffected subjects have the same heterozygous genotypes. Note Used as `--ignore-homo` to ignore the de novo mutations where affected and unaffected subjects have the same ALTERNATIVE homozygous genotypes.	Detect de novo mutations which have nearly full penetrance. In the main output file, there is a column named DenovoMutationEvent to record the genotypes of a child and his or her parents. Example: N140_0:0/1:46,59&N140_1:0/0:57,0&N140_2:0/0:68,0 The child N140_0 has genotype 0/1 with 46 and 59 reads carrying reference alleles and alternative alleles respectively. The father N140_1 and mother N140_2 are homozygous 0/0.
8	ONLY include variants at which 1) the paired non-tumor (unaffected) and tumor (affected) samples have different genotypes. Note The tag `--indiv-pair` has to be used together to specify the matched non-tumor and tumor samples, e.g. `--indiv-pair tumorIndivID1:normalIndivID1,tumorIndivID2:normalIndivID2`	Detect somatic mutation(s) among matched tumor and non-tumor samples.

Note Multiple filtration codes have logical OR relationship.

Note The inheritance mode based filtration is proposed under strong assumptions for rare Mendelian disease with clear inheritance mode. If the inheritance mode is elusive, such filtration is not suggested as it may lead to the miss of genuine causal mutation(s). This is particularly true when one has no sufficient information to distinguish the compound-heterozygosity diseases from the recessive diseases.

Extract mutations affecting both copies of a gene in a diploid: --double-hit-gene-trio-filter or --double-hit-gene-phased-filter

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --db-gene refgene --gene-feature-in 0,1,2,3,4,5,6 --db-filter 1kg201204,dbsnp137,ESP6500AA,ESP6500EA --rare-allele-freq 0.033 --db-score dbnsfp --mendel-causing-predict best --double-hit-gene-trio-filter (OR --double-hit-gene-phased-filter)

This function is designed for a disorder suspected to be under compound-heterozygous or recessive inheritance mode, in which both copies of a gene on the two homologous chromosomes of a patient are damaged by two different mutations or the same mutation. For recessive mode, it simply checks variants with homozygous genotypes in patients. For the compound-heterozygous mode, it can use two different input data, phased genotypes of a patient or unphased genotypes in a trio. Here the trio refers to the two parents and an offspring. When these alleles causing a disease at one locus, it follows the recessive model; and when they are at two loci, it follows the compound-heterozygosity model. In both cases, a gene is hit twice. This is the reason why it has the name 'double-hit gene'.

Besides the file storing prioritized sequence variants, this analysis with --double-hit-gene-trio-filter option will lead to a list of double-hit genes in at least one subject. The counts of double-hit genes of each subject are saved in *.doublehit.gene.trios.flt.count.xls (or *.doublehit.gene.trios.flt.count.txt) and the genotypes of involved sequence variants are saved in *.doublehit.gene.trios.flt.gty.xls (or *.doublehit.gene.trios.flt.gty.txt).

The '*doublehit.gene.trios.flt.count*' and the '*doublehit.gene.trios.flt.gty*' files or sheets contain multiple columns:

Gene: Gene symbols
PubMed: The PubMed ID(s) of papers co-mentioning the gene and phenotypes(or diseases) specified by '--phenotype-term'.
ExonLen: The length of total exons in kb. The overlapped regions are ignored.
CountHitCase: Total number of subjects having double-hit variants in the gene.
<Subjects>: Whether a subject carry the double-hit variants,1 for yes and 0 for no in the 'flt.count' file or sheet. In the 'gty.count' file or sheet, the variants positions and genotypes are shown.
Example:
17797224:0/0,0/1,0/1;17827193:0/1,0/0,0/1: The genotypes of father, mother, and child are 0/0, 0/1 and 0/1 at the variant with coordinate 17797224. At the other variant, the genotypes of father, mother, and child are 0/1, 0/0 and 0/1. Apparently, both copies of this gene in the child are hit by the two mutations.

The corresponding files produced by the analysis with --double-hit-gene-phased-filter option are *.doublehit.gene.phased.flt.count.xls (or *.doublehit.gene.phased.flt.count.txt) and *.doublehit.gene.phased.flt.gty.xls (or *.doublehit.gene.phased.flt.gty.txt).

The '*doublehit.gene.phased.flt.count*' and the '*doublehit.gene.phased.flt.gty*' files or sheets contain multiple columns:

Gene: Gene symbols
PubMed: The PubMed ID(s) of papers co-mentioning the gene and phenotypes(or diseases) specified by '--phenotype-term'.
P: The p-values of chi-square test to examine whether the double-hit mutations are associated with a disease.
CountCtl: Total number of controls having double-hit variants in the gene.
CountCase: Total number of patients having double-hit variants in the gene.
<Subjects>: Whether a subject carry the double-hit variants,1 for yes and 0 for no in the 'flt.count' file or sheet. In the 'gty.count' file or sheet, the variants positions and genotypes are shown.
Example:
105352884:0|1;105353305:1|0: The genotypes of a subject are 0|1 and 1|0 at the variants with coordinate 105352884 and 105353305. Apparently, both copies of this gene in the subject are hit by the two mutations.

Note Both filters have a strong assumption of full penetrance.

Note The --double-hit-gene-trio-filter or --double-hit-gene-phased-filter only works for inputs specified by --vcf-file and requires parents' genotypes to infer the phases of the alleles. And sequence variants with missed parents' genotypes are ignored.

Note To ignore homozygous mutation (i.e., implicating the autosomal recessive model), please add the tag --ignore-homo.

Extract a gene on which EVERY affected subject has at least one case-unique alternative allele. But these alleles may be from different variants: --unique-gene-filter

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --db-filter 1kg201204,dbsnp137,ESP6500AA,ESP6500EA --rare-allele-freq 0.02 --db-score dbnsfp --mendel-causing-predict best --unique-gene-filter

Sequence variants beyond this type of genes will be filtered out.

Note --unique-gene-filter only works for inputs specified by --vcf-file.

Filter sequence variants by Identical by state (IBS): --ibs-case-filter X

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ibs-case-filter X

Search the longest region (in minimal length X kb) of each interesting variant, in which there is at least one allele identical (IBS>0) among all cases at all consecutive variants. This function can be used to highlight potential causal mutations or exclude non-disease-relevant sequence variants as the causal mutation(s) is very likely to have long shared regions among all affected family members of Mendelian dominant and recessive diseases. However, IBS checking is just a quick but over-estimation of identical by descent (IBD), which is a more accurate measure of shared regions among related individuals. KGGSeq currently has no function to estimate IBD regions but allow users to input the IBD information estimated by third-party tools into KGGSeq for annotation.

Note --ibs-case-filter only works for inputs specified by --vcf-file.

In the main output file, two columns will be appended:

LongestIBSRegion:The longest region in which all consecutive variants hae IBS over 0 in patients.
Example: 'chr2:46259552-57303414#Var:26469StartAtTailEndAtTail' means in the IBS region chr2:46259552-57303414 there are 26469 consecutive variants. This region start with and end at the boundaries of the input regions or chromosomes
LongestIBSRegionLength(bp): The length of the region.

Identical by descent (IBD) annotation: --ibd-annot

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ibd-annot path/to/file3

User can also provide IBD (i.e., the two alleles arose from the same allele in an earlier generation) or significant linkage regions, which are estimated by a third party statistical genetics tools (such as BEAGLE, Merlin and SimWalk2). Variants within these regions will be highlighted. For an ibd or linkage annotation file, title line is needed, and the file format is like:

CHR START END
1 6134174 10082841
1 202855724 207862411

Note Title line is needed.

In the main output file, two columns will be appended:

IBDRegion: The region with IBD information.
IBDRegionAnnot: The annotation information at the region.

Filter sequence variants by homozygosity: --homozygosity-case-filter X

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --homozygosity-case-filter X

Search the longest region (in minimal length X kb) of consecutive homozygous genotype for each interesting variant with minimal length X kb in cases but not in controls. For cancer diseases, this function can be used to examine the loss of heterozygosity straightforwardly. For rare recessive diseases, it can be used to examine the runs of homozygosity straightforwardly. But KGGSeq currently does not have methods to test the statistical significance.

Note --homozygosity-case-filter only works for inputs specified by --vcf-file.

In the main output file, two columns will be appended:

LongestHomozygosityRegion longest region in which all consecutive variants hae IBS over 0 in patients.
Example: 'chr2:46259552-57303414#Var:26469StartAtTailEndAtTail' means in the IBS region chr2:46259552-57303414 there are 26469 consecutive variants. This region start with and end at the boundaries of the input regions or chromosomes
LongestHomozygosityRegionLength(bp): The length of the region.

Gene feature filtering

Filtration according to gene features: --db-gene & --gene-feature-in

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --dgv-cnv-annot --splicing 6 --db-gene refgene,gencode,knowngene --gene-feature-in 0,1,2,3,4,5,6

Note --gene-feature-context X allows KGGSeq to jointly annoate non-synonymous variants and splicing variants within X (say, X=5) bp for each subject. However, this requires phased genotypes in VCF format as specified by --vcf-phased.

The '--dbgene' sets database(s) to annotate and filter variants.

Gene database label	Supported reference genome	Description
refgene	hg18, hg19 and hg38	The RefGene database compiled by UCSC from hg18 refGene and hg19 refGene Note: RefSeq has NO mitochondria gene definition.
gencode	hg19 and hg38	The GENCODE gene sets, gencode.v19.annotation.gtf for hg19 and gencode.v23.annotation.gtf for hg38.See home page and the paper of GENCODE for details. Note: GENCODE contains similar number of coding genes but more transcripts than RefGene. It HAS the mitochondria gene definition.
knowngene	hg18, hg19 and hg38	The UCSC knonwGene database compiled by UCSC from hg18 knownGene and hg19 knownGene
path/to/file	-	KGGSeq also accepts a gene dataset customized by users. The example can be seen here. The following are the column names: bin: The bin ID number name: The name of a transcript chrom: The chromosome name strand: The strand information of the transcript txStart: The start site of a transcript txEnd: The end site of a transcript cdsStart: The start site of coding DNA sequence cdsEnd: The end site of coding DNA sequence exonCount: The number of exons exonStarts: The start sites of all exons exonEnds: The end sites of all exons score: mapping score name2: Gene name cdsStartStat: Un-used cdsEndStat: Un-used exonFrames: Un-used sequences: DNA sequences of exons delSites: Deletion sites on the human reference genome, compared to the latest cDNA insSites: Insertion sites on the human reference genome, compared to the latest cDNA

With this filter, variants beyond regions sepcified by '--gene-feature-in' will be excluded. It is '--gene-feature-in 0,1,2,3,4,5...,15' by default.

Illustration of number codes for the gene features:

Feature	Code	Explanation
Frameshift	0	Short insertion or deletion result in a completely different translation from the original.
Nonframeshift	1	Short insertion or deletion results in loss of amino acids in the translated proteins.
Startloss	2	Indels or nucleotide substitution result in the loss of start codon(ATG) (mutated into a non-start codon).
Stoploss	3	Indels or nucleotide substitution result in the loss of stop codons (TAG, TAA, TGA)
Stopgain	4	Indels or nucleotide substitution result in the new stop codons (TAG, TAA, TGA), which may truncate the protein.
Splicing	5	variant is within 2-bp of a splicing junction (use `--splicing x` to change this, the unit of x is base-pair)
Missense	6	Variants result in a codon coding for a different amino acid (missense)
Synonymous	7	Nucleotide substitution does not change amino acid.
Exonic	8	Due to loss of sequences in reference database, this variant can only be mapped into exonic region without more precise annotation.
UTR5	9	Within a 5' untranslated region
UTR3	10	Within a 3' untranslated region
Intronic	11	Within an intron
Upstream	12	Within 1-kb region upstream of transcription start site (use `--neargene x` to change this, the unit of x is base-pair)
Downstream	13	Within 1-kb region downtream of transcription end site (use `--neargene x` to change this, the unit of x is base-pair)
ncRNA	14	Within a transcript without protein-coding annotation in the gene definition (see Notes below for more explanation)
Intergenic	15	variant is in intergenic region
Monomorphic	16	It is not a sequence variation actually, which may be resulted from bugs in reference genome in variant calling.
Unknown	17	Variants has no annotation.

The output of this function is a list of variables providing all involved gene features of a considered variant.

MostImportantFeatureGene: The gene corresponds to the variant’s smallest gene feature code.
MostImportantGeneFeature: The variant’s smallest gene feature code.
RefGeneFeatures: All matched gene features in the reference gene database.
GENCODEFeatures: All matched gene features in the GENCODE gene database.

The gene feature annotations are described in accordance with the Human Genome Variation Society (HGVS) recommendations. The followings are some examples:

GBP4:NM_052941:c.1633A.G>C.C:p.M545L:(11Exons):exon10:missense
Means: At nucleotide 1633 and 1635 of gene GBP4's transcript NM_052941, the nucleotides A and G are changed to a C and C respectively, which jointly changes the 545^th amino-acid M to be L in this transcript's protein. This transcript has 11 exons in total and this variant is in the 10^th exon. The . denotes reference allele.
KLHL17:NM_198317:c.1918A>C:p.T640P:(12Exons):exon12:missense
Means: At nucleotide 1918 of gene KLHL17's transcript NM_198317, an A is changed to a C, which changes the 640^th amino-acid T to be P in this transcript's protein. This transcript has 12 exons in total and this variant is in the 12^th exon.
SYNGR1:NM_004711:(4Exons):exon4:c.C605ins+CAA:p.P202ins??
Means: At nucleotide 605 of gene SYNGR1's transcript NM_004711, there is a 3bp insertion. This transcript has 4 exons in total and this variant is in the 4^th exon. The '+' denotes an insertion. The affected peptide is unknown ('??')
PKD1L2:NM_001076780:c.A705delAA-:p.N236:(18Exons):exon4:frameshift
Means: At nucleotide 705 of gene PKD1L2's transcript NM_001076780, there is a 2bp deletion. This transcript has 18 exons in total and this variant is in the 4^th exon. The '-' denotes an deletion. The variant results in a frameshift of protein-coding .
DNAJC8:NM_014280:c.237+2T>G:(9Exons):exon3GTdonor
Means:This is a T to G substitution of the 2^nd nucleotide of the 2 ^nd intron (from its 5' part) positioned between coding DNA nucleotides 237 and 238, which will affect the GT splicing donor.
HNRNPCL2:NM_001136561:c.-169C>G:(2Exons):5UTR
Means: This is a C to G substitution of nucleotide, 169 nucleotides upstream of the ATG translation initiation codon (coding DNA -169).
PRAMEF4:NM_001009611:c.62C>T:(4Exons):3UTR*
Means: This is a C to T substitution of nucleotide coding DNA 62, located in the 3' UTR.
C1orf159:NM_017891:c.311-23T>C:(10Exons):intronic6
Means: This is a T to C substitution of the 23 nucleotide of the 6 ^th intron (from its 3' part) positioned between coding DNA nucleotides 310 and 311.
HNRNPCL1:NM_001013631:c.138G>A:(2Exons):downstream*
Means: This is a C to A substitution of nucleotide located 138 nucleotides downstream of the gene, i.e. the polyA-addition site.
TNFRSF18:NM_148901:c.-61C>T:(4Exons):upstream
Means: This is a C to T substitution of nucleotide, 61 nucleotides upstream of the ATG translation initiation codon (coding DNA -61).

Table: Compare non-synonymous gene feature annotation of three popular tools and two gene models for variants discovered in 1000 Genomes Project AFR panel

Note: The RefGene model was provided by UCSC on Oct 10, 2015 for KGGSeq; and for ANNOVAR and the latest RefGene model (by Oct 10, 2015) from their websites was used. a: Because ANNOVAR has no startloss category but annotates startloss as missense, no comparison was made for this category. b: Because dbSNP has no startloss category, no consistency was calculated for this category. c: rate of consistency with annotations in the dbSNP. The default parameters settings of each tool were used for the annotation.

Allele frequency trimming by 4 ways

1. Allele frequency trimming according to database: --db-filter

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2  --db-filter 1kg201204,dbsnp141,ESP6500AA,ESP6500EA  --rare-allele-freq 0.005 --db-filter-hard dbsnp138nf

This mode enables variants filtering according to allele frequencies recorded in public data sources.
The '--db-filter' is use to set databases for filtration. In the above case, variants with alternative allele frequency EQUAL to or over 0.005 will be excluded. No spaces are allowed between and within database names.
Moreover, '--db-filter-hard' is use to filter ALL existing variants in a database. But it should be always careful to use this hard filtering.

Note1. To keep variants in a frequency range,[a,b], please use '--allele-freq a,b' option. The boundaries a and b are included.
Note2. The options, '--allele-freq' and '--rare-allele-freq', are mutually exclusive and the latter has higher priority.
Note3. By default, '--rare-allele-freq 0.01', will be used if allele frequency cutoff is set.
Note4. Difference between '--rare-allele-freq' and '--allele-freq': The '--rare-allele-freq 0,0.01' does not consider the variants with missing frequency in public databases, but '--rare-allele-freq 0.01' does. So '--rare-allele-freq 0.01' returns much more variants than '--rare-allele-freq 0,0.01'.

List of public databases for allele frequency trimming

DB Label	Explanation
1kgeur201305	495 subjects in the EUR panel of 1000 Genomes Project release in 2013 May (around 24.0 million sequence variants)
1kgeas201305	496 subjects in the EAS panel of 1000 Genomes Project release in 2013 May (around 23.5 million sequence variants)
1kgafr201305	645 subjects in the AFR panel of 1000 Genomes Project release in 2013 May (around 41.7 million sequence variants)
1kgsas201305	485 subjects in the SAS panel of 1000 Genomes Project release in 2013 May (around 26.7 million sequence variants)
1kgamr201305	346 subjects in the AMR panel of 1000 Genomes Project release in 2013 May (around 28.2 million sequence variants)
1kg201305	All sample of 1000 Genomes Project release 2013 May (over 81 million sequence variants from 2504 subjects)
1kg201204	All sample of 1000 Genomes Project release 2012 April
1kgafr201204	African descendent samples of 1000 Genomes Project release in 2012 April
1kgeur201204	European descendent samples of 1000 Genomes Project release in 2012 April
1kgamr201204	American descendent samples of 1000 Genomes Project release in 2012 April
1kgasn201204	Asian descendent samples of 1000 Genomes Project release in 2012 April
dbsnp135	dbSNP version 135, compiled from UCSC.
dbsnp137	dbSNP version 137, compiled from UCSC.
dbsnp138	dbSNP version 138, compiled from UCSC.
dbsnp138nf	dbSNP version 138 without the flagged SNPs by UCSC (so nonflagged: nf). In UCSC the flagged SNPs include SNPs clinically associated by dbSNP, mapped to a single location in the reference genome assembly, and not known to have a minor allele frequency of at least 1%.
dbsnp141	dbSNP version 141, compiled from NCBI dbSNP V141. Note Indels are ignored.
ESP5400	a public variants dataset from NHLBI GO Exome Sequencing Project (ESP)
ESP6500AA	a public variants dataset of African American from NHLBI GO Exome Sequencing Project (ESP)
ESP6500EA	a public variants dataset of European American from NHLBI GO Exome Sequencing Project (ESP)
exac	Exome Aggregation Consortium (ExAC r0.3.1): Variants from 60,706 unrelated individuals sequenced. The samples are partitioned in to 7 ancestrally different panels, East Asian(eas, n=4327), South Asian (sas, n=8256), African/African American(afr, n= 5203), American(amr, n=5789 ), Finnish(fin, n=3307), Non-Finnish European(nfe, n=33370), Other (oth, n=454). By default (`--db-filter exac`), KGGSeq will use frequencies of all of the 7 panels for annotation and filtering. You can also specify the panels of your interests by the abbreviations. For example, you can specify the frequencies by of East Asian and Non-Finnish European by `--db-filter exac.eas.nfe`.
ehr	DiscovEHR: More than 4 million Variants from more than 50,000 MyCode® participants. The DiscovEHR Collaboration between the Regeneron Genetics Center and Geisinger Health System brings together high throughput DNA sequencing with longitudinal electronic health records for discovery of genetic variation important for human disease and therapeutic response.
gadexome	Genome Aggregation Database: More than 15 million Variants from more than 123,136 participants sequenced in exomes. The samples are partitioned in to 8 ancestrally different panels, East Asian(eas, n=8,624), South Asian (sas, n=15,391), African/African American(afr, n=7,652), Latino (amr, n=16,791), Finnish(fin, n=11,150), Non-Finnish European(nfe, n=55,860),Ashkenazi Jewish (asj, n=4,925), Other (oth, n=2,743). By default (`--db-filter gadexome`), KGGSeq will use frequencies of all of the 8 panels for annotation and filtering. You can also specify the panels of your interests by the abbreviations. For example, you can specify the frequencies by of East Asian and Non-Finnish European by `--db-filter gadexome.eas.nfe`.
gadgenome	Genome Aggregation Database: More than 229 million Variants from more than 15,496 participants sequenced in whole genomes. The samples are partitioned in to 7 ancestrally different panels, East Asian(eas, n=881), African/African American(afr, n=4,368), Latino (amr, n=419), Finnish(fin, n=1,747), Non-Finnish European(nfe, n=7,509),Ashkenazi Jewish (asj, n=151), Other (oth, n=491). By default (`--db-filter gadgenome`), KGGSeq will use frequencies of all of the 7 panels for annotation and filtering. You can also specify the panels of your interests by the abbreviations. For example, you can specify the frequencies by of East Asian and Non-Finnish European by `--db-filter gadgenome.eas.nfe`.

The output of this function is a list of variables providing all involved gene features of a considered variant.

MaxDBAltAF: The maximum alternative allele frequency across datasets specified.
altFreq@hgXX_XXXXX: The alternative allele frequency at a dataset hgXX_XXXXX.
NoteVariants without allele frequencies in reference databases are marked by '.' in the results. Moreover, Variants not EXISTING in reference database are marked by 'N'.

2. Users can also provide local filters to filter variants. --local-filter

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --local-filter path/to/file3 --rare-allele-freq 0.005 --local-filter-hard path/to/file4

Moreover, '--local-filter-hard' is use to filter ALL existing variants in a database. Multiple local datasets can be specified for filtering at a time, '--local-filter path/to/file3.1,path/to/file3.2'.

The '--local-filter' option sets local datasets for filterating. KGGSeq accepts local dataset files in the following format: each file has 5 columns [Chromosome, Physical Position, Reference Allele, Alternative Allele(s), and Frequency of alternative allele(s)], separated by tab character. KGGSeq provides a function to convert data in VCF format to that in the local dataset format:

java -jar ./kggseq.jar --make-filter --vcf-file path/to/file --out test.var An example file is like:

Chrom Position Reference Alternative Freq
1 469 C G 0.150
1 492 C T 0.175
1 519 G C 0.067
1 874290 - 0ACAGAG 0.809
1 875913 CAG 3 0.8431

Note The title line is needed and Freq is optional.

3. An in-house VCF file can be used for allele frequency trimming: --local-filter-vcf

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --local-filter-vcf path/to/file3 --rare-allele-freq 0.005 --local-filter-vcf-hard path/to/file4

If the data were stored in different files chromosome by chromosome, you can use ' _CHROM_' to denote the chromosome names [1...Y] or directly specify [1,2,X] in the file name so that KGGSeq can read data in multiple files in a run.

Similarly, '--local-filter-vcf-hard' is use to filter ALL existing variants in a database.

4. An in-house VCF file without genotypes can also be used for allele frequency trimming: --local-filter-no-gty-vcf

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --local-filter-no-gty-vcf path/to/file3 --rare-allele-freq 0.005 --local-filter-no-gty-vcf-hard path/to/file4

Similarly, '--local-filter-no-gty-vcf-hard' is use to filter ALL existing variants in a database.

5. Filter variants by minor allele frequency in the input sample: --filter-sample-maf-le or --filter-sample-maf-oe

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --filter-sample-maf-le 0.01

--filter-sample-maf-le: Filter out variants with minor allele frequency less than OR equal to a cutoff (x) in the input samples.
--filter-sample-maf-oe: Filter out variants with minor allele frequency over OR equal to a cutoff (x) in the input samples.

6. Filter variants by minor allele frequency in the input sample: --filter-case-maf-le or --filter-case-maf-oe

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --filter-case-maf-le 0.01

--filter-case-maf-le: Filter out variants with minor allele frequency less than OR equal to a cutoff (x) in the input cases.
--filter-case-maf-oe: Filter out variants with minor allele frequency over OR equal to a cutoff (x) in the input cases.

7. Filter variants by minor allele frequency in the input controls --filter-control-maf-le or --filter-control-maf-oe

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --filter-control-maf-le 0.01

--filter-control-maf-le: Filter out variants with minor allele frequency less than OR equal to a cutoff (x) in the input controls.
--filter-control-maf-oe: Filter out variants with minor allele frequency over OR equal to a cutoff (x) in the input controls.

Variant type, region and gene filtering

Ignore copy number variation variants (CNV): --ignore-cnv

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ignore-cnv

Ignore insertion and deletion sequence variants (indels): --ignore-indel

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ignore-indel

Accordingly, users can also ask KGGSeq to ingore Single nucleotide variants (SNVs) : --ignore-snv

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ignore-snv

By default, KGGSeq will process all the three types of variants.

Ingore single nucleotide variants (SNVs) with alleles of C/T or T/C: --ignore-ct-var

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ignore-ct-var

Ingore single nucleotide variants (SNVs) without alleles of C/T and T/C: --only-ct-var

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --ignore-ct-var

By default, KGGSeq will process all variants regardless of C/T allele. The option could be used with WITER or RUNNER for gene-based mutation burden tests .

Ignore some genomic regions: --regions-out

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --regions-out chrX,chrY:1-10000 Alternatively, one can also exclusively process sequence variants at some genome regions: java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --regions-in chr2,chr4:21212-233454

Ignore variants within some genes: --genes-out

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --genes-out TCF4,CNNM2,ANK3 Alternatively, one can KEEP variants only within some genes: java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --genes-in TCF4,CNNM2,ANK3

Filter out genes with too many variants: --ignore-gene-over-var

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 [options to filter out common and neutral variants] --ignore-gene-over-var 5

After a number of hard-filtering by allele frequencies and pathogenic prediction, you may see some genes harboring candidate variants in one or more persons. Empirically, these genes are usually not interesting and their mutations are often produced by some unknown factors such pseudogene or platform bias. This option allows users to remove all mutations of this person at this problematic gene. As a rule of thumb, it is safe to set a cutoff 5 or more.

Filter out genes with too few variants: --ignore-gene-fewer-var

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 [options to filter out common and neutral variants] --ignore-gene-fewer-var 1

After a number of hard-filtering by allele frequencies and pathogenic prediction, you may see some genes harboring two few variants. Empirically, in genes-based association tests for rare variants, it may be better to exclude genes with only 1 variant.

Super duplicate region filtering

Filter out variants in super duplicate regions: --superdup-filter

java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --superdup-filter

KGGSeq can eliminate sequence variants within putative super-duplicate genomic regions defined in a dataset (genomicSuperDups) from UCSC datasets , which have higher genotyping error rate. (See more ).

Note Another option --superdup-annot allows you to just mark (but not to remove) the variants in the super duplicate regions. A column named SuperDupKValue (K-value calculated with Jukes-Cantor model) will be added in the resulting dataset

Case-control Phenotype

Filter for case or control unique variants: --filter-model case-unique or --filter-model control-unique java -jar ./kggseq.jar --buildver hg19 --vcf-file path/to/file1 --ped-file path/to/file2 --filter-model case-unique Case unique: extract variants with alternative alleles only present among affected individuals.
Control unique: extract variants with alternative alleles only present among unaffected individuals.

Note The --filter-model only works for inputs specified by --vcf-file .
Note This function can work with options to flexibly set the sample size and missingness of genotypes, detalied in the Variant QC part.

Expression at specified tissues

Select variants expressed at a number of tissues or cell-types --var-expression-tissues --min-var-expression

java -jar ./kggseq.jar --buildver hg19  --vcf-file path/to/file1 --ped-file path/to/file2 --var-expression-tissues TissueLabel1,TissueLabel2,TissueLabel3 --min-var-expression 0.1

--var-expression-tissues: specify the tissue of cell types where the coding variants are expressed. The expression information is retrieved from a Nature paper, which were generated based on the GTEx dataset. The following are tissue or cell-type names.

Number	TissueLabel	Explanation
1	Cells_Transformedfibroblasts	Cells of transformed fibroblasts.
2	Prostate	Prostate
3	Spleen	Spleen
4	Brain_FrontalCortex_BA9_	Brain_FrontalCortex_BA9_
5	SmallIntestine_TerminalIleum	SmallIntestine_TerminalIleum
6	MinorSalivaryGland	MinorSalivaryGland
7	Artery_Coronary	Artery_Coronary
8	Skin_SunExposed_Lowerleg_	Skin_SunExposed_Lowerleg_
9	Cells_EBV_transformedlymphocytes	Cells_EBV_transformedlymphocytes
10	Brain_Hippocampus	Brain_Hippocampus
11	Esophagus_Muscularis	Esophagus_Muscularis
12	Brain_Nucleusaccumbens_basalganglia_	Brain_Nucleusaccumbens_basalganglia_
13	Artery_Tibial	Artery_Tibial
14	Brain_Hypothalamus	Brain_Hypothalamus
15	Adipose_Visceral_Omentum_	Adipose_Visceral_Omentum_
16	Cervix_Ectocervix	Cervix_Ectocervix
17	Brain_Spinalcord_cervicalc_1_	Brain_Spinalcord_cervicalc_1_
18	Brain_CerebellarHemisphere	Brain_CerebellarHemisphere
19	Nerve_Tibial	Nerve_Tibial
20	Breast_MammaryTissue	Breast_MammaryTissue
21	Liver	Liver
22	Skin_NotSunExposed_Suprapubic_	Skin_NotSunExposed_Suprapubic_
23	AdrenalGland	AdrenalGland
24	Vagina	Vagina
25	Pancreas	Pancreas
26	Lung	Lung
27	FallopianTube	FallopianTube
28	Pituitary	Pituitary
29	Muscle_Skeletal	Muscle_Skeletal
30	Colon_Transverse	Colon_Transverse
31	Artery_Aorta	Artery_Aorta
32	Heart_AtrialAppendage	Heart_AtrialAppendage
33	Adipose_Subcutaneous	Adipose_Subcutaneous
34	Esophagus_Mucosa	Esophagus_Mucosa
35	Heart_LeftVentricle	Heart_LeftVentricle
36	Brain_Cerebellum	Brain_Cerebellum
37	Brain_Cortex	Brain_Cortex
38	Thyroid	Thyroid
39	Brain_Substantianigra	Brain_Substantianigra
40	Kidney_Cortex	Kidney_Cortex
41	Uterus	Uterus
42	Stomach	Stomach
43	WholeBlood	WholeBlood
44	Bladder	Bladder
45	Brain_Anteriorcingulatecortex_BA24_	Brain_Anteriorcingulatecortex_BA24_
46	Brain_Putamen_basalganglia_	Brain_Putamen_basalganglia_
47	Brain_Caudate_basalganglia_	Brain_Caudate_basalganglia_
48	Colon_Sigmoid	Colon_Sigmoid
49	Cervix_Endocervix	Cervix_Endocervix
50	Ovary	Ovary
51	Esophagus_GastroesophagealJunction	Esophagus_GastroesophagealJunction
52	Testis	Testis
53	Brain_Amygdala	Brain_Amygdala
54	mean_proportion	Averaged proportion in all tissues/cell-types

--min-var-expression: filter out variants with the standardized expression ≤ the specified value, (ranged from 0 to 1).

Annotation

dbSNP RSID

Assign dbSNP RSID to variants:--rsid
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --rsid KGGSeq can update the variants' dbSNP rsID (if available) according to their genomic coordinates. The resource data used are from SNPChrPosOnRef (hg19/hg38) [help].

Alternative splicing

Potential of altering splicing--scsnv-annot
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --scsnv-annot Annotate a SNV with predicted potential of altering splicing from dbscSNV , which include all possible human SNVs within splicing consensus regions (-3 to +8 at the 5' splice site and -12 to +2 at the 3' splice site).
The command will append a field to the output file.

ada_score@dbScSNV: The prediction score by ADA approach which combines multiple existing tools. See more in Jian et al (2014)

Structure variation

Map a variant against known structure variation --dgv-cnv-annot
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --dgv-cnv-annot Annotate a variant whether it is within a known insertion or deletion registered in the Database of Genomic Variants , which will help exclude candidates which are not interesting.
The command will append a field to the output file.

DGVIDs: the ID in the Database of Genomic Variants.
CNVSampleSize: the size of sample, used to derive the structure.
LossCNV: Number of subjects having loss of copy number variation.
GainCNV: Number of subjects having excessive copy number variation.

Shared protein-protein interactions (PPIs)

Examine whether variants share the same PPIs with some genes of interest: --candi-list gene1,gene2,...,geneN --ppi-annot ppiDatabase

java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --candi-list gene1,gene2,...,geneN --ppi-annot string --ppi-depth 1

Note Currently, the only supported PPI database is STRING (i.e., string).

Note Instead of specifying the gene list using the --candi-list option, users can specify a file containing the list using the --candi-file option.

In the file containing candidate genes, each row can only have one gene. The format is like the following:

GeneSymbol1
GeneSymbol2
GeneSymbol3
GeneSymbol4

Note The maximum distance of a PPI between a specifified candidate genes and a gene containing the potential causal variants can be adjusted using the --ppi-depth option. The defaul is --ppi-depth 1.

The option will append the following fields to the output file:

IsWithinCandidateGene: Whether the variant is within a candidate gene
PPI: PPI shared by at least one gene of interest and the gene containing the variant

Shared genesets

Examine whether variants share the same genesets with some genes of interest: --candi-list gene1,gene2,...,geneN --geneset-annot GenesetDatabase

java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --candi-list gene1,gene2,...,geneN --geneset-annot GeneSetDatabase

Currently, users can choose between 5 geneset databases from GSEA. The following are brief descriptions of the 5 datasets COPIED from the page of GSEA.

GeneSetDatabase	Description	Version
cano	Gene sets from the geneset databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts. (1452 gene sets)	MSigDB 3.1
cura	Curated gene sets: Gene sets collected from various sources such as online geneset databases, publications in PubMed, and knowledge of domain experts. The gene set page for each gene set lists its source. (4850 gene sets)	MSigDB 3.1
onco	Oncogenic signatures: Gene sets represent signatures of cellular genesets which are often dis-regulated in cancer. The majority of signatures were generated directly from microarray data from NCBI GEO or from internal unpublished profiling experiments which involved perturbation of known cancer genes. In addition, a small number of oncogenic signatures were curated from scientific publications. (189 gene sets)	MSigDB 3.1
cmop	Computational gene sets: Computational gene sets defined by mining large collections of cancer-oriented microarray data. (858 gene sets)	MSigDB 3.1
onto	GO gene sets Gene sets are named by GO term and contain genes annotated by that term. GSEA users: Gene set enrichment analysis identifies gene sets consisting of co-regulated genes; GO gene sets are based on ontologies and do not necessarily comprise co-regulated genes. (1454 gene sets)	MSigDB 3.1

Note The --candi-list and --candi-file options can be shared with --ppi-annot.

The option will append the following fields to the output file:

IsWithinCandidateGene: Whether the variant is within a candidate gene
SharedGeneSet: GeneSet shared by at least one gene of interest and the gene containing the variant

Mouse Phenotype

Annotate the genes with known mouse phenotypes as reference: --mouse-pheno. java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --mouse-pheno The genes harboring interested mutations will be annotated by known mouse phenotypes of ortholog genes in the file of Mouse/Human Orthology with Phenotype Annotations from Mouse Genome Informatics(MGI) and in the file of phenotype-genotype from International Mouse Phenotyping Consortium (IMPC)
An column will be appended in the main output file:

MousePhenotype (MGI):mouse phenotypes of ortholog genes. Multiple phenotypes are separated by '||'.
MousePhenotype (IMPC):mouse phenotypes of ortholog genes. Multiple phenotypes are separated by '||'.

Zebrafish Phenotype

Annotate the genes with known zebrafish phenotypes as reference: --zebrafish-pheno. java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --zebrafish-pheno The genes harboring interested mutations will be annotated by known zebrafish phenotypes of ortholog genes in the file of Phenotypic Zebrafish genes with Human Orthology from The Zebrafish Model Organism Database.
An column will be appended in the main output file:

ZebrafishPhenotype :Affected structure or process subterm after the knock-out of the human ortholog genes in Zebrafish.

DDD Study

Annotate by disease names in Deciphering Developmental Disorders (DDD) study: --ddd-annot. java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --ddd-annot The genes harboring interested mutations will be annotated by the documented disease names in Gene-based genotype to phenotype of Deciphering Developmental Disorders (DDD) study .
An column will be appended in the main output file:

DDDPhenotype : confirmation status;disease;pubmed_ids

PubMed literature

Assign PubMed ID to a gene:--phenotype-term searchTerm and --pubmed-mining

java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --phenotype-term search+Term1,search+Term2 --pubmed-mining

KGGSeq can mine the titles and abstracts of published papers in PubMed via the NCBI E-utilities to find a co-mention with the searched term(s) of interest (i.e., disease) and the genes/cytogeneic regions in which the variants are located. The option --pubmed-mining will append the following two fields to the output file.

PubMedIDIdeogram : PubMed ID of articles in which the term and the cytogeneic position of the variant are co-mentioned
PubMedIDGene : PubMed ID of articles in which the term and the gene containing the variant are co-mentioned

IMPORTANT Please use + instead of blank characters within a search term.

OMIM

Assign OMIM term to a gene --omim-annot
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --omim-annot KGGSeq can extract all available disorder names linked to a gene in which the variants are located from the OMIM dataset morbidmap. The command will append two extra fields to the output file:

DiseaseName(s)MIMid : Disorder, <disorder MIM no.> (<phene mapping key>)
Phenotype mapping method <phene mapping key>:
1 - the disorder is placed on the map based on its association with a gene, but the underlying defect is not known.
2 - the disorder has been placed on the map by linkage; no mutation has been found.
3 - the molecular basis for the disorder is known; a mutation has been found in the gene.
4 - a contiguous gene deletion or duplication syndrome, multiple genes are deleted or duplicated causing the phenotype.
GeneMIMid : Gene/locus MIM no.

COSMIC cancer information

Assign COSMIC cancer annotation to a gene: --cosmic-annot

java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --indiv-pair NONTUMOR.1:TUMOR.1,NONTUMOR.2:TUMOR.2 --genotype-filter 8 --cosmic-annot

KGGSeq can retrieve the confirmed somatic mutation information registered for variants and genes in the COSMIC database. The command will append a field to the output file. e.g.,

COSMICCancerInfo : known cancer and occurrence time of the somatic mutation in COSMIC dataset. e.g., {serous_carcinoma=3, astrocytoma_Grade_IV=1}
This somatic mutation occurred 3 times in samples of 'serous_carcinoma' and once in samples of 'astrocytoma_Grade_IV'.

Prediction at variant

KGGSeq provides a series of methods to combine existing functional scores of DNA variants from multiple algorithms (i.e., SIFT, PolyPhen2, GWAVA and CADD) for a more accurate pathogenic prediction at coding and non-coding variants.

Predicting Mendelian disease-causing variants

Predict Mendelian disease-causing variants: --db-score dbnsfp --mendel-causing-predict
java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --db-score dbnsfp --mendel-causing-predict best --filter-nondisease-variant Use functional or deleterious scores (listed below) collected in dbNSFP v3.0+ database to RE-predict whether a nonsynonymous single nucleotide variant (SNV) will potentially be Mendelian disease causal or not by Logistic regression model (See methods in our paper, Li et al. PLoS Genet. 2013 Jan;9(1):e1003143). By default, KGGSeq try the top 10 subsets of the scores (according to their area under the curves (AUC) of receiver operating characteristic (ROC)) for a combinatorial prediction until a positive prediction occurs. This is a way to relief influence of missing values. Users can try more subsets by a number on the option, say --mendel-causing-predict best30. However, a larger number will increase the false positive rate. In conjunction with the prediction model, the --filter-nondisease-variant tag will be used to filter out the nondisease-causative variants predicted by Logistic regression model (See methods in Li et al. PLoS Genet. 2013 Jan;9(1):e1003143). .Note that this trying will increase the false positive rate although it decreases the false negative rate. So you may need more information to justify the involved gene(s).

Figure: Receiver operating characteristic (ROC) and Precision recall(PR) curves and area under the curves (AUC) of individual scores and combined score by Logistic regression model
Note: this is an updated plot of Figure 1 in our paper . Here we combined more individual functional impact scores, which is the only difference.

Although the logistic regression model was proposed more than 6 years ago,it has comparable performance with two recent methods, REVEL (Am J Hum Genet. 2016) and M-CAP (Nat Genet. 2016) .

Figure: The performance comparison of KGGSeq´s pathogenic prediction with REVEL and M-CAP. Note: A) area under the curve of receiver operating characteristic, B)area under the curve of precision-recall. This plot is from KGGSeq's reference paper. Please read details in producing the plot in the paper .

On the other hand, one can FIX the prediction using a specified subset or full set of the 5 impact scores by option like --mendel-causing-predict 4,6,7
The coding for the functional or deleterious scores used in'--mendel-causing-predict'options is listed below:

Coding	Method	Description (Mostly copied from dbNSFP)
1	SIFT	SIFT uses the 'Sorting Tolerant From Intolerant' (SIFT) algorithm to predict whether a single amino acid substitution affects protein function or not, based on the assumption that important amino acids in a protein sequence should be conserved throughout evolution and substitutions at highly conserved sites are expected to affect protein function.A small scoreindicates a high chance for a substitutionto damage the protein function.
2	Polyphen2_HDIV	Polyphen2 score based on HumDiv, i.e. hdiv_prob. The score ranges from 0 to 1, and the corresponding prediction is "probably damaging" if it is in [0.957,1]; "possibly damaging" if it is in [0.453,0.956]; "benign" if it is in [0,0.452]. Score cutoff for binary classification is 0.5, i.e. the prediction is "neutral" if the score is smaller than 0.5 and "deleterious" if the score is larger than 0.5. Multiple entries separated by ";"
3	Polyphen2_HVAR	Polyphen2 predicts the possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations by an iterative greedy algorithm. In the present study, we use the original scores generated by the HumVar (instead ofHumDiv) trained model as it is preferred for the diagnosis of Mendelian diseases. The scores range from 0 to 1. A substitution with larger score has a higher possibility to damage the protein function.
4	LRT	LRT employed a likelihood ratio test to assess variant deleteriousness based on a comparative genomics data set of 32 vertebrate species. The identified deleterious mutations could disrupt highly conserved amino acids within protein-coding sequences, which are likely to be unconditionally deleterious.The scores range from 0 to 1. A larger score indicates a larger deleterious effect.
5	MutationTaster	MutationTaster assesses the impact of the disease-causing potential of a sequence variant by a naive Bayes classifier using multiple resources such as evolutionary conservation, splice-site changes, loss of protein features and changes that might affect mRNA level. The scores range from 0 to 1. The larger score suggests a higher probability to cause a human disease.
6	MutationAssessor	MutationAssessor "functional impact of a variant : predicted functional (high, medium), predicted non-functional (low, neutral)" Please refer to Reva et al. Nucl. Acids Res. (2011) 39(17):e118 for details
7	FATHMM	FATHMM default score (weighted for human inherited-disease mutations with Disease Ontology); If a score is smaller than -1.5 the corresponding NS is predicted as "D(AMAGING)"; otherwise it is predicted as "T(OLERATED)". If there's more than one scores associated with the same NS due to isoforms, the smallest score (most damaging) was used. Please refer to Shihab et al Hum. Mut. (2013) 34(1):57-65 for details
8	VEST3	VEST 3.0 score. Score ranges from 0 to 1. The larger the score the more likely the mutation may cause functional change. In case there are multiple scores for the same variant, the largest score (most damaging) is presented. Please refer to Carter et al., (2013) BMC Genomics. 14(3) 1-16 for details. Please note this score is free for non-commercial use. For more details please refer to http://wiki.chasmsoftware.org/index.php/SoftwareLicense. Commercial users should contact the Johns Hopkins Technology Transfer office.
9	PROVEAN	PROVEAN score (PROVEANori). Scores range from -14 to 14. The smaller the score the more likely the SNP has damaging effect..
10	CADD	Combined Annotation Dependent Depletion (CADD) score for funtional prediction of a SNP. Please refer to Kircher et al. (2014) Nature Genetics 46(3):310-5 for details. The larger the score the more likely the SNP has damaging effect.
11	GERP++_NR	Neutral rate
12	GERP++_RS	RS score, the larger the score, the more conserved the site
13	phastCons7way_vertebrate	phastCons conservation score based on the multiple alignments of 7 vertebrate genomes (including human). The larger the score, the more conserved the site. Scores range from 0 to 1.
14	SiPhy_29way_logOdds	SiPhy score based on 29 mammals genomes. The larger the score, the more conserved the site. Scores range from 0 to 37.9718 in dbNSFP.

Note Predictions at variants with missing scores at specified methods will be ignored!
Note If you use this feature in any published work, please cite the following manuscript: Li et al. Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genet. 2013 Jan;9(1):e1003143

The option will append the following fields to the output file:

...
Polyphen2_HDIV_pred: Polyphen2 prediction based on HumDiv, "D" ("probably damaging"), "P" ("possibly damaging") and "B" ("benign"). Multiple entries separated by ";"
Polyphen2_HVAR_pred: Polyphen2 prediction based on HumVar, "D" ("probably damaging"), "P" ("possibly damaging") and "B" ("benign"). Multiple entries separated by ";"
LRT_pred: Classification using LRT (D = deleterious, N = neutral, or U = unknown)
MutationTaster_pred: Classification using MutationTaster (A = disease_causing_automatic, D = disease_causing, N = polymorphism, or P = polymorphism_automatic)
MutationAssessor_pred: MutationAssessor "functional impact of a variant : predicted functional (high, medium), predicted non-functional (low, neutral)"
MetaSVM_pred: Prediction of an SVM based ensemble prediction score by Dong et al. [(2015) Human Molecular Genetics 24(8):2125-2137]."T(olerated)" or"D(amaging)".
MetaLR_pred: Prediction of a MetaLR based ensemble prediction score by Dong et al. [(2015) Human Molecular Genetics 24(8):2125-2137]."T(olerated)" or "D(amaging)".
clinvar_clnsig: clinical significance as to the clinvar data set; 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - Pathogenic, 6 - drug response, 7 - histocompatibility. A negative score means the the score is for the ref allele clinvar_trait: the trait/disease the clinvar_clnsig referring to (Please see more description about the attributes at https://drive.google.com/file/d/0B60wROKy6OqcWllRZjZXM2Vuc2c/view
[...]
DiseaseCausalProb_ExoVarTrainedModel: Conditional probability of being Mendelian disease-causing given the above prediction scores under a logistic regression model trained by our dataset ExoVar.
IsRareDiseaseCausal_ExoVarTrainedModel: Classification using the logistic regression model (Y = disease-causing or N = neutral)
BestCombinedTools:OptimalCutoff:TP:TN : The subset of original prediction tools (out of the 13 tools) used for the combined prediction by our Logistic Regression model which have the largest posterior probability among all possible combinatorial subsets: the cutoff leads to the maximal Matthews correlation coefficient (MCC): the corresponding true positive and true negative at the maximal MCC.

Note By default the option --mendel-causing-predict best is always switched on unless other prediction types are specified.

Predict high frequent somatic mutation potential in cancers

Predict non-synonyms highly frequent somatic-mutations of cancers by an integrative approach based on : --db-score dbnsfp --cancer-mut-predict
java -jar kggseq.jar --maf-file path/to/file1 --ped-file path/to/file2 --db-score dbnsfp --cancer-mut-predict We built a model to predict high-frequent somatic mutation potential as prior weights. This was a random forest (ensemble of 500 decision trees) model trained by a large cancer somatic mutation database, COSMIC (V83). We collected 4,320 somatic mutation variants occurring over 15 times in primary cancer tissues to constitute a positive variant set in COSMIC(V83). A negative control variant set containing 258,846 somatic mutation variants was randomly sampled from the COSMIC as well. Each of the control variant occurred only once in primary cancer tissues. The predictors at each variant include 19 deleterious or conservation scores from the database dbNSFP v3.5. See the names of all tools in the following Figure and description the predictors in a web page of dbNSFP v3.5. The area under of the receiver operating characteristic curve of the random forest model was 79%, which was much better than a multivariate logistic regression model and individual predictors.

Table: The area under the curves (AUC) of receiver operating characteristic (ROC) curves of individual scores and combined score by the random forest model and Logistic regression model in a 10-fold cross-validation. As shown in the above table, the combined prediction by the random forest outperformed that by the Logistic regression.

The option will append the following fields to the output file:

...
IsHighMut_COSMICTrainedModel: Classification results by the random forest model (Y = frequent somatic mutations potential or N = low somatic mutations potential)
RandomForestScore:Proportion of positive predictions in the random forest model in the benchmark dataset. It ranges from 0 to 1. The larger value, the more confident a positive prediction (Y).

NoteThe somatic mutations of patients can be input in VCF format:

java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --indiv-pair NONTUMOR.1:TUMOR.1,NONTUMOR.2:TUMOR.2 --genotype-filter 8 --cosmic-annot --db-score dbnsfp --cancer-mut-predict

NoteMore descriptions about this model can been seen in our paper: Jiang et al. WITER: a powerful method for estimation of cancer-driver genes using a weighted iterative regression modelling background mutation counts. Nucleic Acids Research (In press)

Predicting complex disease-causing variants

Predict regulatory or pathogenic potential of complex diseases at non-coding variants by different models.

Predicting noncoding regulatory variants based on composite strategy

Predict regulatory variants by 10 functional scores at non-coding variants: --db-score dbncfp_known [or dbncfp_all] --regulatory-causing-predict
java -jar kggseq.jar --vcf-file path/to/file1 --db-score dbncfp_known --regulatory-causing-predict all Use 10 existing functional prediction scores (See above table) to RE-predict whether a SNV will potentially be regulatory. By default, KGGSeq uses all scores for a combinatorial prediction. On the other hand, one can FIX the prediction using any specified subset (>2) or full set (all) of the impact scores by option like --regulatory-causing-predict 1,6,8,10.

The existing functional prediction scores at non-coding variants are listed below:

Coding	Method	Description
1	GWAVA_Region	GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was composed of all 1KG variants in the 1 kb surrounding each of the HGMD variants.
2	GWAVA_TSS	GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was matched for distance to the nearest TSS genome-wide.
3	GWAVA_Unmatched	GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was constructed from a random selection of SNVs from across the genome in order to sample overall background.
4	CADD_CScore	"Raw" CADD scores come straight from the CADD model, and are interpretable as the extent to which the annotation profile for a given variant suggests that that variant is likely to be "observed" (negative values) vs "simulated" (positive values). These values have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values do have relative meaning, with higher values indicating that a variant is more likely to be simulated (or "not observed") and therefore more likely to have deleterious effects.
5	DANN	DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture nonlinear relationships among features and are better suited than SVMs for problems with a large number of samples and features.
6	FATHMM-MKL	FATHMM-MKL uses MKL classifier to predict the functional consequences of both coding and non-coding sequence variants from various genomic annotations and weights the significance of each component annotation source.
7	FunSeq	FunSeq filters mutations overlapping 1000 Genomes variants and then prioritizes those in regions under strong selection (sensitive and ultrasensitive), breaking TF motifs, and those associated with hubs. It can score the deleterious potential of variants in single or multiple genomes. The scores for each noncoding variant vary from 0 to 6, with 6 corresponding to maximum deleterious effect. When multiple tumor genomes are given as input, FunSeq also identifies recurrent mutations in the same element.
8	FunSeq2	FunSeq2 is originally to annotate and prioritize somatic alterations integrating various resources from genomic and cancer studies. The framework consists of two components: (1) data context from uniformly processing large-scale datasets; and (2) a high-throughput variant prioritization pipeline. FunSeq2 can also be used to prioritize noncoding genetic variants.
9	GWAS3D	GWAS3D systematically assesses the genetic variants that could affect regulatory elements, by integrating annotations from cell type-specific chromatin states, epigenetic modifications, sequence motifs and cross-species conservation. It combines the original GWAS signal, risk haplotype, binding affinity significance and conservation information to prioritize the leading variants, and infer the putative causal variant in the LD of leading variant.
10	SuRFR	SuRFR integrates functional annotation and prior biological knowledge to prioritise candidate functional variants by regression model. It introduces novel training and validation datasets that i) capture the regional heterogeneity of genomic annotation better than previously applied approaches, and ii) facilitate understanding of which annotations are most important for discriminating different classes of functionally relevant variants from background variants.

If you have a lot of variants without annotation of existing tools, you can use the whole genome annotation dataset, --db-score dbncfp_all. However, you have to manually download the file from hgXX_all_SNV_dbNCFP.gz by efficient tools (e.g. FileZilla) and put it into the folder of KGGSeqFolder/resources/hgXX/hgXX_all_SNV_dbNCFP.gz.

Besides the above existing functional prediction scores, this option will append the following fields to the output file:

BF:Bayes factor (BF) that compares the two probability models, in which the null hypothesis is that the variant is neutral, and the alternative hypothesis is that the variant is causal.
Composite_Prob: the overall probability of the variant being regulatory-causal, which is calculated by multiplying the probabilities of being causal of multiple scores.

Reference: Mulin Jun Li et al. Predicting regulatory variants with composite statistic. Bioinformatics. 2016 Sep 15;32(18):2729-36.

Predicting context-dependent noncoding regulatory variants based on composite strategy and tissue/cell type-specific model

Note <argument>: --regulatory-causing-predict is described in last selection (composite strategy).

Predict pathogenic variants by adding cell-type specific regulatory epigenomic markers: --db-score dbncfp_known [or dbncfp_all] --regulatory-causing-predict all --cell
java -jar kggseq.jar --vcf-file path/to/file1 --db-score dbncfp_known --regulatory-causing-predict all --cell GM12878 Assigning a cell type, KGGSeq uses a logit model to measure the probability of regulatory causality for given variants in selected condition. It finally combine the above composite model and context-dependent model into an unified model to estimate the posterior regulatory probability in a specific cell type/tissue. ftp://jjwanglab.org/PRVCS/dbNCFP/dbNCFP_hg19.ANN.bgz

If you have a lot of variants without annotation of existing tools (rare/de novo/somatic variants), you can use the whole genome annotation dataset, --db-score dbncfp_all. However, you have to manually download the file from hgXX_all_SNV_dbNCFP.gz by efficient tools (e.g. FileZilla) and put it into the folder of KGGSeqFolder/resources/hgXX/hgXX_all_SNV_dbNCFP.gz.

KGGSeq supports 16 ENCODE cell types as follows:

Coding	Tissue	Description
A549	Epithelium	epithelial cell line derived from a lung carcinoma tissue. (PMID: 175022), "This line was initiated in 1972 by D.J. Giard, et al. through explant culture of lung carcinomatous tissue from a 58-year-old caucasian male." - ATCC, newly promoted to tier 2: not in 2011 analysis.
CD14	Monocytes	Monocytes-CD14+ are CD14-positive cells from human leukapheresis production, from donor RO 01746 (draw 1 ID is RO 01746, draw 2 ID is RO 01826), newly promoted to tier 2: not in 2011 analysis.
GM12878	Blood	B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus.
H1-hESC	Embryonic stem cell	embryonic stem cells.
HMEC	Breast	mammary epithelial cells.
HSMM	Muscle	skeletal muscle myoblasts.
HSMMT	Muscle	HSMM cell derived skeletal muscle myotubes cell line.
HUVEC	Vessel	umbilical vein endothelial cells.
HeLa-S3	Cervix	cervical carcinoma.
HepG2	Liver	hepatocellular carcinoma.
IMR90	Lung	fetal lung fibroblasts, newly promoted to tier 2: not in 2011 analysis.
K562	Blood	leukemia, "The continuous cell line K-562 was established by Lozzio and Lozzio from the pleural effusion of a 53-year-old female with chronic myelogenous leukemia in terminal blast crises." - ATCC.
NH-A	Brain	astrocytes (also called Astrocy).
NHDF	Skin	dermal fibroblasts from temple / breast
NHEK	Skin	epidermal keratinocytes.
NHLF	Lung	lung fibroblasts.

NoteBy default, KGGSeq will use GM12878 for context-dependent prioritization.

KGGSeq also supports 127 RoadMap human reference epigenomes : java -jar kggseq.jar --vcf-file path/to/file1 --db-score dbncfp_known --regulatory-causing-predict all --cell E116

Coding	Lineage Group	Epigenome Mnemonic	Epigenome Name	Anatomy	Type
E001	ESC	ESC.I3	ES-I3 Cell Line	ESC	CellLine
E002	ESC	ESC.WA7	ES-WA7 Cell Line	ESC	CellLine
E003	ESC	ESC.H1	H1 Cell Line	ESC	CellLine
E004	ES-deriv	ESDR.H1.BMP4.MESO	H1 BMP4 Derived Mesendoderm Cultured Cells	ESC_DERIVED	CellLineDerived
E005	ES-deriv	ESDR.H1.BMP4.TROP	H1 BMP4 Derived Trophoblast Cultured Cells	ESC_DERIVED	CellLineDerived
E006	ES-deriv	ESDR.H1.MSC	H1 Derived Mesenchymal Stem Cells	ESC_DERIVED	CellLineDerived
E007	ES-deriv	ESDR.H1.NEUR.PROG	H1 Derived Neuronal Progenitor Cultured Cells	ESC_DERIVED	CellLineDerived
E008	ESC	ESC.H9	H9 Cell Line	ESC	CellLine
E009	ES-deriv	ESDR.H9.NEUR.PROG	H9 Derived Neuronal Progenitor Cultured Cells	ESC_DERIVED	CellLineDerived
E010	ES-deriv	ESDR.H9.NEUR	H9 Derived Neuron Cultured Cells	ESC_DERIVED	CellLineDerived
E011	ES-deriv	ESDR.CD184.ENDO	hESC Derived CD184+ Endoderm Cultured Cells	ESC_DERIVED	CellLineDerived
E012	ES-deriv	ESDR.CD56.ECTO	hESC Derived CD56+ Ectoderm Cultured Cells	ESC_DERIVED	CellLineDerived
E013	ES-deriv	ESDR.CD56.MESO	hESC Derived CD56+ Mesoderm Cultured Cells	ESC_DERIVED	CellLineDerived
E014	ESC	ESC.HUES48	HUES48 Cell Line	ESC	CellLine
E015	ESC	ESC.HUES6	HUES6 Cell Line	ESC	CellLine
E016	ESC	ESC.HUES64	HUES64 Cell Line	ESC	CellLine
E017	IMR90	LNG.IMR90	IMR90 fetal lung fibroblasts Cell Line	LUNG	CellLine
E018	iPSC	IPSC.15b	iPS-15b Cell Line	IPSC	CellLine
E019	iPSC	IPSC.18	iPS-18 Cell Line	IPSC	CellLine
E020	iPSC	IPSC.20B	iPS-20b Cell Line	IPSC	CellLine
E021	iPSC	IPSC.DF.6.9	iPS DF 6.9 Cell Line	IPSC	CellLine
E022	iPSC	IPSC.DF.19.11	iPS DF 19.11 Cell Line	IPSC	CellLine
E023	Mesench	FAT.MSC.DR.ADIP	Mesenchymal Stem Cell Derived Adipocyte Cultured Cells	FAT	CellLineDerived
E024	ESC	ESC.4STAR	Cell Line	ESC	CellLine
E025	Mesench	FAT.ADIP.DR.MSC	Adipose Derived Mesenchymal Stem Cell Cultured Cells	FAT	PrimaryCell
E026	Mesench	STRM.MRW.MSC	Bone Marrow Derived Cultured Mesenchymal Stem Cells	NECTIVE	PrimaryCell
E027	Epithelial	BRST.MYO	Breast Myoepithelial Primary Cells	BREAST	PrimaryCell
E028	Epithelial	BRST.HMEC.35	Breast variant Human Mammary Epithelial Cells (vHMEC)	BREAST	PrimaryCell
E029	HSC & B-cell	BLD.CD14.PC	Primary monocytes from peripheral blood	BLOOD	PrimaryCell
E030	HSC & B-cell	BLD.CD15.PC	Primary neutrophils from peripheral blood	BLOOD	PrimaryCell
E031	HSC & B-cell	BLD.CD19.CPC	Primary B cells from cord blood	BLOOD	PrimaryCell
E032	HSC & B-cell	BLD.CD19.PPC	Primary B cells from peripheral blood	BLOOD	PrimaryCell
E033	Blood & T-cell	BLD.CD3.CPC	Primary T cells from cord blood	BLOOD	PrimaryCell
E034	Blood & T-cell	BLD.CD3.PPC	Primary T cells from peripheral blood	BLOOD	PrimaryCell
E035	HSC & B-cell	BLD.CD34.PC	Primary hematopoietic stem cells	BLOOD	PrimaryCell
E036	HSC & B-cell	BLD.CD34.CC	Primary hematopoietic stem cells short term culture	BLOOD	PrimaryCell
E037	Blood & T-cell	BLD.CD4.MPC	Primary T helper memory cells from peripheral blood 2	BLOOD	PrimaryCell
E038	Blood & T-cell	BLD.CD4.NPC	Primary T helper naive cells from peripheral blood	BLOOD	PrimaryCell
E039	Blood & T-cell	BLD.CD4.CD25M.CD45RA.NPC	Primary T helper naive cells from peripheral blood	BLOOD	PrimaryCell
E040	Blood & T-cell	BLD.CD4.CD25M.CD45RO.MPC	Primary T helper memory cells from peripheral blood 1	BLOOD	PrimaryCell
E041	Blood & T-cell	BLD.CD4.CD25M.IL17M.PL.TPC	Primary T helper cells PMA-I stimulated	BLOOD	PrimaryCell
E042	Blood & T-cell	BLD.CD4.CD25M.IL17P.PL.TPC	Primary T helper 17 cells PMA-I stimulated	BLOOD	PrimaryCell
E043	Blood & T-cell	BLD.CD4.CD25M.TPC	Primary T helper cells from peripheral blood	BLOOD	PrimaryCell
E044	Blood & T-cell	BLD.CD4.CD25.CD127M.TREGPC	Primary T regulatory cells from peripheral blood	BLOOD	PrimaryCell
E045	Blood & T-cell	BLD.CD4.CD25I.CD127.TMEMPC	Primary T cells effector/memory enriched from peripheral blood	BLOOD	PrimaryCell
E046	HSC & B-cell	BLD.CD56.PC	Primary Natural Killer cells from peripheral blood	BLOOD	PrimaryCell
E047	Blood & T-cell	BLD.CD8.NPC	Primary T killer naive cells from peripheral blood	BLOOD	PrimaryCell
E048	Blood & T-cell	BLD.CD8.MPC	Primary T killer memory cells from peripheral blood	BLOOD	PrimaryCell
E049	Mesench	STRM.CHON.MRW.DR.MSC	Mesenchymal Stem Cell Derived Chondrocyte Cultured Cells	NECTIVE	PrimaryCell
E050	HSC & B-cell	BLD.MOB.CD34.PC.F	Primary hematopoietic stem cells G-CSF-mobilized Female	BLOOD	PrimaryCell
E051	HSC & B-cell	BLD.MOB.CD34.PC.M	Primary hematopoietic stem cells G-CSF-mobilized Male	BLOOD	PrimaryCell
E052	Myosat	MUS.SAT	Muscle Satellite Cultured Cells	MUSCLE	PrimaryCell
E053	Neurosph	BRN.CRTX.DR.NRSPHR	Cortex derived primary cultured neurospheres	BRAIN	PrimaryCell
E054	Neurosph	BRN.GANGEM.DR.NRSPHR	Ganglion Eminence derived primary cultured neurospheres	BRAIN	PrimaryCell
E055	Epithelial	SKIN.PEN.FRSK.FIB.01	Foreskin Fibroblast Primary Cells skin01	SKIN	PrimaryCell
E056	Epithelial	SKIN.PEN.FRSK.FIB.02	Foreskin Fibroblast Primary Cells skin02	SKIN	PrimaryCell
E057	Epithelial	SKIN.PEN.FRSK.KER.02	Foreskin Keratinocyte Primary Cells skin02	SKIN	PrimaryCell
E058	Epithelial	SKIN.PEN.FRSK.KER.03	Foreskin Keratinocyte Primary Cells skin03	SKIN	PrimaryCell
E059	Epithelial	SKIN.PEN.FRSK.MEL.01	Foreskin Melanocyte Primary Cells skin01	SKIN	PrimaryCell
E061	Epithelial	SKIN.PEN.FRSK.MEL.03	Foreskin Melanocyte Primary Cells skin03	SKIN	PrimaryCell
E062	Blood & T-cell	BLD.PER.MONUC.PC	Primary mononuclear cells from peripheral blood	BLOOD	PrimaryCell
E063	Adipose	FAT.ADIP.NUC	Adipose Nuclei	FAT	PrimaryTissue
E065	Heart	VAS.AOR	Aorta	VASCULAR	PrimaryTissue
E066	Other	LIV.ADLT	Liver	LIVER	PrimaryTissue
E067	Brain	BRN.ANG.GYR	Brain Angular Gyrus	BRAIN	PrimaryTissue
E068	Brain	BRN.ANT.CAUD	Brain Anterior Caudate	BRAIN	PrimaryTissue
E069	Brain	BRN.CING.GYR	Brain Cingulate Gyrus	BRAIN	PrimaryTissue
E070	Brain	BRN.GRM.MTRX	Brain Germinal Matrix	BRAIN	PrimaryTissue
E071	Brain	BRN.HIPP.MID	Brain Hippocampus Middle	BRAIN	PrimaryTissue
E072	Brain	BRN.INF.TMP	Brain Inferior Temporal Lobe	BRAIN	PrimaryTissue
E073	Brain	BRN.DL.PRFRNTL.CRTX	Brain Dorsolateral Prefrontal Cortex	BRAIN	PrimaryTissue
E074	Brain	BRN.SUB.NIG	Brain Substantia Nigra	BRAIN	PrimaryTissue
E075	Digestive	GI.CLN.MUC	Colonic Mucosa	GI_COLON	PrimaryTissue
E076	Sm. Muscle	GI.CLN.SM.MUS	Colon Smooth Muscle	GI_COLON	PrimaryTissue
E077	Digestive	GI.DUO.MUC	Duodenum Mucosa	GI_DUODENUM	PrimaryTissue
E078	Sm. Muscle	GI.DUO.SM.MUS	Duodenum Smooth Muscle	GI_DUODENUM	PrimaryTissue
E079	Digestive	GI.ESO	Esophagus	S	PrimaryTissue
E080	Other	ADRL.GLND.FET	Fetal Adrenal Gland	ADRENAL	PrimaryTissue
E081	Brain	BRN.FET.M	Fetal Brain Male	BRAIN	PrimaryTissue
E082	Brain	BRN.FET.F	Fetal Brain Female	BRAIN	PrimaryTissue
E083	Heart	HRT.FET	Fetal Heart	HEART	PrimaryTissue
E084	Digestive	GI.L.INT.FET	Fetal Intestine Large	GI_INTESTINE	PrimaryTissue
E085	Digestive	GI.S.INT.FET	Fetal Intestine Small	GI_INTESTINE	PrimaryTissue
E086	Other	KID.FET	Fetal Kidney	KIDNEY	PrimaryTissue
E087	Other	PANC.ISLT	Pancreatic Islets	PANCREAS	PrimaryTissue
E088	Other	LNG.FET	Fetal Lung	LUNG	PrimaryTissue
E089	Muscle	MUS.TRNK.FET	Fetal Muscle Trunk	MUSCLE	PrimaryTissue
E090	Muscle	MUS.LEG.FET	Fetal Muscle Leg	MUSCLE_LEG	PrimaryTissue
E091	Other	PLCNT.FET	Placenta	PLACENTA	PrimaryTissue
E092	Digestive	GI.STMC.FET	Fetal Stomach	GI_STOMACH	PrimaryTissue
E093	Thymus	THYM.FET	Fetal Thymus	THYMUS	PrimaryTissue
E094	Digestive	GI.STMC.GAST	Gastric	GI_STOMACH	PrimaryTissue
E095	Heart	HRT.VENT.L	Left Ventricle	HEART	PrimaryTissue
E096	Other	LNG	Lung	LUNG	PrimaryTissue
E097	Other	OVRY	Ovary	OVARY	PrimaryTissue
E098	Other	PANC	Pancreas	PANCREAS	PrimaryTissue
E099	Other	PLCNT.AMN	Placenta Amnion	PLACENTA	PrimaryTissue
E100	Muscle	MUS.PSOAS	Psoas Muscle	MUSCLE	PrimaryTissue
E101	Digestive	GI.RECT.MUC.29	Rectal Mucosa Donor 29	GI_RECTUM	PrimaryTissue
E102	Digestive	GI.RECT.MUC.31	Rectal Mucosa Donor 31	GI_RECTUM	PrimaryTissue
E103	Sm. Muscle	GI.RECT.SM.MUS	Rectal Smooth Muscle	GI_RECTUM	PrimaryTissue
E104	Heart	HRT.ATR.R	Right Atrium	HEART	PrimaryTissue
E105	Heart	HRT.VNT.R	Right Ventricle	HEART	PrimaryTissue
E106	Digestive	GI.CLN.SIG	Sigmoid Colon	GI_COLON	PrimaryTissue
E107	Muscle	MUS.SKLT.M	Skeletal Muscle Male	MUSCLE	PrimaryTissue
E108	Muscle	MUS.SKLT.F	Skeletal Muscle Female	MUSCLE	PrimaryTissue
E109	Digestive	GI.S.INT	Small Intestine	GI_INTESTINE	PrimaryTissue
E110	Digestive	GI.STMC.MUC	Stomach Mucosa	GI_STOMACH	PrimaryTissue
E111	Sm. Muscle	GI.STMC.MUS	Stomach Smooth Muscle	GI_STOMACH	PrimaryTissue
E112	Thymus	THYM	Thymus	THYMUS	PrimaryTissue
E113	Other	SPLN	Spleen	SPLEEN	PrimaryTissue
E114	ENCODE	LNG.A549.ETOH002.CNCR	A549 EtOH 0.02pct Lung Carcinoma Cell Line	LUNG	CellLine_Cancer
E115	ENCODE	BLD.DND41.CNCR	Dnd41 TCell Leukemia Cell Line	BLOOD	CellLine_Cancer
E116	ENCODE	BLD.GM12878	GM12878 Lymphoblastoid Cell Line	BLOOD	CellLine
E117	ENCODE	CRVX.HELAS3.CNCR	HeLa-S3 Cervical Carcinoma Cell Line	CERVIX	CellLine_Cancer
E118	ENCODE	LIV.HEPG2.CNCR	HepG2 Hepatocellular Carcinoma Cell Line	LIVER	CellLine_Cancer
E119	ENCODE	BRST.HMEC	HMEC Mammary Epithelial Primary Cells	BREAST	CellLine
E120	ENCODE	MUS.HSMM	HSMM Skeletal Muscle Myoblasts Cell Line	MUSCLE	CellLine
E121	ENCODE	MUS.HSMMT	HSMM cell derived Skeletal Muscle Myotubes Cell Line	MUSCLE	CellLine
E122	ENCODE	VAS.HUVEC	HUVEC Umbilical Vein Endothelial Cells Cell Line	VASCULAR	CellLine
E123	ENCODE	BLD.K562.CNCR	K562 Leukemia Cell Line	BLOOD	CellLine
E124	ENCODE	BLD.CD14.MONO	Monocytes-CD14+ RO01746 Cell Line	BLOOD	CellLine
E125	ENCODE	BRN.NHA	NH-A Astrocytes Cell Line	BRAIN	CellLine
E126	ENCODE	SKIN.NHDFAD	NHDF-Ad Adult Dermal Fibroblast Primary Cells	SKIN	CellLine
E127	ENCODE	SKIN.NHEK	NHEK-Epidermal Keratinocyte Primary Cells	SKIN	CellLine
E128	ENCODE	LNG.NHLF	NHLF Lung Fibroblast Primary Cells	LUNG	CellLine
E129	ENCODE	BONE.OSTEO	Osteoblast Primary Cells	BONE	CellLine

The option will append the following fields to the output file:

BF: Bayes factor
Composite_Prob:Posterior probability of being regualtory given selected functional prediction scores.
Cell_Prob: Cell type-specific regulatory potential.
Combine_Prob: Combined posterior regulatory probability.

Note Epigenomes of some specific marks are missing in some tissues/cells, which may affect the prediction. We will use imputed epigenomes to handle this missing data problem soon!

Figure: Using GWAS fine mapped SNPs and 127 recently profiled human epigenomes, we showed that our context-dependent score can comprehensively depict the phenotypic cell type specificity.

Reference: Mulin Jun Li et al. cepip: context-dependent epigenomic weighting for prioritization of regulatory variants and disease-associated genes, Genome Biology (2017) 18:52

Gene feature specific model --db-score dbncfp_known [or dbncfp_all] --ldgf-func-predict java -jar kggseq.jar --vcf-file path/to/file1 --db-score dbncfp_known --ldgf-func-predict --buildver hg19

1) Annotate non-coding variants by above 10 existing functional prediction scores (See above table) in a known variant dataset and 2) jointly predict the functions importance by a new statistical model (linkage disequilibrium expanded and gene feature specific model). The known dataset contains around 110 million variants from multiple high throughput sequencing resources, including the 1000 Genomes Project Phase 3, UK10K Project, ESP6500 Project and ExAC database. The joint prediction also uses over 700 epigenomic markers for open chromatin, transcription factor binding, histone modification, RNA polymerase binding, and DNA methylation.

If you have a lot of variants without annotation of existing tools, you can use the whole genome annotation dataset, --db-score dbncfp_all. However, you have to manually download the file from hgXX_all_SNV_dbNCFP.gz by efficient tools (e.g. FileZilla) and put it into the folder of KGGSeqFolder/resources/hgXX/hgXX_all_SNV_dbNCFP.gz. Note Predictions at variants with missing scores at specified methods will use population mean of corresponding method!

Note Add the option tag --verbose-noncoding to show all detailed features used in the pathogenic prediction. However, this will consume a lot of memory!

Figure: Receiver operating characteristic (ROC) curves for performances from different gene feature model.

Besides the above existing functional prediction scores, this option will append the following fields to the output file:

IsComplexDiseasePathogenic: whether the variant is predicted to be pathogenic by the model trained with known pathogenic variants and neutral control variants.
RandomForestScore: the pathogenic scores of the random forest prediction model.

Reference: Pan et al. An improved functional prediction by linkage disequilibrium expanded and gene feature specific scores at non-coding variants. (Submitted)

Prediction at gene

Pathogenicity

Predict genes’ pathogenicity: --patho-gene-predict java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --out path/to/prefixname --excel --db-gene refgene --patho-gene-predict Gene-based pathogenicity estimation was developed for Autosomal Dominant (AD), Autosomal Recessive (AR), X chromosome linked (XL) and Paediatric diseases (PAE). By using random forest approach to combine various important gene characteristics with six existing gene prioritization methods for disease causal genes in Clinical Genomic Database, the inheritance-specific pathogenicity prioritization (ISPP) could be estimated. The higher score indicates the more similar molecular characteristics to the known disease causal genes under the corresponding inheritance mode. The six existing gene prioritization methods are Haploinsufficiency (HI), Recessive (REC), Network centrality (NET), Genic Intolerance (RVIS), Gene Damage Index (GDI) and Gene Constraint score (CONS).

The output is a series of pathogenic prediction scores for each gene. By comparing all ISPP scores of each gene, the possible inheritance mode and corresponding pathogenicity could be estimated. Please see the example on Table 1 of ISPP paper

ISPP_AD: Inheritance specific gene pathogenicity for Dominant diseases
ISPP_AR: Inheritance specific gene pathogenicity for Recessive diseases
ISPP_XL: Inheritance specific gene pathogenicity for X chromosome linked diseases
HI: Gene level haploinsufficiency estimation by Ni Huang et al
REC: A model to estimate the probability that a mutation causes a recessive disease by Daniel G. MacArthur et al.
NET: Gene centrality and indispensability in various protein-protein interactions and regulatory networks by Ekta Khurana et al.
RVIS: Residual Variation Intolerance Score , a gene score based module intended to help in the interpretation of human sequence data by Slave Petrovski et al.
GDI: Gene Damage Index (GDI) by Itan Y et al.
CONS: De novo mutation effectiveness prediction by Kaitlin E Samocha et al.

Reference: Jacob Hsu et al. Inheritance Modes Specific Pathogenicity Prioritization (ISPP) for Human Protein Coding Genes. Bioinformatics. 2016 Oct 15;32(20):3065-3071.

Phenotype Mining

Mine the phenotype of disease: --phenotype-term searchTerm and --phenolyzer-prediction java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --out path/to/prefixname --excel --db-gene refgene --phenotype-term searchTerm1,searchTerm2 --phenolyzer-prediction KGGseq has integrated the advanced tool phenolyzer as a plugin to implement the function of phenoytpe mining and predection. The score marked by phenolyer will by used to annotate the related genes in the output file.

NoteTo run Phenolyzer, your computer need install Perl. Please read the installation instructions.

Tissue-specific Expression

Tissues or cell-types where the transcripts of genes are specifically expressed: --tissue-spec-annot java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --out path/to/prefixname --excel --db-gene refgene --tissue-spec-annot Annotate genes with tissues or cell-types in which the transcripts of genes are specifically expressed. The tissue and cell-types are produced by a robust z-score approach for tissue-specific expression with the usage of GTEx RNA-sequencing data (version 7) at 52 tissues/cell-types. The tissues or cell-types are collected here as their tissue or cell-type specific expression p-values are less than 1E-6. See detail in REZ.

The output of this function is a list of tissues or cell-types in which the transcripts of genes are specifically expressed.

TissueWithSpecificExpression: the transcripts and tissues or cell-types where it is pecifically expressed. Multiple tissues and cell-types are delimited by commas and multiple transcripts are delimited by semicolons.

Statistical tests

Cancer-driver gene mutation burden test by Iterative zero-truncated negative-binomial regression (ITER) based on somatic mutations

Prioritize genes with high mutations rates: --iter-gene-codingjava -jar kggseq.jar --maf-file path/to/file1.maf --out path/to/prefixname [other tags] --iter-gene-coding

Note By default, the setting of --response-gene-feature is 0,1,2,3,4,5,6 (i.e., all non-synonymous variants) and the setting of --explanatory-gene-feature is 7 (i.e., the synonymous variants only).
Note The gene feature code can be found HERE.

This approach is designed for identifying specific cancer driver genes in two scenarios.
1) Evaluation of excess of mutations at non-synonymous variants in which number of synonymous mutations is used as one of explanatory variables in the regression model:

java -jar kggseq.jar --maf-file path/to/file1.maf --out path/to/prefixname --iter-gene-coding --cancer-label XXX 2) Evaluation of excess of mutations at synonymous variants in which no variant is used as an explanatory variable in the regression model: java -jar kggseq.jar --maf-file path/to/file1.maf --out path/to/prefixname --response-gene-feature 7 --explanatory-gene-feature - --iter-gene-coding --cancer-label XXX

The option --cancer-label XXX asks the tool to add cancer-matched predictors into the prediction model. Otherwise, the tool will only use the non-cancer-matched predictors. The values of cancer-matched predictors are integrated in the resource data. The following are the cancer labels. The names of the predictors can be seen below.

Label	FullName
ACC	Adrenocortical carcinoma
BLCA	Bladder Urothelial Carcinoma
BRCA	Breast invasive carcinoma
CESC	Cervical squamous cell carcinoma and endocervical adenocarcinoma
CHOL	Cholangiocarcinoma
COAD	Colon adenocarcinoma
CRL	Cell Line Control
DLBC	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
ESCA	Esophageal carcinoma
GBM	Glioblastoma multiforme
HNSC	Head and Neck squamous cell carcinoma
KICH	Kidney Chromophobe
KIRC	Kidney renal clear cell carcinoma
KIRP	Kidney renal papillary cell carcinoma
LAML	Acute Myeloid Leukemia
LCML	Chronic Myelogenous Leukemia
LGG	Brain Lower Grade Glioma
LIHC	Liver hepatocellular carcinoma
LUAD	Lung adenocarcinoma
LUSC	Lung squamous cell carcinoma
MESO	Mesothelioma
MISC	Miscellaneous
OV	Ovarian serous cystadenocarcinoma
PAAD	Pancreatic adenocarcinoma
PCPG	Pheochromocytoma and Paraganglioma
PRAD	Prostate adenocarcinoma
READ	Rectum adenocarcinoma
SARC	Sarcoma
SKCM	Skin Cutaneous Melanoma
STAD	Stomach adenocarcinoma
TGCT	Testicular Germ Cell Tumors
THCA	Thyroid carcinoma
THYM	Thymoma
UCEC	Uterine Corpus Endometrial Carcinoma
UCS	Uterine Carcinosarcoma
UVM	Uveal Melanoma

In the output file of the analysis (*.gene.mutationburden.txt or *.gene.mutationburden.xlsx) in the following format:

Example:
GeneSymbol ResponseVar ResponseVarScore ExplanatoryVar AvgCodingLen expr reptime hic constraint_score Residual P
PIK3CA 340 2167 0 3.307 12.90393121 613 11 5.42012467 54.15940338 0
TP53 341 2770 0 1.448 14.54284996 213 34 1.378924009 81.89842183 0
AKT1 28 178 0 1.554 13.47551291 247 36 4.027570836 19.06261681 2.58E-81
GATA3 106 108 0 1.36 12.11686978 675 -2 2.949085091 13.66765846 7.92E-43
RUNX1 24 92 0 1.639 12.01318547 429 45 2.484099017 12.81066283 7.15E-38

Here are description of the column names.

GeneSymbol: Official gene symbols defined by HGNC.
ResponseVar: The number of mutant alleles as a response variable in the regression.
ResponseVarScore: The number of mutant alleles as a response variable in the regression.
ExplanatoryVar: The number of mutant alleles as an explanatory variable in the regression.
RegionLength: The actual or averaged (of multiple isoforms) length (in kb) of all exons in a gene.
CellLineExpression: Gene expression values were averaged expression across 91 cell lines in the Cancer Cell Line Encylcopedia (CCLE).
ReplicationTiming : DNA replication time of a gene was measured in HeLa cells, ranging from 100 (very early) to 1000 (very late)
Hic: HiC-based chromatin state of a gene was measured from HiC experiments in K562 cells, ranging approximately from -50 (very closed) to +50 (very open)
ConstraintScore: constraint score for de novo mutation potential from Samocha et al (2014) Nature Genetics.
TissueExpression : Cancer-matched gene expression. Batch effects normalized mRNA data called from RNAseq for each gene. Curated from https://atacseq.xenahubs.net
CNV :Cancer-matched Gene’s copy number variation in cancer tissues, curated from https://atacseq.xenahubs.net
ATACSeq : Cancer-matched Chromatin accessibility by ATAC-Seq data, curated from https://atacseq.xenahubs.net
Methylation : Cancer-matched DNA-methylation level, curated from https://atacseq.xenahubs.net
Residual: Residuals (observed number of mutations - fitted number of mutations) in the regression after correction for the factors above.
P: The probability of be equal or over the StandizedResidue under a standard normal distribution.
AlleleScore: All mutation counts and functions scores used to calculate the mutation burden p-values by ITER or WITER.
BHFDR-q: False Discovery Rate (FDR) q-value for multiple testing by Benjamini-Hochberg procedure.

Cancer-driver gene mutation burden test by weighted iterative zero-truncated negative-binomial regression (WITER) based on somatic mutations

Prioritize genes with high mutations rates: --witer-gene-codingjava -jar kggseq.jar --maf-file path/to/file1.maf --out path/to/prefixname [other tags] --witer-gene-coding

This approach is designed for identifying specific cancer driver genes in three scenarios.
1) Evaluation of excess of mutations at non-synonymous variants in which number of synonymous mutations is used as one of explanatory variables in the regression model:

java -jar kggseq.jar --maf-file path/to/file1.maf --out path/to/prefixname --witer-gene-coding --cancer-label XXX The output of WITER is the same as that of ITER except that the is the weighted somatic mutation counts and is larger than the somatic mutation counts.

Cancer-driver gene mutation burden test by weighted iterative zero-truncated negative-binomial regression with reference sample (WITER-ref) based on somatic mutations

Prioritize genes with high mutations rates: --witer-gene-coding

java -jar kggseq.jar --maf-file path/to/file1.maf --out path/to/prefixname [other tags] --witer-gene-coding  --ref-mut-file path/to/referenceDataset --cancer-label XXX

java -jar kggseq.jar --maf-file path/to/file1.maf --out path/to/prefixname --witer-gene-coding --ref-mut-file path/to/referenceDataset --cancer-label XXX The output of WITER is the same as that of ITER except that the is the weighted somatic mutation counts and is larger than the somatic mutation counts.

...
MergedAlleleNum:The number of weighted mutant alleles from reference samples.

Detect disease susceptibility gene mutation burden test by recursive truncated negative-binomial regression (RUNNER) for complex diseases

Test baseline mutation burden at genes with rare coding variants for complex diseases: --runner-gene-coding

java -jar kggseq.jar --nt 6 [or more threads] --vcf-file path/to/file1 --ped-file path/to/file2 --out path/to/prefixname [or other tags] --excel --qqplot --db-filter gadexome.[an ancestry tag],gadgenome.[an ancestry tag],1kgeas201305 --rare-allele-freq 0.01 --min-case-control-freq-ratio 3 --filter-case-maf-oe 0.1 --gene-freq-score [an ancestry tag] --runner-gene-coding

See the detailed example options in the short tutorial

NoteThe ancestry tags are defined by Genome Aggregation Database.

Tag	Ancestry
eas	East Asian
afr	African/African American
sas	South Asian
eur	Non-Finnish European
fin	Finnish
amr	Latino American
asj	Ashkenazi Jewish
control	For the option `--gene-freq-score`, the gene frequency score of a local control sample can be used optionally with this tag. However, unless the size of control sample is over 500, the control tag will lead to decreased power according to our experience.

NoteThe --disease-causing-predict best is the same as --mendel-causing-predict best.
NoteThe --gene-freq-score should be specified with ancestrally matched population for the analyzed sample.
NoteThe --min-case-control-freq-ratio When there are control samples, this option can be used to filter out variants at which the alternative allele frequency in cases is less than that in controls multiplied by X.
NoteThe --adj-disease --residual-prob X --phe disease1 --cov cov1,cov2,... Adjust the disease statues (--phen) by covariables (--cov) with the logistic regression. Subjects with the absolute value of deviance residual less than the Xth (--residual-prob) quantile of a standard normal distribution will be ignored. The defaud value of X is 0.8.

Here are description of the column names in the output file named *.gene.mutationburden.txt.

GeneSymbol: Official gene symbols defined by HGNC.
ResponseVar: The number of mutant alleles.
ResponseVarScore: The number of weighted mutant alleles as a response variable in the regression.
ExplanatoryVar (Optional): The number of mutant alleles as an explanatory variable in the regression.
RegionLength: The length (in kb) of regions where the tested variant type are counted in a gene.
An ancestry tag e.g., EAS : The accumulated allele frequency (named frequency score) in ancestrally matched populations.
**RegionLength*tag: The multiplication of RegionLength and ancestry tag used as an interaction term in the regression
Residual:Deviance residual of mutation burden in the regression model.
P: The probability of be equal or over the Standized Residue under a standard normal distribution.
PumMedID(Optional): The queried PubMed ID(s) of papers co-mentioning the gene and phenotypes(or diseases).
BHFDR-q:** Q-value calculated by Benjamini Hochberg FDR method for multiple testing.

For advanced users, the following are more options to customize the analysis.
1.Perform burden tests without weighting the variants by functional scores --runner1-gene-coding

2.Perform burden tests using local controls to generate gene frequency score --gene-freq-score CONTROL

3.Perform burden tests and search related papers in PubMed --pubmed-mining-top-gene 10

4.Perform burden tests for other types of coding variants --response-gene-feature 2,3,4,5,6 The 2,3,4,5,6 denotes the codes of gene features.

5.Perform burden tests by adding other covariables into RUNNER. --gene-cov-file

The format of the gene covariable is simple:

Gene score1 score1 ...
SAMD11 2.114 0.076914032 ...
NOC2L 2.345 0.046528595 ...
KLHL17 1.989 0.026329064 ...
PLEKHN1 2.2 0.047853652 ...
... ... ... ...

6.Perform burden tests by customized weights or scores at variants. --var-score-file

The weights must be compressed in bgzip file, in which there are five columns. The missing values are denoted by '.'.

Chromosome POS Reference AlternativeAllele Weights
1 865580 G A 0.000568741
1 865592 A C 0.018698691
1 865544 C +GGCT .
1 865656 G T 0.011239452
1 865701 G A .
... ... ... ... ...

7.Perform burden tests among variants with GC types or not --ignore-ct-var or --only-ct-var

8.Perform burden tests among variants expressed at a number of specified tissues or cell types. --var-expression-tissues --min-var-expression

Reference: Jiang et al. Deviation from baseline mutation burden provides powerful and robust rare-variants association test for complex diseases. Nucleic Acids Res. 2021 Dec 20;gkab1234. doi: 10.1093/nar/gkab1234.PubMed

Detect Gene-Gene Interaction by a Recursive trUNcated NEgative-binomial Regression model (GGI-RUNNER) for complex diseases

About GGI-RUNNER

GGI-RUNNER is designed specifically for detecting gene-gene interactions involving rare variants in complex diseases. It assesses the enrichment of the rare variant interaction burden among patients compared to the baseline in the general population, rather than focusing on differences in genotype distribution between cases and controls. The baselines are estimated for pairwise genes by a recursive truncated negative-binomial regression model on multiple genomic features available from public data. The unique statistical strategy of GGI-RUNNER makes it particularly well-suited for studies with limited sample sizes.

Gene-gene interaction testing with rare non-synonymous variants for complex diseases: --runner-digene-coding-interact

java -jar kggseq.jar --nt 10 --buildver hg38 --vcf-file path/to/file1 --ped-file path/to/file2 --out path/to/prefixname --db-gene refgene,gencode --gene-feature-in 0,1,2,3,4,5,6 --db-filter gadexome.[an ancestry tag],gadgenome.[an ancestry tag] --rare-allele-freq 0.01 --gene-freq-score [an ancestry tag] --gene-pair-file /path/to/DIEP_SCORES_file --digene-prob 0.9 --runner-digene-coding-interact --qqplot Note The numerical values following the parameter options can be adjusted flexibly according to needs.
NoteThe --gene-freq-score should be specified with ancestrally matched population for the analyzed sample.
NoteThe --digene-prob should be used in couple with --gene-pair-file, allows for the selection of gene pairs with scores higher than X (e.g. 0.9) for gene-gene interaction analysis. You can download the DIEP_SCORES_file or use other references, but it must adhere to the format of Gene1\tGene2\tScore.

There are more options to customize the analysis

1. Adding quality control filters for individual genotype, genomic variant and sample phenotype (Recommend)

... --hwe-control 1.0E-5 --max-allele 4 --gty-dp 8 --gty-qual 20 --min-obs-rate 0.8 [and other options you need]

2. Specifying variants that are more likely to be enriched in patients (Recommend)

... --min-case-control-freq-ratio 2.0

3. Filtering out variants with a high likelihood of being private in the studied sample (Recommend)

... --filter-sample-maf-oe 0.1

4. Specifying the types of variants to be tested

... --gene-feature-in 0,4,5 Note The 0,4,5 denote the codes of gene feature.
Note The current design of GGI-RUNNER is only applicable to coding variants and does not consider non-coding variants.

5. Ignore variants within some genes

... --genes-out TTN,MUC16,OBSCN

The output

The gene-gene interaction analysis results will summarized in the output file named *.digene.mutationburden.case.txt. The output is in the form

GeneSymbol: Official gene symbols defined by HGNC [Gene1:Gene2].
ResponseVar: The unweighted rare variant interaction burden in patients.
DiGeneScore: The pre-calculate score (e.g. DIEP score) for the gene pair.
ResponseVarScore: The weighted rare variant interaction burden in patients in the regression.
RegionLength: The product of coding region length (in kb) of Gene1 and Gene2.
An ancestry tag e.g., EAS: The product of the accumulated allele frequency (named frequency score) in ancestrally matched populations of Gene1 and Gene2.
**RegionLength*tag: The product of RegionLength and ancestry tag used as an interaction term in the regression.
oe_mis: The product of the observed/expected ratios for missense variations (from gnomAD) of Gene1 and Gene2.
oe_lof: The product of the observed/expected ratios for loss-of-function variations (from gnomAD) of Gene1 and Gene2.
ExonGC: The product of the GC content of the exons (from gnomAD) of Gene1 and Gene2.
CtrlVarScore: The weighted (non-integerization) rare variant interaction burden in controls.
Residual: Deviance residual of the rare variant interaction burden in the regression model.
P: The probability of be equal or over the Standized Residue under a standard normal distribution.
BHFDR-q**: Q-value calculated by Benjamini Hochberg FDR method for multiple testing.

The QQ plot of P-values will stored in the output file named *.digene.mutationburden.case.qq.png.

Association at variants

See the detailed example options in the short tutorial

Conventional statistical association tests at each individual variants: --var-assoc

java -jar ./kggseq.jar --buildver hg19 --var-assoc --p-value-cutoff 0.01 --multiple-testing benfdr --qqplot Calculate p-values (by chi-squared test) for case-control association and odds ratio. The '--p-value-cutoff' sets p-value cutoffs for filtration; the '--multiple-testing' sets the approaches for multiple testing comparisons; and the '--qqplot' enables Q-Q plot of the association results.

Multiple testing method	Description
no	No multiple testing methods. Set a fixed p value cutoff by `--p-value-cutoff`.
benfdr	Benjamini-Hochberg method, which aims to control the FDR (specified by `--p-value-cutoff`), or "False Discovery Rate," rather than the FWE or "FamilyWise Error rate".
bonhol	Bonferroni-Holm method, which is less conservative and more powerful than the Standard Bonferroni method given a family-wise error rate specified by `--p-value-cutoff`.
bonf	Standard Bonferroni correction given a family-wise error rate specified by `--p-value-cutoff`.

The output of this function is a list of p-values and odds ratios at a variant.

PValueAllelic: The p-value calculated by chi-square test for allelic association between case-control status.
OddAllelic: The odds ratio of alleles in cases and controls.
PValueDom: The p-value calculated by chi-square test for genotype association between case-control status under dominant mode.
OddDom: The odds ratio of genotypes in cases and controls under dominant mode.
PValueRec: The p-value calculated by chi-square test for genotype association between case-control status under recessive mode.
OddRec: The odds ratio of genotypes in cases and controls under recessive mode.
PValueGeno: The p-value calculated by chi-square test for genotype association between case-control status.

Note Currently, this function only work for case-control study without confounding factors as covariables.

Variant-based association test by Rvtests: --rvtest-var

java -jar ./kggseq.jar --rvtest-var score,wald,exact,... --rvtest-vcf --rvtest-remove-set --vcf-file path/to/vcf/file --ped-file path/to/pedigree/file --phe phenotypeName --cov covariateName1,covariateName2,...,covariateNameN --out path/to/prefixname --excel --db-gene refgene KGGSeq will automatically download the C++ sources codes from Github and un-compress them into a folder named "plugin/rvtests-master" under the KGGSeq working folder. Users are required go into the "rvtests-master" to make an executable file from the C++ source codes. See more in a demonstration video .

Arguments/values	Annotation
--rvtest-var model names	Initiate the calling of RVTest for variants based association analysis. The model names are defined in Rvtests models. Currently, it supports score, wald, exact, dominantExact, famLRT, famScore, famGrammarGamma and firth models. The default models are "score,wald".
--rvtest-vcf	Ask KGGSeq to produce a new simplified VCF file, in which only genotypes are included in the FORMAT column, as the input of RVTests.
--rvtest-remove-set	Remove the intermediate file defining sets of variants for RVTests group-wise test in your hard disk.

An extra file with the suffix ".rvtest.variant.xlsx" will be created in the targeted output path.

Association at gene

See the detailed example options in the short tutorial

Gene-based association test by SKAT: --skat-gene

java -jar ./kggseq.jar --skat-gene --vcf-file path/to/vcf/file --ped-file path/to/pedigree/file --phe phenotypeName --cov covariateName1,covariateName2,...,covariateNameN --out path/to/prefixname --excel --db-gene refgene [--p-value-cutoff 0.01 --multiple-testing benfdr --qqplot] The gene-based analysis will be carried out by SKAT which was integrated as a plugin in KGGseq. The '--p-value-cutoff' sets p-value cutoffs for filtration; the '--multiple-testing' sets the approaches for multiple testing comparisons; and the '--qqplot' enables Q-Q plot of the association results.

Multiple testing method	Description
no	No multiple testing methods. Set a fixed p value cutoff by `--p-value-cutoff`.
benfdr	Benjamini-Hochberg method, which aims to control the FDR (specified by `--p-value-cutoff`), or "False Discovery Rate," rather than the FWE or "FamilyWise Error rate".
bonhol	Bonferroni-Holm method, which is less conservative and more powerful than the Standard Bonferroni method given a family-wise error rate specified by `--p-value-cutoff`.
bonf	Standard Bonferroni correction given a family-wise error rate specified by `--p-value-cutoff`.

An extra file with the suffix ".skat.gene.xlsx" will be created in the targeted output path. The '--skat-gene' will provide the p values of SKAT,SKATO and Burden test. If the phenotype is binary, the option '--skat-gene binary' can be used, but the speed will decrease.

SKAT:The p-value calculated by SKAT.
SKATO:The p-value calculated by SKAT-O.
Burden:The p-value calculated by Burden test.

Note To use this function, the user should install R language platform first.

Note When total minor allele count (MAC) is very small in a gene, SKATBinary ( '--skat-gene binary' ) has better performance than SKAT( '--skat-gene' ). However, because SKATBinary uses a resampling approach to get a p-value, it is slower (sometimes quite slower) than SKAT function. When MAC is large, the performance of the two tests is similar.

Note The names of phenotype ( '--phe' ) and co-variables ( '--cov' ) are defined in the pedigree file .

Note If the option --skat-cutoff N is added, the genes with variants less than N will not be involved into SKAT. The default value of N is 0.

Note If the option --perm-pheno is added, phenotypes will be permuted prior to association analysis.

Gene-based association test by Rvtests: --rvtest-gene

java -jar ./kggseq.jar --rvtest-gene cmc,cmat,price,skat[nPerm=1000:alpha=0.001:beta1=1:beta2=20],... --rvtest-vcf --rvtest-remove-set --vcf-file path/to/vcf/file --ped-file path/to/pedigree/file --phe phenotypeName --cov covariateName1,covariateName2,...,covariateNameN --out path/to/prefixname --excel --db-gene refgene KGGSeq will automatically download the C++ sources codes from Github and un-compress them into a folder named "plugin/rvtests-master" under the KGGSeq working folder. Users are required go into the "rvtests-master" to make an executable file from the C++ source codes. See more in a demonstration video .

Arguments/values	Annotation
--rvtest-gene model names	Initiate the calling of RVTest for gene based association analysis. The model names are defined in Rvtests models. Currently, it supports cmc, zeggini, mb, fp, exactCMC, cmcWald, rarecover and cmat models.The default models are "cmc,skato[nPerm=1000:alpha=0.001:beta1=1:beta2=20]".
--rvtest-vcf	Ask KGGSeq to produce a new VCF file as the input of RVTests. However, this is very time-consuming for large sample. It is advised to use `--o-vcf` to produce a VCF file with good quality once and to use that cleaned VCF file as the new KGGSeq input. In this scenario, the `--rvtest-vcf` is not needed, which may substantially reduce the analysis time.
--rvtest-remove-set	Remove the intermediate file defining sets of variants for RVTests group-wise test in your hard disk.
--rvtest-min-var N	Remove genes with too few variants (< N) for RVTests group-wise test.

An extra file with the suffix ".rvtest.gene.xlsx" will be created in the targeted output path.

Association at geneset

Geneset-based association test by SKAT: --skat-geneset

java -jar ./kggseq.jar --skat-geneset --geneset-file path/to/geneset/file --vcf-file path/to/vcf/file --ped-file path/to/pedigree/file --phe phenotypeName --cov covariateName1,covariateName2,...,covariateNameN --out path/to/prefixname --excel --db-gene refgene [--p-value-cutoff 0.01 --multiple-testing benfdr --qqplot] The gene-based analysis will be carried out by SKAT which was integrated as a plugin in KGGseq. The '--p-value-cutoff' sets p-value cutoffs for filtration; the '--multiple-testing' sets the approaches for multiple testing comparisons; and the '--qqplot' enables Q-Q plot of the association results.

Note, the pathway gene sets are specified in a text file in th following format:

[Column 1: GeneSet ID]
[Column 2: GeneSet URL]
[Column 3..N: Gene Symbols in the geneset; separated by spaces].
No title row is required!!!

e.g.,
----------------------------------------------
KEGG_GLYCOLYSIS_GLUCONEOGENESIS https://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?geneSetName=KEGG_GLYCOLYSIS_GLUCONEOGENESIS LDHC LDHB LDHA
KEGG_CITRATE_CYCLE_TCA_CYCLE https://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?geneSetName=KEGG_CITRATE_CYCLE_TCA_CYCLE GJB1 OGDHL OGDH PDHB IDH3G LDHB
...
...

An extra file with the suffix ".skat.geneset.xlsx" will be created in the targeted output path. The '--skat-geneset' will provide the p values of SKAT,SKATO and Burden test. If the phenotype is binary, the option '--skat-gene binary' can be used, but the speed will decrease.

SKAT:The p-value calculated by SKAT.
SKATO:The p-value calculated by SKAT-O.
Burden:The p-value calculated by Burden test.

Note To use this function, the user should install R language platform first.

Note The names of phenotype ( '--phe' ) and co-variables ( '--cov' ) are defined in the pedigree file .

Note If the option --skat-cutoff N is added, the genes with variants less than N will not be involved into SKAT. The default value of N is 0.

Note If the option --perm-pheno is added, phenotypes will be permuted prior to association analysis.

Geneset-based association test by Rvtests: --rvtest-geneset

java -jar ./kggseq.jar --rvtest-geneset cmc,skato[nPerm=1000:alpha=0.001:beta1=1:beta2=20],... --rvtest-vcf --rvtest-remove-set --geneset-file path/to/geneset/file --vcf-file path/to/vcf/file --ped-file path/to/pedigree/file --phe phenotypeName --cov covariateName1,covariateName2,...,covariateNameN --out path/to/prefixname --excel --db-gene refgene [--p-value-cutoff 0.01 --multiple-testing benfdr --qqplot] KGGSeq will automatically download the C++ sources codes from Github and un-compress them into a folder named "plugin/rvtests-master" under the KGGSeq working folder. Users are required go into the "rvtests-master" to make an executable file from the C++ source codes. See more in a demonstration video .

Arguments/values	Annotation
--rvtest-geneset model names	Initiate the calling of RVTest for gene based association analysis. The model names are defined in Rvtests models. It supports most of the models, e.g., cmc, zeggini, mb, fp, exactCMC, cmcWald, rarecover,skato and cmat. The default models are "cmc,skato[nPerm=1000:alpha=0.001:beta1=1:beta2=20]".
--rvtest-vcf	Ask KGGSeq to produce a new VCF file as the input of RVTests. However, this is very time-consuming for large sample. It is advised to use `--o-vcf` to produce a VCF file with good quality once and to use that cleaned VCF file as the new KGGSeq input. In this scenario, the `--rvtest-vcf` is not needed, which may substantially reduce the analysis time.
--rvtest-remove-set	Remove the intermediate file defining set of variants for RVTests group-wise test.

An extra file with the suffix ".rvtest.geneset.xlsx" will be created in the targeted output path. The '--rvtest-geneset' will provide the p values of multiple set-based tests for rare variants.

Plotting

Generate a histogram of minor allele frequency (MAF) according to allele frequencies in reference dataset --mafplot-ref or in sample --mafplot-sample. java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --out path/to/prefixname --db-filter 1kg201204,dbsnp141,ESP6500AA,ESP6500EA --allele-freq 0,1 --mafplot-ref java -jar kggseq.jar --vcf-file path/to/file1 --ped-file path/to/file2 --out path/to/prefixname --mafplot-sample Note Currently, --mafplot-sample only supports the inputs specified by --vcf-file.

Generate a Quantile-Quantile (Q-Q) plot for association p-values --qqplot java -jar kggseq.jar --vcf-file path/to/file --out path/to/prefixname --qqplot --skat-gene [or] --var-assoc

Miscellaneous

Calculate pairwise the linkage disequilibrium or genotypic correlation --calc-ld. java -Xmx15g -jar kggseq.jar --vcf-file path/to/file --ped-file path/to/file --out path/to/prefixname --calc-ld --db-filter 1kg201204,dbsnp141,ESP6500AA,ESP6500EA --allele-freq 0.05,0.95 --regions-in chr4:21212-233454 If the genotypes are phased, KGGSeq will calculate the linkage disequilibrium coefficient r-square. For uphased genotypes, KGGSeq will calculate the Pearson genotypic correlation of variants, in which the genotypes are encoded by the number of alternative alleles.
Note Currently, --calc-ld only supports variants with two alleles.
Note By default, KGGSeq will only calculate LD between variants within a window of 100,000,000 base-pairs. The window size can be changed by --ld-win X , in which X denotes a distance in base-pair.
Note By default, only the r-square or correlation square >0.01 will be saved in an output file. This can be changed by --ld-out X, in which X denotes r-square or correlation square. .

The results will be outputted in a text file with the suffix ".ld" or ".ld.gz". The following is an exmaple of the output format:

[chromsome] [coordinate of the first variant] [coordinate of the second variant] [r-square OR correlation]
22 16051249 16052962 0.8680002633575351
22 16051249 16053659 0.13732498950210686
22 16051249 16053862 0.984494959737711

Prune variants in high LD (r or genotypic correlation square >= x) --ld-prune X. java -Xmx15g -jar kggseq.jar --vcf-file path/to/file --ped-file path/to/file --out path/to/prefixname --ld-prune 0.9 --filter-sample-maf-le 0.05 --o-svcf KGGSeq will calculate the pairwise LD between variants within a window of 100,000,000 base-pairs. If a pair of variants have r-square or correlation square over 0.9, the variation with larger coordinate will be pruned. The variants with minor allele frequency <=5% are excluded (See details on --filter-sample-maf-le). The window size can be changed by --ld-win X, in which X denotes a distance in base-pair.

FAQ?

1) Can a compressed VCF file be processed in parallel?

Yes, it can. However, the files has to be compressed in a blocked GNU Zip format or Burrows–Wheeler format. See more here.

2) Does the KGGSeq binary genotype format support variants of over two alleles and/or phased genotypes?

Yes, it does support variants of over two alleles and/or phased genotypes. See more here.

3) How are the settings '--rare-allele-freq c' and '--allele-freq 0,c' different?

'--rare-allele-freq c' will excluded variants with alternative allele frequency EQUAL to or over c in the reference datasets; and '--allele-freq 0,c' will also excluded variants having no allele frequency in the reference datasets beside the variants excluded by '--rare-allele-freq c'

4) Does KGGSeq provide gene expression information in the output?

A project aiming to harness tissue-specific expression information to prioritize candidate genes is underway.

5) How shall I prepare the candidate gene list for protein-interaction and geneset annotation? What is the gene list for?

The candidate gene list can include any gene interesting to the user. It may be verified causal gene for the disease, or potential causal gene. The gene list is used to fish more hidden interesting genes in order to narrow down the searching region. If no gene list provided, KGGSeq can still perform routine ‘filtration + sharing’ strategy for sequencing data.

6)Suppose I have in-house control genomes in VCF format, how can I transform it into KGGSeq acceptable file?

There are two alternative ways to allow you use the in-house VCF data (containing genotypes of multiple individuals) as filter.

1. Convert the VCF data into a standard KGGSeq local filter format by the command: --make-filter --vcf-file path/to/vcffile --out test.var first. And then set the --local-filter test.hg19.var

2. Directly set --local-filter-vcf path/to/vcffile. However, this way is less efficient than the above way as it will take additional time to parse the VCF data.

7) Is there a practical guidance for MAF selection?

MAF selection is an open question, but generally three suggestions can be adopted: for rare dominant Mendelian disease, MAF cut-off at 0 or 0.005 is reasonable; for rare recessive Mendelian disease, MAF cut-off at 0 or 0.03 is reasonable; for common complex disease of relatively rare alleles with large effect size, MAF cut-off is dependent on disease prevalence, penetrance and genetic model, so a higher cut-off like 0.05 might be more reasonable.

8) Can I use ANNOVAR and KGGSeq together on my dataset?

KGGSeq is quite flexible for interacting with other sequence-oriented analytical programs/software (including SOAPSNP, ANNOVAR, PLINK/seq, VAAST, etc). It can accept various input formats, and output different kinds of sequence data. In case of ANNOVAR, KGGSeq can read ANNOVAR-formatted sequence variants, and write the final remaining variants in ANNOVAR format.

9) Can I run KGGSeq on my Lenovo or Mac laptop? Is it time consuming to run a complete KGGSeq process?

Normally, KGGSeq run well and fast with >=1 GB RAM memory. Hence current laptop are certainty affordable for running KGGSeq. The whole process need only <10 minutes, unless PubMed searching is set in the parameter file. For PubMed searching, it really depends on the speed of the network and also the loading of the NCBI PubMed database.

10) How do I report an error or bugs to KGGSeq?

You are welcomed to join our google group (http://groups.google.com/group/kggseq_user?hl=en). This site is used for communication and discussion of KGGSeq usage and functions. You can also directly write an email to limx54@yahoo.com, or kggseq_user@googlegroups.com.

KGGSeq: A biological Knowledge-based mining platform for Genomic and Genetic studies using Sequence data

Quick example

Options

Example: chr startpos endpos ref alt comment1 comment2 comment3 1 69428 69428 T G T 92 129 1 69476 69476 T C T 1 0

fid iid fatid motid sex DISEASE QT AGE 1 NA12344 NA12347 NA12348 1 1 94.17 NA 1 NA12347 0 0 1 NA 19.54 40.0 1 NA12348 0 0 2 2 NA 36.6 Note The head line must start with 'fid iid'. Otherwise, KGGSeq will regard the first row as a subject-line. Note The missing phenotype values are denoted by "NA".

CHR START END 1 6134174 10082841 1 202855724 207862411 Note Title line is needed.

IBDRegion: The region with IBD information. IBDRegionAnnot: The annotation information at the region.

Chrom Position Reference Alternative Freq 1 469 C G 0.150 1 492 C T 0.175 1 519 G C 0.067 1 874290 - 0ACAGAG 0.809 1 875913 CAG 3 0.8431 Note The title line is needed and Freq is optional.

ada_score@dbScSNV: The prediction score by ADA approach which combines multiple existing tools. See more in Jian et al (2014)

DGVIDs: the ID in the Database of Genomic Variants. CNVSampleSize: the size of sample, used to derive the structure. LossCNV: Number of subjects having loss of copy number variation. GainCNV: Number of subjects having excessive copy number variation.

GeneSymbol1 GeneSymbol2 GeneSymbol3 GeneSymbol4

IsWithinCandidateGene: Whether the variant is within a candidate gene PPI: PPI shared by at least one gene of interest and the gene containing the variant

IsWithinCandidateGene: Whether the variant is within a candidate gene SharedGeneSet: GeneSet shared by at least one gene of interest and the gene containing the variant

MousePhenotype (MGI):mouse phenotypes of ortholog genes. Multiple phenotypes are separated by '||'. MousePhenotype (IMPC):mouse phenotypes of ortholog genes. Multiple phenotypes are separated by '||'.

ZebrafishPhenotype :Affected structure or process subterm after the knock-out of the human ortholog genes in Zebrafish.

DDDPhenotype : confirmation status;disease;pubmed_ids

PubMedIDIdeogram : PubMed ID of articles in which the term and the cytogeneic position of the variant are co-mentioned PubMedIDGene : PubMed ID of articles in which the term and the gene containing the variant are co-mentioned

COSMICCancerInfo : known cancer and occurrence time of the somatic mutation in COSMIC dataset. e.g., {serous_carcinoma=3, astrocytoma_Grade_IV=1} This somatic mutation occurred 3 times in samples of 'serous_carcinoma' and once in samples of 'astrocytoma_Grade_IV'.

BF: Bayes factor Composite_Prob:Posterior probability of being regualtory given selected functional prediction scores. Cell_Prob: Cell type-specific regulatory potential. Combine_Prob: Combined posterior regulatory probability.

IsComplexDiseasePathogenic: whether the variant is predicted to be pathogenic by the model trained with known pathogenic variants and neutral control variants. RandomForestScore: the pathogenic scores of the random forest prediction model.

TissueWithSpecificExpression: the transcripts and tissues or cell-types where it is pecifically expressed. Multiple tissues and cell-types are delimited by commas and multiple transcripts are delimited by semicolons.

Cancer-driver gene mutation burden test by Iterative zero-truncated negative-binomial regression (ITER) based on somatic mutations

Cancer-driver gene mutation burden test by weighted iterative zero-truncated negative-binomial regression (WITER) based on somatic mutations

Cancer-driver gene mutation burden test by weighted iterative zero-truncated negative-binomial regression with reference sample (WITER-ref) based on somatic mutations

... MergedAlleleNum:The number of weighted mutant alleles from reference samples.

Detect disease susceptibility gene mutation burden test by recursive truncated negative-binomial regression (RUNNER) for complex diseases

Gene score1 score1 ... SAMD11 2.114 0.076914032 ... NOC2L 2.345 0.046528595 ... KLHL17 1.989 0.026329064 ... PLEKHN1 2.2 0.047853652 ... ... ... ... ...

Chromosome POS Reference AlternativeAllele Weights 1 865580 G A 0.000568741 1 865592 A C 0.018698691 1 865544 C +GGCT . 1 865656 G T 0.011239452 1 865701 G A . ... ... ... ... ...

Detect Gene-Gene Interaction by a Recursive trUNcated NEgative-binomial Regression model (GGI-RUNNER) for complex diseases

SKAT:The p-value calculated by SKAT. SKATO:The p-value calculated by SKAT-O. Burden:The p-value calculated by Burden test.

SKAT:The p-value calculated by SKAT. SKATO:The p-value calculated by SKAT-O. Burden:The p-value calculated by Burden test.

[chromsome] [coordinate of the first variant] [coordinate of the second variant] [r-square OR correlation] 22 16051249 16052962 0.8680002633575351 22 16051249 16053659 0.13732498950210686 22 16051249 16053862 0.984494959737711

Example:
chr startpos endpos ref alt comment1 comment2 comment3
1 69428 69428 T G T 92 129
1 69476 69476 T C T 1 0

fid iid fatid motid sex DISEASE QT AGE 1 NA12344 NA12347 NA12348 1 1 94.17 NA 1 NA12347 0 0 1 NA 19.54 40.0 1 NA12348 0 0 2 2 NA 36.6
Note The head line must start with 'fid iid'. Otherwise, KGGSeq will regard the first row as a subject-line.
Note The missing phenotype values are denoted by "NA".

CHR START END
1 6134174 10082841
1 202855724 207862411

Note Title line is needed.

IBDRegion: The region with IBD information.
IBDRegionAnnot: The annotation information at the region.

Chrom Position Reference Alternative Freq
1 469 C G 0.150
1 492 C T 0.175
1 519 G C 0.067
1 874290 - 0ACAGAG 0.809
1 875913 CAG 3 0.8431

Note The title line is needed and Freq is optional.

DGVIDs: the ID in the Database of Genomic Variants.
CNVSampleSize: the size of sample, used to derive the structure.
LossCNV: Number of subjects having loss of copy number variation.
GainCNV: Number of subjects having excessive copy number variation.

GeneSymbol1
GeneSymbol2
GeneSymbol3
GeneSymbol4

IsWithinCandidateGene: Whether the variant is within a candidate gene
PPI: PPI shared by at least one gene of interest and the gene containing the variant

IsWithinCandidateGene: Whether the variant is within a candidate gene
SharedGeneSet: GeneSet shared by at least one gene of interest and the gene containing the variant

MousePhenotype (MGI):mouse phenotypes of ortholog genes. Multiple phenotypes are separated by '||'.
MousePhenotype (IMPC):mouse phenotypes of ortholog genes. Multiple phenotypes are separated by '||'.

PubMedIDIdeogram : PubMed ID of articles in which the term and the cytogeneic position of the variant are co-mentioned
PubMedIDGene : PubMed ID of articles in which the term and the gene containing the variant are co-mentioned

COSMICCancerInfo : known cancer and occurrence time of the somatic mutation in COSMIC dataset. e.g., {serous_carcinoma=3, astrocytoma_Grade_IV=1}
This somatic mutation occurred 3 times in samples of 'serous_carcinoma' and once in samples of 'astrocytoma_Grade_IV'.

BF: Bayes factor
Composite_Prob:Posterior probability of being regualtory given selected functional prediction scores.
Cell_Prob: Cell type-specific regulatory potential.
Combine_Prob: Combined posterior regulatory probability.

IsComplexDiseasePathogenic: whether the variant is predicted to be pathogenic by the model trained with known pathogenic variants and neutral control variants.
RandomForestScore: the pathogenic scores of the random forest prediction model.

...
MergedAlleleNum:The number of weighted mutant alleles from reference samples.

Gene score1 score1 ...
SAMD11 2.114 0.076914032 ...
NOC2L 2.345 0.046528595 ...
KLHL17 1.989 0.026329064 ...
PLEKHN1 2.2 0.047853652 ...
... ... ... ...

Chromosome POS Reference AlternativeAllele Weights
1 865580 G A 0.000568741
1 865592 A C 0.018698691
1 865544 C +GGCT .
1 865656 G T 0.011239452
1 865701 G A .
... ... ... ... ...

SKAT:The p-value calculated by SKAT.
SKATO:The p-value calculated by SKAT-O.
Burden:The p-value calculated by Burden test.

SKAT:The p-value calculated by SKAT.
SKATO:The p-value calculated by SKAT-O.
Burden:The p-value calculated by Burden test.

[chromsome] [coordinate of the first variant] [coordinate of the second variant] [r-square OR correlation]
22 16051249 16052962 0.8680002633575351
22 16051249 16053659 0.13732498950210686
22 16051249 16053862 0.984494959737711