Input
The main input of KGGSum includes the GWAS summary data in a text file and reference genotypes of GWAS in VCF or GTB formats.
GWAS summary data file
The GWAS summary data file includes variants with GWAS association summaries arranged in rows, with columns separated by whitespace or tabs. The minimal required columns are genomic coordinates and p-values (CHR, BP, and P by default). Additional statistical attributes may be needed for specific analyses; these requirements are detailed in the descriptions of the corresponding analysis functions.
CHR | POS | A1 | A2 | AF_A1 | Beta | SE | P |
---|---|---|---|---|---|---|---|
1 | 86028 | T | C | 0.908392 | -0.00108 | 0.00356 | 0.7626 |
1 | 693731 | A | G | 0.877949 | -0.00034 | 0.00339 | 0.9209 |
1 | 713092 | A | G | 0.00623653 | -0.04053 | 0.02491 | 0.1038 |
1 | 714596 | T | C | 0.962227 | -0.00045 | 0.00615 | 0.9407 |
1 | 715205 | C | G | 0.993317 | 0.03543 | 0.02428 | 0.1445 |
The corresponding option for the GWAS summary input file is --sum-file
or -sf
. More settings about the variants and statistics can be specified in the option.
Format
--sum-file file [cp12Cols=CHR,POS,..] [r12Cols=RSID,..] [pbsCols=P,...] [refG=] [sep=] [freqA1Col=] [sampleSizeCols=] [betaType=<0/1/2>] [prevalence=] [exclude=]
Example
--sum-file ./CAD_UKBIOBANK.gz cp12Cols=chr,bp,a1,a2 pbsCols=pval,beta,se refG=hg19 freqA1Col=AF_A1 exclude=chr6:545554~444555545
options
file
specifies the path to the GWAS summary statistics. This can be a local file, an internet URL, or an intranet file path accessed via SFTP. Note: A LOCAL file path allows wildcards (say, '*.tsv') to specify multiple files as a single input. KGGSum will process the files one by one.cp12Cols
specifies the column name in the summary file: chromosome, positions, effective (VCF alternative, A1) allele, and base (VCF reference, A2) allele.r12Cols
specifies the column names in the summary file: dbSNP rs ID, effective allele (VCF alternative, A1), and base allele (VCF reference, A2). The corresponding coordinates will be retrieved from the dbSNP database. Ensure the database file in GTB format is downloaded, unzipped, and placed in KGGSum’s working directory: ./resources/dbsnp/*.gtb. Note thatr12Cols
andcp12Cols
are mutually exclusive.pbsCols
specifies the column name in the summary file, which are p-values, effect size, and standard errors of effect size.type
specifies the file type of the GWAS summary file. The default one is the TSV format. In addition, there are two alternative formats, VCF and GTB.refG
specifies the reference genome of input variants,. The default is refG=hg19.
Note that incorrect specification of the genome version will lead to the mismatching of GWAS variants with the annotation base variants. All built-in annotation of KGGSum is hg38.sep
specifies the separator of the summary file. By default, it can recognize tabs and spaces. It recognizes four values,. The UNIVERSAL means tabs and spaces or commas. The default is sep=UNIVERSAL
freqA1Col
specifies the column for the value of A1's frequencysampleSizeCols
specifies the columns for the sample sizes of cases and controls. If only one column is specified, it is supposed to be the whole sample size.betaType
specifies the type of effect sizes, <0 1 2>.0>0 means coefficients of linear regression for a quantitative phenotype beta; 1 means coefficients of logistic regression or the logarithms of odds ratio for a qualitative phenotype; 2 means the odds ratio for a binary phenotype. The default is
betaType=1
.prevalence
specifies the disease prevalence in a population. This is only required for a GWAS of disease phenotypes.exclude
specifies the genomic regions of variants to be excluded.
Reference genotypes for linkage-disequilibrium calculation
In the analysis of some methods, genotype data from GWAS samples is required to perform linkage disequilibrium correction. However, such genotypes are often unavailable. In these cases, ancestry-matched reference genotypes can be used as a substitute for KGGSum. Suitable reference datasets include genotypes from the 1000 Genomes Project or the UK Biobank. These references are primarily used to estimate LD for common variants. Ideally, the reference dataset should include between 500 and 5000 subjects; larger datasets, while more comprehensive, may increase computation time. KGGSum supports two genotype file formats: VCF and [GenoType Block (GTB)], with the option --ref-gty-file
.
Option | Description | Default |
---|---|---|
--ref-gty-file |
Specify the input file. It is a combination of multiple parameters. <type> is used to specify the format of the input file. <refG> is used to specify the reference genome of input variants.Format: --input <file> type=[AUTO/VCF/GTB] refG=[hg18/hg19/hg38] Eaxmple: --input ./example.vcf.gz type=VCF refG=hg38 |
type=AUTO refG=hg19 |
Gene Score Profiles
The gene score profile contains various values representing different contexts or conditions, such as RNA or protein expression levels or perturbation effects. Each row corresponds to a gene, and the columns, separated by tabs, represent different contexts or conditions. For each context or condition, two columns can be provided: one for the mean (labeled with .mean) and another for the standard error (SE, labeled with .SE). While the SE column is optional, including it can enhance the accuracy of the analysis. Below is an example of the file format.
Gene | Adipose-Subcutaneous.mean | Adipose-Subcutaneous.SE | Adipose-VisceralOmentum.mean | Adipose-VisceralOmentum.SE | … |
---|---|---|---|---|---|
ENSG00000223972.5 | 0.0038016 | 0.00036668 | 0.0045709 | 0.00046303 | … |
ENSG00000227232.5 | 1.9911 | 0.030021 | 1.8841 | 0.040247 | … |
ENSG00000278267.1 | 0.00049215 | 0.00010645 | 0.00036466 | 9.29E-05 | … |
ENSG00000243485.5 | 0.0047772 | 0.00038018 | 0.0067897 | 0.00074318 | … |
ENSG00000237613.2 | 0.0030462 | 0.00027513 | 0.0030465 | 0.00031694 | … |
The path and relevant settings can be specified by the option
Option | Description |
---|---|
--gene-score-file |
The scores can represent various attributes, such as RNA expression, protein expression, epigenetic markers, or perturbation profiles at genes. Each row corresponds to a gene, and each column (except the first) represents a condition. The first column should contain the gene symbols. This is a combination parameter with the following options:file : Specifies the file path of the gene score file, which can be a local path or a remote path accessed via a network. ==NOTE== For a LOCAL file path, it allows wildcards (say, 'brain*.tsv') to specify multiple files as a single input.calcSpecificity : Triggers the calculation of the specificity of gene scores for each condition. The default is "y(es)".noDirection : Instructs KGGSum to ignore the directionality of specificity. The default is "y(es)".Format: --gene-score-file file=file/path calcSpecifity=<y/n> noDirection=<y/n> Example: --gene-score-file file/path \ calcSpecifity=y |
xQTL summary data
This dataset is used to link variants to their target genes, typically using each gene’s eQTL summary statistics. Each row represents an eQTL and must include the following nine columns: gene symbol, gene ID, chromosome, position, p-value, effective (alternative) allele, base (reference) allele, effect size, and standard error. Below is an example of the file format.
symbol | id | chr | pos | ref | alt | altfreq | beta | se | p |
---|---|---|---|---|---|---|---|---|---|
LINC00115 | ENSG00000225880 | 1 | 796375 | T | C | 0.149 | -0.223 | 0.081 | 5.87E-03 |
LINC00115 | ENSG00000225880 | 1 | 797440 | T | C | 0.159 | -0.24 | 0.078 | 2.28E-03 |
LINC00115 | ENSG00000225880 | 1 | 802496 | C | T | 0.146 | -0.247 | 0.083 | 2.95E-03 |
LINC00115 | ENSG00000225880 | 1 | 812743 | C | T | 0.17 | -0.19 | 0.073 | 9.57E-03 |
LINC01128 | ENSG00000228794 | 1 | 693731 | A | G | 0.118 | -0.258 | 0.094 | 6.31E-03 |
LINC01128 | ENSG00000228794 | 1 | 731718 | T | C | 0.151 | -0.293 | 0.084 | 4.50E-04 |
The path and relevant settings can be specified by the option
Option | Description |
---|---|
--xqtl-file |
Specify the xQTL summary file. This is a combination of parameters. In the file, one row represents a genetic variant with its association summary to a gene. The association can be calculated based on various gene characteristics, including RNA expression (eQTL), RNA splicing (sQTL), protein expression (pQTL), and methylation (mQTL). The first column should contain the gene symbols. This is a combination parameter with the following options: file specifies the path to the xQTL summary statistics. This can be a local file, an internet URL, or an intranet file path accessed via SFTP. ==NOTE==: For a LOCAL file path, it allows wildcards (say, a*b?.qtl.tsv.gz) to specify multiple files as a single input. cp12Cols specifies the column names in the summary file, which are chromosome, positions, effective (VCF alternative, A1) allele, and base (VCF reference, A2) allele.pbsCols specifies the column names in the summary file, which are p-values, effect size, and standard errors of effect size.giCols specifies the column names in the summary file, which are gene symbols, and gene ID.freqA1Col specifies the column for the value of A1's frequencysampleSizeCols specifies the columns for the sample sizes for the xQTL.refG specifies the reference genome of input variants. The default value is hg19.sep specifies the separator of the summary file. By default, it can recognize Tabs, spaces and commas, and the corresponding tag is UNIVERSALpCut specifies the p-value threshold for selecting significant xQTL for subsequent analyses. The default value is pCut=1E-6< br />ldCut specifies the LD r^2^ to prune highly redundant xQTLs. The default value is ldCut=0.8 Format: --xqtl-file file=file/path [cp12Cols=chr,pos,alt,ref] [pbsCols=p,beta,se] [giCols=symbol,id] [refG=hg19] [sep=TAB] [freqA1Col=altfreq] [sampleSizeCols=neff] [pCut=1E-6][ldCut=0.8] |