Input

The main input of KGGSum includes the GWAS summary data in a text file and reference genotypes of GWAS in VCF or GTB formats.

GWAS summary data file

The GWAS summary data file includes variants with GWAS association summaries arranged in rows, with columns separated by whitespace or tabs. The minimal required columns are genomic coordinates and p-values (CHR, BP, and P by default). Additional statistical attributes may be needed for specific analyses; these requirements are detailed in the descriptions of the corresponding analysis functions.

CHR	POS	A1	A2	AF_A1	Beta	SE	P
1	86028	T	C	0.908392	-0.00108	0.00356	0.7626
1	693731	A	G	0.877949	-0.00034	0.00339	0.9209
1	713092	A	G	0.00623653	-0.04053	0.02491	0.1038
1	714596	T	C	0.962227	-0.00045	0.00615	0.9407
1	715205	C	G	0.993317	0.03543	0.02428	0.1445

The corresponding option for the GWAS summary input file is --sum-file or -sf. More settings about the variants and statistics can be specified in the option.

Format

--sum-file file [cp12Cols=CHR,POS,..] [r12Cols=RSID,..] [pbsCols=P,...] [refG=] [sep=] [freqA1Col=] [sampleSizeCols=] [betaType=<0/1/2>] [prevalence=] [exclude=]

Example

--sum-file ./CAD_UKBIOBANK.gz cp12Cols=chr,bp,a1,a2 pbsCols=pval,beta,se refG=hg19 freqA1Col=AF_A1 exclude=chr6:27477797~35448354

options

file specifies the path to the GWAS summary statistics. This can be a local file, an internet URL, or an intranet file path accessed via SFTP. Note: A LOCAL file path allows wildcards (say, '*.tsv') to specify multiple files as a single input. KGGSum will process the files one by one.
cp12Cols specifies the column name in the summary file: chromosome, positions, effective (VCF alternative, A1) allele, and base (VCF reference, A2) allele.
r12Cols specifies the column names in the summary file: dbSNP rs ID, effective allele (VCF alternative, A1), and base allele (VCF reference, A2). The corresponding coordinates will be retrieved from the dbSNP database. Ensure the database file in GTB format is downloaded, unzipped, and placed in KGGSum’s working directory: ./resources/dbsnp/*.gtb. Note that r12Cols and cp12Cols are mutually exclusive.
pbsCols specifies the column name in the summary file, which are p-values, effect size, and standard errors of effect size.
type specifies the file type of the GWAS summary file. The default one is the TSV format. In addition, there are two alternative formats, VCF and GTB.
refG specifies the reference genome of input variants, . The default is refG=hg38. Note that incorrect specification of the genome version will lead to the mismatching of GWAS variants with the annotation base variants. All built-in annotation of KGGSum is hg38.
sep specifies the separator of the summary file. By default, it can recognize tabs and spaces. It recognizes four values, . The UNIVERSAL means tabs and spaces or commas. The default is sep=UNIVERSAL
freqA1Col specifies the column for the value of A1's frequency
sampleSizeCols specifies the columns for the sample sizes of cases and controls. If only one column is specified, it is supposed to be the whole sample size.
betaType specifies the type of effect sizes, <0 1 2>.

0 means coefficients of linear regression for a quantitative phenotype beta; 1 means coefficients of logistic regression or the logarithms of odds ratio for a qualitative phenotype; 2 means the odds ratio for a binary phenotype. The default is betaType=1.
prevalence specifies the disease prevalence in a population. This is only required for a GWAS of disease phenotypes.
exclude specifies the genomic regions of variants to be excluded, e.g., exclude=chr6:27477797~35448354 to exclude the HLC regions (hg19) to avoid time-consuming computation due to complex LD patterns.
genomicControl asks KGGSum to adjust the p-values and chi-square statistics using a genomic control (GC) factor from the input GWAS data before all follow-up analyses. genomicControl=-1 means to use the calculated GC factor from input p-values.

Reference genotypes for linkage-disequilibrium calculation

In the analysis of some methods, genotype data from GWAS samples are required to perform linkage disequilibrium correction. However, such genotypes are often unavailable. In these cases, ancestry-matched reference genotypes can be used as a substitute for KGGSum. Suitable reference datasets include genotypes from the 1000 Genomes Project or the UK Biobank. These references are primarily used to estimate LD for common variants. Ideally, the reference dataset should include between 500 and 5000 subjects; larger datasets, while more comprehensive, may increase computation time. KGGSum supports two genotype file formats: VCF and [GenoType Block (GTB)], with the option --ref-gty-file.

Option	Description	Default
`--ref-gty-file`	Specify the input file. It is a combination of multiple parameters. `<type>` is used to specify the format of the input file. `<refG>` is used to specify the reference genome of input variants. Format: `--input <file> type=[AUTO/VCF/GTB] refG=[hg18/hg19/hg38]` Eaxmple: `--input ./example.vcf.gz type=VCF refG=hg38`	type=AUTO refG=hg19

Gene Score Profiles

The gene score profile contains various values representing different contexts or conditions, such as RNA or protein expression levels or perturbation effects. Each row corresponds to a gene, and the columns, separated by tabs, represent different contexts or conditions. For each context or condition, two columns can be provided: one for the mean (labeled with .mean) and another for the standard error (SE, labeled with .SE). While the SE column is optional, including it can enhance the accuracy of the analysis. Below is an example of the file format.

Gene	Adipose-Subcutaneous.mean	Adipose-Subcutaneous.SE	Adipose-VisceralOmentum.mean	Adipose-VisceralOmentum.SE	…
ENSG00000223972.5	0.0038016	0.00036668	0.0045709	0.00046303	…
ENSG00000227232.5	1.9911	0.030021	1.8841	0.040247	…
ENSG00000278267.1	0.00049215	0.00010645	0.00036466	9.29E-05	…
ENSG00000243485.5	0.0047772	0.00038018	0.0067897	0.00074318	…
ENSG00000237613.2	0.0030462	0.00027513	0.0030465	0.00031694	…

The path and relevant settings can be specified by the option

Option Description

--gene-score-file The scores can represent various attributes, such as RNA expression, protein expression, epigenetic markers, or perturbation profiles at genes. Each row corresponds to a gene, and each column (except the first) represents a condition. The first column should contain the gene symbols. This is a combination parameter with the following options:
file: Specifies the file path of the gene score file, which can be a local path or a remote path accessed via a network. NOTE: For a LOCAL file path, it allows wildcards (say, brain*.tsv) to specify multiple files as a single input. It also supports a more efficient format, Feather, to speed up analysis
calcSpecificity: Triggers the calculation of the specificity of gene scores for each condition. The default is "y(es)".
disableDirection: Instructs KGGSum to ignore the directionality of specificity. The default is "n(o)".
ignoreSE: Instructs KGGSum to ignore the SE of expression mean values (if available) when calculating the specificity. The default is "n(o)".
minValue: Instructs KGGSum to ignore the expression below a minimal value. The default is NA.

Format: --gene-score-file file=file/path calcSpecifity=<y/n> disableDirection=<y/n>
Example: --gene-score-file file/path \
calcSpecifity=y

Option	Description
`--gene-score-file`	The scores can represent various attributes, such as RNA expression, protein expression, epigenetic markers, or perturbation profiles at genes. Each row corresponds to a gene, and each column (except the first) represents a condition. The first column should contain the gene symbols. This is a combination parameter with the following options: `file`: Specifies the file path of the gene score file, which can be a local path or a remote path accessed via a network. NOTE: For a LOCAL file path, it allows wildcards (say, brain*.tsv) to specify multiple files as a single input. It also supports a more efficient format, Feather, to speed up analysis `calcSpecificity`: Triggers the calculation of the specificity of gene scores for each condition. The default is "y(es)". `disableDirection`: Instructs KGGSum to ignore the directionality of specificity. The default is "n(o)". `ignoreSE`: Instructs KGGSum to ignore the SE of expression mean values (if available) when calculating the specificity. The default is "n(o)". `minValue`: Instructs KGGSum to ignore the expression below a minimal value. The default is NA. Format: `--gene-score-file file=file/path calcSpecifity=<y/n> disableDirection=<y/n>` Example: `--gene-score-file file/path` \ calcSpecifity=y

xQTL summary data

This dataset is used to link variants to their target genes, typically using each gene’s eQTL summary statistics. Each row represents an eQTL and must include the following nine columns: gene symbol, gene ID, chromosome, position, p-value, effective (alternative) allele, base (reference) allele, effect size, and standard error. Below is an example of the file format.

symbol	id	chr	pos	ref	alt	altfreq	beta	se	p
LINC00115	ENSG00000225880	1	796375	T	C	0.149	-0.223	0.081	5.87E-03
LINC00115	ENSG00000225880	1	797440	T	C	0.159	-0.24	0.078	2.28E-03
LINC00115	ENSG00000225880	1	802496	C	T	0.146	-0.247	0.083	2.95E-03
LINC00115	ENSG00000225880	1	812743	C	T	0.17	-0.19	0.073	9.57E-03
LINC01128	ENSG00000228794	1	693731	A	G	0.118	-0.258	0.094	6.31E-03
LINC01128	ENSG00000228794	1	731718	T	C	0.151	-0.293	0.084	4.50E-04

The path and relevant settings can be specified by the option

Option Description

--xqtl-file Specify the xQTL summary file. This is a combination of parameters.
In the file, one row represents a genetic variant with its association summary to a gene. The association can be calculated based on various gene characteristics, including RNA expression (eQTL), RNA splicing (sQTL), protein expression (pQTL), and methylation (mQTL). The first column should contain the gene symbols. This is a combination parameter with the following options:

file specifies the path to the xQTL summary statistics. This can be a local file, an internet URL, or an intranet file path accessed via SFTP. NOTE: For a LOCAL file path, it allows wildcards (say, a*b?.qtl.tsv.gz) to specify multiple files as a single input.

cp12Cols specifies the column names in the summary file, which are chromosome, positions, effective (VCF alternative, A1) allele, and base (VCF reference, A2) allele.

pbsCols specifies the column names in the summary file, which are p-values, effect size, and standard errors of effect size.

giCols specifies the column names in the summary file, which are gene symbols, and gene ID.

freqA1Col specifies the column for the value of A1's frequency

sampleSizeCols specifies the columns for the sample sizes for the xQTL.

refG specifies the reference genome of input variants. The default value is hg19.

sep specifies the separator of the summary file. By default, it can recognize Tabs, spaces and commas, and the corresponding tag is UNIVERSAL

pCut specifies the p-value threshold for selecting significant xQTL for subsequent analyses. The default value is pCut=1E-6<br />
ldCut specifies the LD r^2^ to prune highly redundant xQTLs. The default value is ldCut=0.8

Format: --xqtl-file file=file/path [cp12Cols=chr,pos,alt,ref] [pbsCols=p,beta,se] [giCols=symbol,id] [refG=hg19] [sep=TAB] [freqA1Col=altfreq] [sampleSizeCols=neff] [pCut=1E-6][ldCut=0.8]

Option	Description
`--xqtl-file`	Specify the xQTL summary file. This is a combination of parameters. In the file, one row represents a genetic variant with its association summary to a gene. The association can be calculated based on various gene characteristics, including RNA expression (eQTL), RNA splicing (sQTL), protein expression (pQTL), and methylation (mQTL). The first column should contain the gene symbols. This is a combination parameter with the following options: `file` specifies the path to the xQTL summary statistics. This can be a local file, an internet URL, or an intranet file path accessed via SFTP. NOTE: For a LOCAL file path, it allows wildcards (say, a*b?.qtl.tsv.gz) to specify multiple files as a single input. `cp12Cols` specifies the column names in the summary file, which are chromosome, positions, effective (VCF alternative, A1) allele, and base (VCF reference, A2) allele. `pbsCols` specifies the column names in the summary file, which are p-values, effect size, and standard errors of effect size. `giCols` specifies the column names in the summary file, which are gene symbols, and gene ID. `freqA1Col` specifies the column for the value of A1's frequency `sampleSizeCols` specifies the columns for the sample sizes for the xQTL. `refG` specifies the reference genome of input variants. The default value is hg19. `sep` specifies the separator of the summary file. By default, it can recognize Tabs, spaces and commas, and the corresponding tag is UNIVERSAL `pCut` specifies the p-value threshold for selecting significant xQTL for subsequent analyses. The default value is `pCut=1E-6<`br /> `ldCut` specifies the LD r^2^ to prune highly redundant xQTLs. The default value is `ldCut=0.8` Format: `--xqtl-file file=file/path [cp12Cols=chr,pos,alt,ref] [pbsCols=p,beta,se] [giCols=symbol,id] [refG=hg19] [sep=TAB] [freqA1Col=altfreq] [sampleSizeCols=neff] [pCut=1E-6][ldCut=0.8]`

Basic Options

Input

GWAS summary data file

Reference genotypes for linkage-disequilibrium calculation

Gene Score Profiles

xQTL summary data

results matching ""

No results matching ""