Input

The main input of KGGSum includes the GWAS summary data in a text file and reference genotypes of GWAS in VCF or GTB formats.

GWAS summary data file

The GWAS summary data file includes variants with GWAS association summaries arranged in rows, with columns separated by whitespace or tabs. The minimal required columns are genomic coordinates and p-values (CHR, BP, and P by default). Additional statistical attributes may be needed for specific analyses; these requirements are detailed in the descriptions of the corresponding analysis functions.

CHR POS A1 A2 AF_A1 Beta SE P
1 86028 T C 0.908392 -0.00108 0.00356 0.7626
1 693731 A G 0.877949 -0.00034 0.00339 0.9209
1 713092 A G 0.00623653 -0.04053 0.02491 0.1038
1 714596 T C 0.962227 -0.00045 0.00615 0.9407
1 715205 C G 0.993317 0.03543 0.02428 0.1445

The corresponding option for the GWAS summary input file is --sum-file or -sf. More settings about the variants and statistics can be specified in the option.

Format

--sum-file file [cp12Cols=CHR,POS,..] [r12Cols=RSID,..] [pbsCols=P,...] [refG=] [sep=] [freqA1Col=] [sampleSizeCols=] [betaType=<0/1/2>] [prevalence=] [exclude=]

Example

--sum-file ./CAD_UKBIOBANK.gz cp12Cols=chr,bp,a1,a2 pbsCols=pval,beta,se refG=hg19 freqA1Col=AF_A1 exclude=chr6:545554~444555545

options

  • file specifies the path to the GWAS summary statistics. This can be a local file, an internet URL, or an intranet file path accessed via SFTP. Note: A LOCAL file path allows wildcards (say, '*.tsv') to specify multiple files as a single input. KGGSum will process the files one by one.

  • cp12Cols specifies the column name in the summary file: chromosome, positions, effective (VCF alternative, A1) allele, and base (VCF reference, A2) allele.

  • r12Cols specifies the column names in the summary file: dbSNP rs ID, effective allele (VCF alternative, A1), and base allele (VCF reference, A2). The corresponding coordinates will be retrieved from the dbSNP database. Ensure the database file in GTB format is downloaded, unzipped, and placed in KGGSum’s working directory: ./resources/dbsnp/*.gtb. Note that r12Cols and cp12Cols are mutually exclusive.

  • pbsCols specifies the column name in the summary file, which are p-values, effect size, and standard errors of effect size.

  • type specifies the file type of the GWAS summary file. The default one is the TSV format. In addition, there are two alternative formats, VCF and GTB.

  • refG specifies the reference genome of input variants, . The default is refG=hg19. Note that incorrect specification of the genome version will lead to the mismatching of GWAS variants with the annotation base variants. All built-in annotation of KGGSum is hg38.

  • sep specifies the separator of the summary file. By default, it can recognize tabs and spaces. It recognizes four values, . The UNIVERSAL means tabs and spaces or commas. The default is sep=UNIVERSAL

  • freqA1Col specifies the column for the value of A1's frequency

  • sampleSizeCols specifies the columns for the sample sizes of cases and controls. If only one column is specified, it is supposed to be the whole sample size.

  • betaType specifies the type of effect sizes, <0 1 2>.

    0 means coefficients of linear regression for a quantitative phenotype beta; 1 means coefficients of logistic regression or the logarithms of odds ratio for a qualitative phenotype; 2 means the odds ratio for a binary phenotype. The default is betaType=1.

  • prevalence specifies the disease prevalence in a population. This is only required for a GWAS of disease phenotypes.

  • exclude specifies the genomic regions of variants to be excluded.

Reference genotypes for linkage-disequilibrium calculation

In the analysis of some methods, genotype data from GWAS samples is required to perform linkage disequilibrium correction. However, such genotypes are often unavailable. In these cases, ancestry-matched reference genotypes can be used as a substitute for KGGSum. Suitable reference datasets include genotypes from the 1000 Genomes Project or the UK Biobank. These references are primarily used to estimate LD for common variants. Ideally, the reference dataset should include between 500 and 5000 subjects; larger datasets, while more comprehensive, may increase computation time. KGGSum supports two genotype file formats: VCF and [GenoType Block (GTB)], with the option --ref-gty-file.

Option Description Default
--ref-gty-file Specify the input file. It is a combination of multiple parameters. <type> is used to specify the format of the input file. <refG> is used to specify the reference genome of input variants.
Format: --input <file> type=[AUTO/VCF/GTB] refG=[hg18/hg19/hg38]
Eaxmple: --input ./example.vcf.gz type=VCF refG=hg38
type=AUTO refG=hg19

Gene Score Profiles

The gene score profile contains various values representing different contexts or conditions, such as RNA or protein expression levels or perturbation effects. Each row corresponds to a gene, and the columns, separated by tabs, represent different contexts or conditions. For each context or condition, two columns can be provided: one for the mean (labeled with .mean) and another for the standard error (SE, labeled with .SE). While the SE column is optional, including it can enhance the accuracy of the analysis. Below is an example of the file format.

Gene Adipose-Subcutaneous.mean Adipose-Subcutaneous.SE Adipose-VisceralOmentum.mean Adipose-VisceralOmentum.SE
ENSG00000223972.5 0.0038016 0.00036668 0.0045709 0.00046303
ENSG00000227232.5 1.9911 0.030021 1.8841 0.040247
ENSG00000278267.1 0.00049215 0.00010645 0.00036466 9.29E-05
ENSG00000243485.5 0.0047772 0.00038018 0.0067897 0.00074318
ENSG00000237613.2 0.0030462 0.00027513 0.0030465 0.00031694

The path and relevant settings can be specified by the option

Option Description
--gene-score-file The scores can represent various attributes, such as RNA expression, protein expression, epigenetic markers, or perturbation profiles at genes. Each row corresponds to a gene, and each column (except the first) represents a condition. The first column should contain the gene symbols. This is a combination parameter with the following options:
file: Specifies the file path of the gene score file, which can be a local path or a remote path accessed via a network. ==NOTE== For a LOCAL file path, it allows wildcards (say, 'brain*.tsv') to specify multiple files as a single input.
calcSpecificity: Triggers the calculation of the specificity of gene scores for each condition. The default is "y(es)".
noDirection: Instructs KGGSum to ignore the directionality of specificity. The default is "y(es)".

Format: --gene-score-file file=file/path calcSpecifity=<y/n> noDirection=<y/n>
Example: --gene-score-file file/path \
calcSpecifity=y

xQTL summary data

This dataset is used to link variants to their target genes, typically using each gene’s eQTL summary statistics. Each row represents an eQTL and must include the following nine columns: gene symbol, gene ID, chromosome, position, p-value, effective (alternative) allele, base (reference) allele, effect size, and standard error. Below is an example of the file format.

symbol id chr pos ref alt altfreq beta se p
LINC00115 ENSG00000225880 1 796375 T C 0.149 -0.223 0.081 5.87E-03
LINC00115 ENSG00000225880 1 797440 T C 0.159 -0.24 0.078 2.28E-03
LINC00115 ENSG00000225880 1 802496 C T 0.146 -0.247 0.083 2.95E-03
LINC00115 ENSG00000225880 1 812743 C T 0.17 -0.19 0.073 9.57E-03
LINC01128 ENSG00000228794 1 693731 A G 0.118 -0.258 0.094 6.31E-03
LINC01128 ENSG00000228794 1 731718 T C 0.151 -0.293 0.084 4.50E-04

The path and relevant settings can be specified by the option

Option Description
--xqtl-file Specify the xQTL summary file. This is a combination of parameters.
In the file, one row represents a genetic variant with its association summary to a gene. The association can be calculated based on various gene characteristics, including RNA expression (eQTL), RNA splicing (sQTL), protein expression (pQTL), and methylation (mQTL). The first column should contain the gene symbols. This is a combination parameter with the following options:

file specifies the path to the xQTL summary statistics. This can be a local file, an internet URL, or an intranet file path accessed via SFTP. ==NOTE==: For a LOCAL file path, it allows wildcards (say, a*b?.qtl.tsv.gz) to specify multiple files as a single input.

cp12Cols specifies the column names in the summary file, which are chromosome, positions, effective (VCF alternative, A1) allele, and base (VCF reference, A2) allele.

pbsCols specifies the column names in the summary file, which are p-values, effect size, and standard errors of effect size.

giCols specifies the column names in the summary file, which are gene symbols, and gene ID.

freqA1Col specifies the column for the value of A1's frequency

sampleSizeCols specifies the columns for the sample sizes for the xQTL.

refG specifies the reference genome of input variants. The default value is hg19.

sep specifies the separator of the summary file. By default, it can recognize Tabs, spaces and commas, and the corresponding tag is UNIVERSAL

pCut specifies the p-value threshold for selecting significant xQTL for subsequent analyses. The default value is pCut=1E-6<br />
ldCut specifies the LD r^2^ to prune highly redundant xQTLs. The default value is ldCut=0.8

Format: --xqtl-file file=file/path [cp12Cols=chr,pos,alt,ref] [pbsCols=p,beta,se] [giCols=symbol,id] [refG=hg19] [sep=TAB] [freqA1Col=altfreq] [sampleSizeCols=neff] [pCut=1E-6][ldCut=0.8]
Copyright ©MiaoXin Li all right reservedLast modified time: 2025-04-03 07:35:44

results matching ""

    No results matching ""