Resources for Analyses

We have investigated various types of public data that you can download to carry out large-scale knowledge-based data mining analyses on KGGSum. The following are available links for downloading.

Type Description
Reference Variant IDs and genotypes in VCF or GTB formats are used as reference panels to correct the coordinates LD of GWAS variants.
Genotypes 1. 1000 Genomes Project (1KG): the genotypes of 5 different ancestry panels can be downloaded here.
2. The Haplotype Reference Consortium (HRC): It contains high-quality haplotype reference data for approximately 64,940 European individuals. The dataset publishes its sub-reference dataset at https://ega-archive.org/datasets/EGAD00001002729, and you can apply to download the original .vcf file.
Note that the ancestry of the reference panel should be identical to that of the GWAS sample.
dbSNP IDs This is required when only dbSNP SNP IDs (rs#), but not genomic coordinates, are available in the GWAS summary data. RSID coordinates can be accessed at NCBI FTP. We provide a GTB version of this dataset, converted from dbSNP’s VCF (B157).
GWAS summary The GWAS summary statistics of phenotypes. Many public domains are providing such data. As long as the data are in text and contain the columns required by KGGSum, it can be used as input.
Phenotypes The following domains are widely used with downloadable GWAS summary statistics datasets.
1. Open GWAS: A database of genetic associations from GWAS summary datasets for querying or downloading. ==Note==: The GWAS summary data downloaded from Open GWAS in VCF form can be directly used as input of KGGSum without format conversion.
2. GWAS Catalog: The NHGRI-EBI Catalog of human genome-wide association studies
3. Pan-UKBB: Pan-UKBB provides a multi-ancestry analysis of 7,228 phenotypes using a generalized mixed model association testing framework spanning 16,131 genome-wide association studies.
4. FinnGen: is a research project in genomics and personalized medicine. It is a large public-private partnership that has collected and analyzed genome and health data from 500,000 Finnish biobank donors to understand the genetic basis of diseases.
Microbes A collection of GWAS summary datasets for microbes quantities in the human digestive system. They were curated from three databases: Dutch Microbiome Project, Mibiogen, and Finrisk. The curated datasets can be downloaded from here.
Gene expression The gene expression profiles are curated from various public domains.
Tissues The expression profiles of genes in 54 organs of tissues were curated from the GTEx project. It has been included in the downloaded resources.zip file, with the file name, GTEx_v8_TMM_all.gene.meanSE.txt.gz.
Cell-types Gene expression profiles of 6,598 single-cell types from humans and mice, compiled by PCGA (https://pmglab.top/pcga) from publicly available scRNA-seq datasets. The curated datasets can be downloaded from here.
Spatial and cell-types The phenotype-relevant Spatial and cell types can be inferred at the online platform, PSC https://pmglab.top/psc/. The backend analysis tool for this online platform is KGGSum.
Drug perturbation The gene expression perturbation profiles in multiple cell types by various drugs. The preprocessing pipeline and methods can be seen in this Molecular Psychiatry paper. The curated datasets can be downloaded from here.
xQTL The variants linked to genes by properties of genes such as RNA expression, splicing, protein expression, and DNA methylation.
eQTL GTEx eQTL: cis-eQTL summary statistics calculated from the gene or transcript-level expression profile of 49 tissues or organs provided by the GTEx project (v8). The curated datasets can be downloaded from here.

Resources for Annotation

The great efforts of international projects and functional genomic studies have led to a rich tapestry of information from diverse biological domains that can be leveraged for variant annotation. We meticulously pre-processed some annotation fields from several commonly used databases to facilitate convenient variant annotation. Users can download the entire or study-needed databases directly from our website and start annotations for a large number of variants. For small-scale annotations (e.g., <10,000 variants), users can access these databases remotely via FTP/HTTP. The program will automatically fetch the specified slices of databases based on the index file of the given variants in a distributed fashion, eliminating the need to download the entire database locally and ensuring a more convenient and expedited process. Given the expansive and dynamic nature of genomic databases, KGGSum also offers users the adaptability to incorporate custom databases for annotation purposes.

Gene feature annotation

We support three well-established gene annotation systems: RefSeq genes (refgene), and GENCODE genes (gencode). In KGGSum, databases used for gene feature annotation should be constructed in FASTA format. Additionally, KGGA also accommodates custom gene databases formatted in FASTA, allowing users to integrate their own gene annotations tailored to specific research needs or preferences.

Database Name FTP/HTTP Version
refgene https://ftp.ncbi.nih.gov/refseq/H_sapiens/annotation/ Updated on 2024-08-27
encode https://www.gencodegenes.org/human/ Release 47
Copyright ©MiaoXin Li all right reservedLast modified time: 2025-04-04 04:02:20

results matching ""

    No results matching ""