FAPI: Fast and Accurate P-value imputation for genetic association

Links of MX Li's tools:


The FAPI (Fast and Accurate P-value Imputation) is a powerful multi-thread Java-based application developed to infer association p-value of a single-nucleotide polymorphism (SNP) given the association p-values of the SNPs in LD with the SNP. The p value imputation method is described in the reference paper. With the high imputation accuracy, FAPI is very fast, without requiring phases of alleles and any raw genotypes, compared to genotype imputation tools, e.g. IMPUTE and MACH.
FAPI has three main functions:

  1). impute p values for untyped SNPs;
  2). assess the quality of p values for typed SNPs;
  3). perform meta-analysis at both untyped and typed SNPs;


Installation of Java Runtime Environment (JRE)
The JRE is required to run FAPI on any operating systems (OS). It can be downloaded from http://java.sun.com/javase/downloads/index.jsp for free.
Installation of FAPI
FAPI has not had an installation wizard by far. After downloaded and decompressed, it can be launched through a command, java -Xmx1g "./fapi.jar" [arguments], in a command prompt window provided by OS. In the command, -Xmx[size] sets maximum Java heap sizes for FAPI. A larger maximum heap size can speed up the process of analysis. A higher setting like -Xmx4g is suggested for large number of SNPs, say more than 5,000,000. The number, however, should be less than the size of physical memory.

Summary statistics (i.e., p values) input
The primary input of FAPI is a text file (either compressed in a *.gz file or not) containing summary statics of SNPs with the first row as head. The columns are delimited by spaces or tabs. Four types of information about SNPs are required: chromosome, coordinates, maker id, and p values.
The following are an example:
CHR SNPID POS P-value1 Test-Mode P-value2 ...
4 rs1513559 12232332 0.02301 additive 0.007688 ...
4 rs1841043 122323365 0.01115 additive 0.119 ...
... ... ... ... ... ... ...
By default, FAPI requires the first four columns containing the four types of information with the same ORDER as above. Otherwise, users need specify the order of these columns by the tags, --chrom-col, --marker-col,--position-col and --p-col. See more description about the tag below.

Hint: If you only have the rsID on hand, you can use another easy tool, SnpTracker to retrieve chromosome, coordinates of SNPs first.
Data set contain LD information of SNPs
FAPI now support 2 different input format for calculating reference LD information between SNPs. Use can choose any one of appropriate formats.
Phased or unphased genotypes in VCF format  
FAPI can read either phased or unphased genotypes in VCF format to account for LD between SNPs.

To ease the preparation of the LD data, we have provided the VCF data for most widely used haplotypes originally released the by the HapMap project and 1000 Genomes project.

In the analysis, users can use the resources tags to specify their interesting data. FAPI will check the existence of the data in a local machine. If the data do not exist, FAPI will automatically download from the website of FAPI by a multi-thread downloading function.
Resource tag Description
hapmap2.r22.ceu.hg19 Haplotypes of Hapmap 2 release 22. Convert the coordinates to be hg19 from hg18 by UCSC lift over function. Complied from here.
hapmap3.r2.ceu.hg19 Haplotpyes of Hapmap 3 release 2. Convert the coordinates to be hg19 from hg18 by UCSC lift over function. Compiled from here.
1kg.phase1.v3.asn.hg19 Haplotpyes of 1000 Genomes Project phase 1 version 3. Donwload from here.
1kg.phase3.v5.afr.hg19 Haplotpyes of 1000 Genomes Project phase 3 version 5. Download from here.

Note: The resource files are huge. If FAPI failed to download a complete version of 1KG resource files, you can go into our resource file page to download them by a more professional downloading tool.

We acknowledge the complied VCF data of 1000 Genomes projects by the author of MACH, Dr. LI Yun. To see detailed the description about the data, please visit
Unphased genotypes in Plink binary format  
FAPI can directly read unphased genotypes formated by Plink, which is compressed format and can be stored and processed more efficiently.

FAPI will calculate the genotypic correlation to approximate the LD degree between SNPs. The Plink binary file set always includes three linked files *.fam, *.bim and *.bed, which should be put in the same folder.

There are three main functions of FAPI. Please read the detail in demo website.
Impute association p-values at untyped SNPs Link
Impute p-value at untyped SNPs and conduct meta-analysis Link
Validate association p values at typed SNPs Link

Tag Name Description
Analysis functions
--impute Impute the p-values of untyped SNPs given the p-values of types SNPs according to LD information.
--meta Impute the p-values of untyped SNPs and perform meta-analysis of multiple p-value sets.

Hint: If you want to perform meta-analysis without imputing the p-values, Please use --meta and --noimpute options.
--qc Impute the p-values of typed SNPs given the p-values of types SNPs according to LD information and estimate the chance of getting the p-values of typed SNPs.
--size Set the sample size of cases and controls seperated by colon to generate the p-values. This will be used as weight for meta-analysis. The default value is 1:1
Input file settings
--pfile Specify the path of a file containing p values and genomic information of sequence variants.
--gfile Specify the path and type of files containing genotypes for calculating LD in VCF or plink binary format. Path and type of each file are separated by double colon., e.g., path/to/file::vcf or path/to/file::plink

If the reference genotypes are stored in different files chromosome by chromosome, you can use _CHROM_ to denote the chromosome names [1...Y] in the file name, e.g., chr_CHROM_.phase1.cvf.chinese.hg19
--chrom-col The header description indicated chromosome information in a file specified by --pfile. The 1st column will be used by default if this tag is not specified.
--marker-col The header description indicated SNP rsID information in a file specified by --pfile. The 2nd column will be used by default if this tag is not specified.
--position-col The header description indicated coordinate information in a file specified by --pfile. The 3rd column will be used by default if this tag is not specified.
--p-col The header description indicated p value information in each file specified by --pfile. For multiple pfiles, the column names of p-value sources are delimited by comma. The 4th column will be used by default if this tag is not specified.
--missing-p The labels of missing p values in a file specified by --pfile. The number starts from 1.
--maf Filter out genotypes with minor allele frequency less than a number in the reference panel
Performance and Accuracy settings
--nt The number of maximal parallel running CPU
--window-size Set the maximal number of SNPs with actural p-values in scan window for imputation. The default value is 10.
--window-len Set the maximal length of a scan window for imputation. The default value is 1000000bp.
--ignore-r2 Set the maximal value a pair-wise LD (r-square) between SNPs to be ignored in imputation. The default value is 0.01.
--conf-filter Filter out imputed p-values with confidence score over the set value. The default value is 0.3.
--out Specify the path with prefix name for output data
--resource Specify the path of resource files. By default, it is a sub-folder named resources of the main program folder
--no-web Switch off the function of automatically update itself
Comments and suggestions are welcome, please e-mail limx54@163.com
    Kwan JS^, Li MX^*, Deng JE, Sham PC*.FAPI: Fast and Accurate P-value Imputation for genome-wide association study. Eur J Hum Genet. 2015 Aug 26. doi: 10.1038/ejhg.2015.190.

Miao-xin Li, Professor on Precision Medical Genetics & Bioinformatics, Zhongshan School of Medicine, Sun Yat-sen University, All rights reserved.