BHIT, a novel Bayesian partition computational method for detecting SNP interactions (epistasis). The proposed approach builds a Bayesian model on both continuous data and discrete data to partition multiple-phenotype data. Comparing with other methods on both simulation data and real data, the key strengths of BHIT are as follows: (i) With the advanced Bayesian model equipped with MCMC search, BHIT can efficiently explore high-order interactions. (ii) BHIT has a flexible Bayesian model on continuous and discrete data, so that both continuous and discrete phenotypes could be handled simultaneously, and the interaction within or between phenotypes and genetic data can also be detected.
Request Download BHIT
- x86_64bit Linux System
- G++ version 4.4.7 and up
- GSL Library (http://www.gnu.org/software/gsl/)
- GSLWrap Library (http://gslwrap.sourceforge.net)
BHIT inputfilename outputfilename iterNum burninNum observNum SNPNum PhenoNum MAF newRuningTag
- inputfilename: The name of inputfile
- outputfilename: The name of outputfile
- iterNum: Number of Iteration of MCMC
- burinNum: Number of burn-in of MCMC
- observNum: Number of observations, e.g. number of population
- SNPNum: Number of SNPs
- PhenoNum: Number of phenotype types
- MAF:Minor Allel Frequency, should less than 1
- newRunningTag: 1 for new status, 0 for continuing from old status
Example in input.txt
1 2 3 2 1 0.55 0.48
2 1 1 3 2 0.86 0.37
2 2 1 1 3 0.10 0.76
Each line represents one observation. One observation includes 5 SNP and followed by 2 quantitative phenotype. Each SNP is represented as a digit. 1 represents homozygous major allel; 3 represents homozygous minor allel; 2 represents heterozytous.
Note: Users could easily use convert.pl to convert PLINK raw files to inputfiles (To convert your ped or bed file to a raw file use the plink --recodeA option).
Output:Contains 3 output files.
- outputfile: results of BHIT
- tracefile: recoding marginal likelihood status
- tagfile: recording MCMC status
0 1 D1 0 1 D2 0 1 D3 0 1 D4 1 0 D5 1 0 C1 0 1 C2 0 1The 0 column represents independent data sets, and other columns show D1, D2, D3, C1 and C2 are in one partition by identifier 1.
Flowchart of BHIT pipeline
The BHIT pipeline on general species is shown as Figure below. In preprocessing stage, missing data imputation methods (Nputet, fastPHASE, etc.) should be applied to fill the blank space if missing value exists in the genotype data. Then we filter SNP with MAF less than 0.05. All the genotype data should be convert to appropriate data format by PLINK --recodeA. If the input phenotype has continuous trait, whether it follows the normal distribution should be checked by Kolmogorov-Smirnov test. After that, both genotype and phenotype data should be combine together and converted to BHIT file format by script provided by BHIT website. In order to dealing with genome-wide SNPs, we provide three strategies to use BHIT in the pipeline. Strategy A has a two stages, feature selection methods (LASSO, etc) could be used first to filter all SNPs and run BHIT only on the filtered set of SNPs. Strategy B splits all SNPs into different chromosomes and run BHIT on individual chromosomes. Strategy C is mainly focused on SNPs located on protein-coding regions and/or located several known regions users defined. In the end, check and validate all the results.
If you have questions and suggestions, please contact with Juexin Wang (firstname.lastname@example.org)
Juexin Wang, Trupti Joshi, Babu Valliyodan, Haiying Shi, Yanchun Liang, Henry T. Nguyen, Jing Zhang, and Dong Xu. "A Bayesian model for detection of high-order interactions among genetic variants in genome-wide association studies." BMC genomics 16, no. 1 (2015): 1011 PubMed PaperLink