Evaluation of three methods related to Genome-Wide Association studies for identify gene locus using simulated data
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Introduction: Due to the widespread distribution of SNPs throughout the genome, these markers are widely used in livestock breeding research. These markers were used to predict the disease risk in human, to localize genetic variations responsible for complex traits through genome wide association study (GWAS), and to predict the genetic values of economically important traits in plant and animal breeding (Zhang et al 2015). Mostly whole genome scanning methods are based on two SSGWAS (Single SNP Genome-Wide Association Studies) and multiple markers methods. The SSGWAS method is able to identify a large number of common variables affecting quantitative traits. However, a large proportion of the genetic variance remains to be explained (Shirali et al 2018). In quantitative traits the proportion of phenotypic variance explained by SNPs is related to the number of adjacent SNPs in the genomic region. The heritability created by these genomic regions is defined as the regional heritability. The RHM (Regional Heritability Mapping) method is used to identify small genomic regions. This method can capture more of the missing genetic variation (Nagamine et al 2012). In RHM, a mixed model framework based on Restricted Maximum Likelihood (REML) is used, and two variance components, one contributed by the whole genome and a second one by a specific genomic region, are fitted in the model to estimate genomic and regional heritabilities, respectively (Uemoto et al 2013). Also fastBAT (fast and flexible set-Based Association Test) is a method that performs a fast set-based association analysis (Bakshi et al 2016). The purpose of this study is compare SNPs and regions identified by the Genome-Wide Association methods, compare these results with the simulated QTLs and also investigate and determine the false positive results in each method. Material and methods: In this study, markers and populations were simulated as a Forward-in-time process using QMSim software (Sargolzaei and Schenkel 2009). For this population, 27586 single nucleotide polymorphisms (SNPs) were counted on 3 pairs of autosomal chromosomes. Simulation was performed in 3 scenarios with 75, 150 and 300 quantitative trait loci (QTL). The minimum and maximum number of SNPs in the analysis after quality control were 19662 and 23817 SNPs, respectively. For each scenario, 10 replicates were simulated, in all scenarios, heritability was 0.2 which corresponded equally to the polygenic and QTLs effects. Whole genomic relationship and pedigree base genetic relationship matrices were used in all 3 methods to estimate genetic parameters. To create the whole genomic relationships matrix, whole genomic additive effects was estimated using all SNPs. Also the additive effect of genomic regions was estimated using the regional genomic relationship matrix. Whole genomic relationships matrix and regional genomic relationship matrix were estimated based on genetic relationships between individuals using SNPs by GCTA software (Yang et al 2011). Pedigree based genetic relationship matrix was created by the kinship relationship between individuals using pedigree package (Coster 2013) of RStudio software (RStudio Inc 2013). To perform RHM and to estimate variance components, windows containing 50 genotyped SNPs were considered. Also windows containing 25 genotyped SNPs to overlap between two consecutive windows throughout the genome were used. SSGWAS analysis were performed by MLMA (Yu et al 2006) method using GCTA software. MLMA results were adjusted based on P-value at 5% significant threshold using Bonferroni correction. To evaluate the results of SSGWAS using fastBAT method, GCTA software was used. Results and discussion: For each replication after identifying significant SNPs, the genetic variance explained by these SNPs was estimated by equation (Faulkner & McKay 1996). In Table 1, the number of QTLs detected by the SSGWAS method, the MAF of QTLs, the range and mean of genetic variance explained by significant SNPs and QTLs are reported. For 30 replicates of simulation in SSGWAS, 16 QTLs were detected containing 2 QTLs with MAF≤0.1 and other detected QTLs with MAF≥0.1. 107 Significant regions were identified in fastBAT method. In this method, 120 QTLs were detected in 3 scenarios containing 52 QTLs with MAF≤0.1. All QTLs detected in the fastBAT and SSGWAS methods were also detected in the RHM method. In RHM method, 612 regions containing simulated QTLs and number of 316 QTLs with MAF≤0.1 were detected. In all replications, the variance explained by SNPs was equal to the variance explained by QTLs. In SSGWAS, less number of QTLs were detected than the other two methods and the maximum variance explained by QTLs was 14.9%. The criterion used to determine false positive QTLs was the absence of significant QTL in the before and after significant windows containing QTLs. In SSGWAS method the percentage of false positive QTLs was higher than the other two methods. In fastBAT, unlike the other two methods, detected QTLs were not false positive. In table 5 Number of detected QTLs, MAF range of QTLs, range and mean of genetic variance explained by detected QTLs and SNPs in fastBAT are shown. Many QTLs and regions detected by RHM method were not detected by SSGWAS and fastBAT methods. The genetic variance explained by detected QTLs in the RHM was at the range of 7.26 to 46.86% that was higher than other two methods. In table 6 the three methods compared by the number of detected QTLs, number of false positive QTLs, number of stable QTLs and the number of detected QTLs with MAF≤0.1. We found that QTLs with MAF≤0.1 were more frequently detected in RHM than the other two methods. These results confirmed that the RHM method was able to identifying more of QTLs affecting the trait variance.