In MethylIT, the signal detection step requires for the knowledge of the probability distribution of the methylation noise, which is the null hypothesis used to discriminate the signal from the noise. Both, signal and noise, are expressed in terms of an information divergence of methylation levels, which (currently in MethylIT) could be the Hellinger divergence or the total variation distance. As shown in reference (1), on statistical physics basis, the probability distribution of the noise is a member of the generalized gamma distribution. In particular, if the methylation changes on the DNA molecule follow a Poisson process, then the noise follows a Weibull probability distribution.
Function 'gofReport' search for the best fitted model between the set of models requested by the user. Two goodnessoffit (GoF) criteria are applied to select the best fitted model: Akaike's information criterion (AIC) and the correlation coefficient of crossvalidations for the nonlinear regressions (R.Cross.val) (2). These criteria evaluate different information inferred from the models. AIC deals with the tradeoff between the goodness of fit of the model and the complexity of the model, while R.Cross.val provides information on the prediction power/performance of the model when confronted with external dataset.
Although the numerical algorithms to accomplish the nonlinear fit are not
perfect, in general, the model with the lowest AIC must have the highest
R.Cross.val. If the model with the lowest AIC has not the highest
R.Cross.val, then further analyzes are required. These analyzes could include
the visualization of the graphics for the density distribution, evaluation of
whether the parameter values can be meaningful or not, etc. Nevertheless, the
best model will, in general, lead to the identification of the greater amount
of potential DMPs and DMPs, as well as, the highest classification accuracy
estimated with functions estimateCutPoint
and
evaluateDIMPclass
. In the worse scenario, these observations
can ultimately lead to a posthoc decision on which the best model is.
gofReport( HD, model = c("Weibull2P", "Weibull3P", "Gamma2P", "Gamma3P"), column = 9, absolute = FALSE, output = c("best.model", "all"), confl_model = FALSE, num.cores = 1L, verbose = FALSE, ... )
HD  An 'InfDiv' object returned by function


model  A character vector naming the models to fit. Default is
model = c('Weibull2P', 'Weibull3P', 'Gamma2P', 'Gamma3P'). See

column  An integer number denoting the index of the GRanges column where
the information divergence is given. Default column = 9, which is the
column where the Hellinger divergence values are reported (by default) by
function 
absolute  Logic (default, FALSE). Total variation (TV, the difference of methylation levels) is normally an output in the downstream MethylIT analysis. If 'absolute = TRUE', then TV is transformed into TV, which is an information divergence that can be fitted as well. 
output  If output == 'all', the table with the GoF statistics is returned in a list together with the best fitted model and the corresponding statistics. Default is 'best.model' 
confl_model  Logic. If TRUE, then the best model based on highest R.Cross.val is returned for those samples where the model(s) with lowest AIC has not the highest R.Cross.val. 
num.cores  The number of cores to use in the nonlinear fit step, i.e.
at most how many child processes will be run simultaneously (see

verbose  If TRUE, prints the function log to stdout 
...  Further arguments to pass to 
If 'output = 'best.model'', a character vector with the name of the
best fitted model for each sample and model statistics, which can be used the
next step to get the potential DMPs with function
getPotentialDIMP
. Otherwise, it will return a list with the
table carrying the GoF values and the previously mentioned data.
The best model selection is based on the lowest AIC. However, if 'confl_model = TRUE', then the returned list will contain a sublist named 'confl_model' with the best model selected based on the highest R.Cross.val. This sublist is returned only if at least there is one sample where the model with the lowest AIC has not the highest R.Cross.val. This report is useful to analyze any conflict between models. For example, some times the best model selected based on AIC has a R.Cross.val = 0.99945, while the highest R.Cross.val is 0.99950. In such a situation the model with lowest AIC is still fine. However, some times the model with the best AIC has some meaningless parameter value. For example, scale = 0.0000000001 in 'GGamma3P' or in 'Weibull2P' models. This last situation can result from the numerical algorithm used in the parameter estimation. The numerical algorithms for nonlinear fit estimation are not perfect!
For distribution models that include a location parameter ('Weibull3P', 'Gamma3P', and 'GGamma4P') there is an additional important constraint, which cannot be evaluated with 'AIC' or 'R.Cross.val': \(\mu > 0\). That is, the information divergences are strictly positive magnitudes, therefore, any best model in terms of AIC with values \(\mu < 0\) are meaningless and rejected.
R. Sanchez and S. A. Mackenzie, “Information Thermodynamics of Cytosine DNA Methylation,” PLoS One, vol. 11, no. 3, p. e0150427, Mar. 2016.
Stevens JP. Applied Multivariate Statistics for the Social Sciences. Fifth Edit. Routledge Academic; 2009.
nonlinearFitDist
## Loading information divergence dataset data(HD) ## Subsetting object HD (for the sake of runnig a faster example) hd < lapply(HD, function(x) x[seq_len(100)], keep.attr = TRUE) ## The GoF report dt < gofReport(hd)#>    0%  ==================  25%  ===================================  50%  ====================================================  75%  ====================================================================== 100% #> #> *** Creating report ...#> Warning: The best fitted model for sample(s) T1 require(s) for further analysis. #> The model with the lowest AIC must have the highest R.Cross.val#> w2p_AIC w2p_R.Cross.val w3p_AIC w3p_R.Cross.val g2p_AIC #> C1 455.0259 0.9963849 NA NA 437.3394 #> C2 426.8983 0.9951161 431.1954 0.9956952 409.3027 #> C3 582.1629 0.9990002 NA NA 560.5544 #> T1 494.5917 0.9975048 493.8734 0.9975519 461.7047 #> T2 472.0464 0.9971160 NA NA 477.2328 #> T3 517.5384 0.9980853 515.9888 0.9981240 493.3195 #> g2p_R.Cross.val g3p_AIC g3p_R.Cross.val bestModel #> C1 0.9958804 434.8702 0.9958220 w2p #> C2 0.9950139 411.1744 0.9953230 w3p #> C3 0.9988600 559.6221 0.9987277 w2p #> T1 0.9970319 484.9045 0.9974664 Needs revision #> T2 0.9974107 Inf 0.0000000 g2p #> T3 0.9977381 526.9134 0.9983114 g3p #>## To get the potential DMPs ps_dmp < getPotentialDIMP(LR = hd, nlms = dt$nlms, div.col = 9L, dist.name = dt$bestModel)