In Methyl-IT, the signal detection step requires for the knowledge of the probability distribution of the methylation noise, which is the null hypothesis used to discriminate the signal from the noise. Both, signal and noise, are expressed in terms of an information divergence of methylation levels, which (currently in Methyl-IT) could be the Hellinger divergence or the total variation distance. As shown in reference (1), on statistical physics basis, the probability distribution of the noise is a member of the generalized gamma distribution. In particular, if the methylation changes on the DNA probability distribution.

Function 'gofReport' search for the best fitted model between the set of models requested by the user. Two goodness-of-fit (GoF) criteria are applied to select the best fitted model: Akaike's information criterion (AIC) and the correlation coefficient of cross-validations for the nonlinear regressions (R.Cross.val) (2). These criteria evaluate different information inferred from the models. AIC deals with the trade-off between the goodness of fit of the model and the complexity of the model, while R.Cross.val provides information on the prediction power/performance of the model when confronted with external dataset.

Although the numerical algorithms to accomplish the nonlinear fit are not perfect, in general, the model with the lowest AIC must have the highest R.Cross.val. If the model with the lowest AIC has not the highest R.Cross.val, then further analyzes are required. These analyzes could include the visualization of the graphics for the density distribution, evaluation of whether the parameter values can be meaningful or not, etc. Nevertheless, the best model will, in general, lead to the identification of the greater amount of potential DMPs and DMPs, as well as, the highest classification accuracy estimated with functions estimateCutPoint and evaluateDIMPclass. In the worse scenario, these observations can ultimately lead to a post-hoc decision on which the best model is.

gofReport(
HD,
model = c("Weibull2P", "Weibull3P", "Gamma2P", "Gamma3P", "GGamma3P", "GGamma4P"),
column = 9,
absolute = FALSE,
output = c("best.model", "all"),
alt_models = FALSE,
r.cv = FALSE,
npoints = NULL,
min.scale = 1e-05,
min.mu = 0.001,
mu.rel.err = 0.89,
num.cores = 1L,
verbose = FALSE,
...
)

# S3 method for ProbDistrList
print(x, ...)

## Arguments

HD

An 'InfDiv' object returned by function estimateDivergence.

model

A character vector naming the models to fit. Default is model = c('Weibull2P', 'Weibull3P', 'Gamma2P', 'Gamma3P'). See nonlinearFitDist for more options.

column

An integer number denoting the index of the GRanges column where the information divergence is given. Default column = 9, which is the column where the Hellinger divergence values are reported (by default) by function estimateDivergence.

absolute

Logic (default, FALSE). Total variation (TV, the difference of methylation levels) is normally an output in the downstream MethylIT analysis. If 'absolute = TRUE', then TV is transformed into |TV|, which is an information divergence that can be fitted as well.

output

If output == 'all', the table with the GoF statistics is returned in a list together with the best fitted model and the corresponding statistics. Default is 'best.model'

alt_models

logical(1). If TRUE, then the best model based on highest R.Cross.val is returned for those samples where the model(s) with lowest AIC has not the highest R.Cross.val.

r.cv

logical(1). Whether to select the best model based on the highest R.Cross.val (2) (see details section).

npoints

number of points to be used in the fit. Default is NULL.

min.scale

A number. The nonlinear fit of GGamma family distributions sometimes yields weird estimation of the scale parameter with values < 1e-5. Based on our experience, scale parameter values lower than min.scale = 1e-5 are probably meaningless. The result of numerical errors, the nonlinear algorithm are not perfect.

min.mu, mu.rel.err

Numbers. Models with location parameter $$\mu$$ sometimes reports $$\mu$$ values close to zero, which generally corresponds to a non statistically significant model coefficient/parameter. This detail is not detected by the AIC or the cross-validations correlation coefficient. In this case the best model is the model without location parameter, where $$\mu = 0$$. If $$\mu < min.mu$$ and $$|E(\mu) - min(idiv)|/min(idiv) > mu.rel.err$$, where $$E(\mu)$$ is the estimated value and $$min(idiv)$$ is the minimum observed value of the information divergence provided, then the issue is reported in the column named 'model'. It is expected that, at least $$mu.rel.err <= 0.89$$. That is, that its expected that the difference between $$E(\mu)$$ and $$min(idiv)$$ is not greater than the 89%.

num.cores

The number of cores to use in the nonlinear fit step, i.e. at most how many child processes will be run simultaneously (see bplapply function from BiocParallel package).

verbose

If TRUE, prints the function log to stdout

...

Further arguments to pass to nonlinearFitDist.

x

An object from 'ProbDistrList' class

## Value

If 'output = 'best.model'', a character vector with the name of the best fitted model for each sample and model statistics, which can be used the next step to get the potential DMPs with function getPotentialDIMP. Otherwise, it will return a list with the table carrying the GoF values and the previously mentioned data.

## Details

The best model selection is based on the lowest AIC. However, if $$alt_models = TRUE$$ and $$r.cv = FALSE$$, then the returned list will contain two sublists named 'alt_models' and 'alt_nlms' with the best model selected based on the highest R.Cross.val. If $$r.cv = TRUE$$, best model selection is based on the highest R.Cross.val and sublist named 'alt_models' will carry the model selected based on the lowest AIC. This sublist is returned only if at least there is one sample where the model with the lowest AIC has not the highest R.Cross.val. This report is useful to analyze any conflict between models. For example, some times the best model selected based on AIC has a R.Cross.val = 0.99945, while the highest R.Cross.val is 0.99950. In such a situation the model with lowest AIC is still fine. However, some times the model with the best AIC has some meaningless parameter value. For example, scale = 0.0000000001 in 'GGamma3P' or in 'Weibull2P' models. This last situation can result from the numerical algorithm used in the parameter estimation. The numerical algorithms for nonlinear fit estimation are not perfect!

For distribution models that include a location parameter ('Weibull3P', 'Gamma3P', and 'GGamma4P') there is an additional important constraint, which cannot be evaluated with 'AIC' or 'R.Cross.val': $$\mu > 0$$. That is, the information divergences are strictly positive magnitudes, therefore, any best model in terms of AIC with values $$\mu < 0$$ are meaningless and rejected.

## References

1. R. Sanchez and S. A. Mackenzie, “Information Thermodynamics of Cytosine DNA Methylation,” PLoS One, vol. 11, no. 3, p. e0150427, Mar. 2016.

2. Stevens JP. Applied Multivariate Statistics for the Social Sciences. Fifth Edit. Routledge Academic; 2009.

## Author

Robersy Sanchez 11/25/2019 https://github.com/genomaths

## Examples

data(HD)

## Subsetting object HD (for the sake of runnig a faster example)
hd <- lapply(HD, function(x) x[seq_len(100)], keep.attr = TRUE)

## The GoF report
dt <- gofReport(hd)
#>
|
|                                                                      |   0%
|
|============                                                          |  17%
|
|=======================                                               |  33%
|
|===================================                                   |  50%
|
|===============================================                       |  67%
|
|==========================================================            |  83%
|
|======================================================================| 100%
#>
#>  *** Creating report ...
#>      w2p_AIC w2p_R.Cross.val w3p_AIC w3p_R.Cross.val   g2p_AIC g2p_R.Cross.val
#> C1 -444.0595       0.9963234     Inf               0 -441.8100       0.9960348
#> C2 -405.1271       0.9938462     Inf               0 -390.5324       0.9940991
#> C3 -491.5704       0.9978073     Inf               0 -511.8578       0.9981278
#> T1 -479.9246       0.9970872     Inf               0 -448.9375       0.9966225
#> T2 -470.1729       0.9970691     Inf               0 -473.6995       0.9972262
#> T3 -531.2909       0.9983590     Inf               0 -504.9352       0.9979546
#>    g3p_AIC g3p_R.Cross.val  gg3p_AIC gg3p_R.Cross.val gg4p_AIC gg4p_R.Cross.val
#> C1     Inf               0 -442.4498        0.9961000      Inf                0
#> C2     Inf               0 -434.9770        0.9963306      Inf                0
#> C3     Inf               0 -515.8301        0.9983549      Inf                0
#> T1     Inf               0 -486.5860        0.9976408      Inf                0
#> T2     Inf               0 -472.8683        0.9970996      Inf                0
#> T3     Inf               0 -525.0245        0.9983480      Inf                0
#>    bestModel
#> C1       w2p
#> C2      gg3p
#> C3      gg3p
#> T1      gg3p
#> T2       g2p
#> T3       w2p
#>

## Report the model where AIC and R.Cross.val are in conflict.
dt <- gofReport(HD = hd, output = "all", alt_models = TRUE)
#>
|
|                                                                      |   0%
|
|============                                                          |  17%
|
|=======================                                               |  33%
|
|===================================                                   |  50%
|
|===============================================                       |  67%
|
|==========================================================            |  83%
|
|======================================================================| 100%
#>
#>  *** Creating report ...