R/estimateBayesianDivergence.R
estimateBayesianDivergence.Rd
The Information divergence of methylation levels is estimated using the direct estimation or a Bayesian approach of the methylation levels. if 'meth.level = FALSE', Hellinger divergence is computed as given in reference (1). The Bayesian approach followed is described in reference (2).
estimateBayesianDivergence(
x,
Bayesian = FALSE,
init.pars = NULL,
via.optim = TRUE,
JD = FALSE,
jd.stat = FALSE,
columns = c(mC1 = 1, uC1 = 2, mC2 = 3, uC2 = 4),
meth.level = FALSE,
preserve.gr = FALSE,
loss.fun = c("linear", "huber", "smooth", "cauchy", "arctg"),
logbase = 2,
verbose = FALSE,
num.cores = 1L,
tasks = 0L,
...
)
A matrix of counts or GRanges object with the table of counts in the meta-columns (methylated mC and unmethylated uC cytosines). Unless specified in the parameter 'columns', the methylation counts must be given in the first four columns: 'mC1' and 'uC1' methylated and unmethylated counts for control sample, and 'mC2' and 'uC2' methylated and unmethylated counts for treatment sample, respectively.
logical. Whether to perform the estimations based on posterior estimations of methylation levels.
initial parameter values. Defaults is NULL and an initial
guess is estimated using optim
function. If the initial
guessing fails initial parameter values are to alpha = 1 &
beta = 1, which imply the parsimony pseudo-counts greater than zero.
Logical. Whether to estimate beta distribution parameters
via optim
or nls.lm
. If any of
this approaches fail then parameters used init.pars will be returned.
Logic (Default:FALSE). Option on whether to add a column with
values of J-information divergence (see estimateJDiv
).
It is only compute if JD = TRUE and meth.level = FALSE.
logical(1). Whether to compute the \(JD\) statistic with
asymptotic Chi-squared distribution with one degree of freedom (see
estimateJDiv
).
Vector of integer numbers of the columns where the counts 'mC1', 'uC1', 'mC2', and 'uC2' are in the matrix (default 1 to 4). That is, the input could have more than 4 columns, but only 4 columns with the counts are used. if 'meth.level == TRUE', then columns = \(c('p1', 'p2')\).
logical(1). Methylation levels can be provided in place of counts.
logical(1). Option of whether to preserve all the metadata from the original GRanges object.
Loss function(s) used in the estimation of the best fitted model to beta distribution (only applied when Bayesian=TRUE; see (Loss function)). This fitting uses the approach followed in in the R package usefr. After \(z = 1/2 * sum((f(x) - y)^2)\) we have:
"linear": linear function which gives a standard least squares: \(loss(z) = z\).
"huber": Huber loss, \(loss(z) = ifelse(z \leq 1, z, sqrt(z) -1)\).
"smooth": Smooth approximation to the sum of residues absolute values: \(loss(z) = 2*(sqrt(z + 1) - 1)\).
"cauchy": Cauchy loss: \(loss(z) = log(z + 1)\).
"arctg": arc-tangent loss function: \(loss(x) = atan(z)\).
Loss 'linear' function works well for most of the methylation datasets with acceptable quality.
Logarithm base used to compute the JD (if JD = TRUE). Logarithm base 2 is used as default (bit unit). Use logbase = exp(1) for natural logarithm.
if TRUE, prints the function log to stdout
The number of cores to use, i.e. at most how many child processes will be run simultaneously (see 'bplapply' function from BiocParallel package).
integer(1). The number of tasks per job. value must be a scalar integer >= 0L. In this documentation a job is defined as a single call to a function, such as bplapply, bpmapply etc. A task is the division of the \(X\) argument into chunks. When tasks == 0 (default), \(X\) is divided as evenly as possible over the number of workers (see MulticoreParam from BiocParallel package).
Optional parameter values for: maxiter, ftol, ptol, and gradtol
from nlsLM
and nlm
functions.
The input matrix or GRanges object with the four columns of counts and additional columns. If Bayessian = TRUE, the results are based on the posterior estimations of methylation levels. The basic additional columns are:
The original matrix of methylated \(c_{ij}\) and unmethylated \(t_{ij}\) read counts from control \(j=1\) and treatment \(j=2\) samples at positions \(i\).
'p1' and 'p2': methylation levels for control and treatment, respectively. If 'meth.level = FALSE' and 'Bayesian = TRUE' (recommended), 'p1' and 'p2' are estimated following the Bayesian approach described in reference (1).
'bay.TV': total variation TV = p2 - p1
'TV': total variation based on simple counts: \(TV=c_1/(c_1+t_1)-c_2/(c_2+t_2)\), where \(c_i\) and \(t_i\) denote methylated and unmethylated read counts, respectively.
'hdiv': Hellinger divergence. If Bayesian = TRUE, the results are based on the posterior estimations of methylation levels. if 'meth.level = FALSE', then 'hdiv' is computed as given in reference (1), otherwise as: $$hdiv = (sqrt(p_1) - sqrt(p_2))^2 + (sqrt(1 -p_1) - sqrt(1 - p_2))^2$$
For the current version, the Information divergence of methylation levels is estimated based on Hellinger divergence. If read counts are provided, then Hellinger divergence is computed as given in the first formula from Theorem 1 from reference 1. In the present case,
$$H = 2 (n_1 + 1) (n_2 + 1)*((sqrt(p_1) - sqrt(p_2))^2 + (sqrt(1-p_2) - sqrt(1-p_2))^2)/(n_1 + n_2 + 2)$$
where \(n_1\) and \(n_2\) are the coverage for the control and treatment, respectively. Notice that each row from the matrix of counts correspond to a single cytosine position and has four values corresponding to 'mC1' and 'uC1' (control), and mC2' and 'uC2' for treatment.
If the methylation levels are provided in place of counts, then Hellinger divergence is computed as: $$H = (sqrt(p_1) - sqrt(p_2))^2 + (sqrt(1 - p_1) - sqrt(1 - p_2))^2$$
This formula assumes that the probability vectors derived from the
methylation levels \(p_i = c(p_{i2}, 1 - p_{i2})\) (see function
estimateHellingerDiv
) are an unbiased estimation of the
expected one. The Bayesian approach followed here is described in reference
(2).
Basu A., Mandal A., Pardo L. Hypothesis testing for two discrete populations based on the Hellinger distance. Stat. Probab. Lett. 2010, 80: 206-214.
Sanchez R, Yang X, Maher T, Mackenzie S. Discrimination of DNA Methylation Signal from Background Variation for Clinical Diagnostics. Int. J. Mol Sci, 2019, 20:5343.
## The read count data are created
gr <- data.frame(chr = 'chr1', start = 1:10, end = 1:10,
strand = '*',
mC1 = rnbinom(size = 10, mu = 4, n = 500),
uC1 = rnbinom(size = 10, mu = 4, n = 500),
mC2 = rnbinom(size = 10, mu = 4, n = 500),
uC2 = rnbinom(size = 10, mu = 4, n = 500))
gr <- makeGRangesFromDataFrame(gr, keep.extra.columns = TRUE)
## Estimation of the information divergences
hd <- estimateBayesianDivergence(gr, JD = TRUE)
## Keep in mind that Hellinger and J divergences are, in general,
## correlated
cor.test(x = as.numeric(hd$hdiv), y = as.numeric(hd$jdiv),
method = 'kendall')
#>
#> Kendall's rank correlation tau
#>
#> data: as.numeric(hd$hdiv) and as.numeric(hd$jdiv)
#> z = 26.237, p-value < 2.2e-16
#> alternative hypothesis: true tau is not equal to 0
#> sample estimates:
#> tau
#> 0.7889815
#>
## An example with methylation levels
set.seed(123)
sites = 10
dat = data.frame(chr = 'chr1', start = 1:sites, end = 1:sites,strand = '*',
m1 = runif(n = sites), m2 = runif(n = sites))
dat = makeGRangesFromDataFrame(dat, keep.extra.columns = TRUE)
## Transforming the list of data frames into a GRanges object
hd <- estimateBayesianDivergence(x = dat, columns = c(p1 = 1, p2 = 2),
meth.level = TRUE, preserve.gr = TRUE)