The Information divergence of methylation levels is estimated using the direct estimation or a Bayesian approach of the methylation levels. if 'meth.level = FALSE', Hellinger divergence is computed as given in reference (1). The Bayesian approach followed is described in reference (2).

estimateBayesianDivergence(
  x,
  Bayesian = FALSE,
  init.pars = NULL,
  via.optim = TRUE,
  JD = FALSE,
  jd.stat = FALSE,
  columns = c(mC1 = 1, uC1 = 2, mC2 = 3, uC2 = 4),
  meth.level = FALSE,
  preserve.gr = FALSE,
  loss.fun = c("linear", "huber", "smooth", "cauchy", "arctg"),
  logbase = 2,
  verbose = FALSE,
  num.cores = 1L,
  tasks = 0L,
  ...
)

Arguments

x

A matrix of counts or GRanges object with the table of counts in the meta-columns (methylated mC and unmethylated uC cytosines). Unless specified in the parameter 'columns', the methylation counts must be given in the first four columns: 'mC1' and 'uC1' methylated and unmethylated counts for control sample, and 'mC2' and 'uC2' methylated and unmethylated counts for treatment sample, respectively.

Bayesian

logical. Whether to perform the estimations based on posterior estimations of methylation levels.

init.pars

initial parameter values. Defaults is NULL and an initial guess is estimated using optim function. If the initial guessing fails initial parameter values are to alpha = 1 & beta = 1, which imply the parsimony pseudo-counts greater than zero.

via.optim

Logical. Whether to estimate beta distribution parameters via optim or nls.lm. If any of this approaches fail then parameters used init.pars will be returned.

JD

Logic (Default:FALSE). Option on whether to add a column with values of J-information divergence (see estimateJDiv). It is only compute if JD = TRUE and meth.level = FALSE.

jd.stat

logical(1). Whether to compute the \(JD\) statistic with asymptotic Chi-squared distribution with one degree of freedom (see estimateJDiv).

columns

Vector of integer numbers of the columns where the counts 'mC1', 'uC1', 'mC2', and 'uC2' are in the matrix (default 1 to 4). That is, the input could have more than 4 columns, but only 4 columns with the counts are used. if 'meth.level == TRUE', then columns = \(c('p1', 'p2')\).

meth.level

logical(1). Methylation levels can be provided in place of counts.

preserve.gr

logical(1). Option of whether to preserve all the metadata from the original GRanges object.

loss.fun

Loss function(s) used in the estimation of the best fitted model to beta distribution (only applied when Bayesian=TRUE; see (Loss function)). This fitting uses the approach followed in in the R package usefr. After \(z = 1/2 * sum((f(x) - y)^2)\) we have:

  1. "linear": linear function which gives a standard least squares: \(loss(z) = z\).

  2. "huber": Huber loss, \(loss(z) = ifelse(z \leq 1, z, sqrt(z) -1)\).

  3. "smooth": Smooth approximation to the sum of residues absolute values: \(loss(z) = 2*(sqrt(z + 1) - 1)\).

  4. "cauchy": Cauchy loss: \(loss(z) = log(z + 1)\).

  5. "arctg": arc-tangent loss function: \(loss(x) = atan(z)\).

Loss 'linear' function works well for most of the methylation datasets with acceptable quality.

logbase

Logarithm base used to compute the JD (if JD = TRUE). Logarithm base 2 is used as default (bit unit). Use logbase = exp(1) for natural logarithm.

verbose

if TRUE, prints the function log to stdout

num.cores

The number of cores to use, i.e. at most how many child processes will be run simultaneously (see 'bplapply' function from BiocParallel package).

tasks

integer(1). The number of tasks per job. value must be a scalar integer >= 0L. In this documentation a job is defined as a single call to a function, such as bplapply, bpmapply etc. A task is the division of the \(X\) argument into chunks. When tasks == 0 (default), \(X\) is divided as evenly as possible over the number of workers (see MulticoreParam from BiocParallel package).

...

Optional parameter values for: maxiter, ftol, ptol, and gradtol from nlsLM and nlm functions.

Value

The input matrix or GRanges object with the four columns of counts and additional columns. If Bayessian = TRUE, the results are based on the posterior estimations of methylation levels. The basic additional columns are:

1)

The original matrix of methylated \(c_{ij}\) and unmethylated \(t_{ij}\) read counts from control \(j=1\) and treatment \(j=2\) samples at positions \(i\).

2)

'p1' and 'p2': methylation levels for control and treatment, respectively. If 'meth.level = FALSE' and 'Bayesian = TRUE' (recommended), 'p1' and 'p2' are estimated following the Bayesian approach described in reference (1).

3)

'bay.TV': total variation TV = p2 - p1

4)

'TV': total variation based on simple counts: \(TV=c_1/(c_1+t_1)-c_2/(c_2+t_2)\), where \(c_i\) and \(t_i\) denote methylated and unmethylated read counts, respectively.

5)

'hdiv': Hellinger divergence. If Bayesian = TRUE, the results are based on the posterior estimations of methylation levels. if 'meth.level = FALSE', then 'hdiv' is computed as given in reference (1), otherwise as: $$hdiv = (sqrt(p_1) - sqrt(p_2))^2 + (sqrt(1 -p_1) - sqrt(1 - p_2))^2$$

Details

For the current version, the Information divergence of methylation levels is estimated based on Hellinger divergence. If read counts are provided, then Hellinger divergence is computed as given in the first formula from Theorem 1 from reference 1. In the present case,

$$H = 2 (n_1 + 1) (n_2 + 1)*((sqrt(p_1) - sqrt(p_2))^2 + (sqrt(1-p_2) - sqrt(1-p_2))^2)/(n_1 + n_2 + 2)$$

where \(n_1\) and \(n_2\) are the coverage for the control and treatment, respectively. Notice that each row from the matrix of counts correspond to a single cytosine position and has four values corresponding to 'mC1' and 'uC1' (control), and mC2' and 'uC2' for treatment.

If the methylation levels are provided in place of counts, then Hellinger divergence is computed as: $$H = (sqrt(p_1) - sqrt(p_2))^2 + (sqrt(1 - p_1) - sqrt(1 - p_2))^2$$

This formula assumes that the probability vectors derived from the methylation levels \(p_i = c(p_{i2}, 1 - p_{i2})\) (see function estimateHellingerDiv) are an unbiased estimation of the expected one. The Bayesian approach followed here is described in reference (2).

References

  1. Basu A., Mandal A., Pardo L. Hypothesis testing for two discrete populations based on the Hellinger distance. Stat. Probab. Lett. 2010, 80: 206-214.

  2. Sanchez R, Yang X, Maher T, Mackenzie S. Discrimination of DNA Methylation Signal from Background Variation for Clinical Diagnostics. Int. J. Mol Sci, 2019, 20:5343.

Author

Robersy Sanchez (https://genomaths.com)

Examples

## The read count data are created
gr <- data.frame(chr = 'chr1', start = 1:10, end = 1:10,
                strand = '*',
                mC1 = rnbinom(size = 10, mu = 4, n = 500),
                uC1 = rnbinom(size = 10, mu = 4, n = 500),
                mC2 = rnbinom(size = 10, mu = 4, n = 500),
                uC2 = rnbinom(size = 10, mu = 4, n = 500))
gr <- makeGRangesFromDataFrame(gr, keep.extra.columns = TRUE)

## Estimation of the information divergences
hd <- estimateBayesianDivergence(gr, JD = TRUE)

## Keep in mind that Hellinger and J divergences are, in general,
## correlated
cor.test(x = as.numeric(hd$hdiv), y = as.numeric(hd$jdiv),
        method = 'kendall')
#> 
#> 	Kendall's rank correlation tau
#> 
#> data:  as.numeric(hd$hdiv) and as.numeric(hd$jdiv)
#> z = 26.237, p-value < 2.2e-16
#> alternative hypothesis: true tau is not equal to 0
#> sample estimates:
#>       tau 
#> 0.7889815 
#> 

## An example with methylation levels
set.seed(123)
sites = 10
dat = data.frame(chr = 'chr1', start = 1:sites, end = 1:sites,strand = '*',
                m1 = runif(n = sites), m2 = runif(n = sites))

dat =  makeGRangesFromDataFrame(dat, keep.extra.columns = TRUE)

## Transforming the list of data frames into a GRanges object
hd <- estimateBayesianDivergence(x = dat, columns = c(p1 = 1, p2 = 2),
                                meth.level = TRUE, preserve.gr = TRUE)