Given a the methylation levels from two individuals at a given cytosine site, this function computes the J information divergence (JD) between methylation levels. The motivation to introduce JD in Methyl-IT is founded on:

  1. It is a symmetrised form of Kullback–Leibler divergence (\(D_{KL}\)). Kullback and Leibler themselves actually defined the divergence as: \(D_{KL}(P || Q) + D_{KL}(Q || P)\), which is symmetric and nonnegative, where the probability distributions P and Q are defined on the same probability space (see reference (1) and Wikipedia).

  2. In general, JD is highly correlated with Hellinger divergence, which is the main divergence currently used in Methyl-IT (see examples for function estimateDivergence.

  3. By construction, the unit of measurement of JD is: bit of information, which set the basis for further information-thermodynamics analyses.

estimateJDiv(
  p,
  logbase = 2,
  stat = FALSE,
  n = NULL,
  output = c("single", "all")
)

Arguments

p

A numerical vector of the methylation levels \(p = (p1, p2)\) from individuals 1 and 2.

logbase

Logarithm base used to compute the JD. Logarithm base 2 is used as default. Use logbase = exp(1) for natural logarithm.

stat

logical(1). Whether to compute the statistic with asymptotic Chi-squared distribution with one degree of freedom (details).

n

An integer vector \(n = (n_1, n_2)\) carrying the coverage used in the computation of the methylation levels \(p = (p_1, p_2)\). Default is NULL. If provided, then the coverages are used to compute J divergence statistic as reported in reference (3), which follows Chi-squared probablity distribution.

output

Optional. If 'stat = TRUE', whether to return \(JD\) and its asymptotic chi-squared statistic.

Value

The J divergence value for the given methylation levels is returned

Details

The methylation levels \(p_{ij}\), at a given cytosine site i from an individual j, lead to the probability vectors \(p^{ij} = (p_{ij}, 1 - p_{ij})\). Then, the J-information divergence between the methylation levels \(p_{ij}\) and \(q^i = (q_i, 1 - q_i)\), used as reference, is given by (1-2):

$$JD(p^{ij}, q^i) = ((p_{ij} - q_i) * log(p_{ij}/q_i) + (q_i - p_{ij}) * log((1 - p_{ij})/(1 - q_i)))/2$$

Missing methylation levels, reported as NA or NaN, are replaced with zero.

The statistic with asymptotic Chi-squared distribution is based on the statistic suggested by Kupperman (1957) (2) for \(D_{KL}\) and commented in references (3-4). That is,

$$ 2*(n_{1} + 1)*(n_{2} + 1)*JD(p^{ij}, q^i)/(n_1 + n_2 + 2)$$

Where \(n_{1}\) and \(n_{2}\) are the total counts (coverage in the case of methylation) used to compute the probabilities \(p_{ij}\) and \(q_i\). A basic Bayesian correction is added to prevent zero counts.

References

  1. Kullback S, Leibler RA. On Information and Sufficiency. Ann Math Stat. 1951;22: 79–86. doi:10.1214/aoms/1177729694.

  2. Kupperman, M., 1957. Further application to information theory to multivariate analysis and statistical inference. Ph.D. Dissertation, George Washington University.

  3. Salicrú M, Morales D, Menéndez ML, Pardo L. On the applications of divergence type measures in testing statistical hypotheses. Journal of Multivariate Analysis. 1994. pp. 372–391. doi:10.1006/jmva.1994.1068.

  4. Basu A, Mandal A, Pardo L. Hypothesis testing for two discrete populations based on the Hellinger distance. Stat Probab Lett. Elsevier B.V.; 2010;80: 206–214. doi:10.1016/j.spl.2009.10.008.

  5. J. K. Chung, P. L. Kannappan, C. T. Ng, P. K. Sahoo, Measures of distance between probability distributions. J. Math. Anal. Appl. 138, 280–292 (1989).

  6. Lin J. Divergence Measures Based on the Shannon Entropy. IEEE Trans Inform Theory, 1991, 37:145–51.

  7. Sanchez R, Mackenzie SA. Information thermodynamics of cytosine DNA methylation. PLoS One, 2016, 11:e0150427.

See also

https://en.wikipedia.org/wiki/Kullback-Leibler_divergence for more details, and estimateDivergence for an example of using it.

Author

Robersy Sanchez 11/27/2019 https://github.com/genomaths

Examples

p <- c(0.5, 0.5)
estimateJDiv(p)
#> [1] 0

## A numerical trick is implemented. The J-divergence values are the
## same for the following vectors:
p <- c(0.9999999999999999, 0)
q <- c(1, 0)

estimateJDiv(p) == estimateJDiv(q)
#> [1] FALSE