Given a the methylation levels from two individuals at a given cytosine site, this function computes the J information divergence (JD) between methylation levels. The motivation to introduce JD in Methyl-IT is founded on:
It is a symmetrised form of Kullback–Leibler divergence (\(D_{KL}\)). Kullback and Leibler themselves actually defined the divergence as: \(D_{KL}(P || Q) + D_{KL}(Q || P)\), which is symmetric and nonnegative, where the probability distributions P and Q are defined on the same probability space (see reference (1) and Wikipedia).
In general, JD is highly correlated with Hellinger
divergence, which is the main divergence currently used in
Methyl-IT (see examples for function
estimateDivergence
.
By construction, the unit of measurement of JD is: bit of information, which set the basis for further information-thermodynamics analyses.
estimateJDiv(
p,
logbase = 2,
stat = FALSE,
n = NULL,
output = c("single", "all")
)
A numerical vector of the methylation levels \(p = (p1, p2)\) from individuals 1 and 2.
Logarithm base used to compute the JD. Logarithm base 2 is used as default. Use logbase = exp(1) for natural logarithm.
logical(1). Whether to compute the statistic with asymptotic Chi-squared distribution with one degree of freedom (details).
An integer vector \(n = (n_1, n_2)\) carrying the coverage used in the computation of the methylation levels \(p = (p_1, p_2)\). Default is NULL. If provided, then the coverages are used to compute J divergence statistic as reported in reference (3), which follows Chi-squared probablity distribution.
Optional. If 'stat = TRUE', whether to return \(JD\) and its asymptotic chi-squared statistic.
The J divergence value for the given methylation levels is returned
The methylation levels \(p_{ij}\), at a given cytosine site i from an individual j, lead to the probability vectors \(p^{ij} = (p_{ij}, 1 - p_{ij})\). Then, the J-information divergence between the methylation levels \(p_{ij}\) and \(q^i = (q_i, 1 - q_i)\), used as reference, is given by (1-2):
$$JD(p^{ij}, q^i) = ((p_{ij} - q_i) * log(p_{ij}/q_i) + (q_i - p_{ij}) * log((1 - p_{ij})/(1 - q_i)))/2$$
Missing methylation levels, reported as NA or NaN, are replaced with zero.
The statistic with asymptotic Chi-squared distribution is based on the statistic suggested by Kupperman (1957) (2) for \(D_{KL}\) and commented in references (3-4). That is,
$$ 2*(n_{1} + 1)*(n_{2} + 1)*JD(p^{ij}, q^i)/(n_1 + n_2 + 2)$$
Where \(n_{1}\) and \(n_{2}\) are the total counts (coverage in the case of methylation) used to compute the probabilities \(p_{ij}\) and \(q_i\). A basic Bayesian correction is added to prevent zero counts.
Kullback S, Leibler RA. On Information and Sufficiency. Ann Math Stat. 1951;22: 79–86. doi:10.1214/aoms/1177729694.
Kupperman, M., 1957. Further application to information theory to multivariate analysis and statistical inference. Ph.D. Dissertation, George Washington University.
Salicrú M, Morales D, Menéndez ML, Pardo L. On the applications of divergence type measures in testing statistical hypotheses. Journal of Multivariate Analysis. 1994. pp. 372–391. doi:10.1006/jmva.1994.1068.
Basu A, Mandal A, Pardo L. Hypothesis testing for two discrete populations based on the Hellinger distance. Stat Probab Lett. Elsevier B.V.; 2010;80: 206–214. doi:10.1016/j.spl.2009.10.008.
J. K. Chung, P. L. Kannappan, C. T. Ng, P. K. Sahoo, Measures of distance between probability distributions. J. Math. Anal. Appl. 138, 280–292 (1989).
Lin J. Divergence Measures Based on the Shannon Entropy. IEEE Trans Inform Theory, 1991, 37:145–51.
Sanchez R, Mackenzie SA. Information thermodynamics of cytosine DNA methylation. PLoS One, 2016, 11:e0150427.
https://en.wikipedia.org/wiki/Kullback-Leibler_divergence
for more details, and estimateDivergence
for an example of
using it.