Given a the methylation levels from two individuals at a given cytosine site, this function computes the J information divergence (JD) between methylation levels. The motivation to introduce JD in Methyl-IT is founded on:

1. It is a symmetrised form of Kullback–Leibler divergence ($$D_{KL}$$). Kullback and Leibler themselves actually defined the divergence as: $$D_{KL}(P || Q) + D_{KL}(Q || P)$$, which is symmetric and nonnegative, where the probability distributions P and Q are defined on the same probability space (see reference (1) and Wikipedia).

2. In general, JD is highly correlated with Hellinger divergence, which is the main divergence currently used in Methyl-IT (see examples for function estimateDivergence.

3. By construction, the unit of measurement of JD is: bit of information, which set the basis for further information-thermodynamics analyses.

estimateJDiv(
p,
logbase = 2,
stat = FALSE,
n = NULL,
output = c("single", "all")
)

## Arguments

p

A numerical vector of the methylation levels $$p = (p1, p2)$$ from individuals 1 and 2.

logbase

Logarithm base used to compute the JD. Logarithm base 2 is used as default. Use logbase = exp(1) for natural logarithm.

stat

logical(1). Whether to compute the statistic with asymptotic Chi-squared distribution with one degree of freedom (details).

n

An integer vector $$n = (n_1, n_2)$$ carrying the coverage used in the computation of the methylation levels $$p = (p_1, p_2)$$. Default is NULL. If provided, then the coverages are used to compute J divergence statistic as reported in reference (3), which follows Chi-squared probablity distribution.

output

Optional. If 'stat = TRUE', whether to return $$JD$$ and its asymptotic chi-squared statistic.

## Value

The J divergence value for the given methylation levels is returned

## Details

The methylation levels $$p_{ij}$$, at a given cytosine site i from an individual j, lead to the probability vectors $$p^{ij} = (p_{ij}, 1 - p_{ij})$$. Then, the J-information divergence between the methylation levels $$p_{ij}$$ and $$q^i = (q_i, 1 - q_i)$$, used as reference, is given by (1-2):

$$JD(p^{ij}, q^i) = ((p_{ij} - q_i) * log(p_{ij}/q_i) + (q_i - p_{ij}) * log((1 - p_{ij})/(1 - q_i)))/2$$

Missing methylation levels, reported as NA or NaN, are replaced with zero.

The statistic with asymptotic Chi-squared distribution is based on the statistic suggested by Kupperman (1957) (2) for $$D_{KL}$$ and commented in references (3-4). That is,

$$2*(n_{1} + 1)*(n_{2} + 1)*JD(p^{ij}, q^i)/(n_1 + n_2 + 2)$$

Where $$n_{1}$$ and $$n_{2}$$ are the total counts (coverage in the case of methylation) used to compute the probabilities $$p_{ij}$$ and $$q_i$$. A basic Bayesian correction is added to prevent zero counts.

https://en.wikipedia.org/wiki/Kullback-Leibler_divergence for more details, and estimateDivergence for an example of using it.

## Author

Robersy Sanchez 11/27/2019 https://github.com/genomaths

## Examples

p <- c(0.5, 0.5)
estimateJDiv(p)
#>  0

## A numerical trick is implemented. The J-divergence values are the
## same for the following vectors:
p <- c(0.9999999999999999, 0)
q <- c(1, 0)

estimateJDiv(p) == estimateJDiv(q)
#>  FALSE