Robersy Sanchez

rus547@psu.edu

Mackenzie’s lab

Department of Biology and Plant Science.

Pennsylvania State University, University Park, PA 16802

rus547@psu.edu

Mackenzie’s lab

Department of Biology and Plant Science.

Pennsylvania State University, University Park, PA 16802

Abstract

Methylation analysis with Methyl-IT is illustrated on simulated datasets of methylated and unmethylated read counts with relatively high average of methylation levels: 0.15 and 0.286 for control and treatment groups, respectively. The main Methyl-IT downstream analysis is presented alongside the application of Fisher’s exact test. The importance of a signal detection step is shown.Methyl-IT R package offers a methylome analysis approach based on information thermodynamics (IT) and signal detection. Methyl-IT approach confront detection of differentially methylated cytosine as a signal detection problem. This approach was designed to discriminate methylation regulatory signal from background noise induced by molecular stochastic fluctuations. Methyl-IT R package is not limited to the IT approach but also includes Fisher’s exact test (FT), Root-mean-square statistic (RMST) or Hellinger divergence (HDT) tests. Herein, we will show that a signal detection step is also required for FT, RMST, and HDT as well. It is worthy to notice that, as for any standard statistical analysis, any serious robust methylation analysis requires for a descriptive statistical analysis at the beginning or at different downstream steps. This is an ABC principle taught in any undergraduate course on statistics. Methylation analysis is not the exception of the rule. This detail is included in this example.

Note: This example was made with the MethylIT version at https://github.com/genomaths/MethylIT.

It must NOT run with the current version available at https://git.psu.edu/genomath/MethylIT

Methylome datasets of whole-genome bisulfite sequencing (WGBS) are available at Gene Expression Omnibus (GEO DataSets). The data set are downloaded providing the GEO accession numbers for each data set to the function *getGEOSuppFiles* (for details type ?getGEOSuppFiles in the R console). Then, datasets can be read using function *readCounts2GRangesList*. An example on how to load datasets of read-counts of methylated and unmethylated cytosine into Methyl-IT is given in the Cancer example at https://git.psu.edu/genomath/MethylIT (https://git.psu.edu/genomath/MethylIT_data/blob/master/cancer_example_04-03-18.pdf)

For the current example on methylation analysis with Methyl-IT we will use simulated data. Read-count matrices of methylated and unmethylated cytosine are generated with Methyl-IT function *simulateCounts*.

Function *simulateCounts* randomly generates prior methylation levels using Beta distribution function. The expected mean of methylation levels that we would like to have can be estimated using the auxiliary function:

```
bmean <- function(alpha, beta) alpha/(alpha + beta)
alpha.ct <- 0.09
alpha.tt <- 0.2
c(control.group = bmean(alpha.ct, 0.5), treatment.group = bmean(alpha.tt, 0.5),
mean.diff = bmean(alpha.tt, 0.5) - bmean(alpha.ct, 0.5))
```

```
## control.group treatment.group mean.diff
## 0.1525424 0.2857143 0.1331719
```

This simple function uses the \(\alpha\) (*shape1*) and \(\beta\) (*shape2*) parameters from the Beta distribution function to compute the expected value of methylation levels. In the current case, we expect to have a difference of methylation levels about 0.133 between the control and the treatment.

Function *simulateCounts* from MethylIT.utils R package will be used to generate the datasets, which will include three group of samples: reference, control, and treatment.

```
library(MethylIT)
library(MethylIT.utils)
# The number of cytosine sites to generate
sites = 50000
# Set a seed for pseudo-random number generation
set.seed(124)
control.nam <- c("C1", "C2", "C3")
treatment.nam <- c("T1", "T2", "T3")
# Reference group
ref0 = simulateCounts(num.samples = 4, sites = sites, alpha = alpha.ct, beta = 0.5,
size = 50, theta = 4.5,
sample.ids = c("R1", "R2", "R3", "R4"))
# Control group
ctrl = simulateCounts(num.samples = 3, sites = sites, alpha = alpha.ct, beta = 0.5,
size = 50, theta = 4.5, sample.ids = control.nam)
# Treatment group
treat = simulateCounts(num.samples = 3, sites = sites, alpha = alpha.tt, beta = 0.5,
size = 50, theta = 4.5, sample.ids = treatment.nam)
```

Notice that reference and control groups of samples are not identical but belong to the same population.

The estimation of the divergences of methylation levels is required to proceed with the application of signal detection basic approach. The information divergence is estimated here using the function *estimateDivergence*. For each cytosine site, methylation levels are estimated according to the formulas: \(p_i={n_i}^{mC_j}/({n_i}^{mC_j}+{n_i}^{C_j})\), where \({n_i}^{mC_j}\) and \({n_i}^{C_j}\) are the number of methylated and unmethylated cytosines at site \(i\).

If a Bayesian correction of counts is selected in function *estimateDivergence*, then methylated read counts are modeled by a beta-binomial distribution in a Bayesian framework, which accounts for the biological and sampling variations (Hebestreit, Dugas, and Klein 2013; Robinson et al. 2014; Dolzhenko and Smith 2014). In our case we adopted the Bayesian approach suggested in reference (Baldi and Brunak 2001) (Chapter 3).

Two types of information divergences are estimated: *TV*, total variation (*TV*, absolute value of methylation levels) and Hellinger divergence (*H*). *TV* is computed according to the formula: \(TV_d=|p_{tt}-p_{ct}|\) and *H*:

\[H(\hat p_{ij},\hat p_{ir}) = w_i\Big[(\sqrt{\hat p_{ij}} - \sqrt{\hat p_{ir}})^2+(\sqrt{1-\hat p_{ij}} - \sqrt{1-\hat p_{ir}})^2\Big]\] where \(w_i = 2 \frac{m_{ij} m_{ir}}{m_{ij} + m_{ir}}\), \(m_{ij} = {n_i}^{mC_j}+{n_i}^{uC_j}+1\), \(m_{ir} = {n_i}^{mC_r}+{n_i}^{uC_r}+1\) and \(j \in {\{c,t}\}\)

The equation for Hellinger divergence is given in reference (Basu, Mandal, and Pardo 2010), but any other information theoretical divergences could be used as well.

Divergences are estimated for control and treatment groups in respect to a virtual sample, which is created applying function *poolFromGRlist* on the reference group.

```
# Reference sample
ref = poolFromGRlist(ref0, stat = "mean", num.cores = 4L, verbose = FALSE)
# Methylation level divergences
DIVs <- estimateDivergence(ref = ref, indiv = c(ctrl, treat), Bayesian = TRUE,
num.cores = 6L, percentile = 1, verbose = FALSE)
```

The mean of methylation levels differences is:

```
## C1 C2 C3 T1 T2
## -0.0009820776 -0.0014922009 -0.0022257725 0.1358867135 0.1359160219
## T3
## 0.1309217360
```

Likewise for any other signal in nature, the analysis of methylation signal requires for the knowledge of its probability distribution. In the current case, the signal is represented in terms of the Hellinger divergence of methylation levels (*H*).

```
divs = DIVs[order(names(DIVs))]
# To remove hd == 0 to estimate. The methylation signal only is given for
divs = lapply(divs, function(div) div[ abs(div$hdiv) > 0 ], keep.attr = TRUE)
names(divs) <- names(DIVs)
# Data frame with the Hellinger divergences from both groups of samples samples
l = c(); for (k in 1:length(divs)) l = c(l, length(divs[[k]]))
data_div <- data.frame(H = c(abs(divs$C1$hdiv), abs(divs$C2$hdiv), abs(divs$C3$hdiv),
abs(divs$T1$hdiv), abs(divs$T2$hdiv), abs(divs$T3$hdiv)),
sample = c(rep("C1", l[1]), rep("C2", l[2]), rep("C3", l[3]),
rep("T1", l[4]), rep("T2", l[5]), rep("T3", l[6]))
)
```

Empirical critical values for the probability distribution of \(H\) and \(TV\) can be obtained using *quantile* function from the R package *stats*.

```
critical.val <- do.call(rbind, lapply(divs, function(x) {
hd.95 = quantile(x$hdiv, 0.95)
tv.95 = quantile(abs(x$TV), 0.95)
return(c(tv = tv.95, hd = hd.95))
}))
critical.val
```

```
## tv.95% hd.95%
## C1 0.7907276 81.47256
## C2 0.7888943 80.95873
## C3 0.7972732 81.27145
## T1 0.9263158 113.73798
## T2 0.9240569 114.45228
## T3 0.9213483 111.54258
```

The kernel density estimation yields the empirical density shown in the graphics:

```
suppressMessages(library(ggplot2))
# Some information for graphic
crit.val.ct <- max(critical.val[c("C1", "C2", "C3"), 2]) # 81.5
crit.val.tt <- min(critical.val[c("T1", "T2", "T3"), 2]) # 111.5426
# Density plot with ggplot
ggplot(data_div, aes(x = H, colour = sample, fill = sample)) +
geom_density(alpha = 0.05, bw = 0.2, position = "identity", na.rm = TRUE,
size = 0.4) + xlim(c(0, 125)) +
xlab(expression(bolditalic("Hellinger divergence (H)"))) +
ylab(expression(bolditalic("Density"))) +
ggtitle("Density distribution for control and treatment") +
geom_vline(xintercept = crit.val.ct, color = "red", linetype = "dashed", size = 0.4) +
annotate(geom = "text", x = crit.val.ct-2, y = 0.3, size = 5,
label = 'bolditalic(H[alpha == 0.05]^CT==81.5)',
family = "serif", color = "red", parse = TRUE) +
geom_vline(xintercept = crit.val.tt, color = "blue", linetype = "dashed", size = 0.4) +
annotate(geom = "text", x = crit.val.tt -2, y = 0.2, size = 5,
label = 'bolditalic(H[alpha == 0.05]^TT==114.5)',
family = "serif", color = "blue", parse = TRUE) +
theme(
axis.text.x = element_text( face = "bold", size = 12, color="black",
margin = margin(1,0,1,0, unit = "pt" )),
axis.text.y = element_text( face = "bold", size = 12, color="black",
margin = margin( 0,0.1,0,0, unit = "mm")),
axis.title.x = element_text(face = "bold", size = 13,
color="black", vjust = 0 ),
axis.title.y = element_text(face = "bold", size = 13,
color="black", vjust = 0 ),
legend.title = element_blank(),
legend.margin = margin(c(0.3, 0.3, 0.3, 0.3), unit = 'mm'),
legend.box.spacing = unit(0.5, "lines"),
legend.text = element_text(face = "bold", size = 12, family = "serif")
)
```

The graphic above shows that with high probability the methylation signal induced by the treatment has *H* values \(H^{TT}_{\alpha=0.05}\geq114.5\). According to the critical value estimated for the differences of methylation levels, the methylation signal holds \(TV^{TT}_{\alpha=0.05}\geq0.926\).

Notice that most of the methylation changes are not signal but noise (found to the left of the critical values). This situation is typical for all the natural and technologically generated signals.

Assuming that the background methylation variation is consistent with a Poisson process and that methylation changes conform to the second law of thermodynamics, the Hellinger divergence of methylation levels follows a Weibull distribution probability or some member of the generalized gamma distribution family (Sanchez and Mackenzie 2016).

As suggested from the empirical density graphics (above), the critical values \(H_{\alpha=0.05}\) and \(TV_{\alpha=0.05}\) can be used as cutpoints to select potential DMPs. After setting \(dist.name = "ECDF"\) and \(tv.cut = 0.926\) in Methyl-IT function *getPotentialDIMP*, potential DMPs are estimated using the empirical cummulative distribution function (ECDF) and the critical value \(TV_{\alpha=0.05}=0.926\).

In Methyl-IT, Fisher’s exact test (FT) is implemented in function *FisherTest*. In the current case, a pairwise group application of FT to each cytosine site is performed. The differences between the group means of read counts of methylated and unmethylated cytosines at each site are used for testing (\(pooling.stat ="mean"\)). Notice that only cytosine sites with critical values \(TV_d>0.926\) are tested (*tv.cut = 0.926*).

```
ft = FisherTest(LR = divs, tv.cut = 0.926,
pAdjustMethod = "BH", pooling.stat = "mean",
pvalCutOff = 0.05, num.cores = 4L,
verbose = FALSE, saveAll = FALSE)
ft.tv <- getPotentialDIMP(LR = ft, div.col = 9L, dist.name = "None",
tv.cut = 0.926, tv.col = 7, alpha = 0.05)
```

The above setting would impose additional constrains on the output of DMPs resulting from Fisher’s exact test depending on how strong is the methylation background noise in the control population. Basically, methylation variations that can spontaneously occur in the control population with relatively high frequencies are disregarded. The decisions are based on the empirical cumulative distribution function (ECDF).

Potential DMPs can be estimated using the critical values derived from the fitted Weibull 2-parameters models, which are obtained after the non-linear fit of the theoretical model on the genome-wide \(H\) values for each individual sample using Methyl-IT function *nonlinearFitDist* (Sanchez and Mackenzie 2016). As before, only cytosine sites with critical values \(TV_d>0.926\) are considered DMPs. Notice that, it is always possible to use any other values of \(H\) and \(TV_d\) as critical values, but whatever could be the value it will affect the final accuracy of the classification performance of DMPs into two groups, DMPs from control and DNPs from treatment (see below). So, it is important to do an good choices of the critical values.

```
nlms.wb <- nonlinearFitDist(divs, column = 9L, verbose = FALSE, num.cores = 6L)
# Potential DMPs from 'Weibull2P' model
DMPs.wb <- getPotentialDIMP(LR = divs, nlms = nlms.wb, div.col = 9L,
tv.cut = 0.926, tv.col = 7, alpha = 0.05,
dist.name = "Weibull2P")
nlms.wb$T1
```

```
## Estimate Std. Error t value Pr(>|t|)) Adj.R.Square
## shape 0.5413711 0.0003964435 1365.570 0 0.991666592250838
## scale 19.4097502 0.0155797315 1245.833 0
## rho R.Cross.val DEV
## shape 0.991666258901194 0.996595598767685 34.7217494754823
## scale
## AIC BIC COV.shape COV.scale
## shape -221720.747067975 -221694.287733122 1.571674e-07 -1.165129e-06
## scale -1.165129e-06 2.427280e-04
## COV.mu n
## shape NA 50000
## scale NA 50000
```

As in the case of Weibull 2-parameters model, potential DMPs can be estimated using the critical values derived from the fitted Gamma 2-parameters models and only cytosine sites with critical values \(TV_d>0.926\) are considered DMPs.

```
nlms.g2p <- nonlinearFitDist(divs, column = 9L, verbose = FALSE, num.cores = 6L,
dist.name = "Gamma2P")
# Potential DMPs from 'Gamma2P' model
DMPs.g2p <- getPotentialDIMP(LR = divs, nlms = nlms.g2p, div.col = 9L,
tv.cut = 0.926, tv.col = 7, alpha = 0.05,
dist.name = "Gamma2P")
nlms.g2p$T1
```

```
## Estimate Std. Error t value Pr(>|t|)) Adj.R.Square
## shape 0.3866249 0.0001480347 2611.717 0 0.999998194156282
## scale 76.1580083 0.0642929555 1184.547 0
## rho R.Cross.val DEV
## shape 0.999998194084045 0.998331935080129 0.00752417919133131
## scale
## AIC BIC COV.alpha COV.scale
## shape -265404.29138371 -265369.012270572 2.191429e-08 -8.581717e-06
## scale -8.581717e-06 4.133584e-03
## COV.mu df
## shape NA 49998
## scale NA 49998
```

There is not a one-to-one mapping between \(TV_d\) and \(H\). However, at each cytosine site \(i\), these information divergences hold the inequality: \[TV_d(p^{tt}_i,p^{ct}_i)\leq \sqrt{2}H_d(p^{tt}_i,p^{ct}_i)\]

where \(H_d(p^{tt}_i,p^{ct}_i)=\sqrt{\frac{H(p^{tt}_i,p^{ct}_i)}w}\) is the Hellinger distance and \(H(p^{tt}_i,p^{ct}_i)\) is the Helliger divergence equation given above. The critical values for the model Gamma2P can be retrived using:

```
unlist(lapply(nlms.g2p, function(model) {
shape <- model$Estimate[1]
scale <- model$Estimate[2]
return(qgamma(0.95, shape = shape, scale = scale))
}))
```

```
## C1 C2 C3 T1 T2 T3
## 97.40329 97.19037 96.98129 123.78138 123.55315 120.30413
```

So, potential DMPs detected with FT can be constrained with the critical value \(H^{TT}_{\alpha=0.05}\geq120.3\) (Steerneman et al. 1983):

```
# Potential DMPs from Fisher's exact test
ft.hd <- getPotentialDIMP(LR = ft, div.col = 9L, hdiv.col = 9L, hdiv.cut = 120.3,
tv.cut = 0.926, tv.col = 7, alpha = 0.05, dist.name = "None")
```

Summary table:

```
data.frame(ft = unlist(lapply(ft.tv, length)), ft.hd = unlist(lapply(ft.hd, length)),
ecdf = unlist(lapply(DMP.ecdf, length)), Weibull = unlist(lapply(DMPs.wb, length)),
Gamma = unlist(lapply(DMPs.g2p, length)))
```

```
## ft ft.hd ecdf Weibull Gamma
## C1 1253 726 63 756 935
## C2 1221 724 62 755 925
## C3 1280 733 64 768 947
## T1 2504 1429 126 924 1346
## T2 2464 1439 124 942 1379
## T3 2408 1354 121 979 1354
```

Notice that there are significant differences between the numbers of DMPs detected by Fisher’s exact test (FT) and other approaches. There at least about 2408 - 1354 = 1054 DMPs in the treatment samples that are not induced by the treatment, which can naturally occurs in both groups, control and treatment. In other words, the theoretical probabilistic models of Weibull and Gamma distributions are used to describe those methylation events resultant of the normal (non-stressful) biological processes. Any information divergence of methylation level below the critical value \(TV_{\alpha=0.05}\) or \(H_{\alpha=0.05}\) is a methylation event that can occur in normal conditions. *It does not matter how much significant* *a test like FT could be, since it is not about the signification of the test, but about* *how big is the probability to observe that methylation event in the control population*.

Critical values \(H_{\alpha=0.05}\) can be estimated from each fitted model using *predict* function:

```
crit.wb = predict(nlms.wb, pred = "quant", q = 0.95, dist.name = "Weibull2P")
crit.g2p = predict(nlms.g2p, pred = "quant", q = 0.95, dist.name = "Gamma2P")
data.frame(wb = do.call(rbind, crit.wb), g2p = do.call(rbind, crit.g2p))
```

```
## wb g2p
## C1 117.1289 97.40329
## C2 117.4248 97.19037
## C3 116.8375 96.98129
## T1 147.2984 123.78138
## T2 147.3413 123.55315
## T3 142.6252 120.30413
```

The graphics for the empirical (in black) and Gamma (in blue) densities distributions of Hellinger divergence of methylation levels for sample T1 are shown below. The 2-parameter gamma model is build by using the parameters estimated in the non-linear fit of \(H\) values from sample T1. The critical values estimated from the 2-parameter gamma distribution \(H^{\Gamma}_{\alpha=0.05}=123.78\) is more ‘conservative’ than the critical value based on the empirical distribution \(H^{Emp}_{\alpha=0.05}=114.5\). That is, in accordance with the empirical distribution, for a methylation change to be considered a signal its \(H\) value must be \(H\geq114.5\), while according with the 2-parameter gamma model any cytosine carrying a signal must hold \(H\geq124\).

```
suppressMessages(library(ggplot2))
# Some information for graphic
dt <- data_div[data_div$sample == "T1", ]
coef <- nlms.g2p$T1$Estimate # Coefficients from the non-linear fit
dgamma2p <- function(x) dgamma(x, shape = coef[1], scale = coef[2])
qgamma2p <- function(x) qgamma(x, shape = coef[1], scale = coef[2])
# 95% quantiles
q95 <- qgamma2p(0.95) # Gamma model based quantile
emp.q95 = quantile(divs$T1$hdiv, 0.95) # Empirical quantile
# Density plot with ggplot
ggplot(dt, aes(x = H)) +
geom_density(alpha = 0.05, bw = 0.2, position = "identity", na.rm = TRUE,
size = 0.4) + xlim(c(0, 150)) +
stat_function(fun = dgamma2p, colour = "blue") +
xlab(expression(bolditalic("Hellinger divergence (H)"))) +
ylab(expression(bolditalic("Density"))) +
ggtitle("Empirical and Gamma densities distributions of Hellinger divergence (T1)") +
geom_vline(xintercept = emp.q95, color = "black", linetype = "dashed", size = 0.4) +
annotate(geom = "text", x = emp.q95 - 20, y = 0.16, size = 5,
label = 'bolditalic(H[alpha == 0.05]^Emp==114.5)',
family = "serif", color = "black", parse = TRUE) +
geom_vline(xintercept = q95, color = "blue", linetype = "dashed", size = 0.4) +
annotate(geom = "text", x = q95 + 10, y = 0.14, size = 5,
label = 'bolditalic(H[alpha == 0.05]^Gamma==124)',
family = "serif", color = "blue", parse = TRUE) +
theme(
axis.text.x = element_text( face = "bold", size = 12, color="black",
margin = margin(1,0,1,0, unit = "pt" )),
axis.text.y = element_text( face = "bold", size = 12, color="black",
margin = margin( 0,0.1,0,0, unit = "mm")),
axis.title.x = element_text(face = "bold", size = 13,
color="black", vjust = 0 ),
axis.title.y = element_text(face = "bold", size = 13,
color="black", vjust = 0 ),
legend.title = element_blank(),
legend.margin = margin(c(0.3, 0.3, 0.3, 0.3), unit = 'mm'),
legend.box.spacing = unit(0.5, "lines"),
legend.text = element_text(face = "bold", size = 12, family = "serif")
)
```