Example of Methylome Analysis with MethylIT using Cancer Datasets

A Methyl-IT user guide. A guide for users

Robersy Sanchez$^1$
rus547@psu.edu

Sally Mackenzie$^{1,2}$
sam795@psu.edu

$^1$Department of Biology.
Pennsylvania State University, University Park, PA 16802

$^2$Department of Plant Science.
Pennsylvania State University, University Park, PA 16802

16 January 2024

Source: vignettes/cancer_example.Rmd

cancer_example.Rmd

Abstract

An example of methylation analysis with Methyl-IT, a novel methylome analysis procedure based on a signal-detection and machine- learning approach, is provided here. Methylation analysis involves a signal detection problem, and the method was designed to discriminate methylation regulatory signal from background noise induced by thermal fluctuations. Methyl-IT enhances the resolution of genome methylation behavior to reveal network-associated responses, offering resolution of gene pathway influences not attainable with previous methods. Herein, an example of MethylIT application to the analysis of breast cancer methylomes is presented.

Methyl-IT

MethylIT is an R package for methylome analysis based on a signal-detection machine learning approach (SD-ML). This approach is postulated to provide greater sensitivity for resolving true signal from the methylation background within the methylome (1, 2). Because the biological signal created within the dynamic methylome environment characteristic of plants is not free from background noise, the approach, designated Methyl-IT, includes the application of signal detection theory (3–6).

A basic requirement for the application of signal detection is the knowledge of the probability distribution of the background noise, which is used as null hypothesis in the detection of the methylation signal. This probability distribution is, in general, a member of generalized gamma distribution family, and it can be deduced on a statistical mechanical/thermodynamics basis from DNA methylation induced by thermal fluctuations (1). Herein, we provide an example of Methyl-IT application to the analysis of breast cancer methylomes. Due to the size of human methylome the current example only covers the analysis of chromosome 13. A full description of Methyl-IT application of methylome analysis in plants is given in the manuscript (2).

The R packages required in this example are:

suppressPackageStartupMessages({
        library(MethylIT)
        library(BiocParallel)
        library(ggplot2) # graphic
        library(reshape2) # To reshape the data frame
        library(grid) # For multiple plots
        library(gridExtra) # For multiple plots
        library(VennDiagram)
        library(rtracklayer) # To import gene annotation
    }
)

Available datasets and reading

Methylome datasets of whole-genome bisulfite sequencing (WGBS) are available at Gene Expression Omnibus (GEO DataSets). For the current example, datasets from breast tissues (normal and cancer) and embryonic stem cells will be downloaded from GEO. The data set are downloaded providing the GEO accession numbers for each data set to the function ‘getGEOSuppFiles’ (see the function getGEOSuppFiles help). Some times the ftp site is not available and the information must be manually downloaded from GEO database.

# Embryonic stem cells datasets
esc.files = getGEOSuppFiles(GEO = c("GSM2041690", "GSM2041691", "GSM2041692"),
                            verbose = FALSE)
# Breast tissues (normal, cancer, metastasis)
cancer.files = getGEOSuppFiles(GEO = c("GSM1279517","GSM1279514",
                                        "GSM1279513"), verbose = FALSE)

The file path and name of each downloaded dataset is found in the output variables ‘esc.files’ and ‘cancer.files’.

It important to keep in mind, however, that the file access at the Gene Expression Omnibus - NCBI can be changed at the convenience of GEO database and, in this case, getGEOSuppFile can fail.

Reading datasets

Datasets for our example can be read with function readCounts2GRangesList. To specify the reading of only chromosome 13, we can specify the parameter ‘chromosomes = “Chr13”’. The symbol chromosome 13, in this case “Chr13”, must be consistent with the annotation provided in the given GEO dataset. Each file is wholly read with the setting ‘chromosomes = “Chr13”’ and then the GRanges are built only with chromosome 13, which could be time consuming. However, users working on Linux OS can specify the reading of specific lines from each file by using regular expressions. For example, if only chromosomes 1 and 3 are required, then we can set chromosomes = NULL (default) and ‘chromosome.pattern = “^Chr[1,3]”’. This will read all the lines in the downloaded files starting with the words “Chr1” or “Chr3”. If we are interested in chromosomes 1 and 2, then we can set ‘chromosome.pattern = “^Chr[1-2]”’. If all the chromosomes are required, then set chromosomes = NULL and chromosome.pattern = NULL (default). Obviously, before read the files, user must check them to see which annotation was used by the experimenters to denote the chromosomes, e.g., “chr13” or just “13”, etc.

# Embryonic stem cells datasets
ref = readCounts2GRangesList(filenames = esc.files,
                            sample.id = c("ESC1","ESC2", "ESC3"),
                            columns = c(seqnames = 1, start = 2,
                                        mC = 4, uC = 5), pattern = "^chr13",
                            remove = TRUE, verbose = FALSE)
# Breast tissues (normal, cancer, metastasis)
LR = readCounts2GRangesList(filenames = cancer.files,
                            sample.id = c("Breast_normal", "Breast_cancer",
                                            "Breast_metastasis"),
                            columns = c(seqnames = 1, start = 2,
                                        mC = 3, uC = 4),
                            remove = TRUE, pattern = "^13",
                            chromosome.names = "chr13", verbose = FALSE)

In real data analysis, a bigger sample size must be used. For the purpose of this example, to illustrate the application of methylation analysis with Methyl-IT, we will just add an artificial normal sample.

LR$Breast_normal1 <- poolFromGRlist(c(ref[1], LR$Breast_normal,
                                        LR$Breast_normal, LR$Breast_normal,
                                        LR$Breast_normal, LR$Breast_normal),
                                    stat = "mean", num.cores = 6L,
                                    verbose = FALSE)

In the metacolumn of the last GRanges object, mC and uC stand for the methylated and unmethylated read counts, respectively. Notice that option ‘remove = TRUE’ remove the decompressed files (default: FALSE, see ? readCounts2GRangesList for more details about this function).

The reference individual

Any two objects can be compared based on some measured variable if, and only if, the measurements where taken in the same metric space, the same coordinate system, in respect to the same origin of coordinates. Usually, in our daily 3D experience, our brain automatically sets up the origin of coordinates equal to zero. The differences found in the comparison depend on the reference used to perform the measurements and from the metric system. The space where the objects are located (or the set of objects) together with the metric is called metric space.

Unfortunately, this is not the case for the methylation process. Each individual from the same population and even each chromosome follows an independent stochastic methylation process. This is not a surprise, since cytosine methylation affect the mechanical properties of DNA molecule with strong dependence of the DNA sequence context (until, at least. 6 bases at both sides of each cytosine site) (7, 8).

To evaluate the methylation differences between individuals from control and treatment we introduce a metric in the bidimensional space of methylation levels $P_i = (p_i, 1-p_i)$. Vectors $P_i$ provide a measurement of the uncertainty of methylation levels (2). However, to perform the comparison between the uncertainty of methylation levels from each group of individuals, control $(c)$ and treatment $(t)$, we should estimate the uncertainty variation with respect to the same individual reference on the mentioned metric space. The reason to measure the uncertainty variation with respect to the same reference resides in that even sibling individuals follow an independent ontogenetic development and, consequently, their developments follow independent stochastic processes. This a consequence of the “omnipresent” action of the second law of thermodynamics in living organisms. In the current example, we will create the reference individual by pooling the methylation counts from the embryonic stem cells.

It should be noticed that the results are sensitive to the reference used. The statistics mean, median, or sum of the read counts at each cytosine site of some control samples can be used to create a virtual reference sample. A virtual reference individual can be built with function poolFromGRlist. It is up to the user whether to apply the ‘row sum’, ‘row mean’ or ‘row median’ of methylated and unmethylated read counts at each cytosine site across individuals. Notice that when “mean” is selected the virtual reference individual is the group centroid, which play a fundamental role in multivariate statistic.

Ref = poolFromGRlist(ref, stat = "mean", num.cores = 6L, verbose = FALSE)

Ref

## GRanges object with 1560637 ranges and 2 metadata columns:
##             seqnames    ranges strand |        mC        uC
##                <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##         [1]    chr13  19020631      * |         1         1
##         [2]    chr13  19020633      * |         2         2
##         [3]    chr13  19020642      * |         1         1
##         [4]    chr13  19020643      * |         2         2
##         [5]    chr13  19020679      * |         1         1
##         ...      ...       ...    ... .       ...       ...
##   [1560633]    chr13 115108993      * |         1         3
##   [1560634]    chr13 115109022      * |         1         1
##   [1560635]    chr13 115109023      * |         3         4
##   [1560636]    chr13 115109523      * |         2         2
##   [1560637]    chr13 115109524      * |         1         1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Only direct lab experiments can reveal whether differences detected with distinct references outside the experimental conditions for control and treatment groups are real. The best reference would be estimated using a subset of individuals from control group. Such a reference will contribute to remove the intragroup variation, in control and in treatment groups, induced by environmental changes external to or not controlled by the experimental conditions.

Methylation analysis for each cytosine position is frequently performed in the bidimensional space of $(methylated, unmethylated)$ read counts. Frequently, Fisher’s exact test is applied to a single cytosine position, under the null hypothesis that the proportions $p_{ct}=methylated_{ct}/(methylated_{ct}+unmethylated_{ct})$ and $p_{tt}=methylated_{tt}/(methylated_{tt}+unmethylated_{tt})$ are the same for control and treatment, respectively. In this case, the implicit reference point for the counts at every cytosine positions is $(methylated = 0,unmethylated = 0)$, which corresponds to the point $P_i = (0, 1)$.

In our case, the Hellinger divergence (one of the metric used, here) of each individual in respect to the reference is the variable to test in place of $(methylated, unmethylated)$ read counts or the methylation levels $P_i = (p_i, 1-p_i)$.

The use of references is restricted by the thermodynamics basis of the theory. The current information-thermodynamics based approach is supported on the following postulate: “High changes of Hellinger divergences are less frequent than low changes, provided that the divergence is proportional to the amount of energy required to process the change of information in the methylation system”.

The last postulate acknowledges the action of the second law of thermodynamics on the biomolecular methylation system. For the methylation system, it implies that the frequencies of the information divergences between methylation levels must be proportional to a Boltzmann factor (see supplementary information from reference (1)). In other words, the frequencies of information divergences values should follow a trend proportional to an exponential decay.

If we do not observe such a behavior, then either the reference is too far from experimental condition or we are dealing with an extreme situation where the methylation machinery in the cell is dysfunctional. The last situation is found, for example, in the silencing mutation at the gene of cytosine-DNA-methyltransferase in Arabidopsis thaliana. Methylation of 5-methylcytosine at CpG dinucleotides is maintained by MET1 in plants.

Hellinger divergence estimation

To perform the comparison between the uncertainty of methylation levels from each group of individuals, control $(c)$ and treatment $(t) $, the divergence between the methylation levels of each individual is estimated with respect to the same reference on the metric space formed by the vector set $P_i = (p_i, 1-p_i)$ and the Hellinger divergence $H$. Basically, the information divergence between the methylation levels of an individual $j$ and reference sample $r$ is estimated according to the Hellinger divergence given by the formula:

\[H(\hat p_{ij},\hat p_{ir}) = w_i[(\sqrt{\hat p_{ij}} - \sqrt{\hat p_{ir}})^2+(\sqrt{1-\hat p_{ij}} - \sqrt{1-\hat p_{ir}})^2]\] where $w_i = 2

$, $m_{ij} = {n_i}^{mC_j}+{n_i}^{uC_j}+1$, $m_{ir} = {n_i}^{mC_r}+ {n_i}^{uC_r}+1$ and $j \in {\{c,t}\}$.

This equation for Hellinger divergence is given in reference (9), but other information theoretical divergences can be used as well. Next, the information divergence for control (Breast_normal) and treatment (Breast_cancer and Breast_metastasis) samples are estimated with respect to the reference virtual individual. A Bayesian correction of counts can be selected or not. In a Bayesian framework, methylated read counts are modeled by a beta-binomial distribution, which accounts for both the biological and sampling variations (10–12).

In our case we adopted the Bayesian approach suggested in reference (13) (Chapter 3). In a Bayesian framework with uniform priors, the methylation level can be defined as: $p = (mC + 1)/ (mC + uC + 2)$. However, the most natural statistical model for replicated BS-seq DNA methylation measurements is beta-binomial (the beta distribution is a prior conjugate of binomial distribution). We consider the parameter $p$ (methylation level) in the binomial distribution as randomly drawn from a beta distribution. The hyper-parameters $\alpha$ and $\beta$ from the beta-binomial distribution are interpreted as pseudo-counts. The information divergence is estimated here using the function estimateDivergence:

date()

## [1] "Tue Jan 16 15:53:06 2024"

ptm <- proc.time()

HD = estimateDivergence(ref = Ref, 
                        indiv = LR, 
                        Bayesian = TRUE,
                        min.coverage = 5,
                        and.min.cov = FALSE,
                        high.coverage = 300,
                        percentile = 0.999, 
                        verbose = FALSE)

cat((proc.time() - ptm)[3]/60, "minutes.", date()) # in min

## 0.7843 minutes. Tue Jan 16 15:53:53 2024

HD$Breast_cancer

## GRanges object with 584538 ranges and 9 metadata columns:
##            seqnames    ranges strand |        c1        t1        c2        t2
##               <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
##        [1]    chr13  19020631      * |         1         1        14        24
##        [2]    chr13  19020633      * |         2         2        14        25
##        [3]    chr13  19020643      * |         2         2         7        38
##        [4]    chr13  19020782      * |         1         1         6        56
##        [5]    chr13  19020786      * |         1         1        10        53
##        ...      ...       ...    ... .       ...       ...       ...       ...
##   [584534]    chr13 115108776      * |         1         1        52        20
##   [584535]    chr13 115108789      * |         2         2        27        43
##   [584536]    chr13 115108993      * |         1         3        72         5
##   [584537]    chr13 115109023      * |         3         4        56        36
##   [584538]    chr13 115109524      * |         1         1        31         9
##                   p1        p2        TV     bay.TV      hdiv
##            <numeric> <numeric> <numeric>  <numeric> <numeric>
##        [1]  0.451770  0.372577 -0.131579 -0.0791929 0.0361197
##        [2]  0.452016  0.364028 -0.141026 -0.0879881 0.0713908
##        [3]  0.452016  0.178801 -0.344444 -0.2732151 0.8097173
##        [4]  0.451770  0.118131 -0.403226 -0.3336386 0.8438179
##        [5]  0.451770  0.175569 -0.341270 -0.2762003 0.5281366
##        ...       ...       ...       ...        ...       ...
##   [584534]  0.451770  0.703359  0.222222   0.251589 0.3807153
##   [584535]  0.452016  0.387024 -0.114286  -0.064992 0.0405589
##   [584536]  0.449459  0.905400  0.685065   0.455941 2.5160771
##   [584537]  0.451113  0.599126  0.180124   0.148013 0.3253924
##   [584538]  0.451770  0.737171  0.275000   0.285402 0.4843780
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Function ‘estimateDivergence’ returns a list of GRanges objects with the four columns of counts, the information divergence, and additional columns:

1. The original matrix of methylated ($c_i$) and unmethylated
    ($t_i$) read counts from control ($i=1$) and treatment ($i=2$) samples.
2. "p1" and "p2": methylation levels for control and treatment,
    respectively.
3. "bay.TV": total variation TV = p2 - p1.
4. "TV": total variation based on simple counts: $TV=c1/(c1+t1)-
    c2/(c2+t2)$.
5. "hdiv": Hellinger divergence.

If Bayesian = TRUE, results are based on the posterior estimations of methylation levels $p1$ and $p2$. Filtering by coverage is provided at this step which would be used unless previous filtering by coverage had been applied. This is a pairwise filtering. Cytosine sites with ‘coverage’ > ‘min.coverage’ and ‘coverage’ < ‘percentile’ (e.g., 99.9 coverage percentile) in at least one of the samples are preserved. The coverage percentile used is the maximum estimated from both samples: reference and individual.

For some GEO datasets only the methylation levels for each cytosine site are provided. In this case, Hellinger divergence can be estimated as given in reference (1):

\[H(\hat p_{ij},\hat p_{ir}) = 2 \Big[(\sqrt{\hat p_{ij}} - \sqrt{\hat p_{ir}})^2+(\sqrt{1-\hat p_{ij}} - \sqrt{1-\hat p_{ir}})^2\Big]\]

Descriptive statistical analysis

A descriptive analysis on the distribution of Hellinger divergences of methylation levels is recommended. Like in any other statistical analysis, descriptive statistical analysis help us to detect potential issues in the raw data. It is the user responsibility to perform quality check of his/her dataset before start to apply Methyl-IT. Nevertheless, the quality checking of the raw data is not perfect. So, it is healthy to sees for potential issues.

Histogram and boxplots of divergences estimated in each sample

First, the data of interest (Hellinger divergences, “hdiv”) are selected from the GRanges objects:

normal = HD$Breast_normal[, "hdiv"]
normal1 = HD$Breast_normal1[, "hdiv"]
cancer = HD$Breast_cancer[, "hdiv"]
metastasis = HD$Breast_metastasis[, "hdiv"]

Next, a single GRanges object is built from the above set of GRanges objects using the function uniqueGRanges. Notice that the number of cores to use for parallel computation can be specified.

hd = uniqueGRanges(list(normal, normal1, cancer, metastasis),
                    missing = NA, verbose = FALSE)
colnames(mcols(hd)) <- c("normal", "normal1", "cancer", "metastasis")
hd

## GRanges object with 793551 ranges and 4 metadata columns:
##            seqnames    ranges strand |     normal    normal1    cancer
##               <Rle> <IRanges>  <Rle> |  <numeric>  <numeric> <numeric>
##        [1]    chr13  19020631      * | 0.16998196 0.15993886 0.0361197
##        [2]    chr13  19020633      * | 0.55408069 0.43588832 0.0713908
##        [3]    chr13  19020643      * | 0.35265356 0.34033618 0.8097173
##        [4]    chr13  19020680      * | 0.00658000 0.00809711        NA
##        [5]    chr13  19020687      * | 0.00676718 0.00155600        NA
##        ...      ...       ...    ... .        ...        ...       ...
##   [793547]    chr13 115108776      * |   0.472447   0.452470 0.3807153
##   [793548]    chr13 115108789      * |   0.412965   0.374928 0.0405589
##   [793549]    chr13 115108993      * |   2.033301   1.778906 2.5160771
##   [793550]    chr13 115109023      * |   2.490385   2.362445 0.3253924
##   [793551]    chr13 115109524      * |   0.756880   0.713464 0.4843780
##            metastasis
##             <numeric>
##        [1]  1.1844907
##        [2]  1.8145776
##        [3]  0.0171404
##        [4]         NA
##        [5]         NA
##        ...        ...
##   [793547]         NA
##   [793548]         NA
##   [793549]   2.406915
##   [793550]   0.523286
##   [793551]         NA
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Now, the Hellinger divergences estimated for each sample are in a single matrix on the metacolumn of the GRanges object and we can proceed to build the histogram and boxplot graphics for these data. Finally, the data included in the graphic are:

# Define an auxiliary function
fun_length <- function(x){
    return(data.frame(y = median(x, na.rm = TRUE) + 1, 
                      label = paste0("n = ", length(x))))
}
data <- data.frame(normal = hd$normal, normal1 = hd$normal1,
                    cancer = hd$cancer, metastasis = hd$metastasis)
data = suppressMessages(melt(data))
colnames(data) <- c("Breast.tissue", "HD")
data = data[data$HD > 0, ]
DataFrame(data)

## DataFrame with 3174204 rows and 2 columns
##           Breast.tissue         HD
##                <factor>  <numeric>
## 1                normal 0.16998196
## 2                normal 0.55408069
## 3                normal 0.35265356
## 4                normal 0.00658000
## 5                normal 0.00676718
## ...                 ...        ...
## NA.617427    NA                 NA
## NA.617428    NA                 NA
## 3174202      metastasis   2.406915
## 3174203      metastasis   0.523286
## NA.617429    NA                 NA

p1 = ggplot(data, aes(x = HD, fill = Breast.tissue, colour = Breast.tissue)) +
        geom_histogram(alpha = 0.5, binwidth = 0.5, position = "identity",
                na.rm = TRUE, size = 0.7) + xlim(-0.4,20) +
        theme(axis.title.x = element_text(face = "bold", size = 20),
                axis.text.x = element_text(face = "bold", size = 20, 
                                        color =  "black", hjust = 0.5, 
                                        vjust = 0.75),
        axis.text.y = element_text(face = "bold", size = 20, color = "black"),
        axis.title.y = element_text(face = "bold", size = 20,color = "black"),
        legend.text = element_text(size = 20, face = "bold"),
        legend.title = element_text(size = 20, face = "bold")) +
    ylab( "Counts" ) +
    ggtitle("Histogram of Hellinger Divergence")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# For visualization purposes HD is limited to the interval 0 to 50
dt = data[ which(data$HD < 50), ]
p2 = ggplot(dt,aes(x = Breast.tissue, y = HD , fill = Breast.tissue)) +
    geom_boxplot(na.rm = TRUE) + ylim(-0.1, 20) +
    stat_summary(fun.data = fun_length, geom = "text",
                position = position_dodge(width = 0.9), vjust = 1,
                size = 6, fontface = "bold") +
    theme(axis.title.x = element_text(face = "bold", size = 20),
        axis.text.x = element_text(face = "bold", size = 20, color = "black",
                                    hjust = 0.5, vjust = 0.75),
        axis.text.y = element_text(face = "bold", size = 20, color = "black"),
        axis.title.y = element_text(face = "bold", size = 20,color = "black"),
        legend.position = "none" ) +
    ggtitle("Boxplot of Hellinger Divergence")

grid.arrange(p1, p2, ncol = 2)

## Warning: Removed 6 rows containing non-finite values (`stat_summary()`).

Except for the tail, most of the methylation changes occurred under the area covered by the density curve corresponding to the normal breast tissue. This is theoretically expected. This area is explainable in statistical physical terms and, theoretically, it should fit a Weibull distribution. The tails regions cover the methylation changes that, with high probability, are not induced by thermal fluctuation and are not addressed to stabilize the DNA molecule. These changes are methylation signal. Professor David J. Miller (Department of Electrical Engineering, Penn State) proposed modeling the distribution as a mixed Weibull distribution to simultaneously describe the background methylation noise and the methylation signal (personal communication, January, 2018). This model approach seems to be supported by the above histogram, but it must be studied before being incorporated in a future version of Methyl-IT.

Checking cytosine read counts

According to the parameter values set for the function estimateDivergence, for each cytosine site, at least one individual must have a 10 reads or more, and the maximum coverage must be 500 or less.

# Maximum coverage at each cytosine site
covr <- lapply(HD, function(x) {
    cov1 <- x$c1 + x$t1
    cov2 <- x$c2 + x$t2
    cov <- apply(cbind(cov1, cov2), 1, max)
    return(cov)
})

# Quantiles
do.call(rbind, lapply(covr, function(x) {
    q60 <- quantile(x, 0.6)
    q9999 <- quantile(x, 0.9999)
    idx1 <- which(x >= q60)
    idx2 <- which(x <= 500)
    q95 <- quantile(x, 0.95)

    idx <- intersect(idx1, idx2)
    return(c(round(summary(x)), q60, 
            quantile(x, c(0.95, 0.99, 0.999, 0.9999)),
            num.siteGreater_8 = sum(x >= 8),
            q60_to_500 = sum((x >= q60) & (x <= 500)),
            num.siteGreater_500 = sum(x > 500)))})
)

##                   Min. 1st Qu. Median Mean 3rd Qu. Max. 60% 95% 99% 99.9%
## Breast_normal        5      19     30   31      41  291  34  57  69    81
## Breast_cancer        5      18     31   37      51  266  38  88 112   137
## Breast_metastasis    5      14     27   31      43  281  32  70  91   115
## Breast_normal1       5      17     26   27      35  279  29  48  58    68
##                     99.99% num.siteGreater_8 q60_to_500 num.siteGreater_500
## Breast_normal      94.0000            747005     319227                   0
## Breast_cancer     165.0000            569474     234247                   0
## Breast_metastasis 158.8469            435153     187884                   0
## Breast_normal1     80.0000            738787     318786                   0

The quantiles indicates acceptable values for the coverages. Descriptive statistics for methylated reads can be checked as well.

# Maximum coverage at each cytosine site
methc <- lapply(HD, function(x) {
    r <- apply(cbind(x$c1, x$c2), 1, max)
    return(r)
})

# Quantiles
do.call(rbind, lapply(methc, function(x) {
    q10 <- quantile(x, 0.1)
    q60 <- quantile(x, 0.6)
    q9999 <- quantile(x, 0.9999)
    idx1 <- which(x >= q60)
    idx2 <- which(x <= 500)
    q95 <- quantile(x, 0.95)

    idx <- intersect(idx1, idx2)
    return(c(round(summary(x)), q10, q60,
            quantile(x, c(0.95, 0.99, 0.999, 0.9999)),
            num.siteGreater_8 = sum(x >= 8),
            q60_to_500 = sum((x >= q60) & (x <= 500)),
            num.siteGreater_500 = sum(x > 500)))})
)

##                   Min. 1st Qu. Median Mean 3rd Qu. Max. 10% 60% 95% 99% 99.9%
## Breast_normal        4      14     24   25      34  265   7  28  49  60    72
## Breast_cancer        4      10     21   27      38  252   5  27  74  98   123
## Breast_metastasis    4       7     16   21      30  234   4  21  56  75    98
## Breast_normal1       4      12     20   21      29  258   6  24  41  50    60
##                     99.99% num.siteGreater_8 q60_to_500 num.siteGreater_500
## Breast_normal      84.0000            677552     309300                   0
## Breast_cancer     149.5463            472596     237092                   0
## Breast_metastasis 121.0000            328544     184064                   0
## Breast_normal1     70.0000            664530     304195                   0

The critical values from empirical cumulative probability distributions will be used in the downstream analysis.

critical.val <- do.call(rbind, lapply(HD, function(x) {
    hd.95 = quantile(x$hdiv, 0.95)
    tv.95 = quantile(abs(x$bay.TV), 0.95)
    return(c(tv = tv.95, hd = hd.95))
}))

critical.val

##                      tv.95%   hd.95%
## Breast_normal     0.4625591 2.888832
## Breast_cancer     0.4795708 3.413531
## Breast_metastasis 0.4638647 3.104796
## Breast_normal1    0.4442791 2.476178

Nonlinear fit of probability distribution models

A basic requirement for the application of signal detection is the knowledge of the probability distribution of the background noise. Probability distribution, as a Weibull distribution model, can be deduced on a statistical mechanical/thermodynamics basis for DNA methylation induced by thermal fluctuations (1).

Assuming that this background methylation variation is consistent with a Poisson process, it can be distinguished from variation associated with methylation regulatory machinery, which is non-independent for all genomic regions (1). An information-theoretic divergence to express the variation in methylation induced by background thermal fluctuations will follow a Weibull distribution model, provided that it is proportional to the minimum energy dissipated per bit of information associated with the methylation change.

The nonlinear fit to a Weibull distribution model is performed by the function nonlinearFitDist. In general, an information divergence of methylation levels will follow a probability distribution from the family of generalized gamma distribution. Next, the best fitted model will be selected from one of the following distributions: two- and three-parameter Weibull distributions, gamma with two- and three-parameters (“Gamma2P” or “Gamma3P”), and generalized gamma with three-parameter (“GGamma3P”).

The best model is selected on the basis of Akaike’s information criterion and the correlation coefficient of cross-validations for the nonlinear regressions (R.Cross.val). These criteria evaluate different information inferred from the models. AIC deals with the trade-off between the goodness of fit (GOF) of the model and the complexity of the model, while R.Cross.val provides information on the prediction power/performance of the model when confronted with external dataset. Cross-validations for the nonlinear regressions (R.Cross.val) were performed as described in reference (14). In addition, Stein’s formula for adjusted R squared ($\rho$) was used as an estimator of the average cross-validation predictive power (14).

In general, the best fitted model yields the best DMP classification performance (see below). However, on special dataset, as the used in the current example. Function gofReport search for the best fitted model between the set of models requested by the user.

The best model including “GGamma3P” model:

d <- c("Weibull2P", "Weibull3P", "Gamma2P", "Gamma3P", "GGamma3P")
gof_1 <- gofReport(HD = HD, column = 9L, model = d, 
                   num.cores = 6L,
                   output = "all", verbose = FALSE)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |======================================================================| 100%
## 
##  *** Creating report ...

gof_1$bestModel

##     Breast_normal     Breast_cancer Breast_metastasis    Breast_normal1 
##        "GGamma3P"        "GGamma3P"        "GGamma3P"        "GGamma3P"

The best model excluding “GGamma3P” model:

d <- c("Weibull2P", "Weibull3P", "Gamma2P", "Gamma3P")
gof_2 <- gofReport(HD = HD, 
                   column = 9L, 
                   model = d, 
                   num.cores = 6L, 
                   output = "all", 
                   verbose = FALSE)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
## 
##  *** Creating report ...

gof_2$bestModel

##     Breast_normal     Breast_cancer Breast_metastasis    Breast_normal1 
##       "Weibull2P"         "Gamma2P"         "Gamma2P"       "Weibull2P"

When “GGamma3P” model is included, the goodness-of-fit indicators used here suggest that it is best fitted model. However, a final decision must take into account the DMP classification performance as well (see below), which is accomplished after a signal detection step. The reader must keep in mind that we are dealing here with numerical algorithms, which are not perfect. Depending on dataset, the numerical computation will confront different challengers. The difference between g3p_R.Cross.val and gg3p_R.Cross.val suggests that model predictions powers for “GGamma3P” and “Gamma3P” are close.

gof_1$stats

##                    w2p_AIC w2p_R.Cross.val w3p_AIC w3p_R.Cross.val  g2p_AIC
## Breast_normal     -3194788       0.9955368     Inf               0 -3165495
## Breast_cancer     -2001256       0.9906862     Inf               0 -2153574
## Breast_metastasis -1635630       0.9910272     Inf               0 -1847487
## Breast_normal1    -3371455       0.9964339     Inf               0 -3326676
##                   g2p_R.Cross.val g3p_AIC g3p_R.Cross.val gg3p_AIC
## Breast_normal           0.9945423     Inf               0 -4452069
## Breast_cancer           0.9912956     Inf               0 -2813248
## Breast_metastasis       0.9940890     Inf               0 -2324142
## Breast_normal1          0.9957575     Inf               0 -5057365
##                   gg3p_R.Cross.val bestModel
## Breast_normal            0.9989777      gg3p
## Breast_cancer            0.9971319      gg3p
## Breast_metastasis        0.9979396      gg3p
## Breast_normal1           0.9995615      gg3p

It is worthy to observe that although in the current example we successfully performed the nonlinear fit for GGamma3P model, this is not the general case when the number of cytosine sites goes close or above $10^6$, for which the computational cost is very high. The nonlinear fit of GGamma3P or GGamma4P is, in general, difficult since two different set of estimated parameters could produce very close distribution curves. Also, it is important to check the value of the scaling parameter from each distribution. The numerical algorithms could find the best fitted model with scaling parameter values close to zero, say e.g. $10^{-8}$, and with the highest GOFs. Situations like this are biologically meaningless and the fitted models must be discarded (1).

Signal detection

The information thermodynamics-based approach is postulated to provide greater sensitivity for resolving true signal from the thermodynamic background within the methylome (1). Since the biological signal created within the dynamic methylome environment characteristic of living organisms is not free from background noise, the approach, designated Methyl-IT, includes the application of signal detection theory (3–6). Signal detection is a critical step to increase sensitivity and resolution of methylation signal by reducing the signal-to-noise ratio and objectively controlling the false positive rate and prediction accuracy/risk.

Potential methylation signal

The first estimation in our signal detection step is the identification of the cytosine sites carrying potential methylation signal $PS$. The methylation regulatory signal does not hold the theoretical distribution and, consequently, for a given level of significance $\alpha$ (Type I error probability, e.g. $\alpha = 0.05$), cytosine positions $k$ with information divergence $H_k >= H_{\alpha = 0.05}$ can be selected as sites carrying potential signals $PS$. The value of $\alpha$ can be specified.

For example, potential signals with $H_k > H_{\alpha = 0.01}$ can be selected. For each sample, cytosine sites are selected based on the corresponding fitted theoretical distribution model estimated in the previous step. Additionally, since cytosine with $|TV_{d_k}| < 0.1$ are the most abundant sites, depending on the sample (experiment), cytosine positions $k$ with $H_k >= H_{\alpha = 0.05}$ and $|TV_{d_k}| < 0.1$ can be observed. To prevent the last situation we can select the $PS$ with the additional constraint $|TV_{d_k}| > TV_0$, where $TV_0$ (‘tv.cut’) is a user specified value. The $PS$ is detected with the function getPotentialDIMP:

PS1 = getPotentialDIMP(LR = HD, dist.name = gof_1$bestModel, nlms = gof_1$nlms,
                        div.col = 9, alpha = 0.05, tv.col = 8, tv.cut = 0.5)
PS2 = getPotentialDIMP(LR = HD, dist.name = gof_2$bestModel, nlms = gof_2$nlms,
                        div.col = 9, alpha = 0.05, tv.col = 8, tv.cut = 0.5)
PS1$Breast_normal

## GRanges object with 372 ranges and 10 metadata columns:
##         seqnames    ranges strand |        c1        t1        c2        t2
##            <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
##     [1]    chr13  20325510      * |         4         4        60         0
##     [2]    chr13  20414617      * |         3         3        59         0
##     [3]    chr13  20647291      * |         2         2        59         0
##     [4]    chr13  20658445      * |         4         4        59         0
##     [5]    chr13  20755264      * |         2         2        62         0
##     ...      ...       ...    ... .       ...       ...       ...       ...
##   [368]    chr13 114597079      * |         3         3        63         0
##   [369]    chr13 114839635      * |         3         3        62         0
##   [370]    chr13 114840738      * |         3         3        63         0
##   [371]    chr13 115057919      * |         4         4        58         0
##   [372]    chr13 115090233      * |         2         2        61         0
##                p1        p2        TV    bay.TV      hdiv       wprob
##         <numeric> <numeric> <numeric> <numeric> <numeric>   <numeric>
##     [1]  0.450574  0.952109       0.5  0.501535   5.73503 1.50139e-11
##     [2]  0.450303  0.951394       0.5  0.501091   4.56377 1.27451e-05
##     [3]  0.450029  0.951394       0.5  0.501365   3.36306 8.66788e-03
##     [4]  0.450574  0.951394       0.5  0.500820   5.69272 2.94217e-11
##     [5]  0.450029  0.953476       0.5  0.503447   3.42776 6.89877e-03
##     ...       ...       ...       ...       ...       ...         ...
##   [368]  0.450303  0.954131       0.5  0.503828   4.68786 4.56271e-06
##   [369]  0.450303  0.953476       0.5  0.503173   4.65774 5.89952e-06
##   [370]  0.450303  0.954131       0.5  0.503828   4.68786 4.56271e-06
##   [371]  0.450574  0.950658       0.5  0.500084   5.64958 5.74997e-11
##   [372]  0.450029  0.952802       0.5  0.502773   3.40662 7.44184e-03
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

PS2$Breast_normal

## GRanges object with 237 ranges and 10 metadata columns:
##         seqnames    ranges strand |        c1        t1        c2        t2
##            <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
##     [1]    chr13  20325510      * |         4         4        60         0
##     [2]    chr13  20414617      * |         3         3        59         0
##     [3]    chr13  20658445      * |         4         4        59         0
##     [4]    chr13  21002790      * |         4         4        62         0
##     [5]    chr13  21289664      * |         2         2        69         0
##     ...      ...       ...    ... .       ...       ...       ...       ...
##   [233]    chr13 114497528      * |         3         3        60         0
##   [234]    chr13 114597079      * |         3         3        63         0
##   [235]    chr13 114839635      * |         3         3        62         0
##   [236]    chr13 114840738      * |         3         3        63         0
##   [237]    chr13 115057919      * |         4         4        58         0
##                p1        p2        TV    bay.TV      hdiv      wprob
##         <numeric> <numeric> <numeric> <numeric> <numeric>  <numeric>
##     [1]  0.450574  0.952109       0.5  0.501535   5.73503 0.00772371
##     [2]  0.450303  0.951394       0.5  0.501091   4.56377 0.02095007
##     [3]  0.450574  0.951394       0.5  0.500820   5.69272 0.00800731
##     [4]  0.450574  0.953476       0.5  0.502902   5.81721 0.00720116
##     [5]  0.450029  0.957704       0.5  0.507675   3.56496 0.04900990
##     ...       ...       ...       ...       ...       ...        ...
##   [233]  0.450303  0.952109       0.5  0.501806   4.59571  0.0203880
##   [234]  0.450303  0.954131       0.5  0.503828   4.68786  0.0188494
##   [235]  0.450303  0.953476       0.5  0.503173   4.65774  0.0193392
##   [236]  0.450303  0.954131       0.5  0.503828   4.68786  0.0188494
##   [237]  0.450574  0.950658       0.5  0.500084   5.64958  0.0083073
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Model “GGamma3P” reported 20263 - 11314 = 8949 more DMPs in the normal tissue than model “Gamma3P”.

Notice that the total variation distance $TV_d=|TV|$ is an information divergence as well and that it can be used in place of Hellinger divergence (1). The set of vectors $P_i = (p_i, 1-p_i)$ and distance function $TV_d$ integrate a metric space (2). In particular, after takes the Manhattan distance between healthy individuals and patients methylation levels we have:

\[M_d(p_{ij}, p_{ir})=\frac{1}{2}(|\hat p_{ij} - \hat p_{ir}| + | (1-\hat p_{ij}) - (1-\hat p_{ir})|)=|\hat p_{ij} - \hat p_{ir}|=TV_d(p_{ij}, p_{ir})\].

That is, the absolute difference of methylation levels is a particular case of Manhattan distance, which is named total variation distance $TV_d$. The quantitative effect of the vector components $1-\hat p_{ij}$ and $1-\hat p_{ir}$ (in our case, the effect of unmethylated read counts) is not present in $TV_d$ as in $H(\hat p_{ij},\hat p_{ir})$.

Histogram and boxplots of methylation potential signals

As before, a single GRanges object is built from the above set GRanges objects using the function uniqueGRanges, and the Hellinger divergences of the cytosine sites carrying $PS$ (for each sample) are located in a single matrix on the metacolumn of the GRanges object.

# Define an auxiliar function
fun_length <- function(x){
    return(data.frame(y = median(x) + 1.5, label = paste0("n = ", length(x))))
}

# The data to be used in the boxplot
ps1 = uniqueGRanges(PS1, missing = NA, columns = 9, verbose = FALSE,
                    num.cores = 6L)
colnames(mcols(ps1)) <- c("normal", "cancer", "metastasis", "normal1")
dat1 = data.frame(normal = ps1$normal, normal1 = ps1$normal1,
                cancer = ps1$cancer, metastasis = ps1$metastasis)
dat1 = suppressMessages(melt(dat1))
colnames(dat1) <- c("Breast.tissue", "HD")
idx <- which(is.na(dat1$HD))
dat1 <- dat1[-idx, ] # To remove missing data

ps2 <- uniqueGRanges(PS2, missing = NA, columns = 9, verbose = FALSE,
                    num.cores = 6L)
colnames(mcols(ps2)) <- c("normal", "cancer", "metastasis", "normal1")
dat2 = data.frame(normal = ps2$normal, normal1 = ps2$normal1,
        cancer = ps2$cancer, metastasis = ps2$metastasis)
dat2 = suppressMessages(melt(dat2))
colnames(dat2) <- c("Breast.tissue", "HD")
idx <- which(is.na(dat2$HD))
dat2 <- dat2[-idx, ] # To remove missing data

p1 = ggplot(dat1, aes(x = HD, fill = Breast.tissue, colour = Breast.tissue)) +
    geom_histogram(alpha = 0.5, bins = 50, position = "identity",
        na.rm = TRUE, size = 0.7) + ylab("Counts") + xlim(0,30) +
theme(axis.title.x = element_text(face = "bold", size = 20),
        axis.text.x = element_text(face = "bold", size = 20, color = "black",
                                hjust = 0.5, vjust = 0.75),
        axis.text.y = element_text(face = "bold", size = 20, color = "black"),
        axis.title.y = element_text(face = "bold", size = 20,color = "black"),
        legend.text = element_text(size = 20, face = "bold"),
        legend.title = element_text(size = 20, face = "bold") ) +
    ggtitle("Histogram for Potential methylation signal. GGamma3P")

p2 = ggplot(dat1,aes(x = Breast.tissue, y = HD , fill = Breast.tissue)) +
    geom_boxplot(na.rm = TRUE) + ylim(0, 15) +
    stat_summary(fun.data = fun_length, geom = "text", na.rm = TRUE,
                position = position_dodge(width = 0.9), vjust = 1,
                size = 6, fontface = "bold") +
    geom_hline(yintercept = 6, linetype="dashed", color = "red") +
    theme(axis.title.x = element_text(face = "bold", size = 20),
        axis.text.x = element_text(face = "bold", size = 20, color = "black",
                                    hjust = 0.5, vjust = 0.75),
        axis.text.y = element_text(face = "bold", size = 20, color = "black"),
        axis.title.y = element_text(face = "bold", size = 20,color = "black"),
        legend.position = "none" ) +
    ggtitle("Boxplot for Potential methylation signal. GGamma3P")


p3 = ggplot(dat2, aes(x = HD, fill = Breast.tissue, colour = Breast.tissue)) +
    geom_histogram(alpha = 0.5, bins = 50, position = "identity",
        na.rm = TRUE, size = 0.7) + ylab("Counts") + xlim(0,30) +
theme(axis.title.x = element_text(face = "bold", size = 20),
        axis.text.x = element_text(face = "bold", size = 20, color = "black",
                                    hjust = 0.5, vjust = 0.75),
        axis.text.y = element_text(face = "bold", size = 20, color = "black"),
        axis.title.y = element_text(face = "bold", size = 20,color = "black"),
        legend.text = element_text(size = 20, face = "bold"),
        legend.title = element_text(size = 20, face = "bold")) +
    ggtitle("Histogram for Potential methylation signal. Gamma3P")

p4 = ggplot(dat2,aes(x = Breast.tissue, y = HD , fill = Breast.tissue)) +
    geom_boxplot(na.rm = TRUE) + ylim(0, 15) +
    stat_summary(fun.data = fun_length, geom = "text", na.rm = TRUE,
                position = position_dodge(width = 0.9), vjust = 1,
                size = 6, fontface = "bold") +
    geom_hline(yintercept = 6, linetype="dashed", color = "red") +
    theme(axis.title.x = element_text(face = "bold", size = 20),
        axis.text.x = element_text(face = "bold", size = 20, color = "black",
                                    hjust = 0.5, vjust = 0.75),
        axis.text.y = element_text(face = "bold", size = 20, color = "black"),
        axis.title.y = element_text(face = "bold", size = 20,color = "black"),
        legend.position = "none" ) +
    ggtitle("Boxplot for Potential methylation signal. Gamma3P")

grid.arrange(p1, p2, p3, p4, ncol = 2)

The graphics show that “Gamma3P” provides more information on the discrimination of treatment potential DMPs from the control.

Cutpoint estimation Laws of statistical physics can account for background

methylation, a response to thermal fluctuations that presumably functions in DNA stability (1, 2). True signal is detected based on the optimal cutpoint (15), which can be estimated from the area under the curve (AUC) of a receiver operating characteristic (ROC) curve built from a logistic regression performed with the potential signals from controls and treatments. The ROC AUC is equivalent to the probability that a randomly- chosen positive instance is ranked more highly than a randomly-chosen negative instance (16). In the current context, the AUC is equivalent to the probability to distinguish a randomly-chosen methylation regulatory signal induced by the treatment from a randomly-chosen signal in the control.

The need for the application of (what is now known as) signal detection in cancer research was pointed out by Youden in the midst of the last century (17). In the next example, the simple cutpoint estimation available in Methyl-IT is based on the application of Youden index (17). Although cutpoints are estimated for a single variable, the classification performance can be evaluated for several variables and applying different model classifiers. A optimal cutpoint distinguishes disease stages from healthy individual. The performance of this classification is given in the output of function estimateCutPoint.

A model classifier can be requested for further predictions and its classification performance is also provided. Below, the selected model classifier is a logistic model applied on the principal components derived from a previous principal component analysis (PCA, classifier1 = “pca.logistic”“, clas.perf = TRUE). Four predictor variables are available: the Hellinger divergence of methylation levels (hdiv), total variation distance (with Bayesian correction for the methylation levels, bay.TV, absolute difference of methylation levels), relative position of cytosine site in the chromosome (pos), and the logarithm base two of the probability to observe a Hellinger divergence value $H$ greater than the critical value $H_{\alpha = 0.05}$ (values given as probabilities in object PS, wprob).

Notice that the cutpoint can be estimated for any of the two currently available information divergences: Hellinger divergence (div.col = 9) or the total variation distance (with Bayesian correction, div.col = 8).

Cutpoint based on Youden index

Null distribution “GGamma3P”:

cutpoint1 = estimateCutPoint(LR = PS1,
                    control.names = c("Breast_normal","Breast_normal1"),
                    treatment.names = c("Breast_cancer", "Breast_metastasis"),
                    simple = TRUE,
                    classifier1 = "pca.logistic",
                    column = c(hdiv = TRUE, bay.TV = TRUE, 
                                wprob = TRUE, pos = TRUE),
                    n.pc = 4 , center = TRUE, scale = TRUE,
                    div.col = 9, clas.perf = TRUE)
cutpoint1$cutpoint

## [1] 3.408457

cutpoint1$testSetPerformance

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   CT   TT
##         CT    0    2
##         TT  110 1538
##                                           
##                Accuracy : 0.9321          
##                  95% CI : (0.9189, 0.9438)
##     No Information Rate : 0.9333          
##     P-Value [Acc > NIR] : 0.6026          
##                                           
##                   Kappa : -0.0024         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9987          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9333          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.9333          
##          Detection Rate : 0.9321          
##    Detection Prevalence : 0.9988          
##       Balanced Accuracy : 0.4994          
##                                           
##        'Positive' Class : TT              
##

cutpoint1$testSetModel.FDR

## [1] 0.001298701

Null distribution “Gamma3P”:

cutpoint2 = estimateCutPoint(LR = PS2,
                    control.names = c("Breast_normal","Breast_normal1"),
                    treatment.names = c("Breast_cancer", "Breast_metastasis"),
                    simple = TRUE,
                    classifier1 = "pca.logistic",
                    column = c(hdiv = TRUE, bay.TV = TRUE,
                                wprob = TRUE, pos = TRUE),
                    n.pc = 4 , center = TRUE, scale = TRUE,
                    div.col = 9, clas.perf = TRUE)
cutpoint2$cutpoint

## [1] 4.846669

cutpoint2$testSetPerformance

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CT  TT
##         CT  42   0
##         TT   0 487
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9931, 1)
##     No Information Rate : 0.9206     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.9206     
##          Detection Rate : 0.9206     
##    Detection Prevalence : 0.9206     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : TT         
##

cutpoint2$testSetModel.FDR

## [1] 0

The cutpoint estimated based “GGamma3P” model is lower than the value based on “Gamma3P” model. Often, this is a good detail, since it permits the identification of more sites with high probability to carry methylation signal. However, the evaluation of the classification performance has the last word on model selection. In the current case, the DMP classification performance is much better for “Gamma3P” model than for “GGamma3P”.

In particular, the probabilities derived from each distribution model provide fundamental information to the model classifier. Hence, the probability that a potential DMP could carry a methylation signal estimated using “Gamma3P” model is more informative (for the model classifier) than the probability estimated using “GGamma3P” model.

Cutpoint estimated using a model classifier

In the current case, the DMP classification performance for the signal derived with null distribution “Gamma3P” can be improved searching for an optimal cutpoint with a model classifier and using the total variation distance ( with Bayesian correction, bay.TV).

cutpoint2 = estimateCutPoint(LR = PS2,
                    control.names = c("Breast_normal","Breast_normal1"),
                    treatment.names = c("Breast_cancer", "Breast_metastasis"),
                    simple = FALSE,
                    classifier1 = "pca.logistic",
                    classifier2 = "pca.qda",
                    column = c(hdiv = TRUE, bay.TV = TRUE,
                                wprob = TRUE, pos = TRUE),
                    n.pc = 4 , center = TRUE, scale = TRUE,
                    div.col = 8)
cutpoint2$cutpoint

## [1] 0.5000129

cutpoint2$testSetPerformance

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CT  TT
##         CT  97   0
##         TT   0 514
##                                     
##                Accuracy : 1         
##                  95% CI : (0.994, 1)
##     No Information Rate : 0.8412    
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.0000    
##             Specificity : 1.0000    
##          Pos Pred Value : 1.0000    
##          Neg Pred Value : 1.0000    
##              Prevalence : 0.8412    
##          Detection Rate : 0.8412    
##    Detection Prevalence : 0.8412    
##       Balanced Accuracy : 1.0000    
##                                     
##        'Positive' Class : TT        
##

cutpoint2$testSetModel.FDR

## [1] 0

The good DMP classification performance was kept at lower cutpoint value.

DMPs

Cytosine sites carrying a methylation signal are designated differentially informative methylated positions (DMPs). The probability that a DMP is not induced by the treatment is given by the probability of false alarm (P_FA, false positive, Sanchez et al. 2019). That is, the biological signal is naturally present in the control as well as in the treatment. According to the classical view in the epigenomic field, a DMP is a cytosine position carrying a significant methylation signal, which might be identified as differentially methylated cytosine site by Fisher’s exact test (or other current tests).

In Methyl-IT, a DMP is a DNA cytosine position with high probability to be differentially hyper-methylated or hypo-methylated in the treatment with respect to a given control (reference individual). Notice that Methyl- IT definition of DMP is not deterministic in an ordinary sense, but stochastic- deterministic in physico-mathematical terms. In other words, if a given cytosine position is identified as a hyper-methylated DMP in the treatment, then with high probability such a site will be found more frequently hyper- methylated than hypo-methylated in the treatment population*_. Observe that it does not mean that, for all arbitrary subset of samples, the assumed cytosine position will be hypermethylated.

DMPs are selected with the function selectDIMP

DMPs = selectDIMP(PS2, div.col = 8, cutpoint = cutpoint2$cutpoint)

The descriptive statistic of the read counts at DMPs

After the signal detection step the read counts at DMPs is notable greater than at the general methylated cytosine background:

# Maximum coverage at each cytosine site
methc <- lapply(DMPs, function(x) {
                        r <- apply(cbind(x$c1, x$c2), 1, max)
                        return(r)
})

# Quantiles
do.call(rbind, lapply(methc, function(x) {
    q10 <- quantile(x, 0.1)
    q60 <- quantile(x, 0.6)
    q9999 <- quantile(x, 0.9999)
    idx1 <- which(x >= q60)
    idx2 <- which(x <= 500)
    q95 <- quantile(x, 0.95)

    idx <- intersect(idx1, idx2)
    return(c(round(summary(x)), q10, q60,
            quantile(x, c(0.95, 0.99, 0.999, 0.9999)),
            num.siteGreater_8 = sum(x >= 8),
            q60_to_500 = sum((x >= q60) & (x <= 500)),
            num.siteGreater_500 = sum(x > 500)))})
)

##                   Min. 1st Qu. Median Mean 3rd Qu. Max.  10%  60%    95%    99%
## Breast_normal       27      59     62   63      67   91 57.0 63.0  76.00  84.92
## Breast_cancer       43      70     83   84      99  160 57.6 89.0 118.00 130.00
## Breast_metastasis   52      59     68   71      79  119 56.5 72.0  96.75 110.20
## Breast_normal1      61      61     63   63      64   64 61.0 63.4  64.00  64.00
##                     99.9%   99.99% num.siteGreater_8 q60_to_500
## Breast_normal      89.820  90.8820               237        101
## Breast_cancer     157.712 159.7928              1037        417
## Breast_metastasis 117.775 118.8775               246         99
## Breast_normal1     64.000  64.0000                 5          2
##                   num.siteGreater_500
## Breast_normal                       0
## Breast_cancer                       0
## Breast_metastasis                   0
## Breast_normal1                      0

The minimum amount of methylated read counts is observed “Breast_metastasis” (15 reads) and for any the samples the tenth percent of DMPs is 26 or more reads.

Venn Diagram of DMPs

The Venn diagram of DMPs reveals that the number cytosine site carrying methylation signal with a divergence level comparable to that observed in breast tissues with cancer and metastasis is relatively small (2797 DMPs). The number of DMPs decreased in the breast tissue with metastasis, but, as shown in the last boxplot, the intensity of the signal increased.

n12 = length(GenomicRanges::intersect(DMPs$Breast_normal, 
                                    DMPs$Breast_cancer))
n13 = length(GenomicRanges::intersect(DMPs$Breast_normal, 
                                    DMPs$Breast_metastasis))
n23 = length(GenomicRanges::intersect(DMPs$Breast_cancer,
                                    DMPs$Breast_metastasis))
n123 = length(Reduce(GenomicRanges::intersect,
                    list(DMPs$Breast_normal, DMPs$Breast_cancer,
                        DMPs$Breast_metastasis)))

grid.newpage()
v = draw.triple.venn(area1 = length(DMPs$Breast_normal),
                    area2 = length(DMPs$Breast_cancer),
                    area3 = length(DMPs$Breast_metastasis),
                    n12 = n12, n23 = n23, n13 = n13, n123 = n123,
                    category = c("Breast_normal", "Breast_cancer",
                                "Breast_metastasis"),
                    lty = rep("blank", 3), fill = c("blue", "yellow", 
                                                    "magenta"),
                    alpha = c(0.1, 0.2, 0.3),
                    cat.pos = c(-110, 110, 0),
                    cat.col = c("blue", "darkorange", "red"),
                    cat.dist = c( -0.1, -0.1, -0.26),
                    cex = rep(1.7, 7),
                    cat.cex = c( 1.5, 1.5, 1.5),
                    label.col = c( "blue", "darkorange", "darkorange",
                                "red", "white", "red", "red"),
                    scaled = TRUE)
grid.draw(v)

Notice that natural methylation regulatory signals (not induced by the treatment) are present in both groups, control and treatment. The signal detection step permits us to discriminate the “ordinary” signals observed in the control from those induced by the treatment (a disease in the current case). In addition, this diagram reflects a classification of DMPs only based on the cytosine positions. That is, this Venn diagram cannot tell us whether DMPs at the same position can be distinguishable or not. For example, DMPs at the same positions in control and treatment can happen with different probabilities estimated from their corresponding fitted Weibull distributions (see below).

Monte Carlo Evaluation of the Model Classification

Performance The regulatory methylation signal is an output from a natural process that continuously takes place across the ontogenetic development of the organism. Therefore, we expect to see methylation signal in natural, ordinary conditions. Function ‘evaluateDIMPclass’ can be used to perform a classification of DMPs into two classes: DMPS from control and DMPs from treatment samples, as well as an evaluation of the classification performance (for more details see evaluateDIMPclass).

As a matter of fact this function is called by function estimateCutPoint. Additional feature not used before is the possibility to perform Monte Carlo evaluation of the model classifier performance. In above split of the sample into two subsets, training and test datasets, we would be just lucky getting the right proportion that yields the best classification performance. What about is the random split is repeat 300 times?

The performance of the “pca.qda” classifier is:

performance <- evaluateDIMPclass(LR = DMPs,
                    control.names = c("Breast_normal", "Breast_normal1"),
                    treatment.names = c("Breast_cancer", "Breast_metastasis"),
                    column = c(hdiv = TRUE, bay.TV = TRUE, 
                                wprob = TRUE, pos = TRUE),
                    classifier = "pca.qda", prop = 0.6,
                    n.pc = 4 , center = TRUE, scale = TRUE,
                    num.boot = 300, output = "mc.val",
                    num.cores = multicoreWorkers(), tasks = 2L)
performance

##     Accuracy          Kappa        AccuracyLower    AccuracyUpper   
##  Min.   :0.9918   Min.   :0.9687   Min.   :0.9810   Min.   :0.9973  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.9940   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :0.9940   Median :1.0000  
##  Mean   :0.9998   Mean   :0.9991   Mean   :0.9936   Mean   :0.9999  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.9940   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :0.9940   Max.   :1.0000  
##                                                                     
##   AccuracyNull    AccuracyPValue      McnemarPValue      Sensitivity
##  Min.   :0.8412   Min.   :0.000e+00   Min.   :0.07364   Min.   :1   
##  1st Qu.:0.8412   1st Qu.:0.000e+00   1st Qu.:0.08863   1st Qu.:1   
##  Median :0.8412   Median :0.000e+00   Median :0.47950   Median :1   
##  Mean   :0.8412   Mean   :3.189e-40   Mean   :0.31413   Mean   :1   
##  3rd Qu.:0.8412   3rd Qu.:0.000e+00   3rd Qu.:0.47950   3rd Qu.:1   
##  Max.   :0.8412   Max.   :2.341e-38   Max.   :0.47950   Max.   :1   
##                                       NA's   :286                   
##   Specificity     Pos Pred Value   Neg Pred Value   Precision          Recall 
##  Min.   :0.9485   Min.   :0.9904   Min.   :1      Min.   :0.9904   Min.   :1  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1      1st Qu.:1.0000   1st Qu.:1  
##  Median :1.0000   Median :1.0000   Median :1      Median :1.0000   Median :1  
##  Mean   :0.9985   Mean   :0.9997   Mean   :1      Mean   :0.9997   Mean   :1  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1      3rd Qu.:1.0000   3rd Qu.:1  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1      Max.   :1.0000   Max.   :1  
##                                                                               
##        F1           Prevalence     Detection Rate   Detection Prevalence
##  Min.   :0.9952   Min.   :0.8412   Min.   :0.8412   Min.   :0.8412      
##  1st Qu.:1.0000   1st Qu.:0.8412   1st Qu.:0.8412   1st Qu.:0.8412      
##  Median :1.0000   Median :0.8412   Median :0.8412   Median :0.8412      
##  Mean   :0.9999   Mean   :0.8412   Mean   :0.8412   Mean   :0.8415      
##  3rd Qu.:1.0000   3rd Qu.:0.8412   3rd Qu.:0.8412   3rd Qu.:0.8412      
##  Max.   :1.0000   Max.   :0.8412   Max.   :0.8412   Max.   :0.8494      
##                                                                         
##  Balanced Accuracy
##  Min.   :0.9742   
##  1st Qu.:1.0000   
##  Median :1.0000   
##  Mean   :0.9992   
##  3rd Qu.:1.0000   
##  Max.   :1.0000   
##

Differentially methylated genomic regions (DMRs)

In Methyl-IT, the concept of DMR is generalized and it is not limited to any particular genomic region found with specific clustering algorithm. It can be applied to any naturally or algorithmically defined genomic region. For example, an exon region identified statistically to be differentially methylated by using generalized linear regression model (GLM) is a DMR. Differentially methylated genes (DMGs) are estimated from group comparisons for the number of DMPs on gene-body regions between control and treatment. This permits the extension of the concept of DMR by considering a differentially methylated gene as a particular case of DMRs.

Our degree of confidence in whether DMP counts in both groups of samples, control and treatment, represent true biological signal was determined in the signal detection step. To estimate DMRs, we followed similar steps to those proposed in Bioconductor R package DESeq2 (18), but our GLM test looks for statistical difference between the groups based on gene-body DMP counts overlapping a given genomic region rather than read counts. The regression analysis of the generalized linear model (GLM) with logarithmic link was applied to test the difference between group counts.

The fitting algorithmic approaches provided by ‘glm’ and ‘glm.nb’ functions from the R packages stat and MASS, respectively, were used for Poisson (PR), Quasi-Poisson (QPR) and Negative Binomial (NBR) linear regression analyses, respectively.

Differentially methylated genes (DMGs)

We shall call DMGs those DIMRs restricted to gene-body regions. DMGs are detected using function countTest2. Gene annotation is taken from Ensembl

# To load human gene annotation
# AG = import(con = paste0("ftp://ftp.ensembl.org/pub/release-91/gff3/",
# "homo_sapiens/Homo_sapiens.GRCh38.91.gff3.gz"))
# genes = AG[ AG$type == "gene", c( "gene_id", "biotype", "Name" ) ]
# genes = genes[ genes$biotype == "protein_coding", "gene_id" ]
# seqlevels(genes, "coarse") <- "13" # To keep a consistent chromosome 
# annotation
# seqlevels(genes) <- "chr13"

Here, for the sake of brevity, we just load the “RData” compressed file containing only the gene annotation:

con = url(paste0("https://git.psu.edu/genomath/MethylIT_examples/raw/master/",
                "Homo_sapiens_GRCh38.91_genes.RData"))
load(con)
genes = genes[ genes$biotype == "protein_coding", ]
seqlevels(genes, "coarse") <- "13" # To keep a consistent chromosome annotation
seqlevels(genes) <- "chr13"
genes

## GRanges object with 323 ranges and 3 metadata columns:
##         seqnames              ranges strand |         gene_id        biotype
##            <Rle>           <IRanges>  <Rle> |     <character>    <character>
##     [1]    chr13   19173770-19181852      - | ENSG00000198033 protein_coding
##     [2]    chr13   19422877-19536762      - | ENSG00000132958 protein_coding
##     [3]    chr13   19633681-19673459      + | ENSG00000196199 protein_coding
##     [4]    chr13   19674752-19783019      - | ENSG00000121390 protein_coding
##     [5]    chr13   19823482-19863636      - | ENSG00000132950 protein_coding
##     ...      ...                 ...    ... .             ...            ...
##   [319]    chr13 113977783-114132611      - | ENSG00000185989 protein_coding
##   [320]    chr13 114179238-114223084      + | ENSG00000283361 protein_coding
##   [321]    chr13 114234887-114272723      + | ENSG00000130177 protein_coding
##   [322]    chr13 114281584-114305817      + | ENSG00000169062 protein_coding
##   [323]    chr13 114314513-114327328      + | ENSG00000198824 protein_coding
##                Name
##         <character>
##     [1]      TUBA3C
##     [2]       TPTE2
##     [3]    MPHOSPH8
##     [4]       PSPC1
##     [5]       ZMYM5
##     ...         ...
##   [319]       RASA3
##   [320]  AL160396.2
##   [321]       CDC16
##   [322]       UPF3A
##   [323]      CHAMP1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Function getDIMPatGenes is used to count the number of DMPs at gene-body. Nevertheless, it can be used for any arbitrary specified genomic region as well. The operation of this function is based on the ‘findOverlaps’ function from the ‘GenomicRanges’ Bioconductor R package. The ‘findOverlaps’ function has several critical parameters like, for example, ‘maxgap’, ‘minoverlap’, and ‘ignore.strand’. In our function getDIMPatGenes, except for setting ignore.strand = TRUE and type = “within”, we preserve the rest of default ‘findOverlaps’ parameters.

In this case, these are important parameter settings because the local mechanical effect of methylation changes on a DNA region where a gene is located is independent of the strand where the gene is encoded. That is, methylation changes located in any of the two DNA strands inside the gene-body region will affect the flexibility of the DNA molecule (7, 19).

DMPsBN = getDIMPatGenes(GR = DMPs$Breast_normal, GENES = genes)
DMPsBN1 = getDIMPatGenes(GR = DMPs$Breast_normal1, GENES = genes)
DMPsBC = getDIMPatGenes(GR = DMPs$Breast_cancer, GENES = genes)
DMPsBM = getDIMPatGenes(GR = DMPs$Breast_metastasis, GENES =
        genes)

The number of DMPs on the strand where a gene is encoded is obtained by setting ignore.strand = FALSE. However, for the current example results will be the same since the datasets downloaded from GEO do not have strand information. Next, the above GRanges objects carrying the DMP counts from each sample are grouped into a single GRanges object. Since we have only one control, to perform group comparison and to move forward with this example, we duplicated ‘Breast_normal’ sample. Obviously, the confidence on the results increases with the number of sample replications per group (in this case, it is only an illustrative example on how to perform the analysis, since a fair comparison requires for more than one replicate in the control group).

Genes.DMPs = uniqueGRanges( list(DMPsBN[, 2], DMPsBN1[, 2],
                                DMPsBC[, 2], DMPsBM[, 2]),
                            type = "equal", verbose = FALSE,
                            ignore.strand = TRUE )
colnames( mcols(Genes.DMPs)) <- c("Breast_normal", "Breast_normal1",
                                    "Breast_cancer", "Breast_metastasis")

Next, the set of mapped genes are annotated

GeneID = subsetByOverlaps(genes, Genes.DMPs, type = "equal",
                            ignore.strand = FALSE)
names( Genes.DMPs ) <- GeneID$gene_id
Genes.DMPs

## GRanges object with 127 ranges and 4 metadata columns:
##                   seqnames              ranges strand | Breast_normal
##                      <Rle>           <IRanges>  <Rle> |     <numeric>
##   ENSG00000121390    chr13   19674752-19783019      * |             0
##   ENSG00000165475    chr13   20403667-20525857      * |             1
##   ENSG00000032742    chr13   20567069-20691437      * |             1
##   ENSG00000150456    chr13   20728731-20773958      * |             0
##   ENSG00000150457    chr13   20973032-21061547      * |             1
##               ...      ...                 ...    ... .           ...
##   ENSG00000283199    chr13 113953705-113973997      * |             0
##   ENSG00000185989    chr13 113977783-114132611      * |             1
##   ENSG00000283361    chr13 114179238-114223084      * |             2
##   ENSG00000130177    chr13 114234887-114272723      * |             2
##   ENSG00000169062    chr13 114281584-114305817      * |             0
##                   Breast_normal1 Breast_cancer Breast_metastasis
##                        <numeric>     <numeric>         <numeric>
##   ENSG00000121390              0             0                 1
##   ENSG00000165475              0             0                 1
##   ENSG00000032742              0             2                 1
##   ENSG00000150456              0             0                 2
##   ENSG00000150457              0             0                 3
##               ...            ...           ...               ...
##   ENSG00000283199              0             2                 0
##   ENSG00000185989              0            32                 5
##   ENSG00000283361              0            20                 6
##   ENSG00000130177              0            11                 3
##   ENSG00000169062              0             6                 1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Now, we build a glmDataSet object.

## An experiment design is set.
colData <- data.frame(condition = factor(c("BN","BN","BC","BC"),
                                        levels = c("BN", "BC")),
                        c("Breast_normal","Breast_normal1",
                            "Breast_cancer","Breast_metastasis"),
                        row.names = 2)
## A RangedGlmDataSet is created
ds <- glmDataSet(GR = Genes.DMPs, colData = colData)

DMG analysis is performed with the function countTest2

DMGs = countTest2(DS = ds, num.cores = 4L, minCountPerIndv = 8,
                countFilter = TRUE, maxGrpCV = c(1, 1),
                Minlog2FC = 1, pvalCutOff = 0.05,
                MVrate = .95, verbose = FALSE)
DMGs

## GRanges object with 3 ranges and 12 metadata columns:
##                   seqnames              ranges strand | Breast_normal
##                      <Rle>           <IRanges>  <Rle> |     <numeric>
##   ENSG00000088387    chr13   98793429-99086625      * |             2
##   ENSG00000185989    chr13 113977783-114132611      * |             1
##   ENSG00000283361    chr13 114179238-114223084      * |             2
##                   Breast_normal1 Breast_cancer Breast_metastasis    log2FC
##                        <numeric>     <numeric>         <numeric> <numeric>
##   ENSG00000088387              0            33                 7   2.35138
##   ENSG00000185989              0            32                 5   2.56495
##   ENSG00000283361              0            20                 6   1.94591
##                   scaled.deviance     pvalue          model   adj.pval
##                         <numeric>  <numeric>    <character>  <numeric>
##   ENSG00000088387         10.2110 0.00192563   Neg.Binomial 0.00332571
##   ENSG00000185989          9.4506 0.00258248 Neg.Binomial.W 0.00332571
##   ENSG00000283361         10.0206 0.00332571   Neg.Binomial 0.00332571
##                   CT.SignalDensity TT.SignalDensity SignalDensityVariation
##                          <numeric>        <numeric>              <numeric>
##   ENSG00000088387      3.41068e-06      6.82135e-05            6.48028e-05
##   ENSG00000185989      3.22937e-06      1.19487e-04            1.16257e-04
##   ENSG00000283361      2.28066e-05      2.96486e-04            2.73679e-04
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Here, we adds the gene alias:

hits = findOverlaps(DMGs, genes, type = "equal")
DMGs = DMGs[queryHits(hits)]
DMGs$alias <- genes$Name[subjectHits(hits)]

BRCA2, a breast cancer associated risk gene, is found between the DMGs

# DMGs
DMGs[ grep( "ENSG00000139618", names(DMGs) ) ]

## GRanges object with 0 ranges and 13 metadata columns:
##    seqnames    ranges strand | Breast_normal Breast_normal1 Breast_cancer
##       <Rle> <IRanges>  <Rle> |     <numeric>      <numeric>     <numeric>
##    Breast_metastasis    log2FC scaled.deviance    pvalue       model  adj.pval
##            <numeric> <numeric>       <numeric> <numeric> <character> <numeric>
##    CT.SignalDensity TT.SignalDensity SignalDensityVariation       alias
##           <numeric>        <numeric>              <numeric> <character>
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Concluding remarks

Results fundamentally depend on the use of the proper reference samples. If there is not enough control samples to create an independent reference virtual individual, then the centroid from the control group can be used instead, which can be built with function poolFromGRlist.

The reference is used to put individuals from control and patients in the same system of reference. The information divergences of methylation levels with respect to the reference individual are estimated for each cytosine site for each individual, from control and treatment. As a result, the information on the natural spontaneous variation in the control and treatment populations is carried in the probability distributions of the information divergences, estimated for each individual. It does not matter how statistically significant a DMP would be in the group of patients.

What really matter is how big is the probability to observe the same methylation event in the control group in respect to the patient group. Decisions are ultimately made on the basis of such classification probabilities at single cytosine positions. The signal detection based approach implemented in Methyl-IT is addressed to confront the mentioned issue and, in consequence, the level of resolution of Methyl-IT approach reach the single cytosine site.

Acknowledgments

We thank Professor David J Miller for valuable conversations and suggestions on our mathematical modeling.

Funding The work was supported by funding from NSF-SBIR

(2015-33610-23428-UNL) and the Bill and Melinda Gates Foundation (OPP1088661).

References.

R. Sanchez, S. A. Mackenzie, Information Thermodynamics of Cytosine DNA Methylation. PLOS ONE 11, e0150427 (2016).

R. Sanchez, X. Yang, T. Maher, S. Mackenzie, Discrimination of DNA Methylation Signal from Background Variation for Clinical Diagnostics. Int. J. Mol. Sci. 20, 5343 (2019).

M. Greiner, D. Pfeiffer, R. D. Smith, Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests. Preventive Veterinary Medicine 45, 23–41 (2000).

J. V. Carter, J. Pan, S. N. Rai, S. Galandiuk, ROC-ing along: Evaluation and interpretation of receiver operating characteristic curves. Surgery 159, 1638–1645 (2016).

R. Harpaz, et al., Performance of Pharmacovigilance Signal Detection Algorithms for the FDA Adverse Event Reporting System. Clin Pharmacol ther 93, 1–19 (2013).

S. Kruspe, et al., Rapid and Sensitive Detection of Breast Cancer Cells in Patient Blood with Nuclease-Activated Probe Technology. Molecular Therapy - Nucleic Acids 8, 542–557 (2017).

P. M. D. Severin, X. Zou, H. E. Gaub, K. Schulten, Cytosine methylation alters DNA mechanical properties. Nucleic acids research 39, 8740–51 (2011).

T. T. M. Ngo, et al., Effects of cytosine modifications on DNA flexibility and nucleosome mechanical stability. Nature Communications 7, 10813 (2016).

A. Basu, A. Mandal, L. Pardo, Hypothesis testing for two discrete populations based on the Hellinger distance. Statistics & Probability Letters 80, 206–214 (2010).

10.

K. Hebestreit, M. Dugas, H.-U. Klein, Detection of significantly differentially methylated regions in targeted bisulfite sequencing data. Bioinformatics (Oxford, England) 29, 1647–53 (2013).

11.

M. D. Robinson, et al., Statistical methods for detecting differentially methylated loci and regions. Frontiers in genetics 5, 324 (2014).

12.

E. Dolzhenko, A. D. Smith, Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencing experiments. BMC bioinformatics 15, 215 (2014).

13.

P. Baldi, S. Brunak, Bioinformatics: the machine learning approach, Second (MIT Press, 2001).

14.

J. P. Stevens, Applied Multivariate Statistics for the Social Sciences, Fifth Edit (Routledge Academic, 2009).

15.

M. López-Ratón, M. X. Rodríguez-Álvarez, C. Cadarso-Suárez, F. Gude-Sampedro, Others, OptimalCutpoints: an R package for selecting optimal cutpoints in diagnostic tests. Journal of statistical software 61, 1–36 (2014).

16.

T. Fawcett, An introduction to ROC analysis (2005) https:/doi.org/10.1016/j.patrec.2005.10.010.

17.

W. J. Youden, Index for rating diagnostic tests. Cancer 3, 32–35 (1950).

18.

M. I. Love, W. Huber, S. Anders, Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 1–34 (2014).

19.

J. S. Choy, et al., DNA methylation increases nucleosome compaction and rigidity. Journal of the American Chemical Society 132, 1782–1783 (2010).

Session info

Here is the output of sessionInfo() on the system on which this document was compiled running pandoc 3.1.1:

## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] VennDiagram_1.7.3    futile.logger_1.4.3  gridExtra_2.3       
##  [4] reshape2_1.4.4       ggplot2_3.4.4        BiocParallel_1.34.2 
##  [7] MethylIT_0.3.2.7     rtracklayer_1.60.1   GenomicRanges_1.52.1
## [10] GenomeInfoDb_1.36.4  IRanges_2.34.1       S4Vectors_0.38.2    
## [13] BiocGenerics_0.46.0  knitr_1.45          
## 
## loaded via a namespace (and not attached):
##   [1] rstudioapi_0.15.0           jsonlite_1.8.7             
##   [3] magrittr_2.0.3              farver_2.1.1               
##   [5] rmarkdown_2.25              fs_1.6.3                   
##   [7] BiocIO_1.10.0               zlibbioc_1.46.0            
##   [9] ragg_1.2.6                  vctrs_0.6.5                
##  [11] memoise_2.0.1               Rsamtools_2.16.0           
##  [13] RCurl_1.98-1.13             htmltools_0.5.7            
##  [15] S4Arrays_1.0.6              lambda.r_1.2.4             
##  [17] pROC_1.18.5                 caret_6.0-94               
##  [19] sass_0.4.8                  parallelly_1.36.0          
##  [21] bslib_0.6.1                 desc_1.4.3                 
##  [23] plyr_1.8.9                  futile.options_1.0.1       
##  [25] lubridate_1.9.3             cachem_1.0.8               
##  [27] GenomicAlignments_1.36.0    lifecycle_1.0.4            
##  [29] minpack.lm_1.2-4            iterators_1.0.14           
##  [31] pkgconfig_2.0.3             Matrix_1.6-4               
##  [33] R6_2.5.1                    fastmap_1.1.1              
##  [35] GenomeInfoDbData_1.2.10     MatrixGenerics_1.12.3      
##  [37] future_1.33.0               digest_0.6.33              
##  [39] colorspace_2.1-0            AnnotationDbi_1.62.2       
##  [41] textshaping_0.3.7           RSQLite_2.3.3              
##  [43] labeling_0.4.3              randomForest_4.7-1.1       
##  [45] fansi_1.0.6                 timechange_0.2.0           
##  [47] httr_1.4.7                  abind_1.4-5                
##  [49] compiler_4.3.2              proxy_0.4-27               
##  [51] bit64_4.0.5                 withr_2.5.2                
##  [53] DBI_1.1.3                   highr_0.10                 
##  [55] MASS_7.3-60                 lava_1.7.3                 
##  [57] DelayedArray_0.26.7         rjson_0.2.21               
##  [59] ModelMetrics_1.2.2.2        tools_4.3.2                
##  [61] future.apply_1.11.0         nnet_7.3-19                
##  [63] glue_1.6.2                  restfulr_0.0.15            
##  [65] nlme_3.1-164                generics_0.1.3             
##  [67] recipes_1.0.8               gtable_0.3.4               
##  [69] class_7.3-22                data.table_1.14.8          
##  [71] utf8_1.2.4                  XVector_0.40.0             
##  [73] foreach_1.5.2               pillar_1.9.0               
##  [75] stringr_1.5.1               genefilter_1.82.1          
##  [77] splines_4.3.2               dplyr_1.1.4                
##  [79] lattice_0.22-5              survival_3.5-7             
##  [81] bit_4.0.5                   annotate_1.78.0            
##  [83] tidyselect_1.2.0            Biostrings_2.68.1          
##  [85] SummarizedExperiment_1.30.2 xfun_0.41                  
##  [87] Biobase_2.60.0              hardhat_1.3.0              
##  [89] timeDate_4022.108           matrixStats_1.2.0          
##  [91] proto_1.0.0                 stringi_1.8.3              
##  [93] yaml_2.3.8                  evaluate_0.23              
##  [95] codetools_0.2-19            tibble_3.2.1               
##  [97] cli_3.6.2                   rpart_4.1.23               
##  [99] nls2_0.3-3                  xtable_1.8-4               
## [101] systemfonts_1.0.5           munsell_0.5.0              
## [103] jquerylib_0.1.4             Rcpp_1.0.11                
## [105] globals_0.16.2              png_0.1-8                  
## [107] XML_3.99-0.16               parallel_4.3.2             
## [109] pkgdown_2.0.7               gower_1.0.1                
## [111] blob_1.2.4                  bitops_1.0-7               
## [113] listenv_0.9.0               ipred_0.9-14               
## [115] e1071_1.7-13                scales_1.3.0               
## [117] prodlim_2023.08.28          purrr_1.0.2                
## [119] crayon_1.5.2                rlang_1.1.2                
## [121] formatR_1.14                KEGGREST_1.40.1