Abstract

A fast introduction into methylation analysis with Methyl-IT is provided here. The methylome analysis implemented in Methyl-IT based on a signal-detection and machine-learning approach. Vignettes with further examples are available at https://genomaths.github.io/methylit/.

library(MethylIT)

Background

MethylIT is an R package for methylome analysis based on a signal-detection machine learning approach (SD-ML). This approach is postulated to provide greater sensitivity for resolving true signal from the methylation background within the methylome (Sanchez and Mackenzie 2016; Sanchez et al. 2019). Because the biological signal created within the dynamic methylome environment characteristic of plants is not free from background noise, the approach, designated Methyl-IT, includes the application of signal detection theory (Greiner, Pfeiffer, and Smith 2000; Carter et al. 2016; Harpaz et al. 2013; Kruspe et al. 2017). A basic requirement for the application of signal detection is the knowledge of the probability distribution of the background noise, which is used as null hypothesis in the detection of the methylation signal. This probability distribution is, in general, a member of generalized gamma distribution family, and it can be deduced on a statistical mechanical/thermodynamics basis from DNA methylation induced by thermal fluctuations (Sanchez and Mackenzie 2016).

Data sets

Methyl-IT uses the read counts of methylation called derived from whole-genome bisulfite sequencing (WGBS). Methylome datasets of WGBS are available at Gene Expression Omnibus (GEO DataSets). For the current example, we just load RData files containing the read-counts from several samples. Please, notice that we are using here just a toy data set, which permits us to move us fast on Methyl-IT pipeline.

## ===== Download methylation data from PSU's GitLab ========
url <- paste0("https://git.psu.edu/genomath/MethylIT_examples",
            "/raw/master/MethylIT_testing/",
            "memoryLine_100k_samples_cg.RData")
temp <- tempfile(fileext = ".RData")
download.file(url = url, destfile = temp)
load(temp)
file.remove(temp); rm(temp, url)
#> [1] TRUE

names(meth_samples)
#>  [1] "W_3_1" "W_3_2" "W_3_3" "W_3_4" "W_3_5" "M_3_1" "M_3_2" "M_3_3" "M_3_4"
#> [10] "M_3_5"

Gene annotation

Download Arabidopsis thaliana TAIR10 gene annotation data

url <- paste0("https://git.psu.edu/genomath/MethylIT_examples",
            "/raw/master/MethylIT_testing/",
            "Arabidopsis_thaliana.TAIR10.38_only_genes.RData")
temp <- tempfile(fileext = ".RData")
download.file(url = url, destfile = temp)
load(temp)
file.remove(temp); rm(temp, url)
#> [1] TRUE

The reference sample

To perform the comparison between the uncertainty of methylation levels from each group of individuals, control \((c)\) and treatment \((t)\), we should estimate the uncertainty variation with respect to the same reference individual on the mentioned metric space. The reason to measure the uncertainty variation with respect to the same reference resides in that even sibling individuals follow an independent ontogenetic development and, consequently, their developments follow independent stochastic processes. The reference individual is estimated with function poolFromGRlist:

The centroid (of methylation levels) of the group control is used as reference samples. Notice that poolFromGRlist function only uses numerical data. Hence, we must specify the columns carrying the read-count data.

idx <- grep("W", names(meth_samples))
ref <- poolFromGRlist(meth_samples[idx], stat = "mean", num.cores = 4L)
#> *** Building a unique GRanges object from the list...
#>  *** Building coordinates for the new GRanges object ...
#>  *** Strand information is preserved ...
#>  *** Processing GRanges for sample: #1
#>  *** Processing GRanges for sample: #2...
#>  *** Processing GRanges for sample: #3...
#>  *** Processing GRanges for sample: #4...
#>  *** Processing GRanges for sample: #5...
#>  *** Sorting by chromosomes and start positions...
#> *** Building a virtual methylome...

Estimation of information divergences

To perform the comparison between the uncertainty of methylation levels from each group of individuals, control \((c)\) and treatment \((t)\), the divergence between the methylation levels of each individual is estimated with respect to the reference sample.

ml_samples <- meth_samples[c("W_3_1","W_3_2","W_3_3","W_3_4","W_3_5",
                             "M_3_1","M_3_2","M_3_3","M_3_4","M_3_5")]
ml_hd <- estimateDivergence(ref = ref,
                            indiv = ml_samples,
                            Bayesian = TRUE,
                            min.coverage = c(4,4),
                            min.meth = 1,
                            high.coverage = 500,
                            percentile = 0.999,
                            num.cores = 4L,
                            tasks = 2L, verbose = FALSE)
ml_hd$M_3_1
#> GRanges object with 13097 ranges and 9 metadata columns:
#>           seqnames    ranges strand |        c1        t1        c2        t2
#>              <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>       [1]        1       109      + |         3         0         4         0
#>       [2]        1       115      + |         3         0         4         0
#>       [3]        1       161      + |         4         2         5         1
#>       [4]        1       310      + |         6         6         4        12
#>       [5]        1       311      - |         1         3         3         2
#>       ...      ...       ...    ... .       ...       ...       ...       ...
#>   [13093]       Pt     84348      - |         1       210         2       186
#>   [13094]       Pt     84350      + |         2       159         0       133
#>   [13095]       Pt     84351      - |         1       202         2       175
#>   [13096]       Pt     84353      + |         1       153         0       127
#>   [13097]       Pt     84354      - |         1       194         0       164
#>                   p1         p2          TV      bay.TV       hdiv
#>            <numeric>  <numeric>   <numeric>   <numeric>  <numeric>
#>       [1]   0.840693   0.859045    0.000000   0.0183519 0.00293525
#>       [2]   0.840693   0.859045    0.000000   0.0183519 0.00293525
#>       [3]   0.636949   0.764541    0.166667   0.1275924 0.13704986
#>       [4]   0.501713   0.273283   -0.250000  -0.2284300 0.82382952
#>       [5]   0.321422   0.577309    0.350000   0.2558870 0.36730311
#>       ...        ...        ...         ...         ...        ...
#>   [13093] 0.00827573 0.01481755  0.00589896  0.00654182   0.191203
#>   [13094] 0.01697753 0.00601333 -0.01242236 -0.01096420   0.412627
#>   [13095] 0.00859952 0.01573017  0.00637333  0.00713065   0.205566
#>   [13096] 0.01130985 0.00629390 -0.00649351 -0.00501595   0.103208
#>   [13097] 0.00894968 0.00488764 -0.00512821 -0.00406204   0.109975
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths

Function estimateDivergence returns a list of GRanges objects with the four columns of counts, the information divergence, and additional columns:

  1. The original matrix of methylated (\(c_i\)) and unmethylated (\(t_i\)) read counts from control (\(i=1\)) and treatment (\(i=2\)) samples.
  2. “p1” and “p2”: methylation levels for control and treatment, respectively.
  3. “bay.TV”: total variation \(TV = p_2 - p_1\).
  4. “TV”: total variation based on simple counts: \(TV=c_1/(c_1+t_1)-c_2/(c_2+t_2)\).
  5. “hdiv”: Hellinger divergence.

Observe that the absolute values \(|TV|\) is an information divergence as well, which is known as total variation distance (widely used in information theory).

Descriptive statistical analysis

A descriptive analysis on the distribution of Hellinger divergences of methylation levels is recommended. After the raw data filtering introduced with estimateDivergence function, some coverage issue could be detected. Notice that Methyl-IT pipeline assumes that the data quality is good enough.

Like in any other statistical analysis, descriptive statistical analysis help us to detect potential issues in the raw data. It is the user responsibility to perform quality check of his/her dataset before start to apply Methyl-IT. Nevertheless, the quality checking of the raw data is not perfect. So, it is healthy to sees for potential issues.

This is an optional step not included in Methyl-IT. It is up to the user to verify the data quality.

## Critical values
critical.val <- do.call(rbind, lapply(ml_hd, function(x) {
    hd.95 = quantile(x$hdiv, 0.95)
    tv.95 = quantile(abs(x$bay.TV), 0.95)
    btv.95 = quantile(abs(x$TV), 0.95)
    return(c(tv = tv.95, hd = hd.95, bay.TV = btv.95)
    )})
)
critical.val
#>          tv.95%    hd.95% bay.TV.95%
#> W_3_1 0.1946736 0.6295961  0.2500000
#> W_3_2 0.2008298 0.6281197  0.2500000
#> W_3_3 0.1856191 0.6115074  0.2500000
#> W_3_4 0.1911147 0.6250676  0.2500000
#> W_3_5 0.1889018 0.6153943  0.2500000
#> M_3_1 0.2920013 1.0756956  0.3500000
#> M_3_2 0.2693999 1.2647016  0.3333333
#> M_3_3 0.2930976 1.1794525  0.3662121
#> M_3_4 0.2536551 1.4159539  0.3333333
#> M_3_5 0.2625488 1.1071675  0.3333333

Quality checking of the coverage

After the raw data filtering introduced with estimateDivergence function, some coverage issue could be detected.

covr <- lapply(ml_hd, function(x) {
                cov <- x$c2 + x$t2
                return(cov)
})

do.call(rbind, lapply(covr, function(x) {
    q60 <- quantile(x, 0.6)
    q9999 <- quantile(x, 0.9999)
    idx1 <- which(x >= q60)
    idx2 <- which(x <= 500)
    q95 <- quantile(x, 0.95)
    idx <- intersect(idx1, idx2)
    return(c(
        round(summary(x)),
        q60, quantile(x, c(0.95, 0.99, 0.999, 0.9999)),
        num.siteGreater_8 = sum(x >= 8),
        q60_to_500 = sum((x >= q60) & (x <= 500)),
        num.siteGreater_500 = sum(x > 500)
))})
)
#>       Min. 1st Qu. Median Mean 3rd Qu. Max. 60% 95%    99%   99.9%   99.99%
#> W_3_1    0       6     11   33      40  497  14 103 397.00 475.644 492.0466
#> W_3_2    0       6     10   28      39  500  13  93 190.00 483.169 499.0000
#> W_3_3    0       7     12   33      40  491  15 112 351.45 424.000 466.4075
#> W_3_4    0       6     10   31      41  499  14  97 370.26 495.000 499.0000
#> W_3_5    0       7     12   34      41  495  16 114 370.02 453.010 486.6202
#> M_3_1    0       6     10   28      38  500  13  88 368.00 492.000 499.6904
#> M_3_2    0       6     11   27      30  499  14  77 349.32 457.000 494.6320
#> M_3_3    0       5      9   20      28  497  11  56 166.00 438.908 490.4885
#> M_3_4    0       7     11   28      30  494  15  81 347.97 452.597 493.0000
#> M_3_5    0       6      9   22      29  488  12  63 179.86 434.158 484.6693
#>       num.siteGreater_8 q60_to_500 num.siteGreater_500
#> W_3_1              8964       5460                   0
#> W_3_2              8170       5358                   0
#> W_3_3              9359       5439                   0
#> W_3_4              8699       5324                   0
#> W_3_5             10063       5583                   0
#> M_3_1              8478       5275                   0
#> M_3_2              8992       5536                   0
#> M_3_3              7667       5577                   0
#> M_3_4              9362       5410                   0
#> M_3_5              8310       5501                   0

Estimation of the best fitted probability distribution model

A basic requirement for the application of signal detection is the knowledge of the probability distribution of the background noise. Probability distribution, as a Weibull distribution model, can be deduced on a statistical mechanical/thermodynamics basis for DNA methylation induced by thermal fluctuations (Sanchez and Mackenzie 2016).

gof <- gofReport(HD = ml_hd,
                model = c("Weibull2P", "Weibull3P",
                        "Gamma2P", "Gamma3P"),
                column = 9,
                output = "all",
                confl_model = TRUE,
                num.cores = 4L,
                task = 2L,
                verbose = FALSE
)
#> Warning in gofReport(HD = ml_hd, model = c("Weibull2P", "Weibull3P", "Gamma2P", : 
#>  The best fitted model for sample(s) M_3_3, M_3_4 require(s) for further analysis. 
#> The model with the lowest AIC must have the highest R.Cross.val
gof$stats
gof$bestModel

Signal detection

The information thermodynamics-based approach is postulated to provide greater sensitivity for resolving true signal from the thermodynamic background within the methylome (Sanchez and Mackenzie 2016). Since the biological signal created within the dynamic methylome environment characteristic of living organisms is not free from background noise, the approach, designated Methyl-IT, includes the application of signal detection theory (Greiner, Pfeiffer, and Smith 2000; Carter et al. 2016; Harpaz et al. 2013; Kruspe et al. 2017). Signal detection is a critical step to increase sensitivity and resolution of methylation signal by reducing the signal-to-noise ratio and objectively controlling the false positive rate and prediction accuracy/risk.

Potential methylation signal

The first estimation in our signal detection step is the identification of the cytosine sites carrying potential methylation signal \(PS\). The methylation regulatory signal does not hold the theoretical distribution and, consequently, for a given level of significance \(\alpha\) (Type I error probability, e.g. \(\alpha = 0.05\)), cytosine positions \(k\) with information divergence \(H_k >= H_{\alpha = 0.05}\) can be selected as sites carrying potential signals \(PS\). The value of \(\alpha\) can be specified. For example, potential signals with \(H_k > H_{\alpha = 0.01}\) can be selected. For each sample, cytosine sites are selected based on the corresponding fitted theoretical distribution model estimated in the previous step.

ps <- getPotentialDIMP(LR = ml_hd,
                        nlms = gof$nlms,
                        div.col = 9L,
                        tv.col = 8L,
                        tv.cut = 0.25,
                        dist.name = gof$bestModel)
ps$M_3_1
#> GRanges object with 413 ranges and 10 metadata columns:
#>         seqnames    ranges strand |        c1        t1        c2        t2
#>            <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>     [1]        1       511      + |        10         5         4         8
#>     [2]        1      5459      + |         6         2         2         5
#>     [3]        1     12141      - |         0        10         2         5
#>     [4]        1     24196      + |         6        14        16         0
#>     [5]        1     24312      + |        17         1         9         5
#>     ...      ...       ...    ... .       ...       ...       ...       ...
#>   [409]        5     93766      - |         5         3         1         6
#>   [410]        5     93916      - |         0         6         5         3
#>   [411]        5     94024      + |         0        10         3         2
#>   [412]        5     94054      + |         7         2         0         4
#>   [413]        5     94055      - |         5         1         1         5
#>                p1        p2        TV    bay.TV      hdiv       wprob
#>         <numeric> <numeric> <numeric> <numeric> <numeric>   <numeric>
#>     [1] 0.6531875  0.353670 -0.333333 -0.299518   1.31719 2.46254e-02
#>     [2] 0.7136184  0.326729 -0.464286 -0.386890   1.32192 2.43687e-02
#>     [3] 0.0661127  0.326729  0.285714  0.260616   1.11313 3.88924e-02
#>     [4] 0.3147732  0.955159  0.700000  0.640386  10.38634 2.59456e-10
#>     [5] 0.9120627  0.628866 -0.301587 -0.283196   2.07893 4.73040e-03
#>     ...       ...       ...       ...       ...       ...         ...
#>   [409] 0.6080277  0.210430 -0.482143 -0.397598   1.45669  0.01810026
#>   [410] 0.1015119  0.605237  0.625000  0.503725   2.46603  0.00208971
#>   [411] 0.0661127  0.577309  0.600000  0.511196   2.73856  0.00118239
#>   [412] 0.7409696  0.144571 -0.777778 -0.596398   2.69305  0.00130000
#>   [413] 0.7708080  0.238123 -0.666667 -0.532685   2.15187  0.00405209
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths

Cutpoint estimation. DMPs

Laws of statistical physics can account for background methylation, a response to thermal fluctuations that presumably functions in DNA stability (Sanchez and Mackenzie 2016; Sanchez et al. 2019). True signal is detected based on the optimal cutpoint (López-Ratón et al. 2014).

The need for the application of (what is now known as) signal detection in cancer research was pointed out by Youden in the midst of the last century (Youden 1950). In the next example, the simple cutpoint estimation available in Methyl-IT is based on the application of Youden index (Youden 1950). Although cutpoints are estimated for a single variable, the classification performance can be evaluated for several variables and applying different model classifiers. A optimal cutpoint distinguishes disease stages from healthy individual. The performance of this classification is given in the output of function estimateCutPoint.

A model classifier can be requested for further predictions and its classification performance is also provided. Below, the selected model classifier is a quadratic discriminant analysis (QDA) (classifier1 = “qda”, clas.perf = TRUE). Four predictor variables are available: the Hellinger divergence of methylation levels (hdiv), total variation distance (TV, absolute difference of methylation levels), relative position of cytosine site in the chromosome (pos), and the logarithm base two of the probability to observe a Hellinger divergence value \(H\) greater than the critical value \(H_{\alpha = 0.05}\) (values given as probabilities in object PS, wprob).

Notice that the cutpoint can be estimated for any of the two currently available information divergences: Hellinger divergence (div.col = 9) or the total variation distance (with Bayesian correction, div.col = 8).

### Cutpoint estimated using a model classifier
cutpoint <- estimateCutPoint(LR = ps,
                            simple = FALSE,
                            div.col = 9,
                            control.names = c("W_3_1","W_3_2","W_3_3",
                                              "W_3_4","W_3_5"),
                            treatment.names = c("M_3_1","M_3_2","M_3_3",
                                                "M_3_4","M_3_5"),
                            column = c(hdiv = TRUE, bay.TV = TRUE,
                                        wprob = TRUE, pos = TRUE),
                            classifier1 = "pca.qda", n.pc = 4,
                            classifier2 = "pca.logistic",
                            center = TRUE,
                            scale = TRUE,
                            clas.perf = TRUE,
                            verbose = FALSE
)
cutpoint$cutpoint
#> [1] 1.003921
cutpoint$testSetPerformance
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  CT  TT
#>         CT 138   0
#>         TT   0 662
#>                                      
#>                Accuracy : 1          
#>                  95% CI : (0.9954, 1)
#>     No Information Rate : 0.8275     
#>     P-Value [Acc > NIR] : < 2.2e-16  
#>                                      
#>                   Kappa : 1          
#>                                      
#>  Mcnemar's Test P-Value : NA         
#>                                      
#>             Sensitivity : 1.0000     
#>             Specificity : 1.0000     
#>          Pos Pred Value : 1.0000     
#>          Neg Pred Value : 1.0000     
#>              Prevalence : 0.8275     
#>          Detection Rate : 0.8275     
#>    Detection Prevalence : 0.8275     
#>       Balanced Accuracy : 1.0000     
#>                                      
#>        'Positive' Class : TT         
#> 
cutpoint$testSetModel.FDR
#> [1] 0

Differentially methylated positions (DMPs) can be now identified using selectDIMP function

dmps <- selectDIMP(ps, div.col = 9, cutpoint = cutpoint$cutpoint)
data.frame(dmps =unlist(lapply(dmps, length)))
#>       dmps
#> W_3_1   70
#> W_3_2   74
#> W_3_3   58
#> W_3_4   83
#> W_3_5   58
#> M_3_1  412
#> M_3_2  328
#> M_3_3  341
#> M_3_4  262
#> M_3_5  313

Differentially methylated genes (DMGs)

Differentially methylated genes (DMGs) are estimated from group comparisons for the number of DMPs on gene-body regions between control and treatment.

Function getDIMPatGenes is used to count the number of DMPs at gene-body. Nevertheless, it can be used for any arbitrary specified genomic region as well. The operation of this function is based on the ‘findOverlaps’ function from the ‘GenomicRanges’ Bioconductor R package. The ‘findOverlaps’ function has several critical parameters like, for example, ‘maxgap’, ‘minoverlap’, and ‘ignore.strand’. In our function getDIMPatGenes, except for setting ignore.strand = TRUE and type = “within”, we preserve the rest of default ‘findOverlaps’ parameters. In this case, these are important parameter settings because the local mechanical effect of methylation changes on a DNA region where a gene is located is independent of the strand where the gene is encoded. That is, methylation changes located in any of the two DNA strands inside the gene-body region will affect the flexibility of the DNA molecule (Choy et al. 2010; Severin et al. 2011).

nams <- names(dmps)
dmps_at_genes <- getDIMPatGenes(GR = dmps, GENES = genes, ignore.strand = TRUE)
dmps_at_genes <- uniqueGRanges(dmps_at_genes, columns = 2L,
                                ignore.strand = TRUE, type = "equal")
#>  *** Building coordinates for the new GRanges object ...
#>  *** Setting strand information to * ...
#>  *** Processing GRanges for sample: #1
#>  *** Processing GRanges for sample: #2...
#>  *** Processing GRanges for sample: #3...
#>  *** Processing GRanges for sample: #4...
#>  *** Processing GRanges for sample: #5...
#>  *** Processing GRanges for sample: #6...
#>  *** Processing GRanges for sample: #7...
#>  *** Processing GRanges for sample: #8...
#>  *** Processing GRanges for sample: #9...
#>  *** Processing GRanges for sample: #10...
#>  *** Sorting by chromosomes and start positions...
colnames(mcols(dmps_at_genes)) <- nams
dmps_at_genes
#> GRanges object with 80 ranges and 10 metadata columns:
#>        seqnames      ranges strand |     W_3_1     W_3_2     W_3_3     W_3_4
#>           <Rle>   <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>    [1]        1   3631-5899      * |         0         0         0         2
#>    [2]        1   6788-9130      * |         0         0         1         0
#>    [3]        1 11649-13714      * |         0         0         0         0
#>    [4]        1 23121-31227      * |         2         1         5         1
#>    [5]        1 31170-33171      * |         0         0         0         0
#>    ...      ...         ...    ... .       ...       ...       ...       ...
#>   [76]        5 58011-60136      * |         0         0         0         0
#>   [77]        5 61017-63936      * |         0         0         1         0
#>   [78]        5 84474-86275      * |         0         0         0         0
#>   [79]        5 86543-90117      * |         0         0         0         0
#>   [80]        5 91838-95701      * |         0         2         0         0
#>            W_3_5     M_3_1     M_3_2     M_3_3     M_3_4     M_3_5
#>        <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#>    [1]         0         1         2         1         3         1
#>    [2]         0         0         0         1         0         1
#>    [3]         0         1         0         0         0         0
#>    [4]         1        11        12        11        11         9
#>    [5]         1         0         0         0         1         1
#>    ...       ...       ...       ...       ...       ...       ...
#>   [76]         0         0         0         0         0         1
#>   [77]         1         7         6         5         6         5
#>   [78]         0         0         0         0         0         1
#>   [79]         0         0         0         0         0         1
#>   [80]         0         7         3         4         5         5
#>   -------
#>   seqinfo: 5 sequences from an unspecified genome; no seqlengths

Setting the experiment design

The experimental design is set with glmDataSet function

colData <- data.frame(condition = factor(c("WT", "WT", "WT", "WT", "WT",
                                            "ML", "ML", "ML", "ML", "ML"),
                                        levels = c("WT", "ML")),
                    nams,
                    row.names = 2)
## A RangedGlmDataSet is created
ds <- glmDataSet(GR = dmps_at_genes, colData = colData)

Testing for DMGs

DMGs are detected using function countTest2.

dmgs <- countTest2(DS = ds, num.cores = 4L,
                    tasks = 2L,
                    minCountPerIndv = 7,
                    maxGrpCV = c(1, 1),
                    Minlog2FC = 1,
                    CountPerBp = 0.001,
                    test = "LRT",
                    verbose = TRUE)
#> *** Number of GR after filtering counts 4
#> *** GLM...
dmgs
#> GRanges object with 4 ranges and 18 metadata columns:
#>       seqnames      ranges strand |     W_3_1     W_3_2     W_3_3     W_3_4
#>          <Rle>   <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>   [1]        1 23121-31227      * |         2         1         5         1
#>   [2]        3 31145-34670      * |         0         2         3         4
#>   [3]        4 51166-53449      * |         5         1         2         0
#>   [4]        5 34538-37999      * |         0         0         2         0
#>           W_3_5     M_3_1     M_3_2     M_3_3     M_3_4     M_3_5    log2FC
#>       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#>   [1]         1        11        12        11        11         9   1.68640
#>   [2]         2         6        10        10         6         7   1.01160
#>   [3]         0        15        10         6        10         8   1.42403
#>   [4]         1         8         7         9         6         6   1.63413
#>       scaled.deviance      pvalue        model    adj.pval CT.SignalDensity
#>             <numeric>   <numeric>  <character>   <numeric>        <numeric>
#>   [1]         33.2474 8.11462e-09 Neg.Binomial 1.62292e-08      0.000246700
#>   [2]         13.5878 2.27656e-04 Neg.Binomial 2.27656e-04      0.000623936
#>   [3]         26.9510 2.08680e-07 Neg.Binomial 2.78240e-07      0.000700525
#>   [4]         44.2033 2.95980e-11 Neg.Binomial 1.18392e-10      0.000173310
#>       TT.SignalDensity SignalDensityVariation
#>              <numeric>              <numeric>
#>   [1]       0.00133218             0.00108548
#>   [2]       0.00221214             0.00158820
#>   [3]       0.00429072             0.00359019
#>   [4]       0.00207972             0.00190641
#>   -------
#>   seqinfo: 5 sequences from an unspecified genome; no seqlengths

References

Carter, Jane V., Jianmin Pan, Shesh N. Rai, and Susan Galandiuk. 2016. “ROC-ing along: Evaluation and interpretation of receiver operating characteristic curves.” Surgery 159 (6): 1638–45. https://doi.org/10.1016/j.surg.2015.12.029.

Choy, John S., Sijie Wei, Ju Yeon Lee, Song Tan, Steven Chu, and Tae Hee Lee. 2010. “DNA methylation increases nucleosome compaction and rigidity.” Journal of the American Chemical Society 132 (6): 1782–3. https://doi.org/10.1021/ja910264z.

Greiner, M, D Pfeiffer, and R D Smith. 2000. “Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests.” Preventive Veterinary Medicine 45 (1-2): 23–41. https://doi.org/10.1016/S0167-5877(00)00115-X.

Harpaz, Rave, William DuMouchel, Paea LePendu, Anna Bauer-Mehren, Patrick Ryan, and Nigam H Shah. 2013. “Performance of Pharmacovigilance Signal Detection Algorithms for the FDA Adverse Event Reporting System.” Clin Pharmacol Ther 93 (6): 1–19. https://doi.org/10.1038/clpt.2013.24.Performance.

Kruspe, Sven, David D. Dickey, Kevin T. Urak, Giselle N. Blanco, Matthew J. Miller, Karen C. Clark, Elliot Burghardt, et al. 2017. “Rapid and Sensitive Detection of Breast Cancer Cells in Patient Blood with Nuclease-Activated Probe Technology.” Molecular Therapy - Nucleic Acids 8 (September): 542–57. https://doi.org/10.1016/j.omtn.2017.08.004.

López-Ratón, Mónica, Mar\(\backslash\)\(\backslash\)ia Xosé Rodríguez-Álvarez, Carmen Cadarso-Suárez, Francisco Gude-Sampedro, and Others. 2014. “OptimalCutpoints: an R package for selecting optimal cutpoints in diagnostic tests.” Journal of Statistical Software 61 (8): 1–36. https://www.jstatsoft.org/article/view/v061i08.

Sanchez, Robersy, and Sally A. Mackenzie. 2016. “Information Thermodynamics of Cytosine DNA Methylation.” Edited by Barbara Bardoni. PLOS ONE 11 (3): e0150427. https://doi.org/10.1371/journal.pone.0150427.

Sanchez, Robersy, Xiaodong Yang, Thomas Maher, and Sally Mackenzie. 2019. “Discrimination of DNA Methylation Signal from Background Variation for Clinical Diagnostics.” Int. J. Mol. Sci. 20 (21): 5343. https://doi.org/https://doi.org/10.3390/ijms20215343.

Severin, Philip M D, Xueqing Zou, Hermann E Gaub, and Klaus Schulten. 2011. “Cytosine methylation alters DNA mechanical properties.” Nucleic Acids Research 39 (20): 8740–51. https://doi.org/10.1093/nar/gkr578.

Youden, W. J. 1950. “Index for rating diagnostic tests.” Cancer 3 (1): 32–35. https://doi.org/10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3.

Session info

Here is the output of sessionInfo() on the system on which this document was compiled running pandoc 2.7.3:

#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] MethylIT_0.3.2.1     rtracklayer_1.48.0   GenomicRanges_1.40.0
#> [4] GenomeInfoDb_1.24.0  IRanges_2.22.2       S4Vectors_0.26.1    
#> [7] BiocGenerics_0.34.0  BiocStyle_2.16.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-148                bitops_1.0-6               
#>  [3] matrixStats_0.56.0          fs_1.4.1                   
#>  [5] bit64_0.9-7                 lubridate_1.7.9            
#>  [7] rprojroot_1.3-2             tools_4.0.0                
#>  [9] backports_1.1.7             R6_2.4.1                   
#> [11] rpart_4.1-15                DBI_1.1.0                  
#> [13] colorspace_1.4-1            nnet_7.3-14                
#> [15] withr_2.2.0                 tidyselect_1.1.0           
#> [17] bit_1.1-15.2                compiler_4.0.0             
#> [19] Biobase_2.48.0              desc_1.2.0                 
#> [21] DelayedArray_0.14.0         bookdown_0.19              
#> [23] scales_1.1.1                genefilter_1.70.0          
#> [25] pkgdown_1.5.1               stringr_1.4.0              
#> [27] digest_0.6.25               Rsamtools_2.4.0            
#> [29] rmarkdown_2.2               XVector_0.28.0             
#> [31] pkgconfig_2.0.3             htmltools_0.4.0            
#> [33] rlang_0.4.6                 RSQLite_2.2.0              
#> [35] rstudioapi_0.11             generics_0.0.2             
#> [37] BiocParallel_1.22.0         dplyr_1.0.0                
#> [39] ModelMetrics_1.2.2.2        nls2_0.2                   
#> [41] RCurl_1.98-1.2              magrittr_1.5               
#> [43] GenomeInfoDbData_1.2.3      Matrix_1.2-18              
#> [45] Rcpp_1.0.4.6                munsell_0.5.0              
#> [47] lifecycle_0.2.0             stringi_1.4.6              
#> [49] pROC_1.16.2                 yaml_2.2.1                 
#> [51] MASS_7.3-51.6               SummarizedExperiment_1.18.1
#> [53] zlibbioc_1.34.0             plyr_1.8.6                 
#> [55] recipes_0.1.12              blob_1.2.1                 
#> [57] grid_4.0.0                  crayon_1.3.4               
#> [59] lattice_0.20-41             Biostrings_2.56.0          
#> [61] splines_4.0.0               annotate_1.66.0            
#> [63] knitr_1.28                  pillar_1.4.4               
#> [65] reshape2_1.4.4              codetools_0.2-16           
#> [67] XML_3.99-0.3                glue_1.4.1                 
#> [69] evaluate_0.14               data.table_1.12.8          
#> [71] BiocManager_1.30.10         vctrs_0.3.1                
#> [73] foreach_1.5.0               gtable_0.3.0               
#> [75] purrr_0.3.4                 assertthat_0.2.1           
#> [77] ggplot2_3.3.1               xfun_0.14                  
#> [79] gower_0.2.1                 prodlim_2019.11.13         
#> [81] xtable_1.8-4                e1071_1.7-3                
#> [83] ArgumentCheck_0.10.2        class_7.3-17               
#> [85] survival_3.1-12             timeDate_3043.102          
#> [87] minpack.lm_1.2-1            tibble_3.0.1               
#> [89] iterators_1.0.12            AnnotationDbi_1.50.0       
#> [91] GenomicAlignments_1.24.0    memoise_1.1.0              
#> [93] lava_1.6.7                  ellipsis_0.3.1             
#> [95] caret_6.0-86                ipred_0.9-9