Abstract

Herein, an example of Methyl-IT application to the detection of differential methylated regions (DMRs) on first generation MSH1 RNAi transgene segregants data set is presented.

Introduction

MethylIT is an R package for methylome analysis based on a signal-detection machine learning approach (SD-ML). This approach is postulated to provide greater sensitivity for resolving true signal from the methylation background within the methylome (Sanchez and Mackenzie 2016; Sanchez et al. 2019). Because the biological signal created within the dynamic methylome environment characteristic of plants is not free from background noise, the approach, designated Methyl-IT, includes the application of signal detection theory (Greiner, Pfeiffer, and Smith 2000; Carter et al. 2016; Harpaz et al. 2013; Kruspe et al. 2017).

A basic requirement for the application of signal detection is the knowledge of the probability distribution of the background noise, which is used as null hypothesis in the detection of the methylation signal. This probability distribution is, in general, a member of generalized gamma distribution family, and it can be deduced on a statistical mechanical/thermodynamics basis from DNA methylation induced by thermal fluctuations (Sanchez and Mackenzie 2016).

Herein, we provide an example of Methyl-IT application to DMR estimation on patients with PALL (Wahlberg et al. 2016). Due to the size of human methylome the current example only covers the analysis of chromosome 9. A full description of Methyl-IT application of methylome analysis for Clinical Diagnostics is given in the manuscript (Sanchez et al. 2019). A wider analysis of PALL data set is given in (Sanchez and Mackenzie 2020).

The first step of DMR estimation follows the general step of Methyl-IT pipeline for the estimation of DMPs. In Methyl-IT, a DMP is a cytosine positions that, with high probability, carries a statistically significant methylation change in respect to the same position in a control population. Once DMPs are identified, a simple algorithm can be applied to identify clusters of DMPs using dmpClusters function. Next, countTest2 is applied to test whether a given cluster of DMPs is a DMR.

It is worthy to notices that any clustering algorithm will impose non-natural constraints on the methylation patterning, which not necessarily will be associated to the local DNA physico-chemical properties and to local molecular evolutionary features of every DMR detected. Every cytosine methylation has a concrete physical effect of the DNA, altering the DNA persistence length, i.e., altering the local DNA flexibility (Severin et al. 2011; Pérez et al. 2012; Yusufaly, Li, and Olson 2013). DNA flexibility was shown to affect the binding of proteins to methylated DNA and DNA-containing lesion (Teng and Hwang 2018; Hodges, Hudson, and Buck-Koehntop 2020) and nucleosome mechanical stability (Ngo et al. 2016). In other words, the sizes of naturally occurrying methylation motifs are not arbitrarily set, but very well constrained under molecular evolutionary pressure, probably, particular for each species.

The R packages required in this example are:

        library(MethylIT)
        library(MethylIT.utils)
        library(ggplot2) # graphic
        library(grid) # For multiple plots
        library(gridExtra) # For multiple plots
        library(rtracklayer) # To import gene annotation

Estimation of differentially methylated regions (DMRs)

DMP data set is available for download.

## ===== Download methylation data from Github ========
url <- paste0("https://github.com/genomaths/genomaths.github.io/raw/master/AT_memory/",
              "dmps_mm-vs-nm_all-context_gen1_nmRef_min-meth-3_cov-4_04-16-20.RData")
temp <- tempfile(fileext = ".RData")
download.file(url = url, destfile = temp)
load(temp)
file.remove(temp); rm(temp, url)

The gene annotation

## ===== Download methylation data from GitHub ========
url <- paste0("https://github.com/genomaths/genomaths.github.io/raw/master/",
              "Annotation/Arabidopsis_thaliana_genes_TAIR10.38.gtf.RData")
temp <- tempfile(fileext = ".RData")
download.file(url = url, destfile = temp)
load(temp)
file.remove(temp); rm(temp, url);

GENES <- genes[genes$type == "gene"]

## To get exons and introns
exons <- genes[genes$type == "exon"]
introns <- setdiff(x = genes, y = exons, ignore.strand = TRUE)

## To add the corresponding gene names to introns
hits <- findOverlaps(introns, genes, type = "within", ignore.strand = TRUE)
introns <- introns[queryHits(hits)]
mcols(introns) <- mcols(genes[subjectHits(hits)])

Cluster of DMPs

Function dmpClusters search for cluster of DMPs found from all the samples. The basic idea reside in that if a given cluster of DMP is a DMR, then it will be consistently differentially methylated in all treatment samples, and we can identify them applying function countTest2.

There is not a unique approach to build DMP clusters, since several clustering criteria can be applied. Herein, we use the two approach implemented in dmpClusters function (follow the link or type ?dmpClusters to see the help for the function).

clust_50_100 = dmpClusters(GR = dmps, maxDist = 50, minNumDMPs = 10,
                           maxClustDist = 100, method = "fixed.int",
                           num.cores = 1L, verbose = FALSE)

clust_50_100
## GRanges object with 18035 ranges and 1 metadata column:
##           seqnames            ranges strand |      dmps
##              <Rle>         <IRanges>  <Rle> | <integer>
##       [1]        1           183-330      * |        27
##       [2]        1           647-809      * |        10
##       [3]        1       18052-18224      * |        14
##       [4]        1       18643-18746      * |        12
##       [5]        1       42279-42423      * |        40
##       ...      ...               ...    ... .       ...
##   [18031]        5 26670460-26670651      * |        15
##   [18032]        5 26788467-26788575      * |        11
##   [18033]        5 26945319-26945446      * |        11
##   [18034]        5 26974071-26974175      * |        10
##   [18035]        5 26974820-26975362      * |        39
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

DMPs clusterization strongly depends on the maximum distance permitted between two consecutive DMPs that belong to the same cluster (parameter: minNumDMPs) and on the minimum number of DMPs that must be present on a detected agglomeration of DMPs to be considered a cluster (parameter: maxDist).

.clust_50_100 = dmpClusters(GR = dmps, maxDist = 50, minNumDMPs = 10,
                           maxClustDist = 100, method = "relaxed",
                           num.cores = 1L, verbose = FALSE)
.clust_50_100
## GRanges object with 17528 ranges and 1 metadata column:
##           seqnames            ranges strand |      dmps
##              <Rle>         <IRanges>  <Rle> | <integer>
##       [1]        1            94-330      * |        29
##       [2]        1           647-880      * |        13
##       [3]        1       17833-18512      * |        35
##       [4]        1       18643-18746      * |        12
##       [5]        1       42279-42423      * |        40
##       ...      ...               ...    ... .       ...
##   [17524]        5 26788399-26788575      * |        12
##   [17525]        5 26835058-26835302      * |        13
##   [17526]        5 26945319-26945446      * |        11
##   [17527]        5 26974071-26974329      * |        14
##   [17528]        5 26974728-26975362      * |        40
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

Alternatively, we can use function dmpClustering from MethylIT.utils R package. Notice that although ‘method’ parameter are named the same, on each function these stand for different approaches.

clust.50.100 = dmpClustering(dmps = dmps, minNumDMPs = 10, win.size = 50,
                            step.size = 50, method = "fixed.int",
                            maxClustDist = 100, verbose = FALSE)
clust.50.100
## GRanges object with 15521 ranges and 1 metadata column:
##           seqnames            ranges strand |      dmps
##              <Rle>         <IRanges>  <Rle> | <integer>
##       [1]        1            94-330      * |        29
##       [2]        1           647-880      * |        13
##       [3]        1       17833-18746      * |        47
##       [4]        1       42279-42423      * |        40
##       [5]        1       44603-44848      * |        19
##       ...      ...               ...    ... .       ...
##   [15517]        5 26835058-26835302      * |        13
##   [15518]        5 26891628-26892012      * |        10
##   [15519]        5 26945319-26945446      * |        11
##   [15520]        5 26974071-26974329      * |        14
##   [15521]        5 26974728-26975362      * |        40
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths
.clust.50.100 = dmpClustering(dmps = dmps, minNumDMPs = 10, method = "relaxed",
                            maxClustDist = 100, verbose = FALSE)
.clust.50.100
## GRanges object with 17647 ranges and 1 metadata column:
##           seqnames            ranges strand |      dmps
##              <Rle>         <IRanges>  <Rle> | <integer>
##       [1]        1            94-330      * |        29
##       [2]        1           647-880      * |        13
##       [3]        1       17833-18512      * |        35
##       [4]        1       18643-18746      * |        12
##       [5]        1       42279-42423      * |        40
##       ...      ...               ...    ... .       ...
##   [17643]        5 26788399-26788575      * |        12
##   [17644]        5 26835058-26835302      * |        13
##   [17645]        5 26945319-26945446      * |        11
##   [17646]        5 26974071-26974329      * |        14
##   [17647]        5 26974728-26975362      * |        40
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

DMR detection

The count of the number of DMPs on each region will be accomplished with getDMPatRegions function. This function operates based on findOverlaps function from the ‘GenomicRanges’ Bioconductor R package. The findOverlaps function has several critical parameters like, for example, ‘maxgap’, ‘minoverlap’, and ‘ignore.strand’. In our getDMPatRegions function, except for setting ignore.strand = TRUE and type = “within”, we preserve the rest of default ‘findOverlaps’ parameters.

Experiment design for each clustering approach

Design for method = “fixed.int” from dmpClusters function

## ==== Setting up the experiment design to test for DMRs ===
nams <- names(dmps)
dmps_at_clust_50_100 <- getDMPatRegions(GR = dmps, regions = clust_50_100,
                                       ignore.strand = TRUE)
dmps_at_clust_50_100 <- uniqueGRanges(dmps_at_clust_50_100, columns = 2L,
                                      ignore.strand = TRUE, type = 'equal',
                                      verbose = FALSE)
colnames(mcols(dmps_at_clust_50_100)) <- nams

colData <- data.frame(condition = factor(c("MM", "MM","MM","MM", "MM",
                                           "NM","NM","NM","NM"),
                                         levels = c("NM", "MM")),
                      nams,
                      row.names = 2)

## Build a RangedGlmDataSet object
ds_50_100 <- glmDataSet(GR = dmps_at_clust_50_100, colData = colData)

The GRanges object with DMPs on it

dmps_at_clust_50_100
## GRanges object with 18035 ranges and 9 metadata columns:
##           seqnames            ranges strand |   M_1_105   M_1_168   M_1_179
##              <Rle>         <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##       [1]        1           183-330      * |         0        22         2
##       [2]        1           647-809      * |         0         3         4
##       [3]        1       18052-18224      * |         2         2         1
##       [4]        1       18643-18746      * |         2         1         0
##       [5]        1       42279-42423      * |        15        16         0
##       ...      ...               ...    ... .       ...       ...       ...
##   [18031]        5 26670460-26670651      * |         5         1         8
##   [18032]        5 26788467-26788575      * |         7         1         1
##   [18033]        5 26945319-26945446      * |         3         2         0
##   [18034]        5 26974071-26974175      * |         3         4         0
##   [18035]        5 26974820-26975362      * |         4        11         8
##              M_1_27     M_1_3   NM_1_12   NM_1_45   NM_1_70   NM_1_86
##           <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##       [1]         3         1         0         0         2         0
##       [2]         2         1         0         0         0         0
##       [3]         3         7         0         0         0         0
##       [4]         3         0         0         2         1         3
##       [5]        13        27         0         0         0         0
##       ...       ...       ...       ...       ...       ...       ...
##   [18031]         3         0         0         0         0         0
##   [18032]         1         0         1         0         0         0
##   [18033]         2         2         2         1         0         1
##   [18034]         3         0         1         4         2         3
##   [18035]         3        11         2         3         0         0
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

The glmDataSet object.

ds_50_100
## RangedGlmDataSet with 18035 regions and 9 columns (individuals) with factor levels 'NM' and 'MM' 
## The accessible objects in the dataset are: 
##             Length Class      Mode     
## GR           18035 GRanges    S4       
## counts      162315 -none-     numeric  
## colData          1 data.frame list     
## sampleNames      9 -none-     character
## levels           2 -none-     character
## optionData       0 -none-     NULL

Design for method = “relaxed” from dmpClusters function

## ==== Setting up the experiment design to test for DMRs ===
nams <- names(dmps)
.dmps_at_clust_50_100 <- getDMPatRegions(GR = dmps, regions = .clust_50_100,
                                       ignore.strand = TRUE)
.dmps_at_clust_50_100 <- uniqueGRanges(.dmps_at_clust_50_100, columns = 2L,
                                      ignore.strand = TRUE, type = 'equal',
                                      verbose = FALSE)
colnames(mcols(.dmps_at_clust_50_100)) <- nams

## Build a RangedGlmDataSet object
.ds_50_100 <- glmDataSet(GR = .dmps_at_clust_50_100, colData = colData)

Design for method = “fixed.int” from dmpClustering function

## ==== Setting up the experiment design to test for DMRs ===
nams <- names(dmps)
dmps_at_clust.50.100 <- getDMPatRegions(GR = dmps, regions = clust.50.100,
                                       ignore.strand = TRUE)
dmps_at_clust.50.100 <- uniqueGRanges(dmps_at_clust.50.100, columns = 2L,
                                      ignore.strand = TRUE, type = 'equal',
                                      verbose = FALSE)
colnames(mcols(dmps_at_clust.50.100)) <- nams

## Build a RangedGlmDataSet object
ds.50.100 <- glmDataSet(GR = dmps_at_clust.50.100, colData = colData)

Design for method = “relaxed” from dmpClustering function

## ==== Setting up the experiment design to test for DMRs ===
nams <- names(dmps)
dmps_at.clust.50.100 <- getDMPatRegions(GR = dmps, regions = .clust.50.100,
                                       ignore.strand = TRUE)
dmps_at.clust.50.100 <- uniqueGRanges(dmps_at.clust.50.100, columns = 2L,
                                      ignore.strand = TRUE, type = 'equal',
                                      verbose = FALSE)
colnames(mcols(dmps_at.clust.50.100)) <- nams

## Build a RangedGlmDataSet object
ds_50.100 <- glmDataSet(GR = dmps_at.clust.50.100, colData = colData)

DMR detection

Herein, we are requesting a minimum density DMP of 1/100 bp = 0.01/bp (CountPerBp = 0.01), minimum of logarithm of fold changes in the number of DMPs, treatment versus control samples, equal to 1 (Minlog2FC = 1). Also, it is requested a coefficient of variation, on each sample set, not greater than 1 (maxGrpCV = c(1, 1)).

Notice that in the case of species with big genes, for example, human genes, we might want to use lower densities, e.g., CountPerBp = 0.001.

DMRs derived method = “fixed.int” from dmpClusters function

dmrs_fixed.int = countTest2(DS = ds_50_100,
                            minCountPerIndv = 8,
                            maxGrpCV = c(1, 1),
                            Minlog2FC = 1,
                            CountPerBp = 0.01,
                            test = 'LRT',
                            num.cores = 4L,
                            verbose = TRUE)

dmrs_fixed.int

The DMR associated genes at a \(distance \leq 1000\):

hits <- distanceToNearest(dmrs_fixed.int, GENES, ignore.strand=FALSE)

## DMR associated genes 
hits <- hits[mcols(hits)[,1] <= 1000]
GENES[subjectHits(hits)]
## GRanges object with 975 ranges and 4 metadata columns:
##         seqnames            ranges strand |     gene_id     type   gene_biotype
##            <Rle>         <IRanges>  <Rle> | <character> <factor>    <character>
##     [1]        1       43087-43295      - |   AT1G04003     gene         lncRNA
##     [2]        1     239841-242703      - |   AT1G01660     gene protein_coding
##     [3]        1     258717-258963      - |   AT1G04037     gene         lncRNA
##     [4]        1     428650-430720      - |   AT1G02220     gene protein_coding
##     [5]        1   1194187-1196889      + |   AT1G04425     gene          ncRNA
##     ...      ...               ...    ... .         ...      ...            ...
##   [971]        5 23775891-23779855      + |   AT5G58880     gene protein_coding
##   [972]        5 25934568-25935612      + |   AT5G64890     gene protein_coding
##   [973]        5 25967824-25968638      + |   AT5G65005     gene protein_coding
##   [974]        5 26068112-26069652      + |   AT5G65230     gene protein_coding
##   [975]        5 26411796-26414609      - |   AT5G66050     gene protein_coding
##           gene_name
##         <character>
##     [1]        <NA>
##     [2]        <NA>
##     [3]        <NA>
##     [4]      NAC003
##     [5]        <NA>
##     ...         ...
##   [971]        <NA>
##   [972]        PEP2
##   [973]        <NA>
##   [974]     AtMYB53
##   [975]        <NA>
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

DMRs derived method = “relaxed” from dmpClusters function

.dmrs_relaxed = countTest2(DS = .ds_50_100,
                           minCountPerIndv = 8,
                           maxGrpCV = c(1, 1),
                           Minlog2FC = 1,
                           CountPerBp = 0.01,
                           test = 'LRT',
                           num.cores = 4L,
                           verbose = TRUE)
.dmrs_relaxed

The DMR assciated genes at a \(distance \leq 1000\):

hits <- distanceToNearest(.dmrs_relaxed, GENES, ignore.strand=FALSE)

## DMR associated genes 
hits <- hits[mcols(hits)[,1] <= 1000]
GENES[subjectHits(hits)]
## GRanges object with 1215 ranges and 4 metadata columns:
##          seqnames            ranges strand |     gene_id     type
##             <Rle>         <IRanges>  <Rle> | <character> <factor>
##      [1]        1       43087-43295      - |   AT1G04003     gene
##      [2]        1     239841-242703      - |   AT1G01660     gene
##      [3]        1     258717-258963      - |   AT1G04037     gene
##      [4]        1     428650-430720      - |   AT1G02220     gene
##      [5]        1   1194187-1196889      + |   AT1G04425     gene
##      ...      ...               ...    ... .         ...      ...
##   [1211]        5 25934568-25935612      + |   AT5G64890     gene
##   [1212]        5 25967824-25968638      + |   AT5G65005     gene
##   [1213]        5 26068112-26069652      + |   AT5G65230     gene
##   [1214]        5 26161624-26169191      - |   AT5G65460     gene
##   [1215]        5 26411796-26414609      - |   AT5G66050     gene
##            gene_biotype   gene_name
##             <character> <character>
##      [1]         lncRNA        <NA>
##      [2] protein_coding        <NA>
##      [3]         lncRNA        <NA>
##      [4] protein_coding      NAC003
##      [5]          ncRNA        <NA>
##      ...            ...         ...
##   [1211] protein_coding        PEP2
##   [1212] protein_coding        <NA>
##   [1213] protein_coding     AtMYB53
##   [1214] protein_coding        KCA2
##   [1215] protein_coding        <NA>
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

DMRs derived method = “fixed.int” from dmpClustering function

dmrs.fixed.int = countTest2(DS = ds.50.100,
                            minCountPerIndv = 8,
                            maxGrpCV = c(1, 1),
                            Minlog2FC = 1,
                            CountPerBp = 0.01,
                            test = 'LRT',
                            num.cores = 4L,
                            verbose = TRUE)
dmrs.fixed.int

The DMR assciated genes at a \(distance \leq 1000\):

hits <- distanceToNearest(dmrs.fixed.int, GENES, ignore.strand=FALSE)

## DMR associated genes 
hits <- hits[mcols(hits)[,1] <= 1000]
GENES[subjectHits(hits)]
## GRanges object with 1151 ranges and 4 metadata columns:
##          seqnames            ranges strand |     gene_id     type
##             <Rle>         <IRanges>  <Rle> | <character> <factor>
##      [1]        1       43087-43295      - |   AT1G04003     gene
##      [2]        1     258717-258963      - |   AT1G04037     gene
##      [3]        1     428650-430720      - |   AT1G02220     gene
##      [4]        1   1194187-1196889      + |   AT1G04425     gene
##      [5]        1   1431177-1433010      - |   AT1G05010     gene
##      ...      ...               ...    ... .         ...      ...
##   [1147]        5 25934568-25935612      + |   AT5G64890     gene
##   [1148]        5 25967824-25968638      + |   AT5G65005     gene
##   [1149]        5 26068112-26069652      + |   AT5G65230     gene
##   [1150]        5 26161624-26169191      - |   AT5G65460     gene
##   [1151]        5 26411796-26414609      - |   AT5G66050     gene
##            gene_biotype   gene_name
##             <character> <character>
##      [1]         lncRNA        <NA>
##      [2]         lncRNA        <NA>
##      [3] protein_coding      NAC003
##      [4]          ncRNA        <NA>
##      [5] protein_coding        ACO4
##      ...            ...         ...
##   [1147] protein_coding        PEP2
##   [1148] protein_coding        <NA>
##   [1149] protein_coding     AtMYB53
##   [1150] protein_coding        KCA2
##   [1151] protein_coding        <NA>
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

DMRs derived method = “relaxed” from dmpClustering function

dmrs.relaxed = countTest2(DS = ds_50.100,
                          minCountPerIndv = 8,
                          maxGrpCV = c(1, 1),
                          Minlog2FC = 1,
                          CountPerBp = 0.01,
                          test = 'LRT',
                          num.cores = 4L,
                          verbose = TRUE)
dmrs.relaxed

The DMR assciated genes at a \(distance \leq 1000\):

hits <- distanceToNearest(dmrs.relaxed, GENES, ignore.strand=FALSE)

## DMR associated genes 
hits <- hits[mcols(hits)[,1] <= 1000]
GENES[subjectHits(hits)]
## GRanges object with 1213 ranges and 4 metadata columns:
##          seqnames            ranges strand |     gene_id     type
##             <Rle>         <IRanges>  <Rle> | <character> <factor>
##      [1]        1       43087-43295      - |   AT1G04003     gene
##      [2]        1     239841-242703      - |   AT1G01660     gene
##      [3]        1     258717-258963      - |   AT1G04037     gene
##      [4]        1     428650-430720      - |   AT1G02220     gene
##      [5]        1   1194187-1196889      + |   AT1G04425     gene
##      ...      ...               ...    ... .         ...      ...
##   [1209]        5 25934568-25935612      + |   AT5G64890     gene
##   [1210]        5 25967824-25968638      + |   AT5G65005     gene
##   [1211]        5 26068112-26069652      + |   AT5G65230     gene
##   [1212]        5 26161624-26169191      - |   AT5G65460     gene
##   [1213]        5 26411796-26414609      - |   AT5G66050     gene
##            gene_biotype   gene_name
##             <character> <character>
##      [1]         lncRNA        <NA>
##      [2] protein_coding        <NA>
##      [3]         lncRNA        <NA>
##      [4] protein_coding      NAC003
##      [5]          ncRNA        <NA>
##      ...            ...         ...
##   [1209] protein_coding        PEP2
##   [1210] protein_coding        <NA>
##   [1211] protein_coding     AtMYB53
##   [1212] protein_coding        KCA2
##   [1213] protein_coding        <NA>
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

Naturally ocurring regions

As was mentioned in the introduction algorithmically identified DMRs do not necessarily will be associated to the local DNA physico-chemical properties and to local molecular evolutionary features of every DMR detected.

All the weight of the evolutionary pressure lie down on naturally ocurrying methylation patternings inside, for example (but not limited to), exons and introns.

For introns we have:

## ==== Setting up the experiment design to test for DMRs ===
nams <- names(dmps)
dmps_at_exons <- getDIMPatGenes(GR = dmps, GENES = exons,
                                ignore.strand = TRUE,
                                output = "GRanges")

## Build a RangedGlmDataSet object
ds_exons <- glmDataSet(GR = dmps_at_exons, colData = colData)

Differentially methylated exons

dmrs_exons = countTest2(DS = ds_exons,
                        minCountPerIndv = 8,
                        maxGrpCV = c(1, 1),
                        Minlog2FC = 1,
                        CountPerBp = 0.01,
                        test = 'LRT',
                        num.cores = 4L,
                        verbose = TRUE)
## *** Number of GR after filtering counts 194
## *** GLM...
dmrs_exons
## GRanges object with 140 ranges and 17 metadata columns:
##             seqnames            ranges strand |   M_1_105   M_1_168   M_1_179
##                <Rle>         <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##   AT1G10745        1   3572879-3573173      * |         5         7        26
##   AT1G17150        1   5866658-5867176      * |         5         6        16
##   AT1G33030        1 11964776-11965351      * |         6        18        21
##   AT1G33720        1 12222407-12224108      * |        36        12         0
##   AT1G33720        1 12222407-12224108      * |        36        12         0
##         ...      ...               ...    ... .       ...       ...       ...
##   AT5G49010        5 19867969-19868877      * |        24        14        20
##   AT5G49010        5 19867969-19868877      * |        24        14        20
##   AT5G51930        5 21103008-21103568      * |         6        20         0
##   AT5G52090        5 21167432-21169462      * |        46        30        19
##   AT5G54700        5 22222886-22224526      * |        25        17        22
##                M_1_27     M_1_3   NM_1_12   NM_1_45   NM_1_70   NM_1_86
##             <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##   AT1G10745         7         5         1         0         0         0
##   AT1G17150         8         6         0         0         1         0
##   AT1G33030         6        12         0         3         0         3
##   AT1G33720        20        24         4         8         0         0
##   AT1G33720        20        24         4         8         0         0
##         ...       ...       ...       ...       ...       ...       ...
##   AT5G49010         6        20         2         0         0         0
##   AT5G49010         6        20         2         0         0         0
##   AT5G51930         8        34         2         0         0         2
##   AT5G52090        27        27         8         4         6         9
##   AT5G54700        14        12         1         6         2         2
##                log2FC scaled.deviance      pvalue        model    adj.pval
##             <numeric>       <numeric>   <numeric>  <character>   <numeric>
##   AT1G10745   2.17475        17.49778 2.87643e-05 Neg.Binomial 4.60882e-05
##   AT1G17150   1.99606        24.10357 9.12911e-07 Neg.Binomial 1.86551e-06
##   AT1G33030   1.69378        19.38609 1.06782e-05 Neg.Binomial 1.83613e-05
##   AT1G33720   1.57898         6.53006 1.06066e-02 Neg.Binomial 1.06824e-02
##   AT1G33720   1.57898         6.53006 1.06066e-02 Neg.Binomial 1.06824e-02
##         ...       ...             ...         ...          ...         ...
##   AT5G49010   2.47373        42.93430 5.66094e-11 Neg.Binomial 1.73520e-10
##   AT5G49010   2.47373        42.93430 5.66094e-11 Neg.Binomial 1.73520e-10
##   AT5G51930   1.98787         9.76251 1.78107e-03 Neg.Binomial 2.14642e-03
##   AT5G52090   1.48497        41.38165 1.25228e-10 Neg.Binomial 3.67856e-10
##   AT5G54700   1.87877        45.37607 1.62609e-11 Neg.Binomial 5.87895e-11
##             CT.SignalDensity TT.SignalDensity SignalDensityVariation
##                    <numeric>        <numeric>              <numeric>
##   AT1G10745      0.000847458        0.0338983             0.03305085
##   AT1G17150      0.000481696        0.0157996             0.01531792
##   AT1G33030      0.002604167        0.0218750             0.01927083
##   AT1G33720      0.001762632        0.0108108             0.00904818
##   AT1G33720      0.001762632        0.0108108             0.00904818
##         ...              ...              ...                    ...
##   AT5G49010      0.000550055        0.0184818             0.01793179
##   AT5G49010      0.000550055        0.0184818             0.01793179
##   AT5G51930      0.001782531        0.0242424             0.02245989
##   AT5G52090      0.003323486        0.0146726             0.01134909
##   AT5G54700      0.001675807        0.0109689             0.00929311
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

For introns we have:

## ==== Setting up the experiment design to test for DMRs ===
nams <- names(dmps)
dmps_at_introns <- getDIMPatGenes(GR = dmps, GENES = introns,
                                 ignore.strand = TRUE,
                                 output = "GRanges")

## Build a RangedGlmDataSet object
ds_introns <- glmDataSet(GR = dmps_at_introns, colData = colData)

Differentially methylated introns

dmrs_introns = countTest2(DS = ds_introns,
                          minCountPerIndv = 8,
                          maxGrpCV = c(1, 1),
                          Minlog2FC = 1,
                          CountPerBp = 0.01,
                          test = 'LRT',
                          num.cores = 4L,
                          verbose = TRUE)
## *** Number of GR after filtering counts 2
## *** GLM...
dmrs_introns
## GRanges object with 2 ranges and 17 metadata columns:
##             seqnames            ranges strand |   M_1_105   M_1_168   M_1_179
##                <Rle>         <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##   AT2G14000        2   5885110-5885454      * |         4        12        10
##   AT5G52115        5 21174335-21175487      * |        12        19        13
##                M_1_27     M_1_3   NM_1_12   NM_1_45   NM_1_70   NM_1_86
##             <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##   AT2G14000         5         9         5         1         1         1
##   AT5G52115        10         9         0         2         0         3
##                log2FC scaled.deviance      pvalue        model    adj.pval
##             <numeric>       <numeric>   <numeric>  <character>   <numeric>
##   AT2G14000   1.38629         14.2099 1.63508e-04 Neg.Binomial 1.63508e-04
##   AT5G52115   1.79914         47.0358 6.97017e-12 Neg.Binomial 1.39403e-11
##             CT.SignalDensity TT.SignalDensity SignalDensityVariation
##                    <numeric>        <numeric>              <numeric>
##   AT2G14000       0.00579710        0.0231884             0.01739130
##   AT5G52115       0.00108413        0.0109280             0.00984389
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

Differentially methylated genes DMGs

The last results for introns and exons contrasts with the result for the whole gene-body.

## ==== Setting up the experiment design to test for DMRs ===
nams <- names(dmps)
dmps_at_genes <- getDIMPatGenes(GR = dmps, GENES = genes,
                                ignore.strand = TRUE,
                                output = "GRanges")

## Build a RangedGlmDataSet object
ds_genes <- glmDataSet(GR = dmps_at_genes, colData = colData)

Differentically methylated genes

dmrs_genes = countTest2(DS = ds_genes,
                        minCountPerIndv = 8,
                        maxGrpCV = c(1, 1),
                        Minlog2FC = 1,
                        CountPerBp = 0.01,
                        test = 'LRT',
                        num.cores = 4L,
                        verbose = TRUE)
## *** Number of GR after filtering counts 970
## *** GLM...
dmrs_genes
## GRanges object with 894 ranges and 17 metadata columns:
##             seqnames            ranges strand |   M_1_105   M_1_168   M_1_179
##                <Rle>         <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##   AT1G05035        1   1444983-1446116      * |        11        15        12
##   AT1G07025        1   2157658-2158158      * |         8        10        18
##   AT1G07025        1   2157658-2158158      * |         8        10        18
##   AT1G09260        1   2993381-2993820      * |        10         6        24
##   AT1G09260        1   2993381-2993820      * |        10         6        24
##         ...      ...               ...    ... .       ...       ...       ...
##   AT5G52115        5 21174321-21175857      * |        18        27        17
##   AT5G54700        5 22222886-22225616      * |        42        28        36
##   AT5G56920        5 23024896-23025915      * |        19        38        17
##   AT5G60945        5 24526245-24526636      * |         4         8        14
##   AT5G60945        5 24526245-24526636      * |         4         8        14
##                M_1_27     M_1_3   NM_1_12   NM_1_45   NM_1_70   NM_1_86
##             <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##   AT1G05035        22         9         7         2         2         7
##   AT1G07025        16         8         4         2         0         2
##   AT1G07025        16         8         4         2         0         2
##   AT1G09260         2         2         2         0         0         6
##   AT1G09260         2         2         2         0         0         6
##         ...       ...       ...       ...       ...       ...       ...
##   AT5G52115        12        23         0         2         0         3
##   AT5G54700        23        20         2        11         3         3
##   AT5G56920        19        33         2         4         6        14
##   AT5G60945         4        10         2         0         0         0
##   AT5G60945         4        10         2         0         0         0
##                log2FC scaled.deviance      pvalue        model    adj.pval
##             <numeric>       <numeric>   <numeric>  <character>   <numeric>
##   AT1G05035   1.12059        16.07033 6.10329e-05 Neg.Binomial 8.03913e-05
##   AT1G07025   1.46634        25.55214 4.30588e-07 Neg.Binomial 8.56402e-07
##   AT1G07025   1.46634        25.55214 4.30588e-07 Neg.Binomial 8.56402e-07
##   AT1G09260   1.18377         4.41686 3.55854e-02 Neg.Binomial 3.57048e-02
##   AT1G09260   1.18377         4.41686 3.55854e-02 Neg.Binomial 3.57048e-02
##         ...       ...             ...         ...          ...         ...
##   AT5G52115   2.20460         63.4528 1.64254e-15 Neg.Binomial 1.08335e-14
##   AT5G54700   1.83636         42.8798 5.82085e-11 Neg.Binomial 2.20308e-10
##   AT5G56920   1.35504         18.3625 1.82615e-05 Neg.Binomial 2.75303e-05
##   AT5G60945   1.79176         20.6233 5.59125e-06 Neg.Binomial 9.23637e-06
##   AT5G60945   1.79176         20.6233 5.59125e-06 Neg.Binomial 9.23637e-06
##             CT.SignalDensity TT.SignalDensity SignalDensityVariation
##                    <numeric>        <numeric>              <numeric>
##   AT1G05035       0.00396825        0.0121693             0.00820106
##   AT1G07025       0.00399202        0.0239521             0.01996008
##   AT1G07025       0.00399202        0.0239521             0.01996008
##   AT1G09260       0.00454545        0.0200000             0.01545455
##   AT1G09260       0.00454545        0.0200000             0.01545455
##         ...              ...              ...                    ...
##   AT5G52115      0.000813273        0.0126220             0.01180872
##   AT5G54700      0.001739290        0.0109118             0.00917246
##   AT5G56920      0.006372549        0.0247059             0.01833333
##   AT5G60945      0.001275510        0.0204082             0.01913265
##   AT5G60945      0.001275510        0.0204082             0.01913265
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

It seems to be that methylation patterning inside of gene-body forms a sort continuous regions simultaneously covering portions of exon and introns. In particular, this is true for genes experiencing alternative splicing, where some regions inside introns works as exons as well.

References.

Carter, Jane V., Jianmin Pan, Shesh N. Rai, and Susan Galandiuk. 2016. “ROC-ing along: Evaluation and interpretation of receiver operating characteristic curves.” Surgery 159 (6): 1638–45. https://doi.org/10.1016/j.surg.2015.12.029.

Greiner, M, D Pfeiffer, and R D Smith. 2000. “Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests.” Preventive Veterinary Medicine 45 (1-2): 23–41. https://doi.org/10.1016/S0167-5877(00)00115-X.

Harpaz, Rave, William DuMouchel, Paea LePendu, Anna Bauer-Mehren, Patrick Ryan, and Nigam H Shah. 2013. “Performance of Pharmacovigilance Signal Detection Algorithms for the FDA Adverse Event Reporting System.” Clin Pharmacol Ther 93 (6): 1–19. https://doi.org/10.1038/clpt.2013.24.Performance.

Hodges, Amelia J., Nicholas O. Hudson, and Bethany A. Buck-Koehntop. 2020. “Cys2His2 Zinc Finger Methyl-CpG Binding Proteins: Getting a Handle on Methylated DNA.” Journal of Molecular Biology 432 (6): 1640–60. https://doi.org/10.1016/j.jmb.2019.09.012.

Kruspe, Sven, David D. Dickey, Kevin T. Urak, Giselle N. Blanco, Matthew J. Miller, Karen C. Clark, Elliot Burghardt, et al. 2017. “Rapid and Sensitive Detection of Breast Cancer Cells in Patient Blood with Nuclease-Activated Probe Technology.” Molecular Therapy - Nucleic Acids 8 (September): 542–57. https://doi.org/10.1016/j.omtn.2017.08.004.

Ngo, Thuy T. M., Jejoong Yoo, Qing Dai, Qiucen Zhang, Chuan He, Aleksei Aksimentiev, and Taekjip Ha. 2016. “Effects of cytosine modifications on DNA flexibility and nucleosome mechanical stability.” Nature Communications 7 (February): 10813. https://doi.org/10.1038/ncomms10813.

Pérez, Alberto, Chiara Lara Castellazzi, Federica Battistini, Kathryn Collinet, Oscar Flores, Ozgen Deniz, Maria Luz Ruiz, et al. 2012. “Impact of methylation on the physical properties of DNA.” Biophysical Journal 102 (9): 2140–8. https://doi.org/10.1016/j.bpj.2012.03.056.

Sanchez, Robersy, and Sally A. Mackenzie. 2016. “Information Thermodynamics of Cytosine DNA Methylation.” Edited by Barbara Bardoni. PLOS ONE 11 (3): e0150427. https://doi.org/10.1371/journal.pone.0150427.

Sanchez, Robersy, and Sally A Mackenzie. 2020. “Integrative Network Analysis of Differentially Methylated and Expressed Genes for Biomarker Identification in Leukemia.” Scientific Reports 10 (1): 2123. https://doi.org/10.1038/s41598-020-58123-2.

Sanchez, Robersy, Xiaodong Yang, Thomas Maher, and Sally Mackenzie. 2019. “Discrimination of DNA Methylation Signal from Background Variation for Clinical Diagnostics.” Int. J. Mol. Sci. 20 (21): 5343. https://doi.org/https://doi.org/10.3390/ijms20215343.

Severin, Philip M D, Xueqing Zou, Hermann E Gaub, and Klaus Schulten. 2011. “Cytosine methylation alters DNA mechanical properties.” Nucleic Acids Research 39 (20): 8740–51. https://doi.org/10.1093/nar/gkr578.

Teng, Xiaojing, and Wonmuk Hwang. 2018. “Effect of Methylation on Local Mechanics and Hydration Structure of DNA.” Biophysical Journal 114 (8): 1791–1803. https://doi.org/10.1016/j.bpj.2018.03.022.

Wahlberg, Per, Anders Lundmark, Jessica Nordlund, Stephan Busche, Amanda Raine, Karolina Tandre, Lars Rönnblom, et al. 2016. “DNA methylome analysis of acute lymphoblastic leukemia cells reveals stochastic de novo DNA methylation in CpG islands.” Epigenomics 8 (10): 1367–87. https://doi.org/10.2217/epi-2016-0052.

Yusufaly, Tahir I, Yun Li, and Wilma K Olson. 2013. “5-Methylation of cytosine in CG:CG base-pair steps: a physicochemical mechanism for the epigenetic control of DNA nanomechanics.” The Journal of Physical Chemistry. B 117 (51): 16436–42. https://doi.org/10.1021/jp409887t.

Session info

Here is the output of sessionInfo() on the system on which this document was compiled running pandoc 2.7.3:

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] gridExtra_2.3        ggplot2_3.3.1        MethylIT.utils_0.1  
##  [4] MethylIT_0.3.2.1     rtracklayer_1.48.0   GenomicRanges_1.40.0
##  [7] GenomeInfoDb_1.24.0  IRanges_2.22.2       S4Vectors_0.26.1    
## [10] BiocGenerics_0.34.0  knitr_1.28          
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.4-1            ellipsis_0.3.1             
##   [3] class_7.3-17                modeltools_0.2-23          
##   [5] mclust_5.4.6                rprojroot_1.3-2            
##   [7] XVector_0.28.0              fs_1.4.1                   
##   [9] rstudioapi_0.11             flexmix_2.3-15             
##  [11] bit64_0.9-7                 AnnotationDbi_1.50.0       
##  [13] prodlim_2019.11.13          lubridate_1.7.9            
##  [15] codetools_0.2-16            splines_4.0.0              
##  [17] Formula_1.2-3               pROC_1.16.2                
##  [19] Rsamtools_2.4.0             caret_6.0-86               
##  [21] annotate_1.66.0             kernlab_0.9-29             
##  [23] compiler_4.0.0              backports_1.1.7            
##  [25] assertthat_0.2.1            Matrix_1.2-18              
##  [27] htmltools_0.4.0             tools_4.0.0                
##  [29] gtable_0.3.0                glue_1.4.1                 
##  [31] GenomeInfoDbData_1.2.3      reshape2_1.4.4             
##  [33] dplyr_1.0.0                 Rcpp_1.0.4.6               
##  [35] Biobase_2.48.0              pkgdown_1.5.1              
##  [37] vctrs_0.3.1                 Biostrings_2.56.0          
##  [39] nlme_3.1-148                iterators_1.0.12           
##  [41] betareg_3.1-3               lmtest_0.9-37              
##  [43] timeDate_3043.102           gower_0.2.1                
##  [45] xfun_0.14                   stringr_1.4.0              
##  [47] lifecycle_0.2.0             XML_3.99-0.3               
##  [49] zlibbioc_1.34.0             MASS_7.3-51.6              
##  [51] zoo_1.8-8                   scales_1.1.1               
##  [53] ipred_0.9-9                 SummarizedExperiment_1.18.1
##  [55] sandwich_2.5-1              yaml_2.2.1                 
##  [57] memoise_1.1.0               rpart_4.1-15               
##  [59] segmented_1.1-0             stringi_1.4.6              
##  [61] RSQLite_2.2.0               genefilter_1.70.0          
##  [63] desc_1.2.0                  foreach_1.5.0              
##  [65] BiocParallel_1.22.0         lava_1.6.7                 
##  [67] rlang_0.4.6                 pkgconfig_2.0.3            
##  [69] matrixStats_0.56.0          bitops_1.0-6               
##  [71] evaluate_0.14               lattice_0.20-41            
##  [73] purrr_0.3.4                 GenomicAlignments_1.24.0   
##  [75] recipes_0.1.12              bit_1.1-15.2               
##  [77] tidyselect_1.1.0            plyr_1.8.6                 
##  [79] magrittr_1.5                nls2_0.2                   
##  [81] R6_2.4.1                    generics_0.0.2             
##  [83] DelayedArray_0.14.0         DBI_1.1.0                  
##  [85] pillar_1.4.4                withr_2.2.0                
##  [87] mixtools_1.2.0              survival_3.1-12            
##  [89] RCurl_1.98-1.2              nnet_7.3-14                
##  [91] tibble_3.0.1                crayon_1.3.4               
##  [93] rmarkdown_2.2               minpack.lm_1.2-1           
##  [95] data.table_1.12.8           blob_1.2.1                 
##  [97] ModelMetrics_1.2.2.2        digest_0.6.25              
##  [99] xtable_1.8-4                munsell_0.5.0