This function prepares the data for the estimation of information divergences and works as a wrapper calling the functions that compute selected information divergences of methylation levels. In the downstream analysis, the probability distribution of a given information divergence is used in Methyl-IT as the null hypothesis of the noise distribution, which permits, in a further signal detection step, the discrimination of the methylation regulatory signal from the background noise.

For the current version, two information divergences of methylation levels are computed by default: 1) Hellinger divergence (H) and 2) the total variation distance (TVD). In the context of methylation analysis TVD corresponds to the absolute difference of methylation levels. Here, although the variable reported is the total variation (TV), the variable actually used for the downstream analysis is TVD. Once a differentially methylated position (DMP) is identified in the downstream analysis, TV is the standard indicator of whether the cytosine position is hyper- or hypo-methylated.

The option to compute the J-information divergence (JD) is currently provided. The motivation to introduce this divergence is given in the help of function estimateJDiv.

estimateDivergence(
  ref,
  indiv,
  Bayesian = FALSE,
  init.pars = NULL,
  via.optim = TRUE,
  columns = c(mC = 1, uC = 2),
  min.coverage = 4,
  and.min.cov = TRUE,
  min.meth = 4,
  min.umeth = 0,
  min.sitecov = 4,
  high.coverage = NULL,
  percentile = 0.9999,
  JD = FALSE,
  jd.stat = FALSE,
  ignore.strand = FALSE,
  y.centroid = NULL,
  num.cores = multicoreWorkers(),
  tasks = 0L,
  meth.level = FALSE,
  loss.fun = c("linear", "huber", "smooth", "cauchy", "arctg"),
  logbase = 2,
  verbose = TRUE,
  ...
)

Arguments

ref

The GRanges object of the reference individual that will be used in the estimation of the information divergence.

indiv

A list of GRanges objects from the individuals that will be used in the estimation of the information divergence.

Bayesian

logical(1). Whether to perform the estimations based on posterior estimations of methylation levels.

init.pars

initial parameter values. Defaults is NULL and an initial guess is estimated using optim function. If the initial guessing fails initial parameter values are to alpha = 1 and beta = 1, which imply the parsimony pseudo-counts greater than zero.

via.optim

Optional. Only used if Bayesian = TRUE Whether to estimate beta distribution parameters via optim or nls.lm. If any of this approaches fail then parameters used init.pars will be returned.

columns

Vector of one or two integer numbers denoting the indexes of the columns where the methylated and unmethylated read counts are found or, if meth.level = TRUE, the columns corresponding to the methylation levels. If columns = NULL and meth.level = FALSE, then columns = c(1,2) is assumed. If columns = NULL and meth.level = TRUE, then columns = 1 is assumed.

min.coverage

An integer or an integer vector of length 2. Cytosine sites where the coverage in both samples, 'x' and 'y', are less than 'min.coverage' are discarded. The cytosine site is preserved, however, if the coverage is greater than 'min.coverage' in at least one sample. If 'min.coverage' is an integer vector, then the corresponding min coverage is applied to each sample.

and.min.cov

Logical. Whether to apply the logical AND to select the cytosine sites based on \(min.coverage\). If FALSE, then a logical OR is applied, and cytosine sites where at least one sample hold the \(min.coverage\) are preserved. Default is TRUE. To get the results with previous default parameter set: and.min.cov = FALSE.

min.meth

An integer or an integer vector of length 2. Cytosine sites where the numbers of read counts of methylated cytosine in both samples, '1' and '2', are less than 'min.meth' are discarded. If 'min.meth' is an integer vector, then the corresponding min number of reads is applied to each sample. Default is min.meth = 4.

min.umeth

An integer or an integer vector of length 2 (\(min.umeth = c(min.umeth1, min.umeth2)\)). Min number of reads to consider cytosine position. Specifically cytosine positions where \((uC \leq min.umeth) and (mC > 0) and (mC \leq min.meth)\) hold will be removed, where mC and uC stand for the numbers of methylated and unmethylated reads. Default is \(min.umeth = 0\).

min.sitecov

An integer. The minimum total coverage. Only sites where the total coverage \((cov1 + cov2)\) is greater than 'min.sitecov' are considered for downstream analysis, where cov1 and cov2 are the coverages for samples 1 and 2, respectively.

high.coverage

An integer for read counts. Cytosine sites having higher coverage than this are discarded.

percentile

Threshold to remove the outliers from each file and all files stacked.

JD

Logic (Default:FALSE). Option on whether to add a column with values of J-information divergence (see estimateJDiv). It is only compute if JD = TRUE and meth.level = FALSE.

jd.stat

logical(1). Whether to compute the \(JD\) statistic with asymptotic Chi-squared distribution with one degree of freedom (see estimateJDiv).

ignore.strand

When set to TRUE, the strand information is ignored in the overlap of GRanges-class objects. This is a parameter passed to uniqueGRanges function. Default value: FALSE.

y.centroid

Optional. A GRanges-class object corresponding to the treatment/individual centroide. This information is applied to reduce the bias originated by missing cytosine sites. The centroide can be compute applying function poolFromGRlist.

num.cores

The number of cores to use, i.e. at most how many child processes will be run simultaneously (see 'bplapply' function from BiocParallel package).

tasks

integer(1). The number of tasks per job. value must be a scalar \(integer >= 0L\). In this documentation a job is defined as a single call to a function, such as bplapply, bpmapply etc. A task is the division of the X argument into chunks. When tasks == 0 (default), X is divided as evenly as possible over the number of workers (see MulticoreParam from BiocParallel package).

meth.level

logical(1) Whether methylation levels are given in place of counts. Default is FALSE.

loss.fun

Loss function(s) used in the estimation of the best fitted model to beta distribution (only applied when Bayesian=TRUE; see (Loss function)). This fitting uses the approach followed in in the R package usefr. After \(z = 1/2 * sum((f(x) - y)^2)\) we have:

  1. "linear": linear function which gives a standard least squares: \(loss(z) = z\).

  2. "huber": Huber loss, \(loss(z) = ifelse(z \leq 1, z, sqrt(z) -1)\).

  3. "smooth": Smooth approximation to the sum of residues absolute values: \(loss(z) = 2*(sqrt(z + 1) - 1)\).

  4. "cauchy": Cauchy loss: \(loss(z) = log(z + 1)\).

  5. "arctg": arc-tangent loss function: \(loss(x) = atan(z)\).

Loss 'linear' function works well for most of the methylation datasets with acceptable quality.

logbase

Logarithm base used to compute the JD (if JD = TRUE). Logarithm base 2 is used as default (bit unit). Use logbase = \(exp(1)\) for natural logarithm.

verbose

if TRUE, prints the function log to stdout

...

Optional parameters for uniqueGRanges function.

Value

An object from 'infDiv' class with the four columns of counts, the information divergence, and additional columns:

1) A matrix:

The original matrix of methylated \(c_{ij}\) and unmethylated \(t_{ij}\) read counts from control \(j=1\) and treatment \(j=2\) samples at positions \(i\).

2) 'p1' and 'p2':

methylation levels for control and treatment, respectively. If 'meth.level = FALSE' and 'Bayesian = TRUE' (recommended), 'p1' and 'p2' are estimated following the Bayesian approach described in reference (1).

3) 'bay.TV':

total variation TV = p2 - p1

4) 'TV':

total variation based on simple counts: \(TV=c_1/(c_1+t_1)-c_2/(c_2+t_2)\), where \(c_i\) and \(t_i\) denote methylated and unmethylated read counts, respectively.

5) Hellinger divergence, 'hdiv':

If Bayesian = TRUE, the results are based on the posterior estimations of methylation levels. if meth.level = FALSE', then 'hdiv' is computed as given in reference (2), otherwise as: $$hdiv = (sqrt(p_1) - sqrt(p_2))^2 + (sqrt(1 -p_1) - sqrt(1 - p_2))^2$$

Details

If read counts are provided, then Hellinger divergence is computed as given in the first formula from Theorem 1 from reference (1). In the present case:

$$H = 2 (n_1 + 1) (n_2 + 1)*((\sqrt(p_1) - \sqrt(p_2))^2 + (\sqrt(1-p_2) - \sqrt(1-p_2))^2)/(n_1 + n_2 + 2)$$

where \(n_1\) and \(n_2\) are the coverage for the control and treatment, respectively. Notice that each row from the matrix of counts correspond to a single cytosine position and has four values corresponding to 'mC1' and 'uC1' (control), and mC2' and 'uC2' for treatment.

According with the above equation, to estimate Hellinger divergence, not only the methylation levels are considered in the estimation of H, but also the control and treatment coverage at each given cytosine site. At this point, it is worthy to do mention that if the reference sample is derived with function poolFromGRlist using the 'sum' of read counts to compute a methylation pool, then 'min.coverage' parameter value must be used to prevent an over estimation of the divergence for low coverage cytosines sites. For example, if a reference sample is derived as the methylation pool of read count sum from 3 individuals and we want to consider only methylation sites with minimum coverage of 4, then we can set \(min.coverage = c(12, 4)\), where the number 12 (3 x 4) is the minimum coverage requested for the each cytosine site in the reference sample.

If the methylation levels are provided in place of counts, then the Hellinger divergence is computed as: $$H = (\sqrt(p_1) - \sqrt(p_2))^2 + (\sqrt(1 - p_1) - \sqrt(1 - p_2))^2$$

This formula assumes that the probability vectors derived from the methylation levels \(p_i = c(p_{i1}, 1 - p_{i2}\)) (see estimateHellingerDiv are an unbiased estimation of the expected one. The function applies a pairwise filtering after building a single GRanges from the two GRanges objects. Experimentally available cytosine sites are paired using the function uniqueGRanges.

It is important to observe that several filtering conditions are provided to select biological meaningful cytosine positions, which prevent to carry experimental errors in the downstream analyses. By filtering the read count we try to remove bad quality data, which would be in the edge of the experimental error originated by the BS-seq sequencing. It is user responsibility to check whether cytosine positions used in the analysis are biological meaningful. For example, a cytosine position with counts mC1 = 10 and uC1 = 20 in the 'ref' sample and mC2 = 1 and uC2 = 0 in an 'indv' sample will lead to methylation levels p1 = 0.333 and p2 = 1, respectively, and TV = p2 - p1 = 0.667, which apparently indicates a hypermethylated site. However, there are not enough reads supporting p2 = 1. A Bayesian estimation of TV will reveal that this site would be, in fact, hypomethylated. So, the best practice will be the removing of sites like that. This particular case is removed under the default settings: min.coverage = 4, min.meth = 4, and min.umeth = 0 (see example for function uniqueGRfilterByCov, called by 'estimateDivergence').

A source of bias is originated by missing cytosine sites. Missing data are frequently found in experimental data sets and, in particular, in bisulfite genomic sequencing data. Typically, in statistical analyses, the bias originated by missing data (for given variable) is mitigated by using the mean of the known values for the corresponding variable. That is, in present case, if the reads for some cytosine site are missed in a sample from a set of, e.g., three individuals, then the means of reads (methylated and unmethylated) for such site are applied as an estimation of the best expected ("guessed") value of missed reads. Obviously, if the reads are missed in all the samples, then the site is discarded. See examples from function uniqueGRfilterByCov.

Notice that filtering the data sets to remove undesired cytosine sites is user responsibility. Here, for the sake of facilitating a smooth transition to MethylIT pipeline, we provide some filtering options. It is up to the user whether to apply them or not. For example, if the user wants no filtering at all, then just set: min.coverage = 0, and.min.cov = FALSE, min.meth = 0, min.umeth = 0, min.sitecov = 0, high.coverage = NULL, percentile = 1.

References

  1. Sanchez R, Yang X, Maher T, Mackenzie S. Discrimination of DNA Methylation Signal from Background Variation for Clinical Diagnostics. Int. J. Mol Sci, 2019, 20:5343.

  2. Basu A., Mandal A., Pardo L. Hypothesis testing for two discrete populations based on the Hellinger distance. Stat. Probab. Lett. 2010, 80: 206-214.

Author

Robersy Sanchez (https://genomaths.com)

Examples

## Load a dataset of simulated read counts.
## This is a list of GRanges objects
data("rcounts", package = "MethylIT")

## The first GRanges object is the referenc sample
rcounts$ref
#> GRanges object with 10000 ranges and 2 metadata columns:
#>           seqnames    ranges strand |        mC        uC
#>              <Rle> <IRanges>  <Rle> | <numeric> <numeric>
#>       [1]        1         1      + |        18        46
#>       [2]        1         2      + |         2       140
#>       [3]        1         3      - |         4        24
#>       [4]        1         4      + |        41        55
#>       [5]        1         5      + |        36       145
#>       ...      ...       ...    ... .       ...       ...
#>    [9996]        1      9996      - |        49        49
#>    [9997]        1      9997      + |        24        95
#>    [9998]        1      9998      + |        18        47
#>    [9999]        1      9999      - |        32        95
#>   [10000]        1     10000      - |        28       116
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## For the safe of time, we will use just the first 500 cytosine positions
rc <- lapply(rcounts, function(x) x[ 1:500 ])
rc$C1
#> GRanges object with 500 ranges and 2 metadata columns:
#>         seqnames    ranges strand |        mC        uC
#>            <Rle> <IRanges>  <Rle> | <numeric> <numeric>
#>     [1]        1         1      + |        57         7
#>     [2]        1         2      + |         0       142
#>     [3]        1         3      - |         0        28
#>     [4]        1         4      + |        64        32
#>     [5]        1         5      + |         0       181
#>     ...      ...       ...    ... .       ...       ...
#>   [496]        1       496      + |       123        20
#>   [497]        1       497      - |        53         8
#>   [498]        1       498      + |         0       129
#>   [499]        1       499      + |         0        28
#>   [500]        1       500      - |         6       135
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## The are three control and three treat samples
names(rc)
#> [1] "ref" "C1"  "C2"  "C3"  "T1"  "T2"  "T3" 

## The estimation of information divergences includes both available
## divergences: Hellinger and J divergences.
hd <- estimateDivergence(
    ref = rc$ref,
    indiv = rc[ -1 ],
    Bayesian = TRUE,
    num.cores = 6L,
    percentile = 1,
    JD = TRUE,
    verbose = FALSE)
hd
#> InfDiv object of length: 6
#> ------- 
#> $C1
#> GRanges object with 323 ranges and 10 metadata columns:
#>         seqnames    ranges strand |        c1        t1        c2        t2
#>            <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>     [1]        1         1      + |        18        46        57         7
#>     [2]        1         3      - |         4        24         0        28
#>     [3]        1         4      + |        41        55        64        32
#>     [4]        1         5      + |        36       145         0       181
#>     [5]        1         6      - |        16        97        64        49
#>     ...      ...       ...    ... .       ...       ...       ...       ...
#>   [319]        1       495      + |        16        78         0        94
#>   [320]        1       496      + |        31       112       123        20
#>   [321]        1       497      - |        13        48        53         8
#>   [322]        1       498      + |        57        72         0       129
#>   [323]        1       500      - |         2       140         6       135
#>                p1         p2         TV     bay.TV      hdiv       jdiv
#>         <numeric>  <numeric>  <numeric>  <numeric> <numeric>  <numeric>
#>     [1]  0.272555 0.86505099   0.609375   0.592496  26.14515   1.213627
#>     [2]  0.161637 0.01061095  -0.142857  -0.151026   2.77443   0.314745
#>     [3]  0.407564 0.65457939   0.239583   0.247015   6.03698   0.180549
#>     [4]  0.199746 0.00175225  -0.198895  -0.197994  31.85233   0.708001
#>     [5]  0.147470 0.55803651   0.424779   0.410567  22.64077   0.588705
#>     ...       ...        ...        ...        ...       ...        ...
#>   [319] 0.1744742 0.00333588 -0.1702128 -0.1711383 13.073338 0.51175476
#>   [320] 0.2167029 0.84894900  0.6433566  0.6322461 65.407536 1.37339109
#>   [321] 0.2134473 0.84284463  0.6557377  0.6293973 27.809084 1.35469532
#>   [322] 0.4259443 0.00244640 -0.4418605 -0.4234979 54.855382 1.74503697
#>   [323] 0.0270367 0.04412242  0.0284687  0.0170857  0.307412 0.00625465
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> ...
#> <5 more GRanges elements>
#> ------- 

## Keep in mind that Hellinger and J divergences are, in general,
## correlated
cor.test(x = hd$T1$hdiv, y = hd$T1$jdiv, method = 'kendall')
#> 
#> 	Kendall's rank correlation tau
#> 
#> data:  hd$T1$hdiv and hd$T1$jdiv
#> z = 23.802, p-value < 2.2e-16
#> alternative hypothesis: true tau is not equal to 0
#> sample estimates:
#>       tau 
#> 0.7979042 
#> 
cor.test(x = hd$C1$hdiv, y = hd$C1$jdiv, method = 'pearson')
#> 
#> 	Pearson's product-moment correlation
#> 
#> data:  hd$C1$hdiv and hd$C1$jdiv
#> t = 34.446, df = 321, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.8614380 0.9083527
#> sample estimates:
#>       cor 
#> 0.8871664 
#> 

## Next, using a data set from the example for function
## 'uniqueGRfilterByCov' and same filtering conditions.
## The reference sample:
strands <- c("+","-","+","-", "+","-","+","+","+","+","+")
pos <- c(10,11,11,12,13,13,14,15,16,17,18)

x <- data.frame(chr = 'chr1', start = pos, end = pos,
                mC = c(2,3,2,5,10,7,9,11,4,10,7),
                uC = c(2,30,20,4,8,0,10,3,0,8,1),
                strand = strands)

x <- makeGRangesFromDataFrame(x, keep.extra.columns = TRUE)

## The treatment sample and the centroide of treatment group:
y <- data.frame(chr = 'chr1', start = 11:18, end = 11:18,
                mC2 = c(4,1,2,1,4,5:7), uC2 = c(0,0,2:7),
                strand = c("+","-","-","+","+","+","+","+"))

y <- makeGRangesFromDataFrame(y, keep.extra.columns = TRUE)

y_centroid <- data.frame(
                chr = 'chr1',
                start = pos, end = pos,
                mC2 = c(8,7,6,7,5,8,1:5), uC2 = 0:10,
                strand = c("+","-","+","-", "+","-","+","+","+","+","+"))
y_centroid <- makeGRangesFromDataFrame(y_centroid,
                                      keep.extra.columns = TRUE)


## Estimation of the divergences whitout applying Bayesian correction of
## the methylation levels:
estimateDivergence(ref = x,
                   indiv = y,
                   and.min.cov = FALSE,
                   min.coverage = c(1, 8),
                   ignore.strand = FALSE,
                   y.centroid = y_centroid,
                   min.percentile = FALSE,
                   percentile = 1,
                   verbose = FALSE)
#> InfDiv object of length: 1
#> ------- 
#> [[1]]
#> GRanges object with 11 ranges and 8 metadata columns:
#>        seqnames    ranges strand |        c1        t1        c2        t2
#>           <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>    [1]     chr1        10      + |         2         2         8         0
#>    [2]     chr1        11      - |         3        30         7         1
#>    [3]     chr1        11      + |         2        20         6         2
#>    [4]     chr1        12      - |         5         4         7         3
#>    [5]     chr1        13      + |        10         8         5         4
#>    [6]     chr1        13      - |         7         0         8         5
#>    [7]     chr1        14      + |         9        10         1         6
#>    [8]     chr1        15      + |        11         3         4         4
#>    [9]     chr1        16      + |         4         0         5         5
#>   [10]     chr1        17      + |        10         8         6         6
#>   [11]     chr1        18      + |         7         1         7         7
#>               p1        p2         TV       hdiv
#>        <numeric> <numeric>  <numeric>  <numeric>
#>    [1] 0.5000000  1.000000  0.5000000  3.7657700
#>    [2] 0.0909091  0.875000  0.7840909 10.8412820
#>    [3] 0.0909091  0.750000  0.6590909  6.7831887
#>    [4] 0.5555556  0.700000  0.1444444  0.2355480
#>    [5] 0.5555556  0.555556  0.0000000  0.0000000
#>    [6] 1.0000000  0.615385 -0.3846154  4.3890857
#>    [7] 0.4736842  0.142857 -0.3308271  1.5590061
#>    [8] 0.7857143  0.500000 -0.2857143  1.0325249
#>    [9] 1.0000000  0.500000 -0.5000000  4.0272818
#>   [10] 0.5555556  0.500000 -0.0555556  0.0478316
#>   [11] 0.8750000  0.500000 -0.3750000  1.9926489
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> ...
#> <0 more GRanges elements>
#> ------- 

## Estimation of the divergences applying Bayesian correction.
## Without Bayesian correction of the methylation levels, the Hellinger
## divergences values can be over estimated. The correction is introduced
## based on the assumption that the methylation levels follow Beta
## probability distribution.
estimateDivergence(ref = x,
                   indiv = y,
                   Bayesian = TRUE,
                   and.min.cov = FALSE,
                   min.coverage = c(1, 8),
                   ignore.strand = FALSE,
                   y.centroid = y_centroid,
                   min.percentile = FALSE,
                   percentile = 1,
                   verbose = FALSE)
#> InfDiv object of length: 1
#> ------- 
#> [[1]]
#> GRanges object with 11 ranges and 9 metadata columns:
#>        seqnames    ranges strand |        c1        t1        c2        t2
#>           <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>    [1]     chr1        10      + |         2         2         8         0
#>    [2]     chr1        11      - |         3        30         7         1
#>    [3]     chr1        11      + |         2        20         6         2
#>    [4]     chr1        12      - |         5         4         7         3
#>    [5]     chr1        13      + |        10         8         5         4
#>    [6]     chr1        13      - |         7         0         8         5
#>    [7]     chr1        14      + |         9        10         1         6
#>    [8]     chr1        15      + |        11         3         4         4
#>    [9]     chr1        16      + |         4         0         5         5
#>   [10]     chr1        17      + |        10         8         6         6
#>   [11]     chr1        18      + |         7         1         7         7
#>               p1        p2         TV      bay.TV       hdiv
#>        <numeric> <numeric>  <numeric>   <numeric>  <numeric>
#>    [1]  0.512885  0.807508  0.5000000  0.29462359 0.64589935
#>    [2]  0.128421  0.732692  0.7840909  0.60427045 5.99400265
#>    [3]  0.144870  0.657875  0.6590909  0.51300506 3.89142742
#>    [4]  0.548925  0.637327  0.1444444  0.08840153 0.08500909
#>    [5]  0.551755  0.542473  0.0000000 -0.00928233 0.00113916
#>    [6]  0.856076  0.587671 -0.3846154 -0.26840507 0.95922062
#>    [7]  0.481495  0.306742 -0.3308271 -0.17475254 0.36894253
#>    [8]  0.739457  0.508242 -0.2857143 -0.23121426 0.65275653
#>    [9]  0.795141  0.507169 -0.5000000 -0.28797127 0.64927842
#>   [10]  0.551755  0.506344 -0.0555556 -0.04541138 0.03195970
#>   [11]  0.778853  0.505689 -0.3750000 -0.27316428 0.94029072
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> ...
#> <0 more GRanges elements>
#> -------