estimateCutPoint {MethylIT}R Documentation

Estimate cutpoints to distinguish the treatment methylation signal from the control

Description

Given a list of two GRanges objects, control and treatment, carrying the potential signals (prior classification) from controls and treatments in terms of an information divergence (given the meta-columns), the function estimates the cutpoints of the control group versus treatment group.

Usage

estimateCutPoint(LR, control.names, treatment.names, simple = TRUE,
  column = c(hdiv = TRUE, TV = TRUE, wprob = FALSE, pos = FALSE),
  classifier1 = c("logistic", "pca.logistic", "lda", "qda", "pca.lda",
  "pca.qda"), classifier2 = NULL, tv.cut = 0.25, div.col = NULL,
  clas.perf = FALSE, post.cut = 0.5, prop = 0.6, n.pc = 1,
  find.cut = FALSE, cut.interval = c(0.5, 0.8), cut.incr = 0.01,
  stat = 1, num.cores = 1L, tasks = 0L, ...)

Arguments

LR

An object from 'pDMP' class. This obejct is previously obtained with function getPotentialDIMP.

control.names, treatment.names

Names/IDs of the control and treatment samples, which must be include in the variable LR.

simple

Logic (default, TRUE). If TRUE, then Youden Index is used to estimate the cutpoint.

column

a logical vector for column names for the predictor variables to be used: Hellinger divergence "hdiv", total variation "TV", probability of potential DIMP "wprob", and the relative cytosine site position "pos" in respect to the chromosome where it is located. The relative position is estimated as (x - x.min)/(x.max - x), where x.min and x.max are the maximum and minimum for the corresponding chromosome, repectively. If "wprob = TRUE", then Logarithm base-10 of "wprob" will be used as predictor in place of "wprob".

classifier1, classifier2

Classification model to use. Option "logistic" applies a logistic regression model; option "lda" applies a Linear Discriminant Analysis (LDA); "qda" applies a Quadratic Discriminant Analysis (QDA), "pca.logistic" applies logistic regression model using the Principal Component (PCs) estimated with Principal Component Analysis (PCA) as predictor variables. pca.lda" applies LDA using PCs as predictor variables, and the option "pca.qda" applies a Quadratic Discriminant Analysis (QDA) using PCs as predictor variables. If classifier2 is not NULL, then it will be used to evaluate the classification performance, and the corresponding best fitted model will be returned.

tv.cut

A cutoff for the total variation distance to be applied to each site/range. Only sites/ranges k with TVD_{k} > tv.cut are are used in the analysis. Its value must be a number. 0 < tv.cut < 1. Default is tv.cut = 0.25.

div.col

Column number for divergence variable for which the estimation of the cutpoint will be performed.

clas.perf

Logic. Whether to evaluate the classification performance for the estimated cutpoint using a model classifier when 'simple=TRUE'. Default, FALSE.

post.cut

If 'simple=FALSE', this is posterior probability to dicide whether a DMPs belong to treatment group. Default post.cut = 0.5.

prop

Proportion to split the dataset used in the logistic regression: group versus divergence (at DIMPs) into two subsets, training and testing.

n.pc

Number of principal components (PCs) to use if the classifier is not 'logistic'. In the current case, the maximun number of PCs is 4.

find.cut

Logic. Wether to search for an optimal cutoff value to classify DMPs based on given specifications.

cut.interval

0 < cut.interval < 0.1. If find.cut = TRUE, the interval of treatment group posterior probabilities where to search for a cutpoint. Deafult cut.interval = c(0.5, 0.8).

cut.incr

0 < cut.incr < 0.1. If find.cut = TRUE, the sucesive increamental values runing on the interval cut.interval. Deafult, cut.incr = 0.01.

stat

An integer number indicating the statistic to be used in the testing when find.cut = TRUE. The mapping for statistic names are:

  • 0 = "Accuracy"

  • 1 = "Sensitivity"

  • 2 = "Specificity"

  • 3 = "Pos Pred Value"

  • 4 = "Neg Pred Value"

  • 5 = "Precision"

  • 6 = "Recall"

  • 7 = "F1"

  • 8 = "Prevalence"

  • 9 = "Detection Rate"

  • 10 = "Detection Prevalence"

  • 11 = "Balanced Accuracy"

  • 12 = FDR

num.cores, tasks

Paramaters for parallele computation using package BiocParallel-package: the number of cores to use, i.e. at most how many child processes will be run simultaneously (see bplapply and the number of tasks per job (only for Linux OS).

Details

The function performs an estimation of the optimal cutpoint for the classification of the differentially methylated (cytosines) positions into two classes: DMPs from control and DMPs from treatment. The simplest approach to estimate the cutpoint is based on the application of Youden Index. More complexes approach based in several machine learning model are provided as well.

Results of the classification perfomance resulting from the estimated cutpoint are normally given, with the exception of those extreme situations where the statistics to evaluate performance cannot be estimated. More than one classifier model can be applied. For example, one classifier (logistic model) can be used to esitmate the posterior classification probabilities of DMP into those from control and those from treatment. This probabilities are then used to estimate the cutpoint in range of values from, say, 0.5 to 0.8. Next, a different classifier can be used to evaluate the classification performance. Different classifier models would yield different performances. Models are returned and can be used in futher prediction with new datasets from the same batch experiment. This is a machine learnig approach to discriminate the biological regulatory signal naturally generated in the control from that one induced by the treatment.

Value

Depending the parameter setting will return the following list with elements:

  1. cutpoint: Cutpoint estimated.

  2. testSetPerformance: Performance evaluation on the test set.

  3. testSetModel.FDR: False discovery rate on the test set.

  4. model: Model used in the performance evaluation.

  5. modelConfMatrix: Confusion matrix for the whole dataset derived applying the model classifier used in the performance evaluation.

  6. initModel: Initial classifier model applied to estimate posterior classifications used in the cutpoint estimation.

  7. postProbCut: Posterior probability used to estimate the cutpoint

  8. classifier: Name of the model classifier used in the performance evaluation.

  9. statistic: Name of the performance statistic used to find the cutpoint when find.cut = TRUE.

  10. optStatVal: Value of the performance statistic at the cutpoint.

Examples

set.seed(123) ## To set a seed for random number generation
## GRanges object of the reference with methylation levels in
## its matacolumn
num.points <- 5000
Ref <- makeGRangesFromDataFrame(
  data.frame(chr = '1',
             start = 1:num.points,
             end = 1:num.points,
             strand = '*',
             p1 = rbeta(num.points, shape1 = 1, shape2 = 1.5)),
  keep.extra.columns = TRUE)

## List of Granges objects of individuals methylation levels
Indiv <- GRangesList(
  sample11 = makeGRangesFromDataFrame(
    data.frame(chr = '1',
               start = 1:num.points,
               end = 1:num.points,
               strand = '*',
               p2 = rbeta(num.points, shape1 = 1.5, shape2 = 2)),
    keep.extra.columns = TRUE),
  sample12 = makeGRangesFromDataFrame(
    data.frame(chr = '1',
               start = 1:num.points,
               end = 1:num.points,
               strand = '*',
               p2 = rbeta(num.points, shape1 = 1.6, shape2 = 2)),
    keep.extra.columns = TRUE),
  sample21 = makeGRangesFromDataFrame(
    data.frame(chr = '1',
               start = 1:num.points,
               end = 1:num.points,
               strand = '*',
               p2 = rbeta(num.points, shape1 = 40, shape2 = 4)),
    keep.extra.columns = TRUE),
  sample22 = makeGRangesFromDataFrame(
    data.frame(chr = '1',
               start = 1:num.points,
               end = 1:num.points,
               strand = '*',
               p2 = rbeta(num.points, shape1 = 41, shape2 = 4)),
    keep.extra.columns = TRUE))
## To estimate Hellinger divergence using only the methylation levels.
HD <- estimateDivergence(ref = Ref, indiv = Indiv, meth.level = TRUE,
                         columns = 1)
## To perform the nonlinear regression analysisx
nlms <- nonlinearFitDist(HD, column = 4, verbose = FALSE)

## Next, the potential signal can be estimated
PS <- getPotentialDIMP(LR = HD, nlms = nlms, div.col = 4, alpha = 0.05)

cutpoint <- estimateCutPoint(LR = PS, simple = TRUE, find.cut = FALSE,
                             column = c(hdiv = TRUE, TV = TRUE,
                                        wprob = TRUE, pos = TRUE),
                             interaction = "hdiv:TV", clas.perf = FALSE,
                             control.names = c("sample11", "sample12"),
                             treatment.names = c("sample21", "sample22"))

[Package MethylIT version 0.3.1 ]