`R/estimateCutPoint.R`

`estimateCutPoint.Rd`

Given a list of two GRanges objects, control and treatment, carrying the potential signals (prior classification) from controls and treatments in terms of an information divergence (given the metacolumns), the function estimates the cutpoints of the control group versus treatment group.

```
estimateCutPoint(
LR,
control.names,
treatment.names,
simple = TRUE,
column = c(hdiv = TRUE, jdiv = TRUE, jdiv.stat = FALSE, TV = TRUE, bay.TV = FALSE,
wprob = TRUE, pos = TRUE),
classifier1 = c("logistic", "pca.logistic", "lda", "qda", "pca.lda", "pca.qda",
"random_forest"),
classifier2 = NULL,
tv.cut = 0.25,
tv.col = NULL,
div.col = NULL,
clas.perf = FALSE,
post.cut = 0.5,
prop = 0.6,
center = FALSE,
scale = FALSE,
n.pc = 1,
interactions = NULL,
cut.values = NULL,
stat = 0,
cutp_data = FALSE,
maxnodes = NULL,
ntree = 400,
nsplit = 1L,
num.cores = 1L,
tasks = 0L,
...
)
```

- LR
An object from 'pDMP' class. This object is previously obtained with function

`getPotentialDIMP`

.- control.names, treatment.names
Names/IDs of the control and treatment samples, which must be include in the variable LR.

- simple
Logic (default, TRUE). If TRUE, then Youden Index is used to estimate the cutpoint. If FALSE, the minimum information divergence value with posterior classification probability greater than

*post.cut*(usually*post.cut*= 0.5) as estimated by*classifier1*will be the reported cutpoint, except if a better cutpoint is found in the set of values provided by the user in the parameter*cut.values*.- column
a logical vector for column names for the predictor variables to be used: Hellinger divergence 'hdiv', total variation 'TV', probability of potential DMP 'wprob', and the relative cytosine site position 'pos' in respect to the chromosome where it is located. The relative position is estimated as (x - x.min)/(x.max - x), where x.min and x.max are the maximum and minimum for the corresponding chromosome, respectively. If 'wprob = TRUE', then Logarithm base-10 of 'wprob' will be used as predictor in place of 'wprob'.

- classifier1, classifier2
Classification model to use. Option 'logistic' applies a logistic regression model; option 'lda' applies a Linear Discriminant Analysis (LDA); 'qda' applies a Quadratic Discriminant Analysis (QDA), 'pca.logistic' applies logistic regression model using the Principal Component (PCs) estimated with Principal Component Analysis (PCA) as predictor variables. pca.lda' applies LDA using PCs as predictor variables, and the option 'pca.qda' applies a Quadratic Discriminant Analysis (QDA) using PCs as predictor variables. If classifier2 is not NULL, then it will be used to evaluate the classification performance, and the corresponding best fitted model will be returned.

- tv.cut
A cutoff for the total variation distance to be applied to each site/range. Only sites/ranges

*k*with \(TVD_{k} > tv.cut\) are are used in the analysis. Its value must be a number. \(0 < tv.cut < 1\). Default is \(tv.cut = 0.25\).- tv.col
Column number for the total variation to be used for filtering cytosine positions (if provided).

- div.col
Column number for divergence variable for which the estimation of the cutpoint will be performed.

- clas.perf
Logic. Whether to evaluate the classification performance for the estimated cutpoint using a model classifier when 'simple=TRUE'. Default, FALSE.

- post.cut
If 'simple=FALSE', this is posterior probability to decide whether a DMPs belong to treatment group. Default

*post.cut*= 0.5.- prop
Proportion to split the dataset used in the logistic regression: group versus divergence (at DMPs) into two subsets, training and testing.

- center
A logical value indicating whether the variables should be shifted to be zero centered.

- scale
A logical value indicating whether the variables should be

- n.pc
Number of principal components (PCs) to use if the classifier is not 'logistic'. In the current case, the maximun number of PCs is 4.

- interactions
If a logistic classifier is used. Variable interactions to consider in a logistic regression model. Any pairwise combination of the variable 'hdiv', 'TV', 'wprob', and 'pos' can be provided. For example: 'hdiv:TV', 'wprob:pos', 'wprob:TV', etc.

- cut.values
Cut values of the information divergence (ID) specified in

*div.col*where to check the classification performance (0 <*cut.interval*< max ID). If provided, the search for a cutpoint will include these values.- stat
An integer number indicating the statistic to be used in the testing when

*simple*= FALSE The mapping for statistic names are:0 = 'Accuracy'

1 = 'Sensitivity'

2 = 'Specificity'

3 = 'Pos Pred Value'

4 = 'Neg Pred Value'

5 = 'Precision'

6 = 'Recall'

7 = 'F1'

8 = 'Prevalence'

9 = 'Detection Rate'

10 = 'Detection Prevalence'

11 = 'Balanced Accuracy'

12 = 'FDR'

- cutp_data
logical(1) (optional). If TRUE, and simple = TRUE, then a data frame for further analysis or estimation of the optimal cutpoint based only on the selected divergence is provided.

- maxnodes, ntree
Only for Random Forest classifier (

`randomForest`

). Maximum number maxnodes of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible. Parameter ntree stands for the number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.- nsplit
Only for Random Forest classifier. The Random Forest (

`randomForest`

) package uses a C+Fortran implementation which only supports integer indexes, so any dataframe/data table/matrix with >2^31 elements (limit for integers) gives an error. The option nsplit is applied to train \(ntrees=floor(ntree/nsplit)\) models with rep(ntrees,nsplit) which are finally combined to obtain a forest with ntree. Each model would contain ntrees.- num.cores, tasks
Parameters for parallel computation using package

`BiocParallel-package`

: the number of cores to use, i.e. at most how many child processes will be run simultaneously (see`bplapply`

and the number of tasks per job (only for Linux OS).- ...
arguments passed to or from other methods.

Depending the parameter setting will return the following list with elements:

cutpoint: Cutpoint estimated.

testSetPerformance: Performance evaluation on the test set.

testSetModel.FDR: False discovery rate on the test set.

model: Model used in the performance evaluation.

modelConfMatrix: Confusion matrix for the whole dataset derived applying the model classifier used in the performance evaluation.

initModel: Initial classifier model applied to estimate posterior classifications used in the cutpoint estimation.

postProbCut: Posterior probability used to estimate the cutpoint

classifier: Name of the model classifier used in the performance evaluation.

statistic: Name of the performance statistic used to find the cutpoint when

*simple*= FALSE.optStatVal: Value of the performance statistic at the cutpoint.

The function performs an estimation of the optimal cutpoint for the classification of the differentially methylated (cytosines) positions into two classes: DMPs from control and DMPs from treatment. The simplest approach to estimate the cutpoint is based on the application of Youden Index. More complexes approach based in several machine learning model are provided as well.

Results of the classification performance resulting from the estimated cutpoint are normally given, with the exception of those extreme situations where the statistics to evaluate performance cannot be estimated. More than one classifier model can be applied. For example, one classifier (logistic model) can be used to estimate the posterior classification probabilities of DMP into those from control and those from treatment. These probabilities are then used to estimate the cutpoint in range of values from, say, 0.5 to 0.8. Next, a different classifier can be used to evaluate the classification performance. Different classifier models would yield different performances. Models are returned and can be used in further prediction with new datasets from the same batch experiment. This is a machine learning approach to discriminate the biological regulatory signal naturally generated in the control from that one induced by the treatment.

Notice that the estimation of an optimal cutpoint based on the application Youden Index (simple = TRUE) only uses the information provided by the selected information divergence. As a result, classification results based only in one variable can be poor or can fail. However, option simple = FALSE, uses the information from several variables following a machine-learning (ML) approach.

Nevertheless, when simple = TRUE, still a ML model classifier can be built
using the optimal cutpoint estimated and setting clas.perf = TRUE. Such a ML
model can be used for predictions in further analyses with function
`predictDIMPclass`

.

```
## Get a dataset of potential signals and the estimates cutpoint
## from the package and performs cutpoint estimation
data(PS)
cutp <- mlCutpoint(LR = PS,
column = c(hdiv = TRUE, TV = TRUE,
wprob = TRUE, pos = TRUE),
classifier1 = 'qda', n.pc = 4,
control.names = c('C1', 'C2', 'C3'),
treatment.names = c('T1', 'T2', 'T3'),
tv.cut = 0.68, prop = 0.6,
cut.values = seq(114, 118, 1),
div.col = 9L)
cutp
#> Cutpoint estimation with 'qda' classifier
#> Cutpoint search performed using model posterior probabilities
#>
#> Posterior probability used to get the cutpoint = 0.5
#> Cytosine sites with treatment PostProbCut >= 0.5 have a
#> divergence value >= 117.0424
#>
#> Optimized statistic: Accuracy = 1
#> Cutpoint = 117.04
#>
#> Model classifier 'qda'
#>
#> The accessible objects in the output list are:
#> Length Class Mode
#> cutpoint 1 -none- numeric
#> testSetPerformance 6 confusionMatrix list
#> testSetModel.FDR 1 -none- numeric
#> model 13 qdaDMP list
#> modelConfMatrix 6 confusionMatrix list
#> initModel 1 -none- character
#> postProbCut 1 -none- numeric
#> postCut 1 -none- numeric
#> classifier 1 -none- character
#> statistic 1 -none- character
#> optStatVal 1 -none- numeric
#> cutpData 1 -none- logical
```