Evaluate DMPs Classification — evaluateDIMPclass • MethylIT

For a given cutpoint (previously estimated with the function estimateCutPoint), 'evaluateDIMPclass' will return the evaluation of the classification of DMPs into two clases: DMPS from control and DMPs from treatment samples.

evaluateDIMPclass(
  LR,
  control.names,
  treatment.names,
  column = c(hdiv = FALSE, jdiv = FALSE, jdiv.stat = FALSE, TV = FALSE, bay.TV = FALSE,
    wprob = FALSE, pos = FALSE),
  classifier = c("logistic", "pca.logistic", "lda", "qda", "pca.lda", "pca.qda",
    "random_forest"),
  pval.col = NULL,
  n.pc = 1,
  center = FALSE,
  scale = FALSE,
  interactions = NULL,
  output = "conf.mat",
  prop = 0.6,
  post.cut = 0.5,
  maxnodes = NULL,
  ntree = 400,
  nsplit = 1L,
  num.boot = 100,
  num.cores = 1L,
  tasks = 0L,
  seed = 1234,
  verbose = FALSE
)

Arguments

LR: An object from 'pDMP' class. including control and treatment GRanges containing divergence values for each DMP in the meta-column. LR is generated by the function 'selectDIMP' Each GRanges object must correspond to a sample. For example, if a sample is named 's1', then this sample can be accessed in the list of GRanges objects as LR$s1.
control.names: Names/IDs of the control samples, which must be include in the variable LR.
treatment.names: Names/IDs of the treatment samples, which must be included in the variable LR.
column: a logical vector for column names for the predictor variables to be used: Hellinger divergence 'hdiv', total variation 'TV', TV estimated with Bayesian correction of methylation levels 'bay.TV', probability of potential DMP 'wprob', and the relative cytosine site position 'pos' in respect to the chromosome where it is located. The relative position is estimated as (x - x.min)/(x.max - x), where x.min and x.max are the maximum and minimum for the corresponding chromosome, respectively. If 'wprob = TRUE', then Logarithm base-10 of 'wprob' will be used as predictor in place of 'wprob'.
classifier: Classification model to use. Option 'logistic' applies a logistic regression model; option 'lda' applies a Linear Discriminant Analysis (LDA); 'qda' applies a Quadratic Discriminant Analysis (QDA), 'pca.logistic' applies logistic regression model using the Principal Component (PCs) estimated with Principal Component Analysis (PCA) as predictor variables. pca.lda' applies LDA using PCs as predictor variables, and the option 'pca.qda' applies a Quadratic Discriminant Analysis (QDA) using PCs as predictor variables.
pval.col: Column number for p-value used in the performance analysis and estimation of the cutpoints. Default: NULL. If NULL it is assumed that the column is named 'wprob'.
n.pc: Number of principal components (PCs) to use if the classifier is not 'logistic'. In the current case, the maximun number of PCs is 4.
center: A logical value indicating whether the variables should be shifted to be zero centered.
scale: A logical value indicating whether the variables should be
interactions: Variable interactions to consider in a logistic regression model. Any pairwise combination of the variable 'hdiv', 'TV', 'wprob', and 'pos' can be provided. For example: 'hdiv:TV', 'wprob:pos', 'wprob:TV', etc.
output: Type of output to request: output = c('conf.mat', 'mc.val', 'boot.all', 'all'). See below.
prop: Proportion to split the dataset used in the logistic regression: group versus divergence (at DMPs) into two subsets, training and testing.
post.cut: Posterior probability to decide whether a DMPs belong to treatment group. Default post.cut = 0.5.
maxnodes, ntree: Only for Random Forest classifier (randomForest, 'random_forest'). Maximum number $maxnodes$ of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible. Parameter $ntree$ stands for the number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
nsplit: Only for Random Forest classifier. The Random Forest (randomForest, 'random_forest') package uses a C+Fortran implementation which only supports integer indexes, so any dataframe/data table/matrix with >2^31 elements (limit for integers) gives an error. The option nsplit is applied to train $ntrees=floor(ntree/nsplit)$ models with $rep(ntrees,nsplit)$ which are finally combined to obtain a forest with $ntree$. Each model in this would contain $ntrees$.
num.boot: Number of bootstrap validations to perform in the evaluation of the logistic regression: group versus divergence (at DMPs).
num.cores, tasks: Parameters for parallel computation using package BiocParallel-package: the number of cores to use, i.e. at most how many child processes will be run simultaneously (see bplapply and the number of tasks per job (only for Linux OS).
seed: Random seed used for random number generation.
verbose: if TRUE, prints the function log to stdout

Value

Setting output = 'conf.mat' will perform a logistic regression group versus divergence (at DMPs) to evaluate the discrimination between control-DMPs and treatment-DMPs. The evaluation of this classification is provided through the function 'confusionMatrix' from R package 'caret'. 'mc.val' will perform a 'num.boot'-times Monte Carlo (bootstrap) validation and return a summary. By default function 'confusionMatrix' from R package caret' randomly splits the sample into two subsets, training and testing, according to the supplied proportion 'prop' (i.e., prop = 0.6). After selecting output = 'mc.val', the function 'confusionMatrix' will be executed 'num.boot'-times, each time performing a different random split of the sample. 'boot.all' same as 'mc.val' plus a matrix with statistcs reported by 'confusionMatrix'. 'all' return a list with all the mentioned outputs.

Details

The regulatory methylation signal is also an output from a natural process that continuously takes place across the ontogenetic development of the organisms. So, we expect to see methylation signal on natural ordinary conditions. Here, to distinguish a control methylation signal from a treatment, three classification models are provided: 1) logistic, 2) Linear Discriminant Analysis (LDA) and 3) Quadratic Discriminant Analysis (QDA). In particular, four predictor variables can be used: Hellinger divergence 'hdiv', total variation 'TV', probability of potential DMP 'wprob' and DMP genomic coordinated 'pos'. Principal component analysis (PCA) is used to convert a set of observations of possibly correlated predictor variables into a set of values of linearly uncorrelated variables (principal components, PCs). The PCs are used as new, uncorrelated predictor variables for LDA, QDA, and logistic classifiers.

A classification result with low accuracy and compromising values from other classification performance indicators (see below) suggest that the treatment does not induce a significant regulatory signal different from control.

Author

Robersy Sanchez (https://genomaths.com).

Examples

## Get a data set of DMPs
data(dmps, package = 'MethylIT')

## Classification of DMPs into two clases: DMPS from control and DMPs
## from treatment samples and evaluation of the classifier performance
## (for more details see ?evaluateDIMPclass).
perf <- evaluateDIMPclass(LR = dmps,
                        column = c(hdiv = TRUE, TV = TRUE,
                                    wprob = TRUE, pos = TRUE),
                        classifier = 'lda', n.pc = 4L,
                        control.names =  c('C1', 'C2', 'C3'),
                        treatment.names = c('T1', 'T2', 'T3'),
                        center = TRUE, scale = TRUE, prop = 0.6)

## Model classification performance
perf$Performance
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  CT  TT
#>         CT  53   0
#>         TT   0 610
#>                                      
#>                Accuracy : 1          
#>                  95% CI : (0.9945, 1)
#>     No Information Rate : 0.9201     
#>     P-Value [Acc > NIR] : < 2.2e-16  
#>                                      
#>                   Kappa : 1          
#>                                      
#>  Mcnemar's Test P-Value : NA         
#>                                      
#>             Sensitivity : 1.0000     
#>             Specificity : 1.0000     
#>          Pos Pred Value : 1.0000     
#>          Neg Pred Value : 1.0000     
#>              Prevalence : 0.9201     
#>          Detection Rate : 0.9201     
#>    Detection Prevalence : 0.9201     
#>       Balanced Accuracy : 1.0000     
#>                                      
#>        'Positive' Class : TT         
#>