R: Cutpoint estimation based on Mixtures of Gamma Distributions

R Documentation

Cutpoint estimation based on Mixtures of Gamma Distributions

Description

This functions estimates cutpoint value to classify DMPs into two classes: 1) from treatment and 2) from control, based on Mixtures of Gamma Distributions. The cutpoint estimations are limited to the analysis of mixture distributions of the form: F(x) = λ G(x) + (1 - λ) H(x), where λ \in [0, 1], G(x) and F(x) are the gamma cummulative distribution functions distributions followed by the information divergences estimated for individuals from control and treatment populations, respectively.

Usage

gammaMixtCut(LR, post.cut = 0.5, div.col = NULL, tv.col = NULL,
  tv.cut = 0.25, find.cut = FALSE, control.names = NULL,
  treatment.names = NULL, column = c(hdiv = FALSE, TV = FALSE, wprob =
  FALSE, pos = FALSE), classifier = c("logistic", "pca.logistic", "lda",
  "svm", "qda", "pca.lda", "pca.qda"), prop = 0.6, clas.perf = FALSE,
  cut.interval = c(0.5, 0.8), cut.incr = 0.01, stat = 1,
  maximize = TRUE, num.cores = 1L, tasks = 0L,
  tol = .Machine$double.eps^0.5, maxiter = 1000, ...)

Arguments

`LR`	A "pDMP"or "InfDiv" object obtained with functions `getPotentialDIMP` or `estimateDivergence`. These are list of GRanges objects, where each GRanges object from the list must have at least two columns: a column containing the total variation of methylation level (TV, difference of methylation levels) and a column containing a divergence of methylation levels (it could be TV or Hellinger divergence).
`post.cut`	Posterior probability to dicide whether a DMPs belong to treatment group. Default post.cut = 0.5.
`div.col`	Column number for divergence of methylation levels used in the estimation of the cutpoints. Default: 9L (hdiv column from an InfDiv object).
`tv.col`	Column number where the total variation is located in the metadata from each GRanges object.
`tv.cut`	A cutoff for the total variation distance to be applied to each site/range. Only sites/ranges k with TVD_{k} > tv.cut are are used in the analysis. Its value must be a number 0 < tv.cut < 1. Default is tv.cut = 0.25.
`find.cut`	Logic. Wether to search for an optimal cutoff value to classify DMPs based on given specifications.
`control.names, treatment.names`	Optional. Names/IDs of the control and treatment samples, which must be include in the variable LR (default, NULL). However, these are required if any of the parameters find.cut or clas.perf are set TRUE.
`treatment.names`	Optional. Names/IDs of the treatment samples, which must be include in the variable LR (default, NULL).
`column`	a logical vector for column names for the predictor variables to be used: Hellinger divergence "hdiv", total variation "TV", probability of potential DIMP "wprob", and the relative cytosine site position "pos" in respect to the chromosome where it is located. The relative position is estimated as (x - x.min)/(x.max - x), where x.min and x.max are the maximum and minimum for the corresponding chromosome, repectively. If "wprob = TRUE", then Logarithm base-10 of "wprob" will be used as predictor in place of "wprob" ( see `evaluateDIMPclass`).
`classifier`	Classification model to use. Option "logistic" applies a logistic regression model; option "lda" applies a Linear Discriminant Analysis (LDA); "qda" applies a Quadratic Discriminant Analysis (QDA), "pca.logistic" applies logistic regression model using the Principal Component (PCs) estimated with Principal Component Analysis (PCA) as predictor variables. pca.lda" applies LDA using PCs as predictor variables, and the option "pca.qda" applies a Quadratic Discriminant Analysis (QDA) using PCs as predictor variables. 'SVM' applies Support Vector Machines classifier from R package e1071.
`prop`	Proportion to split the dataset used in the logistic regression: group versus divergence (at DIMPs) into two subsets, training and testing.
`clas.perf`	Logic. Whether to return the classificaiton performance for the estimated cutpoint. Default, FALSE.
`cut.interval`	0 < cut.interval < 0.1. If find.cut= TRUE, the interval of treatment group posterior probabilities where to search for a cutpoint. Deafult cut.interval = c(0.5, 0.8).
`cut.incr`	0 < cut.incr < 0.1. If find.cut= TRUE, the sucesive increamental values runing on the interval cut.interval. Deafult, cut.incr = 0.01.
`stat`	An integer number indicating the statistic to be used in the testing when find.cut = TRUE. The mapping for statistic names are: 0 = "Accuracy", 1 = "Sensitivity", 2 = "Specificity", 3 = "Pos Pred Value", 4 = "Neg Pred Value", 5 = "Precision", 6 = "Recall", 7 = "F1", 8 = "Prevalence", 9 = "Detection Rate", 10 = "Detection Prevalence", 11 = "Balanced Accuracy", 12 = FDR.
`maximize`	Whether to maximize the performance indicator given in parameter 'stat'. Default: TRUE.
`num.cores, tasks`	Paramaters for parallele computation using package `BiocParallel-package`: the number of cores to use, i.e. at most how many child processes will be run simultaneously (see `bplapply` and the number of tasks per job (only for Linux OS).
`...`	Additional arameters to pass to functions `evaluateDIMPclass` and `gammamixEM`.

Details

After the estimation of potential DMPs, the pool of DMPs from control and treatment is assumed that follows mixtures of Gamma distributions corresponding to two populations. A posterior probability 2d-vector is estimated for each DMP. The cutpoint is determined from the intersection of the two gamma probabilities density distributions. That is, f(x) and g(x) are the estimatied densities for control and treatment groups, repectively, then the cutpoint is the values of x for which f(x) = g(x).

The Mixtures of Gamma Distributions (MGD) is estimated by using function gammamixEM from package *mixtools*. By default function gammamixEM produces returns a long list including the posterior probability to belong to the treatment. Here, by the sake of brevety only the information on the fitted model is given. The posterior model probability can be retrieved by using *predict* function. Accordign with MGD model, DMPs with a posterior probability to belong to the treatment group greater than *post.cut = 0.5* is classified as *DMP from treatment*. The post.cut can be modified. For all the cases 0 < post.cut < 1. The cutpoint and, hence, the classification derived throught MGD model might differ from that provided throught evaluateDIMPclass function, which includes more information about the DMP and, therefore, reports better performance. The classification perfomance reported when *clas.perf* = TRUE or *find.cut* = TRUE is created with function evaluateDIMPclass for the especified matchin learning model.

If parameter *find.cut = TRUE*, then a search for the best cutpoint in a predifined inteval (*cut.interval*) is performed calling function evaluateDIMPclass.

[Package MethylIT.utils version 0.3.1 ]