Department of Biology and Plant Science.

Pennsylvania State University, University Park, PA 16802

Maintainer: Robersy Sanchez

Pennsylvania State University, University Park, PA 16802

Maintainer: Robersy Sanchez

`vignettes/pca_lda_with_methylit.Rmd`

`pca_lda_with_methylit.Rmd`

Abstract

When methylation analysis is intended for diagnostic/prognostic purposes, for example, in clinical applications for patient diagnostics, to know whether the patient would be in healthy or disease stage we would like to have a good predictor tool in our side. It turns out that classical machine learning (ML) tools like hierarchical clustering, principal components and linear discriminant analysis can help us to reach such a goal. The current Methyl-IT downstream analysis is equipped with the mentioned ML tools.

The discrimination of the methylation signal from the stochastic methylation background resultant from the standard (non-stressful) biological processes is a critical step for the genome-wide methylation analysis. Such a discrimination requires for the knowledge of the probability distribution of the information divergence of methylation levels and a proper evaluation of the classification performance of differentially methylated positions (DMPs) into two classes: DMPs from control and DMPs from treatment.

Once a methylation analysis is performed to detect the genome-wide DMPs, a natural next step is to investigate whether the detected DMPs carries enough discriminatory power to discern between individual groups. Then, we must move from the methylation analysis to statistical/machine learning field (multivariate statistics). To reach our goal, each individual can be represented as a vector from the \(N\)-dimensional space of DMPs, where DMPs can be expressed in terms of some suitable measurement as function of the methylation levels. These suitable measurement are information divergences, such as total variation distance (\(TV_d\), the absolute value of methylation levels) and Hellinger divergence (\(H\)). It is worthy to observe that methylation levels are uncertainty measurements and, therefore, they do not carry information per se. Information is only carried on measurements expressing uncertainty variations like \(TV_d\), \(H\), Jensen-Shannon Divergence, and in general, the so called information theoretical divergences (1).

The \(N\)-dimensional space of DMPs
is, usually, too big. So, a first dimension reduction is required. This
can be done by splitting the chromosomes into non-overlapping genomic
region and next computing some statistic like the *sum* or
*mean* of some information divergences of methylation levels at
DMPs inside of each region. Still we will have to deal with a
multidimensional space, but acceptable for a meaningful application of
hierarchical clustering (HC), principal component analysis (PCA) and
linear discriminant analysis (LDA). In the current examples, Methyl-IT
methylation analysis will be applied to a dataset of simulated samples
to detect DMPs on then. Next, after space dimension reduction, the
mentioned machine learning approached will be applied.

For the current example on methylation analysis with Methyl-IT we will use simulated data. Read-count matrices of methylated and unmethylated cytosine are generated with MethylIT.utils function simulateCounts. A basic example generating datasets is given in the web page: Methylation analysis with Methyl-IT

```
library(MethylIT)
library(MethylIT.utils)
library(ggplot2)
library(ape)
alpha.ct <- 0.019
alpha.g1 <- 0.022
alpha.g2 <- 0.025
# The number of cytosine sites to generate
sites = 1.5e5
# Set a seed for pseudo-random number generation
set.seed(124)
control.nam <- c("C1", "C2", "C3", "C4", "C5")
treatment.nam1 <- c("T1", "T2", "T3", "T4", "T5")
treatment.nam2 <- c("T6", "T7", "T8", "T9", "T10")
# Reference group
ref0 = simulateCounts(num.samples = 3, sites = sites, alpha = alpha.ct,
beta = 0.5, size = 50, theta = 4.5,
sample.ids = c("R1", "R2", "R3"))
# Control group
ctrl = simulateCounts(num.samples = 5, sites = sites, alpha = alpha.ct,
beta = 0.5, size = 50, theta = 4.5,
sample.ids = control.nam)
# Treatment group II
treat = simulateCounts(num.samples = 5, sites = sites, alpha = alpha.g1,
beta = 0.5, size = 50, theta = 4.5,
sample.ids = treatment.nam1)
# Treatment group II
treat2 = simulateCounts(num.samples = 5, sites = sites, alpha = alpha.g2,
beta = 0.5, size = 50, theta = 4.5,
sample.ids = treatment.nam2)
```

A reference sample (virtual individual) is created using individual samples from the control population using function poolFromGRlist. The reference sample is further used to compute the information divergences of methylation levels, \(TV_d\) and \(H\), with function estimateDivergence. This is a first fundamental step to remove the background noise (fluctuations) generated by the inherent stochasticity of the molecular processes in the cells.

```
# === Methylation level divergences ===
# Reference sample
ref = poolFromGRlist(ref0, stat = "mean", num.cores = 4L, verbose = FALSE)
divs <- estimateDivergence(
ref = ref,
indiv = c(ctrl, treat, treat2),
Bayesian = TRUE,
and.min.cov = FALSE,
num.cores = 6L,
percentile = 1,
verbose = FALSE)
# To remove hd == 0 to estimate. The methylation signal only is given for
divs = lapply(divs, function(div) div[ abs(div$hdiv) > 0 ], keep.attr = TRUE)
names(divs) <- names(divs)
divs
```

```
## InfDiv object of length: 15
## -------
## $C1
## GRanges object with 26720 ranges and 9 metadata columns:
## seqnames ranges strand | c1 t1 c2 t2
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric>
## [1] 1 7 * | 2 42 6 38
## [2] 1 12 * | 23 62 0 85
## [3] 1 19 * | 12 27 0 39
## [4] 1 22 * | 13 27 0 40
## [5] 1 31 * | 58 49 80 27
## ... ... ... ... . ... ... ... ...
## [26716] 1 149977 * | 11 35 34 12
## [26717] 1 149979 * | 16 34 0 50
## [26718] 1 149990 * | 24 54 0 78
## [26719] 1 149994 * | 25 51 76 0
## [26720] 1 149995 * | 16 63 0 79
## p1 p2 TV bay.TV hdiv
## <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] 0.0721422 0.13768432 0.0909091 0.0655421 0.526471
## [2] 0.2659280 0.00327293 -0.2705882 -0.2626551 19.800862
## [3] 0.2920976 0.00697775 -0.3076923 -0.2851199 9.313997
## [4] 0.3068138 0.00681017 -0.3250000 -0.3000036 10.213202
## [5] 0.5191743 0.73900460 0.2056075 0.2198303 5.688769
## ... ... ... ... ... ...
## [26716] 0.235823 0.71967696 0.500000 0.483854 11.7687
## [26717] 0.305641 0.00549133 -0.320000 -0.300150 13.0602
## [26718] 0.299161 0.00356061 -0.307692 -0.295601 20.8075
## [26719] 0.318165 0.98256166 0.671053 0.664396 51.1029
## [26720] 0.203882 0.00351646 -0.202532 -0.200366 13.2064
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## ...
## <14 more GRanges elements>
## -------
```

To get some statistical description about the sample is useful. Here,
empirical critical values for the probability distribution of \(H\) and \(TV\) can is obtained using
*quantile* function from the R package *stats*.

```
critical.val <- do.call(rbind, lapply(divs, function(x) {
x <- x[x$hdiv > 0]
hd.95 = quantile(x$hdiv, 0.95)
tv.95 = quantile(abs(x$bay.TV), 0.95)
return(c(tv = tv.95, hd = hd.95))
}))
critical.val
```

```
## tv.95% hd.95%
## C1 0.6523406 62.26763
## C2 0.6500765 60.89674
## C3 0.6508274 61.48960
## C4 0.4858965 56.63342
## C5 0.4970348 58.16023
## T1 0.9236731 126.34191
## T2 0.9136241 120.45367
## T3 0.9137810 121.99373
## T4 0.9192143 124.54298
## T5 0.9225918 124.25236
## T6 0.9447922 132.69308
## T7 0.9447348 134.88441
## T8 0.9408379 131.80496
## T9 0.9482600 134.68997
## T10 0.9456255 133.87163
```

Here, the methylation signal is expressed in terms of Hellinger divergence of methylation levels. Here, the signal distribution is modelled by a Weibull probability distribution model. Basically, the model could be a member of the generalized gamma distribution family. For example, it could be gamma distribution, Weibull, or log-normal. To describe the signal, we may prefer a model with a cross-validations: R.Cross.val > 0.95. Cross-validations for the nonlinear regressions are performed in each methylome as described in (2). The non-linear fit is performed through the function nonlinearFitDist

The best model excluding “GGamma3P” model, for which the computation is computationally expensive:

```
d <- c("Weibull2P", "Weibull3P", "Gamma2P", "Gamma3P")
gof <- gofReport(HD = divs, column = 9L, model = d, num.cores = 6L,
output = "all", verbose = FALSE)
```

`gof$bestModel`

```
## C1 C2 C3 C4 C5 T1
## "Weibull2P" "Weibull2P" "Weibull2P" "Weibull2P" "Weibull2P" "Weibull2P"
## T2 T3 T4 T5 T6 T7
## "Weibull2P" "Weibull2P" "Weibull2P" "Weibull2P" "Weibull2P" "Weibull2P"
## T8 T9 T10
## "Weibull2P" "Weibull2P" "Weibull2P"
```

Although we did not included the main member of the generalized gamma
distribution family (“GGamma3P”), the Cross-validations correlation
coefficient *w3p_R.Cross.val* values above 0.99 gives us some
peace of mind on our confidence on the model prediction
power/performance when confronted with external data sets (AIC and BIC
are not conceived to provide such an information).

`gof$stats`

```
## w2p_AIC w2p_R.Cross.val w3p_AIC w3p_R.Cross.val g2p_AIC g2p_R.Cross.val
## C1 -172642.2 0.9994639 Inf 0 -164820.4 0.9993928
## C2 -175595.0 0.9995130 Inf 0 -167532.7 0.9994499
## C3 -177238.5 0.9995419 Inf 0 -169774.2 0.9994861
## C4 -219212.7 0.9994948 Inf 0 -206296.0 0.9993131
## C5 -217448.4 0.9994364 Inf 0 -205264.1 0.9992280
## T1 -168080.2 0.9972702 Inf 0 -160888.3 0.9963827
## T2 -167285.4 0.9972727 Inf 0 -160748.1 0.9964683
## T3 -170302.3 0.9974339 Inf 0 -163998.4 0.9966883
## T4 -168551.4 0.9973218 Inf 0 -162169.8 0.9965314
## T5 -167122.8 0.9972268 Inf 0 -160694.8 0.9964093
## T6 -173245.9 0.9971876 Inf 0 -165656.8 0.9962175
## T7 -170905.5 0.9970145 Inf 0 -163342.2 0.9959969
## T8 -171372.8 0.9970756 Inf 0 -163649.6 0.9960768
## T9 -170097.3 0.9969614 Inf 0 -162250.6 0.9958859
## T10 -171267.1 0.9970285 Inf 0 -163857.5 0.9960168
## g3p_AIC g3p_R.Cross.val bestModel
## C1 Inf 0 w2p
## C2 Inf 0 w2p
## C3 Inf 0 w2p
## C4 Inf 0 w2p
## C5 Inf 0 w2p
## T1 Inf 0 w2p
## T2 Inf 0 w2p
## T3 Inf 0 w2p
## T4 Inf 0 w2p
## T5 Inf 0 w2p
## T6 Inf 0 w2p
## T7 Inf 0 w2p
## T8 Inf 0 w2p
## T9 Inf 0 w2p
## T10 Inf 0 w2p
```

The above statistical description of the dataset (evidently) suggests that there two main groups: control and treatments, while treatment group would split into two subgroups of samples. In the current case, to search for a good cutpoint, we do not need to use all the samples. The critical value \(H_{\alpha=0.05}=33.51223\) suggests that any optimal cutpoint for the subset of samples T1 to T5 will be optimal for the samples T6 to T10 as well.

Below, we are letting the PCA+LDA model classifier to take the decision on whether a differentially methylated cytosine position is a treatment DMP. To do it, Methyl-IT function getPotentialDIMP is used to get methylation signal probabilities of the observed \(H\) values for all cytosine site (alpha = 1), in accordance with the 2-parameter Weibull distribution model. Next, this information will be used to identify DMPs using Methyl-IT function estimateCutPoint. Cytosine positions with \(H\) values above the cutpoint are considered DMPs. Finally, a PCA + QDA model classifier will be fitted to classify DMPs into two classes: DMPs from control and those from treatment.

```
dmps <- getPotentialDIMP(LR = divs,
nlms = gof$nlms,
div.col = 9L,
tv.cut = 0.461,
tv.col = 8,
alpha = 0.05,
dist.name = gof$bestModel)
cutp = estimateCutPoint(LR = dmps,
simple = FALSE,
column = c(hdiv = TRUE, TV = TRUE,
wprob = TRUE, pos = TRUE),
classifier1 = "pca.lda",
classifier2 = "pca.qda",
control.names = control.nam,
treatment.names = treatment.nam1,
post.cut = 0.5,
cut.values = seq(15, 38, 1),
clas.perf = TRUE,
prop = 0.6,
center = FALSE,
scale = FALSE,
n.pc = 4,
div.col = 9L,
stat = 0)
cutp
```

```
## Cutpoint estimation with 'pca.lda' classifier
## Cutpoint search performed using model posterior probabilities
##
## Posterior probability used to get the cutpoint = 0.5
## Cytosine sites with treatment PostProbCut >= 0.5 have a
## divergence value >= 78.68617
##
## Optimized statistic: Accuracy = 1
## Cutpoint = 78.686
##
## Model classifier 'pca.qda'
##
## The accessible objects in the output list are:
## Length Class Mode
## cutpoint 1 -none- numeric
## testSetPerformance 6 confusionMatrix list
## testSetModel.FDR 1 -none- numeric
## model 5 pcaQDA list
## modelConfMatrix 6 confusionMatrix list
## initModel 1 -none- character
## postProbCut 1 -none- numeric
## postCut 1 -none- numeric
## classifier 1 -none- character
## statistic 1 -none- character
## optStatVal 1 -none- numeric
## cutpData 1 -none- logical
```

In this case, the cutpoint \(H_{cutpoint}=92.138\) is lower from what is expected from the higher treatment empirical \(H_{\alpha = 0.05}^{TT_{Emp}}=120.45119\) critical value: \(H_{cutpoint} < H_{\alpha = 0.05}^{TT_{Emp}}\).

The model performance in in the test dataset is:

`cutp$testSetPerformance`

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction CT TT
## CT 2579 0
## TT 0 6418
##
## Accuracy : 1
## 95% CI : (0.9996, 1)
## No Information Rate : 0.7133
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.7133
## Detection Rate : 0.7133
## Detection Prevalence : 0.7133
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : TT
##
```

The above cutpoint can be used to identify DMPs from control and
treatment. The PCA+QDA model classifier can be used any time to
discriminate control DMPs from those treatment. DMPs are retrieved using
selectDIMP
function (`selectDIMP`

):

```
dmps <- selectDIMP(LR = divs, div.col = 9L, cutpoint = cutp$cutpoint,
tv.cut = 0.34, tv.col = 7)
sapply(dmps,length)
```

```
## C1 C2 C3 C4 C5 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
## 704 659 644 789 861 3403 3244 3234 3308 3328 3826 3888 3840 3957 3875
```

Next, to represent individual samples as vectors from the
*N*-dimensional space, we can use getGRegionsStat
function from MethylIT.utils R
package. Here, the simulated “chromosome” is split into regions of 450bp
non-overlapping windows. and the *density* of Hellinger
divergences values is taken for each windows.

```
ns <- names(dmps)
DMRs <- getGRegionsStat(GR = dmps, win.size = 350, step.size = 350,
stat = "density", column = 9L, scaling = 1)
names(DMRs) <- ns
as(DMRs, "GRangesList")
```

Hierarchical clustering (HC) is an unsupervised machine learning approach. HC can provide an initial estimation of the number of possible groups and information on their members. However, the effectivity of HC will depend on the experimental dataset, the metric used, and the agglomeration algorithm applied. For an unknown reason (and based on the personal experience of the author working in numerical taxonomy), Ward’s agglomeration algorithm performs much better on biological experimental datasets than the rest of the available algorithms like UPGMA, UPGMC, etc.

```
dmgm <- uniqueGRanges(DMRs, verbose = FALSE)
dmgm <- t(as.matrix(mcols(dmgm)))
rownames(dmgm) <- ns
sampleNames <- ns
hc = hclust(dist(data.frame( dmgm ))^2, method = 'ward.D2')
hc.rsq = hc
hc.rsq$height <- sqrt( hc$height )
```

```
colors = sampleNames
colors[grep("C", colors)] <- "green4"
colors[grep("T[6-9]{1}", colors)] <- "red"
colors[grep("T10", colors)] <- "red"
colors[grep("T[1-5]", colors)] <- "blue"
# rgb(red, green, blue, alpha, names = NULL, maxColorValue = 1)
clusters.color = c(rgb(0, 0.7, 0, 0.1), rgb(0, 0, 1, 0.1), rgb(1, 0.2, 0, 0.1))
par(font.lab=2,font=3,font.axis=2, mar=c(0,4,2,0), family="serif" , lwd = 0.4)
plot(as.phylo(hc.rsq), tip.color = colors, label.offset = 0.5, font = 2, cex = 0.9,
edge.width = 0.4, direction = "downwards", no.margin = FALSE,
align.tip.label = TRUE, adj = 0)
axisPhylo( 2, las = 1, lwd = 0.4, cex.axis = 1.4, hadj = 0.8, tck = -0.01 )
## cuts = c(xleft, ybottom, xright, ytop)
hclust_rect(hc.rsq, k = 2L, border = c("green4", "blue", "red"),
color = clusters.color, cuts = c(0.56, 1.4, 0.41, 120))
```

```
## Warning in is.null(color) || length(color) < l: 'length(x) = 2 > 1' in coercion
## to 'logical(1)'
```

Here, we have use function as.phylo from the R package ape for better dendrogram visualization and function hclust_rect from MethylIT.utils R package to draw rectangles with background colors around the branches of a dendrogram highlighting the corresponding clusters.

MethylIT function pcaLDA will be used to perform the PCA and PCA + LDA analyses. The function returns a list of two objects: 1) ‘lda’: an object of class ‘lda’ from package ‘MASS’. 2) ‘pca’: an object of class ‘prcomp’ from package ‘stats’. For information on how to use these objects see ?lda and ?prcomp.

Unlike hierarchical clustering (HC), LDA is a supervised machine learning approach. So, we must provide a prior classification of the individuals, which can be derived, for example, from the HC, or from a previous exploratory analysis with PCA.

```
# A prior classification derived from HC
grps <- cutree(hc, k = 3)
grps[grep(1, grps)] <- "CT"
grps[grep(2, grps)] <- "T1"
grps[grep(3, grps)] <- "T2"
grps <- factor(grps)
ld <- pcaLDA(data = data.frame(dmgm), grouping = grps, n.pc = 3, max.pc = 3,
scale = FALSE, center = FALSE, tol = 1e-6)
summary(ld$pca)
```

```
## Importance of first k=3 (out of 15) components:
## PC1 PC2 PC3
## Standard deviation 221.4347 35.6409 29.79811
## Proportion of Variance 0.9032 0.0234 0.01636
## Cumulative Proportion 0.9032 0.9266 0.94300
```

We may retain enough components so that the cumulative percent of
variance accounted for at least 70 to 80%. By setting \(scale=TRUE\) and \(center=TRUE\), we could have different
results and would improve or not our results. In particular, these
settings are essentials if the *N*-dimensional space is
integrated by variables from different measurement scales/units, for
example, Kg and g, or Kg and Km.

The individual coordinates in the principal components (PCs) are returned by function pcaLDA. In the current case, based on the cumulative proportion of variance, the two firsts PCs carried about the 84% of the total sample variance and could split the sample into meaningful groups.

```
pca.coord <- ld$pca$x
pca.coord
```

```
## PC1 PC2 PC3
## C1 -45.00927 7.884912 17.39353588
## C2 -43.05409 6.015694 2.75771385
## C3 -36.48632 -2.033121 2.35944370
## C4 -44.85556 -7.591736 -0.03176889
## C5 -45.62115 -0.162804 7.90475426
## T1 -293.61151 8.474484 -4.12206109
## T2 -253.43665 -30.909978 43.28911190
## T3 -224.13477 10.719154 -11.12880291
## T4 -232.37265 36.469259 0.60205009
## T5 -224.31342 22.069410 3.54331693
## T6 -290.02667 51.944972 6.58200024
## T7 -286.12184 -19.867377 -8.32584719
## T8 -259.19317 -6.488206 -89.57270641
## T9 -270.07011 -101.465917 12.84137767
## T10 -257.04937 35.129342 41.52432128
```

Next, the graphic for individual coordinates in the two firsts PCs can be easily visualized now:

```
dt <- data.frame(pca.coord[, 1:2], subgrp = grps)
p0 <- theme(
axis.text.x = element_text( face = "bold", size = 18, color="black",
# hjust = 0.5, vjust = 0.5,
family = "serif", angle = 0,
margin = margin(1,0,1,0, unit = "pt" )),
axis.text.y = element_text( face = "bold", size = 18, color="black",
family = "serif",
margin = margin( 0,0.1,0,0, unit = "mm" )),
axis.title.x = element_text(face = "bold", family = "serif", size = 18,
color="black", vjust = 0 ),
axis.title.y = element_text(face = "bold", family = "serif", size = 18,
color="black",
margin = margin( 0,2,0,0, unit = "mm" ) ),
legend.title=element_blank(),
legend.text = element_text(size = 20, face = "bold", family = "serif"),
legend.position = c(0.88, 0.14),
panel.border = element_rect(fill=NA, colour = "black",size=0.07),
panel.grid.minor = element_line(color= "white",size = 0.2),
axis.ticks = element_line(size = 0.1), axis.ticks.length = unit(0.5, "mm"),
plot.margin = unit(c(1,1,0,0), "lines"))
```

```
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
```

```
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
```

```
ggplot(dt, aes(x = PC1, y = PC2, colour = grps)) +
geom_vline(xintercept = 0, color = "white", size = 1) +
geom_hline(yintercept = 0, color = "white", size = 1) +
geom_point(size = 6) +
scale_color_manual(values = c("green4","blue","brown1")) +
stat_ellipse(aes(x = PC1, y = PC2, fill = subgrp), data = dt, type = "norm",
geom = "polygon", level = 0.5, alpha=0.2, show.legend = FALSE) +
scale_fill_manual(values = c("green4","blue","brown1")) + p0
```

```
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
```

In the current case, better resolution is obtained with the linear discriminant functions, which is based on the three firsts PCs. Notice that the number principal components used the LDA step must be lower than the number of individuals (\(N\)) divided by 3: \(N/3\).

```
ind.coord <- predict(ld, newdata = data.frame(dmgm), type = "scores")
dt <- data.frame(ind.coord, subgrp = grps)
p0 <- theme(
axis.text.x = element_text( face = "bold", size = 18, color="black",
# hjust = 0.5, vjust = 0.5,
family = "serif", angle = 0,
margin = margin(1,0,1,0, unit = "pt" )),
axis.text.y = element_text( face = "bold", size = 18, color="black",
family = "serif",
margin = margin( 0,0.1,0,0, unit = "mm" )),
axis.title.x = element_text(face = "bold", family = "serif", size = 18,
color="black", vjust = 0 ),
axis.title.y = element_text(face = "bold", family = "serif", size = 18,
color="black",
margin = margin( 0,2,0,0, unit = "mm" ) ),
legend.title=element_blank(),
legend.text = element_text(size = 20, face = "bold", family = "serif"),
legend.position = c(0.15, 0.83),
panel.border = element_rect(fill=NA, colour = "black",size=0.07),
panel.grid.minor = element_line(color= "white",size = 0.2),
axis.ticks = element_line(size = 0.1), axis.ticks.length = unit(0.5, "mm"),
plot.margin = unit(c(1,1,0,0), "lines"))
ggplot(dt, aes(x = LD1, y = LD2, colour = grps)) +
geom_vline(xintercept = 0, color = "white", size = 1) +
geom_hline(yintercept = 0, color = "white", size = 1) +
geom_point(size = 6) +
scale_color_manual(values = c("green4","blue","brown1")) +
stat_ellipse(aes(x = LD1, y = LD2, fill = subgrp), data = dt, type = "norm",
geom = "polygon", level = 0.5, alpha=0.2, show.legend = FALSE) +
scale_fill_manual(values = c("green4","blue","brown1")) + p0
```

`## Too few points to calculate an ellipse`

1.

F.
Liese, I. Vajda, On divergences and informations in statistics and
information theory. *IEEE Transactions on Information
Theory* **52**, 4394–4412 (2006).

2.

J.
P. Stevens, *Applied Multivariate Statistics for
the Social Sciences*, Fifth Edit (Routledge Academic,
2009).

Here is the output of `sessionInfo()`

on the system on
which this document was compiled running pandoc 2.19.2:

```
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ape_5.6-2 ggplot2_3.4.1 MethylIT.utils_0.1
## [4] MethylIT_0.3.2.6.3 rtracklayer_1.58.0 GenomicRanges_1.50.2
## [7] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.1
## [10] BiocGenerics_0.44.0
##
## loaded via a namespace (and not attached):
## [1] systemfonts_1.0.4 plyr_1.8.8
## [3] splines_4.2.3 BiocParallel_1.32.5
## [5] listenv_0.9.0 digest_0.6.31
## [7] foreach_1.5.2 htmltools_0.5.4
## [9] fansi_1.0.4 magrittr_2.0.3
## [11] memoise_2.0.1 recipes_1.0.4
## [13] globals_0.16.2 Biostrings_2.66.0
## [15] annotate_1.76.0 gower_1.0.1
## [17] matrixStats_0.63.0 sandwich_3.0-2
## [19] hardhat_1.2.0 timechange_0.2.0
## [21] pkgdown_2.0.7 colorspace_2.1-0
## [23] blob_1.2.4 textshaping_0.3.6
## [25] xfun_0.37 dplyr_1.1.0
## [27] crayon_1.5.2 RCurl_1.98-1.10
## [29] jsonlite_1.8.4 genefilter_1.80.3
## [31] survival_3.5-5 zoo_1.8-11
## [33] iterators_1.0.14 glue_1.6.2
## [35] gtable_0.3.2 ipred_0.9-13
## [37] zlibbioc_1.44.0 XVector_0.38.0
## [39] DelayedArray_0.24.0 future.apply_1.10.0
## [41] scales_1.2.1 DBI_1.1.3
## [43] Rcpp_1.0.10 xtable_1.8-4
## [45] bit_4.0.5 proxy_0.4-27
## [47] Formula_1.2-4 lava_1.7.1
## [49] prodlim_2019.11.13 httr_1.4.5
## [51] modeltools_0.2-23 farver_2.1.1
## [53] pkgconfig_2.0.3 XML_3.99-0.13
## [55] flexmix_2.3-18 nnet_7.3-18
## [57] sass_0.4.5 utf8_1.2.3
## [59] caret_6.0-93 labeling_0.4.2
## [61] tidyselect_1.2.0 rlang_1.1.0
## [63] reshape2_1.4.4 AnnotationDbi_1.60.0
## [65] munsell_0.5.0 tools_4.2.3
## [67] cachem_1.0.7 cli_3.6.0
## [69] generics_0.1.3 RSQLite_2.2.20
## [71] evaluate_0.20 stringr_1.5.0
## [73] fastmap_1.1.1 yaml_2.3.7
## [75] ragg_1.2.5 betareg_3.1-4
## [77] ModelMetrics_1.2.2.2 knitr_1.42
## [79] bit64_4.0.5 fs_1.6.1
## [81] purrr_1.0.1 randomForest_4.7-1.1
## [83] KEGGREST_1.38.0 future_1.31.0
## [85] nlme_3.1-162 compiler_4.2.3
## [87] rstudioapi_0.14 png_0.1-8
## [89] e1071_1.7-13 tibble_3.2.0
## [91] bslib_0.4.2 stringi_1.7.12
## [93] highr_0.10 desc_1.4.2
## [95] lattice_0.20-45 Matrix_1.5-3
## [97] vctrs_0.6.0 pillar_1.8.1
## [99] lifecycle_1.0.3 lmtest_0.9-40
## [101] jquerylib_0.1.4 data.table_1.14.6
## [103] bitops_1.0-7 R6_2.5.1
## [105] BiocIO_1.8.0 parallelly_1.34.0
## [107] codetools_0.2-19 MASS_7.3-58.2
## [109] SummarizedExperiment_1.28.0 proto_1.0.0
## [111] rprojroot_2.0.3 rjson_0.2.21
## [113] minpack.lm_1.2-3 withr_2.5.0
## [115] GenomicAlignments_1.34.0 Rsamtools_2.14.0
## [117] GenomeInfoDbData_1.2.9 parallel_4.2.3
## [119] grid_4.2.3 rpart_4.1.19
## [121] timeDate_4022.108 class_7.3-21
## [123] rmarkdown_2.20 nls2_0.3-3
## [125] MatrixGenerics_1.10.0 pROC_1.18.0
## [127] Biobase_2.58.0 lubridate_1.9.2
## [129] restfulr_0.0.15
```