vignettes/GenExp_Methylation_association_hyper_down-regulated.Rmd
GenExp_Methylation_association_hyper_down-regulated.Rmd
Abstract
This document is Supplementary Information for the manuscript: “Segregation of an MSH1 RNAi transgene produces heritable non-genetic memory in association with methylome reprogramming” (1). Statistical analysis supporting the association between gene expression and cytosine DNA methylation at gene-body is provided.
The classical statistical approach to analyze the association between two variables is based in correlation tests. However, small or zero correlation coefficient does not necessarily implies independence of the tested variables/processes and, consequently, does not implies the lack of association at all. There various steps involved in investigating the dependence between two random variables. The state of the art include the application of copula distributions. Herein, we provide study of the statistical dependence structure between gene expression and methylation in memory line. To uncover the association between gene expression and cytosine DNA methylation, first, the probability distributions of gene expression (in terms of logarithm of fold changes: logFC) and methylation divergences are estimated.
Next, the joint probability distribution is estimated, expressed in term of copula distribution. All the analyses are performed in R and can be reproduced as presented here. The association is investigated with the application of copula distributions.
With the exception of usefr R package, all the packages are available in Bioconductor and in CRAN repository. Package usefr is available in GitHub at https://github.com/genomaths/usefr You can install usefr package from GitHub typing the following command in the R console:
devtools::install_git("https://github.com/genomaths/usefr.git")
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
Gene expression data sets were derived using the R package edgeR. All the genes with the minimal quality of gene expression differences are included in the analyses. That is, to evaluate the existence of an association between gene expression and cytosine methylation all the genes with meaningful expression and methylation values must be included, regardless they would be differentially methylated and/or differentially expressed or not.
Gene expression data for Arabidopsis Memory first generation (G1)
## ===== Download methylation data from Github ========
url <- paste0("github.com/genomaths/genomaths.github.io/raw/master/supp-inf/",
"mm_nm_wt_transgen_meth_density_and_gene-exp_at_genes_06-01-19.",
"RData")
temp <- tempfile(fileext = ".RData")
download.file(url = url, destfile = temp)
load(temp)
file.remove(temp); rm(temp, url)
## Gene expresion data set
deg_mm1_wt1
## An object of class "DGEExact"
## $table
## logFC logCPM PValue adj.pval
## AT1G01010 -0.60092985 3.066880 9.377730e-05 0.0004180724
## AT1G01020 -0.60729200 2.913263 2.803733e-05 0.0001409788
## AT1G01030 -0.25438232 -1.478816 5.857299e-01 0.6890000078
## AT1G01040 0.03597925 5.632805 7.833824e-01 0.8505857888
## AT1G01046 -1.61271935 -2.996378 2.235591e-02 0.0498548032
## 20284 more rows ...
##
## $comparison
## [1] "WT1" "MM1"
##
## $genes
## genes.seqnames genes.start genes.end genes.strand genes.type
## AT1G01010 1 3631 5899 + gene
## AT1G01020 1 6788 9130 - gene
## AT1G01030 1 11649 13714 - gene
## AT1G01040 1 23121 31227 + gene
## AT1G01046 1 28500 28706 + gene
## genes.gene_biotype genes.gene_id genes.gene_name length
## AT1G01010 protein_coding AT1G01010 NAC001 2269
## AT1G01020 protein_coding AT1G01020 ARV1 2343
## AT1G01030 protein_coding AT1G01030 NGA3 2066
## AT1G01040 protein_coding AT1G01040 DCL1 8107
## AT1G01046 miRNA AT1G01046 MIR838A 207
## 20284 more rows ...
Methylation dataset contains the densities of methylation levels from control (Col-0) and Memory lines samples at DMP on gene-body regions. Densities were computed using function getGRegionsStat from the R package MethylIT.utils.
Methylation data for a memory G1 sample
dmp_den_genes[, "M_1_105"]
## GRanges object with 33164 ranges and 1 metadata column:
## seqnames ranges strand | M_1_105
## <Rle> <IRanges> <Rle> | <numeric>
## AT1G01010 1 3631-5899 * | 0.841607
## AT1G01020 1 6788-9130 * | 0.389668
## AT1G03987 1 11101-11372 * | 0.000000
## AT1G01030 1 11649-13714 * | 0.000000
## AT1G01040 1 23121-31227 * | 6.776141
## ... ... ... ... . ...
## AT5G09945 5 26966885-26967079 * | 0.00000
## AT5G67630 5 26967378-26969400 * | 1.28801
## AT5G67640 5 26969516-26970668 * | 0.00000
## AT5G09955 5 26971389-26971689 * | 1.05172
## AT5G09965 5 26972177-26972644 * | 0.00000
## -------
## seqinfo: 5 sequences from an unspecified genome; no seqlengths
The analysis will be performed with Memory line (M1) and wild type (WT1) control, both of them, MM1 and WT1, first generation G1.
idx <- na.omit(match(rownames(deg_mm1_wt1), names(dmp_den_genes)))
dmp_den_m1 <- dmp_den_genes[idx, c("M_1_105", "M_1_168", "M_1_179", "M_1_27", "M_1_3")]
dmp_den_wt1 <- dmp_den_genes[idx, c( "W_1_1", "W_1_3", "W_1_4", "W_1_5")]
The means of methylation levels densities are computed for each group
mcols(dmp_den_wt1) <- rowMeans(as.matrix(mcols(dmp_den_wt1)))
mcols(dmp_den_m1) <- rowMeans(as.matrix(mcols(dmp_den_m1)))
The absolute difference of the group density means of methylation levels between control (Col-0) and Memory lines at gene-body regions (G1).
dmp_den_wt1_mm1 <- dmp_den_m1
mcols(dmp_den_wt1_mm1) <- as.matrix(mcols(dmp_den_m1)) - as.matrix(mcols(dmp_den_wt1))
colnames(mcols(dmp_den_wt1_mm1)) <- "dmp_den"
idx <- which(dmp_den_wt1_mm1$dmp_den > 0)
dmp_den_wt1_mm1 <- dmp_den_wt1_mm1[idx]
dmp_den_wt1_mm1
## GRanges object with 20022 ranges and 1 metadata column:
## seqnames ranges strand | dmp_den
## <Rle> <IRanges> <Rle> | <numeric>
## AT1G01010 1 3631-5899 * | 0.734204
## AT1G01020 1 6788-9130 * | 0.304498
## AT1G01030 1 11649-13714 * | 0.180618
## AT1G01040 1 23121-31227 * | 6.571474
## AT1G01046 1 28500-28706 * | 2.597185
## ... ... ... ... . ...
## AT4G40060 4 18571239-18573084 * | 0.166230
## AT4G40065 4 18574503-18575585 * | 0.121346
## AT4G40070 4 18576216-18577774 * | 0.151446
## AT4G40085 4 18579194-18580702 * | 0.161783
## AT4G40080 4 18579314-18580717 * | 0.141613
## -------
## seqinfo: 5 sequences from an unspecified genome; no seqlengths
Final variables: |group means density differences| and \(|logFC|\). For the sake of brevity and clarity, we shall call the first variable: divergence of methylation densities or simply methylation divergence (\(MD\)).
.deg_mm1_wt1 <- deg_mm1_wt1[match(names(dmp_den_wt1_mm1), rownames(deg_mm1_wt1)), ]
MD <- dmp_den_wt1_mm1$dmp_den # Methylation (|group mean density differences|)
idx <- which(.deg_mm1_wt1$table$logFC < 0 ) # Down-regulated
logFC <- abs(.deg_mm1_wt1$table$logFC[idx]) # Gene Expression (|logFC|)
MD <- MD[idx]
DataFrame(MD,logFC)
## DataFrame with 13962 rows and 2 columns
## MD logFC
## <numeric> <numeric>
## 1 0.734204 0.600930
## 2 0.304498 0.607292
## 3 0.180618 0.254382
## 4 2.597185 1.612719
## 5 0.823158 0.515993
## ... ... ...
## 13958 0.383503 0.371768
## 13959 0.532277 0.136153
## 13960 0.166230 0.201576
## 13961 0.161783 0.115828
## 13962 0.141613 1.628195
There are 13962 genes to be included in the downstream analysis.
Searching for the best fitted distribution functions by using R functions fitdistr from MASS R package and fitCDF from usefr R package did not yield a suitable results. The best fitted model was found for a mixture distribution model obtained with the function fitMixDist from usefr R package. The best fitted model for \(MD\), from comparison MM1 vs WT1, was a mixture of Normal and Weibull distributions.
We have left the algorithm to estimate the initial value
x <- MD[ MD > 0] # Weibull distribution is defineed for MD > 0.
meth_FIT <- fitMixDist(X = x, args = list(norm = c(mean = NA, sd = NA),
weibull = c(shape = NA, scale = NA)),
dens = TRUE, npoints = 900)
## fitting ...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## *** Performing nonlinear regression model crossvalidation...
summary(meth_FIT$fit)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## mean 0.280665 0.001606 174.7 <2e-16 ***
## sd 0.173905 0.001340 129.8 <2e-16 ***
## shape 2.111314 0.019854 106.3 <2e-16 ***
## scale 1.386690 0.008022 172.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02188 on 1022 degrees of freedom
## Number of iterations to termination: 13
## Reason for termination: Relative error in the sum of squares is at most `ftol'.
cat("\n Mixture proportions (phi):", meth_FIT$phi)
##
## Mixture proportions (phi): 0.4118775 0.5881225
The graphic for histogram and density distribution curves:
par(mar = c(4, 4, 2, 1), font.lab=2, font=2,font.axis=2,family="serif" , lwd = 1.2, cex = 1.4)
hist(x, 240, freq = FALSE, las = 1, #ylim = c(0, 0.021), cex = 1,
panel.first={points(0, 0, pch=16, cex=1e6, col="grey95")
grid(col="white", lty = 1)}, main = text,
cex.main = 0.9, xlim = c(0, 5), family = "serif",
col = "cyan1", border = "deepskyblue", xaxt ="n", yaxt = "n", ann = FALSE)
axis(1, padj = -1, las = 1, lwd = 0.8, tck = -0.02)
axis(2, hadj = 0.7, las = 1, lwd = 0.8, tck = -0.02)
mtext(side = 1, text = "Divergence of methylation level density at gene-body",
line = 1.6, cex = 1.4, font = 3)
mtext(side = 2, text = "Density distribution", line = 2.1, cex = 1.4, font = 3)
mtext("(Normal & Weibull Mixture Dists.)", cex = 1.4, font = 3)
x1 <- seq(0, 10, by = 0.01)
lines(x1, dmixtdistr(x1, phi = meth_FIT$phi, arg = meth_FIT$args), col = "blue")
The best fitted model for the \(|logFC|\), from comparison MM1 vs WT1, was a mixture of Gamma and Weibull distributions. Function fitMixDist calls the mixture function models set up in function mixtdistr
x <- logFC[ logFC > 0 ]
gexp_FIT <- fitMixDist(X = x, args = list(gamma = c(shape = NA, scale = NA),
weibull = c(shape = NA, scale = NA)),
dens = TRUE, npoints = 900)
## fitting ...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## *** Performing nonlinear regression model crossvalidation...
summary(gexp_FIT$fit)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## shape 0.683987 0.017598 38.87 <2e-16 ***
## scale 1.774870 0.115703 15.34 <2e-16 ***
## shape 1.372105 0.005934 231.22 <2e-16 ***
## scale 0.407377 0.001764 230.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02787 on 937 degrees of freedom
## Number of iterations to termination: 16
## Reason for termination: Relative error in the sum of squares is at most `ftol'.
cat("\n Mixture proportions (phi):", gexp_FIT$phi)
##
## Mixture proportions (phi): 0.2149922 0.7850078
The graphic for histogram and density distribution curves:
par(mar = c(3, 3, 2, 1), font.lab=2, font=2,font.axis=2,family="serif", lwd = 1.2, cex = 1.4)
hist(x, 200, freq = FALSE, las = 1, #ylim = c(0, 0.021), cex = 1,
panel.first={points(0, 0, pch=16, cex=1e6, col="grey95")
grid(col="white", lty = 1)}, main = text,
cex.main = 0.9, xlim = c(0, 4), family = "serif",
col = "cyan1", border = "deepskyblue", xaxt ="n", yaxt = "n", ann = FALSE)
axis(1, padj = -1.5, las = 1, lwd = 0.5, tck = -0.02)
axis(2, hadj = 0.4, las = 1, lwd = 0.5, tck = -0.02)
mtext(side = 1, text = expression(italic(abs(logFC))),
line = 1.3, cex = 1.4, font = 3)
mtext(side = 2, text = "Density distribution", line = 1.6, cex = 1.4, font = 3)
mtext("(Weibull & Normal Mixture Dists.)", cex = 1.4, font = 3)
x1 <- seq(0, 10, by = 0.01)
lines(x1, dmixtdistr(x1, phi = gexp_FIT$phi, arg = gexp_FIT$args), col = "blue")
There various steps involved in investigating the dependence between two random variables. The state of the art in this subject include the application of copula distributions.
Sklar’s theorem ((2, 3)) named after Abe Sklar, provides the theoretical foundation for the application of copulas. Sklar’s theorem states that every multivariate cumulative distribution function (CDF):
\[H(x_1, ..., x_d)=Pr(X_1 \leq x_1, ..., X_d \leq x_d)\]
of any \(d-tuple\) of continuous random variables \((X_1, ..., X_d)\) can be expressed in terms of its marginals CDFs: \(F_i(x_i)=Pr(X_i \leq x_i)\) a copula \(C\):
\[H(x_1, ..., x_d)=C(F_1(x_1), ..., F_d(x_d))\] While Sklar showed that \(C\) and \(F_i\) are uniquely determined when H is known, a valid model for \((X_1, ..., X_d)\) arises from the last equation whenever the “ingredients” are chosen from given parametric families of distribution ((4)): \[F \in (F_\delta), C \in (C_\alpha)\]
Current literature on copula theory provides abundant information on models and copula families. Here, for the particular case of bi-dimensional copula defined for \(MD\) and \(|logFC|\), we have chosen Farlie–Gumbel–Morgenstern family of copulas, which are defined for, each \(\theta \in [-1, 1]\) by: \[C_\alpha(u, v)=uv+\alpha uv(1-u)(1-v)\] \(u,v \in [0,1]\). The correlation coefficient \(\rho\) is linked to \(\alpha\) by the relationship: \(\rho=\frac{\alpha}{3}\), which ranges from \(-\frac{1}{3}\) to \(\frac{1}{3}\), where \(u=F(x)\), \(v=G(y)\), with marginal CDFs: \(F(x)\) and \(G(y)\).
The first step requires for provide the parameters values, previously estimated, of the marginal distributions, which were given throughout the function mixtdistr. The the specific parameter values are retrieved from the previous nonlinear fit obtained for the marginal CDFs:
margins = c("mixtdistr", "mixtdistr")
parMargins = list(
list(phi = meth_FIT$phi, arg = meth_FIT$args),
list(phi = gexp_FIT$phi, arg = gexp_FIT$args))
The parameter estimation and goodness-of-fit (GOF) test take advantages of the tools available in the R package copula. Both, parameter estimation and GOF, can be accomplished with function bicopulaGOF from the R package usefr.
fgm_cop <- bicopulaGOF(x = MD, y = logFC, copula = "fgmCopula", sample.size = 13000,
margins = margins, paramMargins = parMargins, nboots = 10,
approach = "chisq", breaks = 30, num.cores = 6L, seed = 123,
verbose = FALSE)
The null hypothesis that the data follows FGM copula distribution is not rejected.
fgm_cop$gof
## Chisq.stat mc_p.value sample.size num.sampl
## 1.117631e+03 2.727273e-01 1.300000e+04 1.000000e+01
All the information needed about the fitted FGM copula is found in the object fgm_cop$copula, which is a mvdc object from copula package with several slots. In particular, the specific information about FGM copula is found in:
fgm_cop$copula@copula
## Farlie-Gumbel-Morgenstern (FGM) copula, dim. d = 2
## param.: -0.509
## Dimension: 2
## Parameters:
## alpha1.2 = -0.5091995
The P-P plot is build with function ppCplot from the R package usefr. If the GOF is good enough, then it is expected that points in the graphics are closed to the diagonal (red) straight line. Details on how the computation is performed are provided in the help for function. ppCplot.
g <- ppCplot(X = MD, Y = logFC, copula = fgm_cop$copula, npoints = 500, col = "blue", cex = 0.5)
Information on the data and copula used to yield the graphic are stored in the output variable “g”. For example, the empirical and theoretical probabilities used to built the graphic can be retrieved from the data frame: “g$data”:
DataFrame(g$data)
## DataFrame with 500 rows and 2 columns
## emprob thprob
## <numeric> <numeric>
## 1 0.000859476 0.000838487
## 2 0.114811632 0.119596864
## 3 0.198682137 0.199610747
## 4 0.174616817 0.181781110
## 5 0.449863916 0.455994109
## ... ... ...
## 496 0.100774 0.0964349
## 497 0.139092 0.1387958
## 498 0.193740 0.1872904
## 499 0.489042 0.4987606
## 500 0.464475 0.4624016
The domain region \((X, Y)\) is defined and function dMvdc is used to compute the density \(z\) in the grid \([X,Y] \times [X,Y]\) using the previous FMG fitted copula.
X = seq(0, 1.5, length = 100)
Y = seq(0, 1.5, length = 100)
sp_rho <- round(rho(fgm_cop$copula@copula),3) # Spearman's rho: -0.171
z <- outer(X, Y, function(x,y) dMvdc(cbind(x,y),fgm_cop$copula)) * 4
phi = 20; theta = 45
zf <- function(x,y) dMvdc(cbind(x,y),fgm_cop$copula) * 3.2
cat("Spearman's rho = ", sp_rho)
## Spearman's rho = -0.17
An auxiliary function to draw the panels
panelfirst <- function(pmat) {
par(mfrow = c(1, 1),mar=c(4,7,0, 1), font=2,lwd = 0.01, cex =1, family = "serif")
YZ <- trans3D(x = rep(0, ncol(z)), Y, z[18,], pmat = pmat)
scatter2D(YZ$x, YZ$y, colvar = z[18,], lwd = 1,
type = "l", add = TRUE, colkey = FALSE)
XZ <- trans3D(x = X, y = rep(max(Y), nrow(z)),
z = z[, 12], pmat = pmat)
scatter2D(XZ$x, XZ$y, colvar = z[, 12], lwd = 1,
type = "l", add = TRUE, colkey = FALSE)
}
The graphical settings depend on the final printing destiny and on the dataset.
par(mfrow = c(1, 1),mar=c(4,7,0, 1), font=2,lwd = 0.1, cex =1, family = "serif")
persp3D(X, Y, z, lwd = 0.01, image = TRUE,
xlab = "", ylab = "", zlab = "", family = "serif",
box = T, panel.first = panelfirst, bty = "u",
col.panel = "steelblue", col.grid = "black", lwd.grid = 0.5,
contour = list(col = "black", side = "zmin", lwd = 0.4),
ticktype = "detailed", clab = "", expand = 0.7, d = 2,
phi = phi, theta = theta, resfac = 2.5, cex.lab = 1, lwd = 0.005,
xlim = c(0, 1.5), zlim = c(-5, 6), font=2, cex.axis = 1.2,
colkey = list(length = 0.7, width = 0.6, col.ticks = "black", lwd = 0.01,
font=2, shift = 0, dist = 0, lwd.ticks = 0.01, cex.clab = 1.2,
cex.axis = 1, line = -0.8))
text(0.38,-0.48, srt = 34, labels = expression(paste("|", italic(log[2]), italic(FC), "|")),
xpd = TRUE,font = 2, cex = 1.2)
text(-0.3,-0.44, srt = -28, labels = expression(paste(italic("MD"))),
xpd = TRUE, font=2, cex = 1.2)
text(-0.3,-0.6, srt = 0, labels = expression(paste("Spearman' s ", italic(rho), ": ", -0.17)),
xpd = TRUE, font = 1, cex = 1)
The marginal distribtuions are drawn in the \(XZ\) and \(YZ\) vertical planes. Consistent with the
Spearman’ s \(\rho=-0.17\), for a fixed
value of \(|log_2FC|\), (a line
parallel to \(MD\) axis) the joint
probability density distribution function \(f(X \leq MD, Y \leq |log_2FC|)\) decreases
with the increment of \(MD\). That is,
the probability to observe DEGs with values, say between 0.1 and
0.5,
\(0.1 \leq |log_2FC| \leq 0.5\), is
greater for \(0.1 \eq MD \leq 0.5\)
than for \(0.5 < MD \leq 1\).
The 2D-dimensional plot permits to see in more detail the projection of the density surface on the \(XY\) plane and contour lines.
It is not difficult to see that for a fixed value of \(MD\) (a line parallel to \(|log_2FC|\) axis) the joint probability density distribution function \(f(X \leq MD, Y \leq |log_2FC|)\) (color gradient from red to blue) decreases with the increment of \(|log_2FC|\). That is, the probability to observe \(MD\) values, say between 0.1 and 0.5 (see the squared region below) \(0.1 \leq MD \leq 0.5\), is greater for \(0.1 \leq |log_2FC| \leq 0.5\) than for \(0.5 < |log_2FC| \leq 1\).
In summary, gene expression and methylation processes (2kb upstream gene TSS regions) are not independent in memory line first generation. There exist an stochastic deterministic dependence between \(|log_2FC|\) and \(MD\) that can be expressed in terms joint probability distribution, which leads to an inverse relationship between the conditioned expected values \(|log_2FC|\) given \(MD\), and vice versa. This association is, however, weaker than the association found for gene expression and methylation at gene body.
The above 3D and 2D density plots and results suggest that most of the genes have low \(|logFC|\) and low \(MD\), and the number of genes decrease for middle levels of \(|logFC|\) and \(MD\). This observation leads to the application of a linear-by-linear association test.
The linear-by-linear test (LLT) is used to test a linear association among variables in a contingency table with ordered categories ((5)). The null hypothesis for the linear-by-linear test is that there is no association among the variables in the table.
It is obvious from the 2D density plots that a relationship for \(MD\) < 0.08 is not meaningful. In particular, both variables, \(|logFC|\) and \(MD\), are at the level of technical noise. The percentage of the genes at noise level are:
## [1] 1.78
Hence, the genes considered in the downstream analysis are:
idx <- which(MD > 0.08)
md <- MD[ idx]
logfc <- logFC[ idx ]
dt <- data.frame(md = md , logfc = logfc)
nrow(dt)
## [1] 13714
To apply the LLT the values of \(|logFC|\) and \(MD\) must be discretized into ordinal values. A K-means discretization approach is applied with following auxiliary function
# ------------------- Auxiliary function for discretization ------------------ #
Kmean.discretization <- function(x, categories, algorithm = "Hartigan-Wong",
iter.max = 10^6, column = 1L) {
centers <- kmeans(x, centers = categories, algorithm = algorithm,
iter.max = iter.max)$centers
# order the centers
centers <- sort(centers)
cl <- stats::kmeans(stats::na.omit(x), centers = centers,
algorithm = algorithm, iter.max = iter.max )
return( data.frame(x, cluster = cl$cluster) )
}
cl = Kmean.discretization(md, categories = 3, algorithm = "MacQueen" )
cl_summary <- sapply(1:3, function(k) summary(cl$x[cl$cluster == k]))
colnames(cl_summary) <- c("low", "middle", "high")
cl_summary
## low middle high
## Min. 0.08010906 0.9764634 2.346377
## 1st Qu. 0.25350885 1.1902080 2.557401
## Median 0.39796318 1.4417072 2.850635
## Mean 0.44995143 1.5025125 3.188218
## 3rd Qu. 0.63007886 1.7738444 3.420480
## Max. 0.97620682 2.3446362 20.519587
Cutpoint are now applied:
cl = Kmean.discretization(logfc, categories = 3, algorithm = "MacQueen" )
cl_summary <- sapply(1:3, function(k) summary(cl$x[cl$cluster == k]))
colnames(cl_summary) <- c("eLow", "eMiddle", "eHigh")
cl_summary
## eLow eMiddle eHigh
## Min. 2.270881e-05 0.6366017 1.835555
## 1st Qu. 1.345836e-01 0.7410084 2.070626
## Median 2.526246e-01 0.9079608 2.399485
## Mean 2.729755e-01 0.9997545 2.671355
## 3rd Qu. 4.015427e-01 1.1817728 2.905358
## Max. 6.360073e-01 1.8328136 9.407983
In the current case, the contingency table suggests an obvious relationship between \(|logFC|\) and \(MD\).
##
## Low Middle High
## eLow 5815 3890 761
## eMiddle 1997 630 177
## eHigh 355 68 21
The spineplot of this contingency table (table(x = logfc_level, y = md_level)) help to visualize the existent relationship between \(|logFC|\) and \(MD\).
par( mar = c(5, 7, 4, 4), family = "Serif", font = 2, cex = 0.8, las = 1, lwd = 0.2 )
tb <- spineplot(m[, c(3,2,1)], col = c( "blue", "dodgerblue","deepskyblue" ),
xlab = "", ylab = "")
mtext(side = 1, text = expression(italic(abs(logFC))),
line = 2.5, cex = 1, font = 3)
mtext(side = 2, text = expression(paste(italic("MD"))),
line = 4, cex = 1, font = 3)
abline(a = 0.0, b = 0.72, col = "white", lty = "dashed", lwd = 1)
A dashed line was added to emphasize the direction of linear trend. From “Low” to “high” there seems to be a linear relationship between pairwise proportion of \(|logFC|\) and \(MD\).
Conceptually, the spineplot is a plot of the conditional probability \(P(y|x)\) against \(P(x)\). Since the joint probability distribution of these variable is described by a FGM copula, a linear relationship between \(P(y|x)\) against \(P(x)\) must be expected, which is a property of FGM copula (6).
Here, we use the LLT test implemented in function lbl_test from the R package coin. Two test variants are available: 1) permutation test and asymptotic test.
The asymptotic LLT test:
lbl_test( m )
##
## Asymptotic Linear-by-Linear Association Test
##
## data: Var2 (ordered) by Var1 (eLow < eMiddle < eHigh)
## Z = -14.655, p-value < 2.2e-16
## alternative hypothesis: two.sided
The permutation LLT test:
lbl_test( m, distribution = "approximate", B = 10000, alternative = "less" )
##
## Approximative Linear-by-Linear Association Test
##
## data: Var2 (ordered) by Var1 (eLow < eMiddle < eHigh)
## Z = -14.655, p-value < 1e-04
## alternative hypothesis: less
The null hypothesis of no association among the variables in the table is rejected. In the current case the p-value < 1e-04 and, hence, we do not have enough evidence to support the null hypothesis.
The variable \(MD\) was split into three levels, “0”: “Low” (\(0 < MD \leq 0.5\)), “1”: “Middle” (\(0.5 < MD \leq 1\)), and “2”: “High” ($1 < MD $)
meth_cut <- 1
xy <- cbind(MD, logFC)
idx <- which(logFC > 0.5)
xy <- data.frame(xy[idx, ])
xy$md <- 0
xy$md[xy$MD > meth_cut] <- 1
The glm support the decrease of \(-logFC\) with th increment of \(MD\). The model coefficients are highly statistically significant.
##
## Call:
## glm(formula = logFC ~ md, family = Gamma(link = "identity"),
## data = xy)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.09233 0.01227 89.030 <2e-16 ***
## md -0.17546 0.02040 -8.602 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.4203748)
##
## Null deviance: 1283.8 on 4662 degrees of freedom
## Residual deviance: 1255.4 on 4661 degrees of freedom
## AIC: 6431
##
## Number of Fisher Scoring iterations: 3
The ANOVA test confirm the negative trend given by the coefficient of MD (highly statistically significant).
anova(gm, test = "LRT")
## Analysis of Deviance Table
##
## Model: Gamma, link: identity
##
## Response: logFC
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 4662 1283.8
## md 1 28.416 4661 1255.4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The boxplot suggest a clear trend of \(-logFC\) decreasing with increment of MD levels. Notice that the notchs from the boxes do not overlap, which implies that the differences between medians from three groups (“Low”, “Middle”, and “High”) are statistically significant. So, the GLM results can be visualized.
par(bty = "n", family = "serif", font = 2, cex = 1, lwd = 0.1, mar = c(4, 5, 2, 1.1))
boxplot(logFC ~ md, data = xy, ylim = c(0.4, 2.5), xaxt ="n", yaxt = "n", ann = FALSE)
rect(0.5, 0.4, 3.5, 5, col = "grey94", lty = 0, lwd = 1)
grid(0, 5, col = "white", lwd = 2, lty = 1)
boxplot(logFC ~ md, data = xy, ylim = c(0.4, 2.5), notch = TRUE, xaxt ="n",
yaxt = "n", ann = FALSE, pch = 20, medcol = "darkblue", whisklwd = 1, staplelwd = 1,
col = c("deepskyblue", "dodgerblue"), outcol = "red", add = TRUE)
axis(1, padj = 0.1, at = 1:2, lwd = 0.5, tck = -0.02, line = -0.6, cex.axis = 1.3,
labels = c("Low", "High"))
axis(2, hadj = 1, las = 1, lwd = 0.5, tck = -0.02, line = -1, cex.axis = 1.4)
mtext(side = 1, text = expression(paste(italic("MD"))),
line = 2.2, cex = 1.4, font = 3)
mtext(side = 2, text = expression(italic(-logFC)), line = 2.1, cex = 1.4, font = 3)
text(x = 1.5, y = 2.2, "***", pos = 3, cex = 1.2 )
.deg_mm1_wt1 <- deg_mm1_wt1[match(names(dmp_den_wt1_mm1), rownames(deg_mm1_wt1)), ]
MD <- dmp_den_wt1_mm1$dmp_den # Methylation (|group mean density differences|)
idx <- which(.deg_mm1_wt1$table$logFC > 0 ) # Up-regulated
logFC <- .deg_mm1_wt1$table$logFC[idx] # Gene Expression (|logFC|)
MD <- MD[idx]
DataFrame(MD,logFC)
## DataFrame with 6060 rows and 2 columns
## MD logFC
## <numeric> <numeric>
## 1 6.571474 0.0359792
## 2 1.199583 0.5287217
## 3 0.762753 0.1245334
## 4 0.274631 0.0142820
## 5 0.623658 0.4497949
## ... ... ...
## 6056 0.223080 0.317756
## 6057 0.634709 0.394906
## 6058 0.296653 0.943018
## 6059 0.121346 1.680553
## 6060 0.151446 1.921001
There are 6060 genes with \(MD > 0\) and \(logFC > 0\) to be included in the downstream analysis.
The best fitted model for \(MD\), from comparison MM1 vs WT1, was a mixture of Normal and Weibull distributions.
x <- MD[ MD > 0]
meth_FIT <- fitMixDist(X = x, args = list(norm = c(mean = NA, sd = NA),
weibull = c(shape = NA, scale = NA)),
dens = TRUE, npoints = 900)
## fitting ...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## *** Performing nonlinear regression model crossvalidation...
summary(meth_FIT$fit)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## mean 0.285889 0.002124 134.60 <2e-16 ***
## sd 0.163948 0.001777 92.27 <2e-16 ***
## shape 2.093709 0.029261 71.55 <2e-16 ***
## scale 1.360781 0.011846 114.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04601 on 1044 degrees of freedom
## Number of iterations to termination: 12
## Reason for termination: Relative error in the sum of squares is at most `ftol'.
cat("\n Mixture proportions (phi):", meth_FIT$phi)
##
## Mixture proportions (phi): 0.4201527 0.5798473
The graphic for histogram and density distribution curves:
par(mar = c(4, 4, 2, 1), font.lab=2, font=2,font.axis=2,family="serif" , lwd = 1.2, cex = 1.4)
hist(MD, 90, freq = FALSE, las = 1, #ylim = c(0, 0.021), cex = 1,
panel.first={points(0, 0, pch=16, cex=1e6, col="grey95")
grid(col="white", lty = 1)}, main = text,
cex.main = 0.9, xlim = c(0, 5), family = "serif",
col = "cyan1", border = "deepskyblue", xaxt ="n", yaxt = "n", ann = FALSE)
axis(1, padj = -1, las = 1, lwd = 0.8, tck = -0.02)
axis(2, hadj = 0.7, las = 1, lwd = 0.8, tck = -0.02)
mtext(side = 1, text = "Divergence of methylation level density at gene-body",
line = 1.6, cex = 1.4, font = 3)
mtext(side = 2, text = "Density distribution", line = 2.1, cex = 1.4, font = 3)
mtext("(Normal & Weibull Mixture Dists.)", cex = 1.4, font = 3)
x1 <- seq(0, 10, by = 0.01)
lines(x1, dmixtdistr(x1, phi = meth_FIT$phi, arg = meth_FIT$args), col = "blue")
The best fitted model for the \(|logFC|\), from comparison MM1 vs WT1, was a mixture of Gamma and Weibull distributions. Function fitMixDist calls the mixture function models set up in function mixtdistr
gexp_FIT <- fitMixDist(X = logFC, args = list(gamma = c(shape = NA, scale = NA),
weibull = c(shape = NA, scale = NA)),
dens = TRUE, npoints = 900)
## fitting ...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## *** Performing nonlinear regression model crossvalidation...
summary(gexp_FIT$fit)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## shape 1.346227 0.266677 5.048 5.54e-07 ***
## scale 0.718222 0.148811 4.826 1.67e-06 ***
## shape 1.028815 0.006226 165.255 < 2e-16 ***
## scale 0.276247 0.009824 28.121 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04628 on 791 degrees of freedom
## Number of iterations to termination: 7
## Reason for termination: Relative error in the sum of squares is at most `ftol'.
cat("\n Mixture proportions (phi):", gexp_FIT$phi)
##
## Mixture proportions (phi): 0.2454972 0.7545028
The graphic for histogram and density distribution curves:
par(mar = c(3, 3, 2, 1), font.lab=2, font=2,font.axis=2,family="serif", lwd = 1.2, cex = 1.4)
hist(logFC, 120, freq = FALSE, las = 1, #ylim = c(0, 0.021), cex = 1,
panel.first={points(0, 0, pch=16, cex=1e6, col="grey95")
grid(col="white", lty = 1)}, main = text,
cex.main = 0.9, xlim = c(0, 4), family = "serif",
col = "cyan1", border = "deepskyblue", xaxt ="n", yaxt = "n", ann = FALSE)
axis(1, padj = -1.5, las = 1, lwd = 0.5, tck = -0.02)
axis(2, hadj = 0.4, las = 1, lwd = 0.5, tck = -0.02)
mtext(side = 1, text = expression(italic(abs(logFC))),
line = 1.3, cex = 1.4, font = 3)
mtext(side = 2, text = "Density distribution", line = 1.6, cex = 1.4, font = 3)
mtext("(Gamma & Weibull Mixture Dists.)", cex = 1.4, font = 3)
x1 <- seq(0, 10, by = 0.01)
lines(x1, dmixtdistr(x1, phi = gexp_FIT$phi, arg = gexp_FIT$args), col = "blue")
The first step requires for provide the parameters values, previously estimated, of the marginal distributions, which were given throughout the function mixtdistr. The specific parameter values are retrieved from the previous nonlinear fit obtained for the marginal CDFs:
margins = c("mixtdistr", "mixtdistr")
parMargins = list(
list(phi = meth_FIT$phi, arg = meth_FIT$args),
list(phi = gexp_FIT$phi, arg = gexp_FIT$args))
The parameter estimation and goodness-of-fit (GOF) test take advantage of the tools available in the R package copula. Both, parameter estimation and GOF, can be accomplished with function bicopulaGOF from the R package usefr.
fgm_cop <- bicopulaGOF(x = MD, y = logFC, copula = "fgmCopula", sample.size = 5999,
margins = margins, paramMargins = parMargins, nboots = 10,
approach = "chisq", breaks = 30, num.cores = 6L, seed = 123,
verbose = FALSE)
The null hypothesis that the data follows FGM copula distribution is not rejected.
fgm_cop$gof
## Chisq.stat mc_p.value sample.size num.sampl
## 924.4173554 0.4545455 5999.0000000 10.0000000
All the information needed about the fitted FGM copula is found in the object fgm_cop$copula, which is a mvdc object from copula package with several slots. In particular, the specific information about FGM copula is found in:
fgm_cop$copula@copula
## Farlie-Gumbel-Morgenstern (FGM) copula, dim. d = 2
## param.: -0.492
## Dimension: 2
## Parameters:
## alpha1.2 = -0.491595
The P-P plot is build with function ppCplot from the R package usefr. If the GOF is good enough, then it is expected that points in the graphics are closed to the diagonal (red) straight line. Details on how the computation is performed are provided in the help for function. ppCplot.
g <- ppCplot(X = MD, Y = logFC, copula = fgm_cop$copula, npoints = 500, col = "blue", cex = 0.5)
Information on the data and copula used to yield the graphic are stored in the output variable “g”. For example, the empirical and theoretical probabilities used to built the graphic can be retrieved from the data frame: “g$data”:
DataFrame(g$data)
## DataFrame with 500 rows and 2 columns
## emprob thprob
## <numeric> <numeric>
## 1 0.122277 0.128040
## 2 0.181023 0.186947
## 3 0.127558 0.131029
## 4 0.139109 0.141022
## 5 0.286964 0.294494
## ... ... ...
## 496 0.167162 0.167162
## 497 0.189109 0.182595
## 498 0.675743 0.676203
## 499 0.339109 0.339881
## 500 0.444389 0.439957
The domain region \((X, Y)\) is defined and function dMvdc is used to compute the density \(z\) in the grid \([X,Y] \times [X,Y]\) using the previous FMG fitted copula.
X = seq(0, 1.5, length = 100)
Y = seq(0, 1.8, length = 100)
sp_rho <- round(rho(fgm_cop$copula@copula),3) # Spearman' s rho: -0.171
z <- outer(X, Y, function(x,y) dMvdc(cbind(x,y),fgm_cop$copula)) * 4
phi = 15; theta = 45
sp_rho
## [1] -0.164
An auxiliary function to draw the panels
panelfirst <- function(pmat) {
par(mfrow = c(1, 1),mar=c(4,7,0, 1), font=2,lwd = 0.01, cex =1, family = "serif")
YZ <- trans3D(x = rep(0, ncol(z)), Y, z[18,], pmat = pmat)
scatter2D(YZ$x, YZ$y, colvar = z[18,], lwd = 1,
type = "l", add = TRUE, colkey = FALSE)
XZ <- trans3D(x = X, y = rep(max(Y), nrow(z)),
z = z[, 12], pmat = pmat)
scatter2D(XZ$x, XZ$y, colvar = z[, 12], lwd = 1,
type = "l", add = TRUE, colkey = FALSE)
}
The graphical settings depend on the final printing destiny and on the dataset.
par(mfrow = c(1, 1),mar=c(4,7,0, 1), font=2,lwd = 0.1, cex =1, family = "serif")
persp3D(X, Y, z, lwd = 0.01, image = TRUE,
xlab = "", ylab = "", zlab = "", family = "serif",
box = T, panel.first = panelfirst, bty = "u",
col.panel = "steelblue", col.grid = "black", lwd.grid = 0.5,
contour = list(col = "black", side = "zmin", lwd = 0.4),
ticktype = "detailed", clab = "", expand = 0.7, d = 2,
phi = phi, theta = theta, resfac = 2.5, cex.lab = 1, lwd = 0.005,
xlim = c(0, 1.5), zlim = c(-5, 9), font=2, cex.axis = 1.2,
colkey = list(length = 0.5, width = 0.6, col.ticks = "black", lwd = 0.01,
font=2, shift = 0, dist = 0, lwd.ticks = 0.01, cex.clab = 1.2,
cex.axis = 1, line = -0.8))
text(0.38,-0.4, srt = 38, labels = expression(paste("|", italic(log[2]), italic(FC), "|")),
xpd = TRUE,font = 2, cex = 1.2)
text(-0.41,-0.39, srt = -20, labels = expression(paste(italic("MD"))),
xpd = TRUE, font=2, cex = 1.2)
text(-0.36,-0.5, srt = 0, labels = expression(paste("Spearman' s ", italic(rho), ": ", -0.164)),
xpd = TRUE, font = 1, cex = 1)
The marginal distributions are drawn in the \(XZ\) and \(YZ\) vertical planes. Consistent with the
Spearman’ s \(\rho=-0.164\), for a
fixed value of \(|log_2FC|\), (a line
parallel to \(MD\) axis) the joint
probability density distribution function \(f(X \leq MD, Y \leq log_2FC)\) decreasses
with the increment of \(MD\). That is,
the probability to observe DEGs with values, say between 0.1 and
0.5,
\(0.1 \leq log_2FC \leq 0.5\), is
greater for \(0.1 \leq MD \leq 0.5\)
than for \(0.5 < MD \leq 1\).
The 2D-dimensional plot permits to see in more detail the projection of the density surface on the \(XY\) plane and contour lines.
It is not difficult to see that for a fixed value of \(MD\) (a line parallel to \(|log_2FC|\) axis) the joint probability density distribution function \(f(X \leq MD, Y \leq |log_2FC|)\) (color gradient from red to blue) decreasses with the increment of \(|log_2FC|\). That is, the probability to observe \(MD\) values, say between 0.1 and 0.5 (see the squared region below) \(0.1 \leq MD \leq 0.5\), is greater for \(0.1 \leq |log_2FC| \leq 0.5\) than for \(0.5 < |log_2FC| \leq 1\).
In summary, gene expression and methylation processes are not independent in memory line first generation. There exist an stochastic deterministic dependence between \(|log_2FC|\) and \(MD\) that can be expressed in terms joint probability distribution, which leads to an inverse relationship between the conditioned expected values \(|log_2FC|\) given \(MD\), and vice versa.
As in the previous case (above), tt is obvious from the 2D density plots that for \(MD\) < 0.08 the any relationship is not meaningful. In particular, both variables, \(logFC\) and \(MD\), are at the level of technical noise. The percentage of the genes at noise level are:
## [1] 1.42
The genes for the downstream analysis are:
idx <- which(MD > 0.08)
md <- MD[ idx]
logfc <- logFC[ idx ]
dt <- data.frame(md = md , logfc = logfc)
nrow(dt)
## [1] 5974
# ------------------- Auxiliary function for discretization ------------------ #
Kmean.discretization <- function(x, categories, algorithm = "Hartigan-Wong",
iter.max = 10^6, column = 1L) {
centers <- kmeans(x, centers = categories, algorithm = algorithm,
iter.max = iter.max)$centers
# order the centers
centers <- sort(centers)
cl <- stats::kmeans(stats::na.omit(x), centers = centers,
algorithm = algorithm, iter.max = iter.max )
return( data.frame(x, cluster = cl$cluster) )
}
cl = Kmean.discretization(md, categories = 3, algorithm = "MacQueen" )
cl_summary <- sapply(1:3, function(k) summary(cl$x[cl$cluster == k]))
colnames(cl_summary) <- c("low", "middle", "high")
cl_summary
## low middle high
## Min. 0.0809488 0.9703714 2.348914
## 1st Qu. 0.2543436 1.1740022 2.546431
## Median 0.3925193 1.4335135 2.922496
## Mean 0.4465135 1.4933123 3.204277
## 3rd Qu. 0.6216258 1.7647503 3.509952
## Max. 0.9683489 2.3473622 10.496433
Cutpoint are now applied:
cl = Kmean.discretization(logfc, categories = 3, algorithm = "MacQueen" )
cl_summary <- sapply(1:3, function(k) summary(cl$x[cl$cluster == k]))
colnames(cl_summary) <- c("eLow", "eMiddle", "eHigh")
cl_summary
## eLow eMiddle eHigh
## Min. 3.772575e-05 0.6448911 2.052246
## 1st Qu. 8.178209e-02 0.7713312 2.324534
## Median 1.865436e-01 0.9716668 2.575225
## Mean 2.257154e-01 1.0636271 3.027001
## 3rd Qu. 3.430369e-01 1.2671241 3.394137
## Max. 6.443115e-01 2.0306128 7.945112
In the current case, the contingency table suggests an obvious relationship between \(|logFC|\) and \(MD\).
##
## Low Middle High
## eLow 2721 1676 355
## eMiddle 795 210 59
## eHigh 128 24 6
The spineplot of this contingency table (table(x = logfc_level, y = md_level)) help to visualize the existent relationship between \(|logFC|\) and \(MD\).
par( mar = c(5, 7, 4, 4), family = "Serif", font = 2, cex = 0.8, las = 1, lwd = 0.2 )
tb <- spineplot(m[, c(3,2,1)], col = c( "blue", "dodgerblue","deepskyblue" ),
xlab = "", ylab = "")
mtext(side = 1, text = expression(italic(logFC)),
line = 2.5, cex = 1, font = 3)
mtext(side = 2, text = expression(paste(italic("MD"))),
line = 4, cex = 1, font = 3)
abline(a = 0.0, b = 0.72, col = "white", lty = "dashed", lwd = 1)
A dashed line was added to emphasize the direction of linear trend. From “Low” to “high” there seems to be a linear relationship between pairwise proportion of \(|logFC|\) and \(MD\).
Conceptually, the spineplot is a plot of the conditional probability \(P(y|x)\) against \(P(x)\). Since the joint probability distribution of these variable is described by a FGM copula, a linear relationship between \(P(y|x)\) against \(P(x)\) must be expected, which is a property of FGM copula (6).
Here, we use the LLT test implemented in function lbl_test from the R package coin. Two test variants are available: 1) permutation test and asymptotic test.
The asymptotic LLT test:
lbl_test( m )
##
## Asymptotic Linear-by-Linear Association Test
##
## data: Var2 (ordered) by Var1 (eLow < eMiddle < eHigh)
## Z = -10.163, p-value < 2.2e-16
## alternative hypothesis: two.sided
The permutation LLT test:
lbl_test( m, distribution = "approximate", B = 10000, alternative = "less" )
##
## Approximative Linear-by-Linear Association Test
##
## data: Var2 (ordered) by Var1 (eLow < eMiddle < eHigh)
## Z = -10.163, p-value < 1e-04
## alternative hypothesis: less
The null hypothesis of no association among the variables in the table is rejected. In the current case the \(p-value < 1e-04\) and, hence, we do not have enough evidence to support the null hypothesis.
The variable \(MD\) is split int three levels, 0:“Low” (\(0 < MD \leq 0.5\)), 1:“Middle” (\(0.5 < MD \leq 1\)), and 2:“High” ($1 < MD $)
meth_cut1 <- 0.5
meth_cut2 <- 1.2
xy <- cbind(MD, logFC)
idx <- which(logFC > 0.5)
xy <- data.frame(xy[idx, ])
xy$md <- 0
xy$md[xy$MD > meth_cut1] <- 1
xy$md[xy$MD > meth_cut2] <- 2
The glm support the decrease of \(-logFC\) with the increment of \(MD\). The model coefficients are higly statistically significant.
##
## Call:
## glm(formula = logFC ~ md, family = Gamma(link = "identity"),
## data = xy)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.18737 0.02644 44.915 < 2e-16 ***
## md -0.10495 0.02264 -4.635 3.84e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.490523)
##
## Null deviance: 532.96 on 1702 degrees of freedom
## Residual deviance: 522.93 on 1701 degrees of freedom
## AIC: 2751.1
##
## Number of Fisher Scoring iterations: 4
The ANOVA test confirm the negative trend given by the coefficient of MD (highly statistically significant).
anova(gm, test = "LRT")
## Analysis of Deviance Table
##
## Model: Gamma, link: identity
##
## Response: logFC
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1702 532.96
## md 1 10.034 1701 522.93 6.106e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Only genes with \(MD < 1\) are analyzed. That is, only two groups are considered: 0:“Middle” (\(0.5 < MD \leq 1\)), and 1:“High” ($1 < MD $)
dt <- xy[xy$MD < meth_cut2, ]
dt$md <- 0
dt$md[ dt$MD > meth_cut1 ] <- 1
gm <- glm(formula = logFC ~ md, data = dt, family = Gamma(link = "identity"))
summary( gm )
##
## Call:
## glm(formula = logFC ~ md, family = Gamma(link = "identity"),
## data = dt)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.19166 0.02834 42.049 < 2e-16 ***
## md -0.12276 0.04503 -2.726 0.00649 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.4971405)
##
## Null deviance: 430.00 on 1342 degrees of freedom
## Residual deviance: 426.45 on 1341 degrees of freedom
## AIC: 2291.1
##
## Number of Fisher Scoring iterations: 3
cat("\n =============== The ANOVA test =============== \n\n")
##
## =============== The ANOVA test ===============
anova(gm, test = "LRT")
## Analysis of Deviance Table
##
## Model: Gamma, link: identity
##
## Response: logFC
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1342 430.00
## md 1 3.5479 1341 426.45 0.007552 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Only genes with \(MD < 1\) are analyzed. That is, only two groups are considered: 0:“Low” (\(0 < MD \leq 0.5\)), 1:“Middle” (\(0.5 < MD \leq 1\)).
dt <- xy[xy$MD > meth_cut1, ]
dt$md <- 0
dt$md[ dt$MD > meth_cut2 ] <- 1
gm <- glm(formula = logFC ~ md, data = dt, family = Gamma(link = "identity"))
summary( gm )
##
## Call:
## glm(formula = logFC ~ md, family = Gamma(link = "identity"),
## data = dt)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.06890 0.03514 30.419 <2e-16 ***
## md -0.08433 0.05084 -1.659 0.0976 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.5014626)
##
## Null deviance: 238.85 on 823 degrees of freedom
## Residual deviance: 237.48 on 822 degrees of freedom
## AIC: 1171.9
##
## Number of Fisher Scoring iterations: 3
cat("\n =============== The ANOVA test =============== \n\n")
##
## =============== The ANOVA test ===============
anova(gm, test = "LRT")
## Analysis of Deviance Table
##
## Model: Gamma, link: identity
##
## Response: logFC
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 823 238.85
## md 1 1.3641 822 237.48 0.09908 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The boxplot suggest a clear trend of \(-logFC\) decreasing with increment of MD levels. Notice that the notchs from the boxes do not overlap, which implies that the differences between medians from three groups (“Low”, “Middle”, and “High”) are statistically significant. So, the GLM results can be visualized.
par(bty = "n", family = "serif", font = 2, cex = 1, lwd = 0.1)
boxplot(logFC ~ md, data = xy, ylim = c(0.4, 2.5), xaxt ="n", yaxt = "n", ann = FALSE)
rect(0.5, 0.4, 3.5, 5, col = "grey94", lty = 0, lwd = 1)
grid(0, 5, col = "white", lwd = 2, lty = 1)
boxplot(logFC ~ md, data = xy, ylim = c(0.4, 2.5), notch = TRUE, xaxt ="n",
yaxt = "n", ann = FALSE, pch = 20, medcol = "darkblue", whisklwd = 1, staplelwd = 1,
col = c("deepskyblue", "dodgerblue", "dodgerblue3" ), outcol = "red", add = TRUE)
axis(1, padj = -0.6, at = 1:3, lwd = 0.5, tck = -0.02, line = -0.6, cex.axis = 1.3,
labels = c("Low", "Middle", "High"))
axis(2, hadj = 0.6, las = 1, lwd = 0.5, tck = -0.02, line = -1, cex.axis = 1.3)
mtext(side = 1, text = expression(paste(italic("MD"))),
line = 1.3, cex = 1.4, font = 3)
mtext(side = 2, text = expression(italic(-logFC)), line = 1, cex = 1.4, font = 3)
text(x = 1.5, y = 2.2, "**", pos = 3, cex = 1.2 )
Here is the output of sessionInfo()
on the system on
which this document was compiled running pandoc 3.1.1:
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] coin_1.4-3 survival_3.5-7 copula_1.1-2
## [4] plot3D_1.4 usefr_0.1.1 edgeR_3.42.4
## [7] limma_3.56.2 ggplot2_3.4.4 GenomicRanges_1.52.1
## [10] GenomeInfoDb_1.36.4 IRanges_2.34.1 S4Vectors_0.38.2
## [13] BiocGenerics_0.46.0
##
## loaded via a namespace (and not attached):
## [1] ADGofTest_0.3 tidyselect_1.2.0 libcoin_1.0-10
## [4] dplyr_1.1.4 bitops_1.0-7 fastmap_1.1.1
## [7] RCurl_1.98-1.13 TH.data_1.1-2 digest_0.6.33
## [10] mixdist_0.5-5 lifecycle_1.0.4 magrittr_2.0.3
## [13] compiler_4.3.2 rlang_1.1.2 sass_0.4.8
## [16] tools_4.3.2 utf8_1.2.4 yaml_2.3.8
## [19] knitr_1.45 mclust_6.0.1 multcomp_1.4-25
## [22] BiocParallel_1.34.2 withr_2.5.2 purrr_1.0.2
## [25] numDeriv_2016.8-1.1 desc_1.4.3 grid_4.3.2
## [28] pcaPP_2.0-3 fansi_1.0.6 stabledist_0.7-1
## [31] colorspace_2.1-0 MASS_7.3-60 scales_1.3.0
## [34] cli_3.6.2 mvtnorm_1.2-4 rmarkdown_2.25
## [37] ragg_1.2.6 generics_0.1.3 rstudioapi_0.15.0
## [40] nls2_0.3-3 cachem_1.0.8 stringr_1.5.1
## [43] zlibbioc_1.46.0 modeltools_0.2-23 splines_4.3.2
## [46] parallel_4.3.2 pspline_1.0-19 XVector_0.40.0
## [49] matrixStats_1.2.0 vctrs_0.6.5 sandwich_3.0-2
## [52] misc3d_0.9-1 Matrix_1.6-4 jsonlite_1.8.7
## [55] minpack.lm_1.2-4 systemfonts_1.0.5 locfit_1.5-9.8
## [58] jquerylib_0.1.4 proto_1.0.0 glue_1.6.2
## [61] pkgdown_2.0.7 codetools_0.2-19 stringi_1.8.3
## [64] cubature_2.1.0 gtable_0.3.4 gsl_2.1-8
## [67] munsell_0.5.0 tibble_3.2.1 pillar_1.9.0
## [70] htmltools_0.5.7 GenomeInfoDbData_1.2.10 R6_2.5.1
## [73] tcltk_4.3.2 textshaping_0.3.7 evaluate_0.23
## [76] lattice_0.22-5 highr_0.10 memoise_2.0.1
## [79] bslib_0.6.1 Rcpp_1.0.11 xfun_0.41
## [82] zoo_1.8-12 fs_1.6.3 pkgconfig_2.0.3