Abstract

This document is Supplementary Information for the manuscript: “Segregation of an MSH1 RNAi transgene produces heritable non-genetic memory in association with methylome reprogramming” (1). Statistical analysis supporting the association between gene expression and cytosine DNA methylation at gene-body is provided.

## Background

The classical statistical approach to analyze the association between two variables is based in correlation tests. However, small or zero correlation coefficient does not necessarily implies independence of the tested variables/processes and, consequently, does not implies the lack of association at all. There various steps involved in investigating the dependence between two random variables. The state of the art include the application of copula distributions. Herein, we provide study of the statistical dependence structure between gene expression and methylation in memory line. To uncover the association between gene expression and cytosine DNA methylation, first, the probability distributions of gene expression (in terms of logarithm of fold changes: logFC) and methylation divergences are estimated.

Next, the joint probability distribution is estimated, expressed in term of copula distribution. All the analyses are performed in R and can be reproduced as presented here. The association is investigated with the application of copula distributions.

## Required R packages

With the exception of usefr R package, all the packages are available in Bioconductor and in CRAN repository. Package usefr is available in GitHub at https://github.com/genomaths/usefr You can install usefr package from GitHub typing the following command in the R console:

 devtools::install_git("https://github.com/genomaths/usefr.git")
library( GenomicRanges )
library( ggplot2 )
library( edgeR )
library( usefr )
library( plot3D )
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
library( copula )
library( coin )

## Analysis for down-regulated genes

### Data sets

Gene expression data sets were derived using the R package edgeR. All the genes with the minimal quality of gene expression differences are included in the analyses. That is, to evaluate the existence of an association between gene expression and cytosine methylation all the genes with meaningful expression and methylation values must be included, regardless they would be differentially methylated and/or differentially expressed or not.

Gene expression data for Arabidopsis Memory first generation (G1)

## ===== Download methylation data from Github ========
url <- paste0("github.com/genomaths/genomaths.github.io/raw/master/supp-inf/",
"mm_nm_wt_transgen_meth_density_and_gene-exp_at_genes_06-01-19.",
"RData")
temp <- tempfile(fileext = ".RData")
file.remove(temp); rm(temp, url)
## Gene expresion data set
deg_mm1_wt1
## An object of class "DGEExact"
## $table ## logFC logCPM PValue adj.pval ## AT1G01010 -0.60092985 3.066880 9.377730e-05 0.0004180724 ## AT1G01020 -0.60729200 2.913263 2.803733e-05 0.0001409788 ## AT1G01030 -0.25438232 -1.478816 5.857299e-01 0.6890000078 ## AT1G01040 0.03597925 5.632805 7.833824e-01 0.8505857888 ## AT1G01046 -1.61271935 -2.996378 2.235591e-02 0.0498548032 ## 20284 more rows ... ## ##$comparison
## [1] "WT1" "MM1"
##
## $genes ## genes.seqnames genes.start genes.end genes.strand genes.type ## AT1G01010 1 3631 5899 + gene ## AT1G01020 1 6788 9130 - gene ## AT1G01030 1 11649 13714 - gene ## AT1G01040 1 23121 31227 + gene ## AT1G01046 1 28500 28706 + gene ## genes.gene_biotype genes.gene_id genes.gene_name length ## AT1G01010 protein_coding AT1G01010 NAC001 2269 ## AT1G01020 protein_coding AT1G01020 ARV1 2343 ## AT1G01030 protein_coding AT1G01030 NGA3 2066 ## AT1G01040 protein_coding AT1G01040 DCL1 8107 ## AT1G01046 miRNA AT1G01046 MIR838A 207 ## 20284 more rows ... Methylation dataset contains the densities of methylation levels from control (Col-0) and Memory lines samples at DMP on gene-body regions. Densities were computed using function getGRegionsStat from the R package MethylIT.utils. Methylation data for a memory G1 sample dmp_den_genes[, "M_1_105"] ## GRanges object with 33164 ranges and 1 metadata column: ## seqnames ranges strand | M_1_105 ## <Rle> <IRanges> <Rle> | <numeric> ## AT1G01010 1 3631-5899 * | 0.841607 ## AT1G01020 1 6788-9130 * | 0.389668 ## AT1G03987 1 11101-11372 * | 0.000000 ## AT1G01030 1 11649-13714 * | 0.000000 ## AT1G01040 1 23121-31227 * | 6.776141 ## ... ... ... ... . ... ## AT5G09945 5 26966885-26967079 * | 0.00000 ## AT5G67630 5 26967378-26969400 * | 1.28801 ## AT5G67640 5 26969516-26970668 * | 0.00000 ## AT5G09955 5 26971389-26971689 * | 1.05172 ## AT5G09965 5 26972177-26972644 * | 0.00000 ## ------- ## seqinfo: 5 sequences from an unspecified genome; no seqlengths ### Dataset for the first generation The analysis will be performed with Memory line (M1) and wild type (WT1) control, both of them, MM1 and WT1, first generation G1. idx <- na.omit(match(rownames(deg_mm1_wt1), names(dmp_den_genes))) dmp_den_m1 <- dmp_den_genes[idx, c("M_1_105", "M_1_168", "M_1_179", "M_1_27", "M_1_3")] dmp_den_wt1 <- dmp_den_genes[idx, c( "W_1_1", "W_1_3", "W_1_4", "W_1_5")] The means of methylation levels densities are computed for each group mcols(dmp_den_wt1) <- rowMeans(as.matrix(mcols(dmp_den_wt1))) mcols(dmp_den_m1) <- rowMeans(as.matrix(mcols(dmp_den_m1))) The absolute difference of the group density means of methylation levels between control (Col-0) and Memory lines at gene-body regions (G1). dmp_den_wt1_mm1 <- dmp_den_m1 mcols(dmp_den_wt1_mm1) <- as.matrix(mcols(dmp_den_m1)) - as.matrix(mcols(dmp_den_wt1)) colnames(mcols(dmp_den_wt1_mm1)) <- "dmp_den" idx <- which(dmp_den_wt1_mm1$dmp_den > 0)
dmp_den_wt1_mm1 <- dmp_den_wt1_mm1[idx]
dmp_den_wt1_mm1
## GRanges object with 20022 ranges and 1 metadata column:
##             seqnames            ranges strand |   dmp_den
##                <Rle>         <IRanges>  <Rle> | <numeric>
##   AT1G01010        1         3631-5899      * |  0.734204
##   AT1G01020        1         6788-9130      * |  0.304498
##   AT1G01030        1       11649-13714      * |  0.180618
##   AT1G01040        1       23121-31227      * |  6.571474
##   AT1G01046        1       28500-28706      * |  2.597185
##         ...      ...               ...    ... .       ...
##   AT4G40060        4 18571239-18573084      * |  0.166230
##   AT4G40065        4 18574503-18575585      * |  0.121346
##   AT4G40070        4 18576216-18577774      * |  0.151446
##   AT4G40085        4 18579194-18580702      * |  0.161783
##   AT4G40080        4 18579314-18580717      * |  0.141613
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

Final variables: |group means density differences| and $$|logFC|$$. For the sake of brevity and clarity, we shall call the first variable: divergence of methylation densities or simply methylation divergence ($$MD$$).

.deg_mm1_wt1 <- deg_mm1_wt1[match(names(dmp_den_wt1_mm1), rownames(deg_mm1_wt1)), ]

MD <- dmp_den_wt1_mm1$dmp_den # Methylation (|group mean density differences|) idx <- which(.deg_mm1_wt1$table$logFC < 0 ) # Down-regulated logFC <- abs(.deg_mm1_wt1$table$logFC[idx]) # Gene Expression (|logFC|) MD <- MD[idx] DataFrame(MD,logFC) ## DataFrame with 13962 rows and 2 columns ## MD logFC ## <numeric> <numeric> ## 1 0.734204 0.600930 ## 2 0.304498 0.607292 ## 3 0.180618 0.254382 ## 4 2.597185 1.612719 ## 5 0.823158 0.515993 ## ... ... ... ## 13958 0.383503 0.371768 ## 13959 0.532277 0.136153 ## 13960 0.166230 0.201576 ## 13961 0.161783 0.115828 ## 13962 0.141613 1.628195 There are 13962 genes to be included in the downstream analysis. ### Distribution of the divergence of methylation level density at gene-body. MM1 vs WT1 Searching for the best fitted distribution functions by using R functions fitdistr from MASS R package and fitCDF from usefr R package did not yield a suitable results. The best fitted model was found for a mixture distribution model obtained with the function fitMixDist from usefr R package. The best fitted model for $$MD$$, from comparison MM1 vs WT1, was a mixture of Normal and Weibull distributions. We have left the algorithm to estimate the initial value x <- MD[ MD > 0] # Weibull distribution is defineed for MD > 0. meth_FIT <- fitMixDist(X = x, args = list(norm = c(mean = NA, sd = NA), weibull = c(shape = NA, scale = NA)), dens = TRUE, npoints = 900) ## fitting ... ## | | | 0% | |=================================== | 50% | |======================================================================| 100% ## *** Performing nonlinear regression model crossvalidation... summary(meth_FIT$fit)
##
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)
## mean  0.280658   0.001606   174.7   <2e-16 ***
## sd    0.173889   0.001339   129.8   <2e-16 ***
## shape 2.111095   0.019847   106.4   <2e-16 ***
## scale 1.386660   0.008021   172.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02187 on 1022 degrees of freedom
## Number of iterations to termination: 13
## Reason for termination: Relative error in the sum of squares is at most ftol'.
cat("\n Mixture proportions (phi):", meth_FIT$phi) ## ## Mixture proportions (phi): 0.4118238 0.5881762 The graphic for histogram and density distribution curves: par(mar = c(4, 4, 2, 1), font.lab=2, font=2,font.axis=2,family="serif" , lwd = 1.2, cex = 1.4) hist(x, 240, freq = FALSE, las = 1, #ylim = c(0, 0.021), cex = 1, panel.first={points(0, 0, pch=16, cex=1e6, col="grey95") grid(col="white", lty = 1)}, main = text, cex.main = 0.9, xlim = c(0, 5), family = "serif", col = "cyan1", border = "deepskyblue", xaxt ="n", yaxt = "n", ann = FALSE) axis(1, padj = -1, las = 1, lwd = 0.8, tck = -0.02) axis(2, hadj = 0.7, las = 1, lwd = 0.8, tck = -0.02) mtext(side = 1, text = "Divergence of methylation level density at gene-body", line = 1.6, cex = 1.4, font = 3) mtext(side = 2, text = "Density distribution", line = 2.1, cex = 1.4, font = 3) mtext("(Normal & Weibull Mixture Dists.)", cex = 1.4, font = 3) x1 <- seq(0, 10, by = 0.01) lines(x1, dmixtdistr(x1, phi = meth_FIT$phi, arg = meth_FIT$args), col = "blue") ### Distribution of the $$|logFC|$$. MM1 vs WT1 The best fitted model for the $$|logFC|$$, from comparison MM1 vs WT1, was a mixture of Gamma and Weibull distributions. Function fitMixDist calls the mixture function models set up in function mixtdistr x <- logFC[ logFC > 0 ] gexp_FIT <- fitMixDist(X = x, args = list(gamma = c(shape = NA, scale = NA), weibull = c(shape = NA, scale = NA)), dens = TRUE, npoints = 900) ## fitting ... ## | | | 0% | |=================================== | 50% | |======================================================================| 100% ## *** Performing nonlinear regression model crossvalidation... summary(gexp_FIT$fit)
##
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)
## shape 0.683987   0.017598   38.87   <2e-16 ***
## scale 1.774870   0.115703   15.34   <2e-16 ***
## shape 1.372105   0.005934  231.22   <2e-16 ***
## scale 0.407377   0.001764  230.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02787 on 937 degrees of freedom
## Number of iterations to termination: 16
## Reason for termination: Relative error in the sum of squares is at most ftol'.
cat("\n Mixture proportions (phi):", gexp_FIT$phi) ## ## Mixture proportions (phi): 0.2149922 0.7850078 The graphic for histogram and density distribution curves: par(mar = c(3, 3, 2, 1), font.lab=2, font=2,font.axis=2,family="serif", lwd = 1.2, cex = 1.4) hist(x, 200, freq = FALSE, las = 1, #ylim = c(0, 0.021), cex = 1, panel.first={points(0, 0, pch=16, cex=1e6, col="grey95") grid(col="white", lty = 1)}, main = text, cex.main = 0.9, xlim = c(0, 4), family = "serif", col = "cyan1", border = "deepskyblue", xaxt ="n", yaxt = "n", ann = FALSE) axis(1, padj = -1.5, las = 1, lwd = 0.5, tck = -0.02) axis(2, hadj = 0.4, las = 1, lwd = 0.5, tck = -0.02) mtext(side = 1, text = expression(italic(abs(logFC))), line = 1.3, cex = 1.4, font = 3) mtext(side = 2, text = "Density distribution", line = 1.6, cex = 1.4, font = 3) mtext("(Weibull & Normal Mixture Dists.)", cex = 1.4, font = 3) x1 <- seq(0, 10, by = 0.01) lines(x1, dmixtdistr(x1, phi = gexp_FIT$phi, arg = gexp_FIT$args), col = "blue") ### Joint distribution of $$MD$$ and $$|logFC|$$ expressed in terms of Copula distribution. There various steps involved in investigating the dependence between two random variables. The state of the art in this subject include the application of copula distributions. Sklar’s theorem ((2, 3)) named after Abe Sklar, provides the theoretical foundation for the application of copulas. Sklar’s theorem states that every multivariate cumulative distribution function (CDF): $H(x_1, ..., x_d)=Pr(X_1 \leq x_1, ..., X_d \leq x_d)$ of any $$d-tuple$$ of continuous random variables $$(X_1, ..., X_d)$$ can be expressed in terms of its marginals CDFs: $$F_i(x_i)=Pr(X_i \leq x_i)$$ a copula $$C$$: $H(x_1, ..., x_d)=C(F_1(x_1), ..., F_d(x_d))$ While Sklar showed that $$C$$ and $$F_i$$ are uniquely determined when H is known, a valid model for $$(X_1, ..., X_d)$$ arises from the last equation whenever the “ingredients” are chosen from given parametric families of distribution ((4)): $F \in (F_\delta), C \in (C_\alpha)$ Current literature on copula theory provides abundant information on models and copula families. Here, for the particular case of bi-dimensional copula defined for $$MD$$ and $$|logFC|$$, we have chosen Farlie–Gumbel–Morgenstern family of copulas, which are defined for, each $$\theta \in [-1, 1]$$ by: $C_\alpha(u, v)=uv+\alpha uv(1-u)(1-v)$ $$u,v \in [0,1]$$. The correlation coefficient $$\rho$$ is linked to $$\alpha$$ by the relationship: $$\rho=\frac{\alpha}{3}$$, which ranges from $$-\frac{1}{3}$$ to $$\frac{1}{3}$$, where $$u=F(x)$$, $$v=G(y)$$, with marginal CDFs: $$F(x)$$ and $$G(y)$$. #### Fitting Farlie–Gumbel–Morgenstern copulas with $$MD$$ and $$|logFC|$$ marginal CDFs The first step requires for provide the parameters values, previously estimated, of the marginal distributions, which were given throughout the function mixtdistr. The the specific parameter values are retrieved from the previous nonlinear fit obtained for the marginal CDFs: margins = c("mixtdistr", "mixtdistr") parMargins = list( list(phi = meth_FIT$phi, arg = meth_FIT$args), list(phi = gexp_FIT$phi, arg = gexp_FIT$args)) The parameter estimation and goodness-of-fit (GOF) test take advantages of the tools available in the R package copula. Both, parameter estimation and GOF, can be accomplished with function bicopulaGOF from the R package usefr. fgm_cop <- bicopulaGOF(x = MD, y = logFC, copula = "fgmCopula", sample.size = 13000, margins = margins, paramMargins = parMargins, nboots = 10, approach = "chisq", breaks = 30, num.cores = 6L, seed = 123, verbose = FALSE) The null hypothesis that the data follows FGM copula distribution is not rejected. fgm_cop$gof
##   Chisq.stat   mc_p.value  sample.size    num.sampl
## 1.117631e+03 1.818182e-01 1.300000e+04 1.000000e+01

All the information needed about the fitted FGM copula is found in the object fgm_cop$copula, which is a mvdc object from copula package with several slots. In particular, the specific information about FGM copula is found in: fgm_cop$copula@copula
## Farlie-Gumbel-Morgenstern (FGM) copula, dim. d = 2
##  param.: -0.509
## Dimension:  2
## Parameters:
##   alpha1.2   = -0.5091997

#### P-P plot. Graphical Diagnostics of GOF

The P-P plot is build with function ppCplot from the R package usefr. If the GOF is good enough, then it is expected that points in the graphics are closed to the diagonal (red) straight line. Details on how the computation is performed are provided in the help for function. ppCplot.

g <- ppCplot(X = MD, Y = logFC, copula = fgm_cop$copula, npoints = 500, col = "blue", cex = 0.5) Information on the data and copula used to yield the graphic are stored in the output variable “g”. For example, the empirical and theoretical probabilities used to built the graphic can be retrieved from the data frame: “g$data”:

DataFrame(g$data) ## DataFrame with 500 rows and 2 columns ## emprob thprob ## <numeric> <numeric> ## 1 0.000859476 0.000838486 ## 2 0.114811632 0.119596858 ## 3 0.198682137 0.199610746 ## 4 0.174616817 0.181781104 ## 5 0.449863916 0.455994107 ## ... ... ... ## 496 0.100774 0.0964349 ## 497 0.139092 0.1387958 ## 498 0.193740 0.1872904 ## 499 0.489042 0.4987606 ## 500 0.464475 0.4624016 #### Three-dimensional copula density plot The domain region $$(X, Y)$$ is defined and function dMvdc is used to compute the density $$z$$ in the grid $$[X,Y] \times [X,Y]$$ using the previous FMG fitted copula. X = seq(0, 1.5, length = 100) Y = seq(0, 1.5, length = 100) sp_rho <- round(rho(fgm_cop$copula@copula),3) # Spearman's rho: -0.171

z <- outer(X, Y, function(x,y) dMvdc(cbind(x,y),fgm_cop$copula)) * 4 phi = 20; theta = 45 zf <- function(x,y) dMvdc(cbind(x,y),fgm_cop$copula) * 3.2

cat("Spearman's rho = ", sp_rho)
## Spearman's rho =  -0.17

An auxiliary function to draw the panels

panelfirst <- function(pmat) {
par(mfrow = c(1, 1),mar=c(4,7,0, 1), font=2,lwd = 0.01, cex =1, family = "serif")
YZ <- trans3D(x = rep(0, ncol(z)), Y, z[18,], pmat = pmat)

scatter2D(YZ$x, YZ$y, colvar = z[18,], lwd = 1,
type = "l", add = TRUE, colkey = FALSE)

XZ <- trans3D(x = X, y = rep(max(Y), nrow(z)),
z = z[, 12], pmat = pmat)

scatter2D(XZ$x, XZ$y, colvar = z[, 12], lwd = 1,
type = "l", add = TRUE, colkey = FALSE)

}

The graphical settings depend on the final printing destiny and on the dataset.

par(mfrow = c(1, 1),mar=c(4,7,0, 1), font=2,lwd = 0.1, cex =1, family = "serif")
persp3D(X, Y, z, lwd = 0.01, image = TRUE,
xlab = "", ylab = "", zlab = "", family = "serif",
box = T, panel.first = panelfirst, bty = "u",
col.panel = "steelblue", col.grid = "black", lwd.grid = 0.5,
contour = list(col = "black", side = "zmin", lwd = 0.4),
ticktype = "detailed", clab = "", expand = 0.7, d = 2,
phi = phi, theta = theta, resfac = 2.5, cex.lab = 1, lwd = 0.005,
xlim = c(0, 1.5), zlim = c(-5, 6), font=2, cex.axis = 1.2,
colkey = list(length = 0.7, width = 0.6, col.ticks = "black", lwd = 0.01,
font=2, shift = 0, dist = 0, lwd.ticks = 0.01,  cex.clab = 1.2,
cex.axis = 1, line = -0.8))
text(0.38,-0.48, srt = 34, labels = expression(paste("|", italic(log[2]), italic(FC), "|")),
xpd = TRUE,font = 2, cex = 1.2)
text(-0.3,-0.44, srt = -28, labels = expression(paste(italic("MD"))),
xpd = TRUE, font=2, cex = 1.2)
text(-0.3,-0.6, srt = 0, labels = expression(paste("Spearman' s ", italic(rho), ": ", -0.17)),
xpd = TRUE, font = 1, cex = 1)

The marginal distribtuions are drawn in the $$XZ$$ and $$YZ$$ vertical planes. Consistent with the Spearman’ s $$\rho=-0.17$$, for a fixed value of $$|log_2FC|$$, (a line parallel to $$MD$$ axis) the joint probability density distribution function $$f(X \leq MD, Y \leq |log_2FC|)$$ decreases with the increment of $$MD$$. That is, the probability to observe DEGs with values, say between 0.1 and 0.5,
$$0.1 \leq |log_2FC| \leq 0.5$$, is greater for $$0.1 \eq MD \leq 0.5$$ than for $$0.5 < MD \leq 1$$.

#### Two-dimensional copula density plot, Contour plot

The 2D-dimensional plot permits to see in more detail the projection of the density surface on the $$XY$$ plane and contour lines.

It is not difficult to see that for a fixed value of $$MD$$ (a line parallel to $$|log_2FC|$$ axis) the joint probability density distribution function $$f(X \leq MD, Y \leq |log_2FC|)$$ (color gradient from red to blue) decreases with the increment of $$|log_2FC|$$. That is, the probability to observe $$MD$$ values, say between 0.1 and 0.5 (see the squared region below) $$0.1 \leq MD \leq 0.5$$, is greater for $$0.1 \leq |log_2FC| \leq 0.5$$ than for $$0.5 < |log_2FC| \leq 1$$.

In summary, gene expression and methylation processes (2kb upstream gene TSS regions) are not independent in memory line first generation. There exist an stochastic deterministic dependence between $$|log_2FC|$$ and $$MD$$ that can be expressed in terms joint probability distribution, which leads to an inverse relationship between the conditioned expected values $$|log_2FC|$$ given $$MD$$, and vice versa. This association is, however, weaker than the association found for gene expression and methylation at gene body.

### Asymptotic linear-by-linear association test

The above 3D and 2D density plots and results suggest that most of the genes have low $$|logFC|$$ and low $$MD$$, and the number of genes decrease for middle levels of $$|logFC|$$ and $$MD$$. This observation leads to the application of a linear-by-linear association test.

The linear-by-linear test (LLT) is used to test a linear association among variables in a contingency table with ordered categories ((5)). The null hypothesis for the linear-by-linear test is that there is no association among the variables in the table.

It is obvious from the 2D density plots that a relationship for $$MD$$ < 0.08 is not meaningful. In particular, both variables, $$|logFC|$$ and $$MD$$, are at the level of technical noise. The percentage of the genes at noise level are:

round(100 * sum(MD < 0.08)/length(MD), 2)
## [1] 1.78

Hence, the genes considered in the downstream analysis are:

idx <- which(MD > 0.08)
md  <- MD[ idx]
logfc  <- logFC[ idx ]
dt <- data.frame(md = md , logfc = logfc)
nrow(dt)
## [1] 13714

#### Discretization of $$MD$$ and $$|logFC|$$

To apply the LLT the values of $$|logFC|$$ and $$MD$$ must be discretized into ordinal values. A K-means discretization approach is applied with following auxiliary function

# ------------------- Auxiliary function for discretization ------------------ #
Kmean.discretization <- function(x, categories, algorithm = "Hartigan-Wong",
iter.max = 10^6, column = 1L) {
centers <- kmeans(x, centers = categories, algorithm = algorithm,
iter.max = iter.max)$centers # order the centers centers <- sort(centers) cl <- stats::kmeans(stats::na.omit(x), centers = centers, algorithm = algorithm, iter.max = iter.max ) return( data.frame(x, cluster = cl$cluster) )
}

#### Discretization for $$MD$$

cl = Kmean.discretization(md, categories = 3, algorithm = "MacQueen" )
cl_summary <- sapply(1:3, function(k) summary(cl$x[cl$cluster == k]))
colnames(cl_summary) <- c("low", "middle", "high")
cl_summary
##                low    middle      high
## Min.    0.08010906 0.9764634  2.346377
## 1st Qu. 0.25350885 1.1902080  2.557401
## Median  0.39796318 1.4417072  2.850635
## Mean    0.44995143 1.5025125  3.188218
## 3rd Qu. 0.63007886 1.7738444  3.420480
## Max.    0.97620682 2.3446362 20.519587

Cutpoint are now applied:

cutpoint1 <- min(cl_summary[,2])
cutpoint2 <- min(cl_summary[,3])

dt$md_level <- rep("High") dt$md_level[ dt$md < cutpoint2 ] <- "Middle" dt$md_level[ dt$md < cutpoint1 ] <- "Low" dt$md_level <- factor(dt$md_level, levels = c("Low", "Middle", "High")) #### Discretization for $$|logFC|$$ cl = Kmean.discretization(logfc, categories = 3, algorithm = "MacQueen" ) cl_summary <- sapply(1:3, function(k) summary(cl$x[cl$cluster == k])) colnames(cl_summary) <- c("eLow", "eMiddle", "eHigh") cl_summary ## eLow eMiddle eHigh ## Min. 2.270881e-05 0.6366017 1.835555 ## 1st Qu. 1.345836e-01 0.7410084 2.070626 ## Median 2.526246e-01 0.9079608 2.399485 ## Mean 2.729755e-01 0.9997545 2.671355 ## 3rd Qu. 4.015427e-01 1.1817728 2.905358 ## Max. 6.360073e-01 1.8328136 9.407983 cutpoint1 <- min(cl_summary[,2]) cutpoint2 <- min(cl_summary[,3]) dt$logfc_level <- rep("eHigh")
dt$logfc_level[ dt$logfc < cutpoint2 ] <- "eMiddle"
dt$logfc_level[ dt$logfc < cutpoint1   ] <- "eLow"

dt$logfc_level <- factor(dt$logfc_level, levels = c("eLow", "eMiddle", "eHigh"))

#### The contingency table and its graphic visualization

In the current case, the contingency table suggests an obvious relationship between $$|logFC|$$ and $$MD$$.

m <- as.matrix( base::table( dt$logfc_level, dt$md_level) )
m
##
##            Low Middle High
##   eLow    5815   3890  761
##   eMiddle 1997    630  177
##   eHigh    355     68   21

The spineplot of this contingency table (table(x = logfc_level, y = md_level)) help to visualize the existent relationship between $$|logFC|$$ and $$MD$$.

par( mar = c(5, 7, 4, 4), family = "Serif", font = 2, cex = 0.8, las = 1, lwd = 0.2 )
tb <- spineplot(m[, c(3,2,1)], col = c( "blue", "dodgerblue","deepskyblue" ),
xlab = "", ylab = "")
mtext(side = 1, text = expression(italic(abs(logFC))),
line = 2.5, cex = 1, font = 3)
mtext(side = 2, text = expression(paste(italic("MD"))),
line = 4, cex = 1, font = 3)
abline(a = 0.0, b = 0.72, col = "white", lty = "dashed", lwd = 1)

A dashed line was added to emphasize the direction of linear trend. From “Low” to “high” there seems to be a linear relationship between pairwise proportion of $$|logFC|$$ and $$MD$$.

Conceptually, the spineplot is a plot of the conditional probability $$P(y|x)$$ against $$P(x)$$. Since the joint probability distribution of these variable is described by a FGM copula, a linear relationship between $$P(y|x)$$ against $$P(x)$$ must be expected, which is a property of FGM copula (6).

#### Testing association between $$|logFC|$$ and $$MD$$

Here, we use the LLT test implemented in function lbl_test from the R package coin. Two test variants are available: 1) permutation test and asymptotic test.

The asymptotic LLT test:

lbl_test( m )
##
##  Asymptotic Linear-by-Linear Association Test
##
## data:  Var2 (ordered) by Var1 (eLow < eMiddle < eHigh)
## Z = -14.655, p-value < 2.2e-16
## alternative hypothesis: two.sided

The permutation LLT test:

lbl_test( m, distribution = "approximate", B = 10000, alternative = "less" )
##
##  Approximative Linear-by-Linear Association Test
##
## data:  Var2 (ordered) by Var1 (eLow < eMiddle < eHigh)
## Z = -14.655, p-value < 1e-04
## alternative hypothesis: less

The null hypothesis of no association among the variables in the table is rejected. In the current case the p-value < 1e-04 and, hence, we do not have enough evidence to support the null hypothesis.

### Generalized linear model (GLM)

The variable $$MD$$ was split into three levels, “0”: “Low” ($$0 < MD \leq 0.5$$), “1”: “Middle” ($$0.5 < MD \leq 1$$), and “2”: “High” ($1 < MD$)

meth_cut <- 1

xy <- cbind(MD, logFC)
idx <- which(logFC > 0.5)
xy <- data.frame(xy[idx, ])
xy$md <- 0 xy$md[xy$MD > meth_cut] <- 1 The glm support the decrease of $$-logFC$$ with th increment of $$MD$$. The model coefficients are highly statistically significant. gm <- glm(formula = logFC ~ md, data=xy, family = Gamma(link = "identity")) summary( gm ) ## ## Call: ## glm(formula = logFC ~ md, family = Gamma(link = "identity"), ## data = xy) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.6916 -0.4656 -0.2364 0.1463 3.3044 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.09233 0.01227 89.030 <2e-16 *** ## md -0.17546 0.02040 -8.602 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for Gamma family taken to be 0.4203748) ## ## Null deviance: 1283.8 on 4662 degrees of freedom ## Residual deviance: 1255.4 on 4661 degrees of freedom ## AIC: 6431 ## ## Number of Fisher Scoring iterations: 3 The ANOVA test confirm the negative trend given by the coefficient of MD (highly statistically significant). anova(gm, test = "LRT") ## Analysis of Deviance Table ## ## Model: Gamma, link: identity ## ## Response: logFC ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 4662 1283.8 ## md 1 28.416 4661 1255.4 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #### Boxplot The boxplot suggest a clear trend of $$-logFC$$ decreasing with increment of MD levels. Notice that the notchs from the boxes do not overlap, which implies that the differences between medians from three groups (“Low”, “Middle”, and “High”) are statistically significant. So, the GLM results can be visualized. par(bty = "n", family = "serif", font = 2, cex = 1, lwd = 0.1, mar = c(4, 5, 2, 1.1)) boxplot(logFC ~ md, data = xy, ylim = c(0.4, 2.5), xaxt ="n", yaxt = "n", ann = FALSE) rect(0.5, 0.4, 3.5, 5, col = "grey94", lty = 0, lwd = 1) grid(0, 5, col = "white", lwd = 2, lty = 1) boxplot(logFC ~ md, data = xy, ylim = c(0.4, 2.5), notch = TRUE, xaxt ="n", yaxt = "n", ann = FALSE, pch = 20, medcol = "darkblue", whisklwd = 1, staplelwd = 1, col = c("deepskyblue", "dodgerblue"), outcol = "red", add = TRUE) axis(1, padj = 0.1, at = 1:2, lwd = 0.5, tck = -0.02, line = -0.6, cex.axis = 1.3, labels = c("Low", "High")) axis(2, hadj = 1, las = 1, lwd = 0.5, tck = -0.02, line = -1, cex.axis = 1.4) mtext(side = 1, text = expression(paste(italic("MD"))), line = 2.2, cex = 1.4, font = 3) mtext(side = 2, text = expression(italic(-logFC)), line = 2.1, cex = 1.4, font = 3) text(x = 1.5, y = 2.2, "***", pos = 3, cex = 1.2 ) ## Analysis for up-regulated genes ### Data set (II) .deg_mm1_wt1 <- deg_mm1_wt1[match(names(dmp_den_wt1_mm1), rownames(deg_mm1_wt1)), ] MD <- dmp_den_wt1_mm1$dmp_den # Methylation (|group mean density differences|)
idx <- which(.deg_mm1_wt1$table$logFC > 0 ) # Up-regulated
logFC <- .deg_mm1_wt1$table$logFC[idx] # Gene Expression (|logFC|)
MD <- MD[idx]
DataFrame(MD,logFC)
## DataFrame with 6060 rows and 2 columns
##             MD     logFC
##      <numeric> <numeric>
## 1     6.571474 0.0359792
## 2     1.199583 0.5287217
## 3     0.762753 0.1245334
## 4     0.274631 0.0142820
## 5     0.623658 0.4497949
## ...        ...       ...
## 6056  0.223080  0.317756
## 6057  0.634709  0.394906
## 6058  0.296653  0.943018
## 6059  0.121346  1.680553
## 6060  0.151446  1.921001

There are 6060 genes with $$MD > 0$$ and $$logFC > 0$$ to be included in the downstream analysis.

### Distribution of the divergence of methylation level density at gene-body. MM1 vs WT1 (II)

The best fitted model for $$MD$$, from comparison MM1 vs WT1, was a mixture of Normal and Weibull distributions.

x <- MD[ MD > 0]
meth_FIT <- fitMixDist(X = x, args = list(norm = c(mean = NA, sd = NA),
weibull = c(shape = NA, scale = NA)),
dens = TRUE, npoints = 900)
## fitting ...
##
|
|                                                                      |   0%
|
|===================================                                   |  50%
|
|======================================================================| 100%
## *** Performing nonlinear regression model  crossvalidation...
summary(meth_FIT$fit) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## mean 0.285889 0.002124 134.60 <2e-16 *** ## sd 0.163948 0.001777 92.27 <2e-16 *** ## shape 2.093709 0.029261 71.55 <2e-16 *** ## scale 1.360781 0.011846 114.88 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.04601 on 1044 degrees of freedom ## Number of iterations to termination: 12 ## Reason for termination: Relative error in the sum of squares is at most ftol'. cat("\n Mixture proportions (phi):", meth_FIT$phi)
##
##  Mixture proportions (phi): 0.4201527 0.5798473

The graphic for histogram and density distribution curves:

par(mar = c(4, 4, 2, 1), font.lab=2, font=2,font.axis=2,family="serif" , lwd = 1.2, cex = 1.4)
hist(MD, 90, freq = FALSE, las = 1, #ylim = c(0, 0.021), cex = 1,
panel.first={points(0, 0, pch=16, cex=1e6, col="grey95")
grid(col="white", lty = 1)}, main = text,
cex.main = 0.9, xlim = c(0, 5), family = "serif",
col = "cyan1", border = "deepskyblue", xaxt ="n", yaxt = "n", ann = FALSE)
axis(1, padj = -1, las = 1, lwd = 0.8, tck = -0.02)
axis(2, hadj = 0.7, las = 1, lwd = 0.8, tck = -0.02)
mtext(side = 1, text = "Divergence of methylation level density at gene-body",
line = 1.6, cex = 1.4, font = 3)
mtext(side = 2, text = "Density distribution", line = 2.1, cex = 1.4, font = 3)
mtext("(Normal & Weibull Mixture Dists.)", cex = 1.4, font = 3)
x1 <- seq(0, 10, by = 0.01)
lines(x1, dmixtdistr(x1, phi = meth_FIT$phi, arg = meth_FIT$args), col = "blue")

### Distribution of the $$|logFC|$$. MM1 vs WT1 (II)

The best fitted model for the $$|logFC|$$, from comparison MM1 vs WT1, was a mixture of Gamma and Weibull distributions. Function fitMixDist calls the mixture function models set up in function mixtdistr

gexp_FIT <- fitMixDist(X = logFC, args = list(gamma = c(shape = NA, scale = NA),
weibull = c(shape = NA, scale = NA)),
dens = TRUE, npoints = 900)
## fitting ...
##
|
|                                                                      |   0%
|
|===================================                                   |  50%
|
|======================================================================| 100%
## *** Performing nonlinear regression model  crossvalidation...
summary(gexp_FIT$fit) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## shape 1.346227 0.266677 5.048 5.54e-07 *** ## scale 0.718222 0.148811 4.826 1.67e-06 *** ## shape 1.028815 0.006226 165.255 < 2e-16 *** ## scale 0.276247 0.009824 28.121 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.04628 on 791 degrees of freedom ## Number of iterations to termination: 7 ## Reason for termination: Relative error in the sum of squares is at most ftol'. cat("\n Mixture proportions (phi):", gexp_FIT$phi)
##
##  Mixture proportions (phi): 0.2454972 0.7545028

The graphic for histogram and density distribution curves:

par(mar = c(3, 3, 2, 1), font.lab=2, font=2,font.axis=2,family="serif", lwd = 1.2, cex = 1.4)
hist(logFC, 120, freq = FALSE, las = 1, #ylim = c(0, 0.021), cex = 1,
panel.first={points(0, 0, pch=16, cex=1e6, col="grey95")
grid(col="white", lty = 1)}, main = text,
cex.main = 0.9, xlim = c(0, 4), family = "serif",
col = "cyan1", border = "deepskyblue", xaxt ="n", yaxt = "n", ann = FALSE)
axis(1, padj = -1.5, las = 1, lwd = 0.5, tck = -0.02)
axis(2, hadj = 0.4, las = 1, lwd = 0.5, tck = -0.02)
mtext(side = 1, text = expression(italic(abs(logFC))),
line = 1.3, cex = 1.4, font = 3)
mtext(side = 2, text = "Density distribution", line = 1.6, cex = 1.4, font = 3)
mtext("(Gamma & Weibull Mixture Dists.)", cex = 1.4, font = 3)
x1 <- seq(0, 10, by = 0.01)
lines(x1, dmixtdistr(x1, phi = gexp_FIT$phi, arg = gexp_FIT$args), col = "blue")

### Joint distribution of $$MD$$ and $$|logFC|$$ expressed in terms of Copula distribution (II)

#### Fitting Farlie–Gumbel–Morgenstern copulas with $$MD$$ and $$|logFC|$$ marginal CDFs (II)

The first step requires for provide the parameters values, previously estimated, of the marginal distributions, which were given throughout the function mixtdistr. The specific parameter values are retrieved from the previous nonlinear fit obtained for the marginal CDFs:

margins = c("mixtdistr", "mixtdistr")
parMargins = list(
list(phi = meth_FIT$phi, arg = meth_FIT$args),
list(phi = gexp_FIT$phi, arg = gexp_FIT$args))

The parameter estimation and goodness-of-fit (GOF) test take advantage of the tools available in the R package copula. Both, parameter estimation and GOF, can be accomplished with function bicopulaGOF from the R package usefr.

fgm_cop <- bicopulaGOF(x = MD, y = logFC, copula = "fgmCopula", sample.size = 5999,
margins = margins, paramMargins = parMargins, nboots = 10,
approach = "chisq", breaks = 30, num.cores = 6L, seed = 123,
verbose = FALSE)

The null hypothesis that the data follows FGM copula distribution is not rejected.

fgm_cop$gof ## Chisq.stat mc_p.value sample.size num.sampl ## 924.4173554 0.3636364 5999.0000000 10.0000000 All the information needed about the fitted FGM copula is found in the object fgm_cop$copula, which is a mvdc object from copula package with several slots. In particular, the specific information about FGM copula is found in:

fgm_cop$copula@copula ## Farlie-Gumbel-Morgenstern (FGM) copula, dim. d = 2 ## param.: -0.492 ## Dimension: 2 ## Parameters: ## alpha1.2 = -0.491595 #### P-P plot. Graphical Diagnostics of GOF (II) The P-P plot is build with function ppCplot from the R package usefr. If the GOF is good enough, then it is expected that points in the graphics are closed to the diagonal (red) straight line. Details on how the computation is performed are provided in the help for function. ppCplot. g <- ppCplot(X = MD, Y = logFC, copula = fgm_cop$copula, npoints = 500, col = "blue", cex = 0.5)

Information on the data and copula used to yield the graphic are stored in the output variable “g”. For example, the empirical and theoretical probabilities used to built the graphic can be retrieved from the data frame: “g$data”: DataFrame(g$data)
## DataFrame with 500 rows and 2 columns
##        emprob    thprob
##     <numeric> <numeric>
## 1    0.122277  0.128040
## 2    0.181023  0.186947
## 3    0.127558  0.131029
## 4    0.139109  0.141022
## 5    0.286964  0.294494
## ...       ...       ...
## 496  0.167162  0.167162
## 497  0.189109  0.182595
## 498  0.675743  0.676203
## 499  0.339109  0.339881
## 500  0.444389  0.439957

#### Three-dimentional copula density plot (II)

The domain region $$(X, Y)$$ is defined and function dMvdc is used to compute the density $$z$$ in the grid $$[X,Y] \times [X,Y]$$ using the previous FMG fitted copula.

X = seq(0, 1.5, length = 100)
Y = seq(0, 1.8, length = 100)
sp_rho <- round(rho(fgm_cop$copula@copula),3) # Spearman' s rho: -0.171 z <- outer(X, Y, function(x,y) dMvdc(cbind(x,y),fgm_cop$copula)) * 4
phi = 15; theta = 45
sp_rho
## [1] -0.164

An auxiliary function to draw the panels

panelfirst <- function(pmat) {
par(mfrow = c(1, 1),mar=c(4,7,0, 1), font=2,lwd = 0.01, cex =1, family = "serif")
YZ <- trans3D(x = rep(0, ncol(z)), Y, z[18,], pmat = pmat)

scatter2D(YZ$x, YZ$y, colvar = z[18,], lwd = 1,
type = "l", add = TRUE, colkey = FALSE)

XZ <- trans3D(x = X, y = rep(max(Y), nrow(z)),
z = z[, 12], pmat = pmat)

scatter2D(XZ$x, XZ$y, colvar = z[, 12], lwd = 1,
type = "l", add = TRUE, colkey = FALSE)

}

The graphical settings depend on the final printing destiny and on the dataset.

par(mfrow = c(1, 1),mar=c(4,7,0, 1), font=2,lwd = 0.1, cex =1, family = "serif")
persp3D(X, Y, z, lwd = 0.01, image = TRUE,
xlab = "", ylab = "", zlab = "", family = "serif",
box = T, panel.first = panelfirst, bty = "u",
col.panel = "steelblue", col.grid = "black", lwd.grid = 0.5,
contour = list(col = "black", side = "zmin", lwd = 0.4),
ticktype = "detailed", clab = "", expand = 0.7, d = 2,
phi = phi, theta = theta, resfac = 2.5, cex.lab = 1, lwd = 0.005,
xlim = c(0, 1.5), zlim = c(-5, 9), font=2, cex.axis = 1.2,
colkey = list(length = 0.5, width = 0.6, col.ticks = "black", lwd = 0.01,
font=2, shift = 0, dist = 0, lwd.ticks = 0.01,  cex.clab = 1.2,
cex.axis = 1, line = -0.8))
text(0.38,-0.4, srt = 38, labels = expression(paste("|", italic(log[2]), italic(FC), "|")),
xpd = TRUE,font = 2, cex = 1.2)
text(-0.41,-0.39, srt = -20, labels = expression(paste(italic("MD"))),
xpd = TRUE, font=2, cex = 1.2)
text(-0.36,-0.5, srt = 0, labels = expression(paste("Spearman' s ", italic(rho), ": ", -0.164)),
xpd = TRUE, font = 1, cex = 1)

The marginal distributions are drawn in the $$XZ$$ and $$YZ$$ vertical planes. Consistent with the Spearman’ s $$\rho=-0.164$$, for a fixed value of $$|log_2FC|$$, (a line parallel to $$MD$$ axis) the joint probability density distribution function $$f(X \leq MD, Y \leq log_2FC)$$ decreasses with the increment of $$MD$$. That is, the probability to observe DEGs with values, say between 0.1 and 0.5,
$$0.1 \leq log_2FC \leq 0.5$$, is greater for $$0.1 \leq MD \leq 0.5$$ than for $$0.5 < MD \leq 1$$.

#### Two-dimensional copula density plot, Contour plot (II)

The 2D-dimensional plot permits to see in more detail the projection of the density surface on the $$XY$$ plane and contour lines.

It is not difficult to see that for a fixed value of $$MD$$ (a line parallel to $$|log_2FC|$$ axis) the joint probability density distribution function $$f(X \leq MD, Y \leq |log_2FC|)$$ (color gradient from red to blue) decreasses with the increment of $$|log_2FC|$$. That is, the probability to observe $$MD$$ values, say between 0.1 and 0.5 (see the squared region below) $$0.1 \leq MD \leq 0.5$$, is greater for $$0.1 \leq |log_2FC| \leq 0.5$$ than for $$0.5 < |log_2FC| \leq 1$$.

In summary, gene expression and methylation processes are not independent in memory line first generation. There exist an stochastic deterministic dependence between $$|log_2FC|$$ and $$MD$$ that can be expressed in terms joint probability distribution, which leads to an inverse relationship between the conditioned expected values $$|log_2FC|$$ given $$MD$$, and vice versa.

### Asymptotic linear-by-linear association test (II)

As in the previous case (above), tt is obvious from the 2D density plots that for $$MD$$ < 0.08 the any relationship is not meaningful. In particular, both variables, $$logFC$$ and $$MD$$, are at the level of technical noise. The percentage of the genes at noise level are:

round(100 * sum(MD < 0.08)/length(MD), 2)
## [1] 1.42

The genes for the downstream analysis are:

idx <- which(MD > 0.08)
md  <- MD[ idx]
logfc  <- logFC[ idx ]
dt <- data.frame(md = md , logfc = logfc)
nrow(dt)
## [1] 5974

#### Discretization of $$MD$$ and $$logFC$$ (II)

# ------------------- Auxiliary function for discretization ------------------ #
Kmean.discretization <- function(x, categories, algorithm = "Hartigan-Wong",
iter.max = 10^6, column = 1L) {
centers <- kmeans(x, centers = categories, algorithm = algorithm,
iter.max = iter.max)$centers # order the centers centers <- sort(centers) cl <- stats::kmeans(stats::na.omit(x), centers = centers, algorithm = algorithm, iter.max = iter.max ) return( data.frame(x, cluster = cl$cluster) )
}

#### Discretization for $$MD$$ (II)

cl = Kmean.discretization(md, categories = 3, algorithm = "MacQueen" )
cl_summary <- sapply(1:3, function(k) summary(cl$x[cl$cluster == k]))
colnames(cl_summary) <- c("low", "middle", "high")
cl_summary
##               low    middle      high
## Min.    0.0809488 0.9703714  2.348914
## 1st Qu. 0.2543436 1.1740022  2.546431
## Median  0.3925193 1.4335135  2.922496
## Mean    0.4465135 1.4933123  3.204277
## 3rd Qu. 0.6216258 1.7647503  3.509952
## Max.    0.9683489 2.3473622 10.496433

Cutpoint are now applied:

cutpoint1 <- min(cl_summary[,2])
cutpoint2 <- min(cl_summary[,3])

dt$md_level <- rep("High") dt$md_level[ dt$md < cutpoint2 ] <- "Middle" dt$md_level[ dt$md < cutpoint1 ] <- "Low" dt$md_level <- factor(dt$md_level, levels = c("Low", "Middle", "High")) #### Discretization for $$logFC$$ (II) cl = Kmean.discretization(logfc, categories = 3, algorithm = "MacQueen" ) cl_summary <- sapply(1:3, function(k) summary(cl$x[cl$cluster == k])) colnames(cl_summary) <- c("eLow", "eMiddle", "eHigh") cl_summary ## eLow eMiddle eHigh ## Min. 3.772575e-05 0.6448911 2.052246 ## 1st Qu. 8.178209e-02 0.7713312 2.324534 ## Median 1.865436e-01 0.9716668 2.575225 ## Mean 2.257154e-01 1.0636271 3.027001 ## 3rd Qu. 3.430369e-01 1.2671241 3.394137 ## Max. 6.443115e-01 2.0306128 7.945112 cutpoint1 <- min(cl_summary[,2]) cutpoint2 <- min(cl_summary[,3]) dt$logfc_level <- rep("eHigh")
dt$logfc_level[ dt$logfc < cutpoint2 ] <- "eMiddle"
dt$logfc_level[ dt$logfc < cutpoint1   ] <- "eLow"

dt$logfc_level <- factor(dt$logfc_level, levels = c("eLow", "eMiddle", "eHigh"))

#### The contingency table and its graphic visualization (II)

In the current case, the contingency table suggests an obvious relationship between $$|logFC|$$ and $$MD$$.

m <- as.matrix( base::table( dt$logfc_level, dt$md_level) )
m
##
##            Low Middle High
##   eLow    2721   1676  355
##   eMiddle  795    210   59
##   eHigh    128     24    6

The spineplot of this contingency table (table(x = logfc_level, y = md_level)) help to visualize the existent relationship between $$|logFC|$$ and $$MD$$.

par( mar = c(5, 7, 4, 4), family = "Serif", font = 2, cex = 0.8, las = 1, lwd = 0.2 )
tb <- spineplot(m[, c(3,2,1)], col = c( "blue", "dodgerblue","deepskyblue" ),
xlab = "", ylab = "")
mtext(side = 1, text = expression(italic(logFC)),
line = 2.5, cex = 1, font = 3)
mtext(side = 2, text = expression(paste(italic("MD"))),
line = 4, cex = 1, font = 3)
abline(a = 0.0, b = 0.72, col = "white", lty = "dashed", lwd = 1)

A dashed line was added to emphasize the direction of linear trend. From “Low” to “high” there seems to be a linear relationship between pairwise proportion of $$|logFC|$$ and $$MD$$.

Conceptually, the spineplot is a plot of the conditional probability $$P(y|x)$$ against $$P(x)$$. Since the joint probability distribution of these variable is described by a FGM copula, a linear relationship between $$P(y|x)$$ against $$P(x)$$ must be expected, which is a property of FGM copula (6).

#### Testing association between $$logFC$$ and $$MD$$ (II)

Here, we use the LLT test implemented in function lbl_test from the R package coin. Two test variants are available: 1) permutation test and asymptotic test.

The asymptotic LLT test:

lbl_test( m )
##
##  Asymptotic Linear-by-Linear Association Test
##
## data:  Var2 (ordered) by Var1 (eLow < eMiddle < eHigh)
## Z = -10.163, p-value < 2.2e-16
## alternative hypothesis: two.sided

The permutation LLT test:

lbl_test( m, distribution = "approximate", B = 10000, alternative = "less" )
##
##  Approximative Linear-by-Linear Association Test
##
## data:  Var2 (ordered) by Var1 (eLow < eMiddle < eHigh)
## Z = -10.163, p-value < 1e-04
## alternative hypothesis: less

The null hypothesis of no association among the variables in the table is rejected. In the current case the $$p-value < 1e-04$$ and, hence, we do not have enough evidence to support the null hypothesis.

### Generalized linear model (GLM) (II)

The variable $$MD$$ is split int three levels, 0:“Low” ($$0 < MD \leq 0.5$$), 1:“Middle” ($$0.5 < MD \leq 1$$), and 2:“High” ($1 < MD$)

meth_cut1 <- 0.5
meth_cut2 <- 1.2

xy <- cbind(MD, logFC)
idx <- which(logFC > 0.5)
xy <- data.frame(xy[idx, ])
dt$md <- 0 dt$md[ dt$MD > meth_cut1 ] <- 1 gm <- glm(formula = logFC ~ md, data = dt, family = Gamma(link = "identity")) summary( gm ) ## ## Call: ## glm(formula = logFC ~ md, family = Gamma(link = "identity"), ## data = dt) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.7587 -0.5299 -0.2535 0.1411 2.9756 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.19166 0.02834 42.049 < 2e-16 *** ## md -0.12276 0.04503 -2.726 0.00649 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for Gamma family taken to be 0.4971405) ## ## Null deviance: 430.00 on 1342 degrees of freedom ## Residual deviance: 426.45 on 1341 degrees of freedom ## AIC: 2291.1 ## ## Number of Fisher Scoring iterations: 3 cat("\n =============== The ANOVA test =============== \n\n") ## ## =============== The ANOVA test =============== anova(gm, test = "LRT") ## Analysis of Deviance Table ## ## Model: Gamma, link: identity ## ## Response: logFC ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 1342 430.00 ## md 1 3.5479 1341 426.45 0.007552 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #### GLM group II (“Middle” $$MD$$) versus group III (“High” $$MD$$) (II) Only genes with $$MD < 1$$ are analyzed. That is, only two groups are considered: 0:“Low” ($$0 < MD \leq 0.5$$), 1:“Middle” ($$0.5 < MD \leq 1$$). dt <- xy[xy$MD > meth_cut1, ]
dt$md <- 0 dt$md[ dt\$MD > meth_cut2   ] <- 1

gm <- glm(formula = logFC ~ md, data = dt, family = Gamma(link = "identity"))
summary( gm )
##
## Call:
## glm(formula = logFC ~ md, family = Gamma(link = "identity"),
##     data = dt)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -0.6729  -0.4812  -0.2722   0.1356   2.9756
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.06890    0.03514  30.419   <2e-16 ***
## md          -0.08433    0.05084  -1.659   0.0976 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.5014626)
##
##     Null deviance: 238.85  on 823  degrees of freedom
## Residual deviance: 237.48  on 822  degrees of freedom
## AIC: 1171.9
##
## Number of Fisher Scoring iterations: 3
cat("\n =============== The ANOVA test  =============== \n\n")
##
##  =============== The ANOVA test  ===============
anova(gm, test = "LRT")
## Analysis of Deviance Table
##
##
## Response: logFC
##
## Terms added sequentially (first to last)
##
##
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                   823     238.85
## md    1   1.3641       822     237.48  0.09908 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#### Boxplot (II)

The boxplot suggest a clear trend of $$-logFC$$ decreasing with increment of MD levels. Notice that the notchs from the boxes do not overlap, which implies that the differences between medians from three groups (“Low”, “Middle”, and “High”) are statistically significant. So, the GLM results can be visualized.

par(bty = "n", family = "serif", font = 2, cex = 1, lwd = 0.1)
boxplot(logFC ~ md, data = xy, ylim = c(0.4, 2.5), xaxt ="n",  yaxt = "n", ann = FALSE)
rect(0.5, 0.4, 3.5, 5, col = "grey94", lty = 0, lwd = 1)
grid(0, 5, col = "white", lwd = 2, lty = 1)
boxplot(logFC ~ md, data = xy, ylim = c(0.4, 2.5), notch = TRUE, xaxt ="n",
yaxt = "n", ann = FALSE, pch = 20, medcol = "darkblue", whisklwd = 1, staplelwd = 1,
col = c("deepskyblue", "dodgerblue", "dodgerblue3" ), outcol = "red", add  = TRUE)
axis(1, padj = -0.6, at = 1:3, lwd = 0.5, tck = -0.02, line = -0.6, cex.axis = 1.3,
labels = c("Low", "Middle", "High"))
axis(2, hadj = 0.6, las = 1, lwd = 0.5, tck = -0.02, line = -1, cex.axis = 1.3)
mtext(side = 1, text = expression(paste(italic("MD"))),
line = 1.3, cex = 1.4, font = 3)
mtext(side = 2, text = expression(italic(-logFC)), line = 1, cex = 1.4, font = 3)
text(x = 1.5, y = 2.2, "**", pos = 3, cex = 1.2 )

## References

1.
2.
A. Sklar, Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris 8, 229–231 (1959).
3.
A. Sklar, Random Variables, Joint Distribution Functions, and Copulas. Kybernetika 9, 449–460 (1973).
4.
G. Christian, A. Favre, Everything You Always Wanted to Know about Copula Modeling but Were Afraid to Ask. J HYDROL ENG 12, 347–368 (2007).
5.
A. Agresti, An Introduction to Categorical Data Analysis (John Wiley & Sons, Inc., 2007).
6.
N. Balakrishna, C. D. Lai, Distributions Expressed as Copulas (Springer New York, 2009).

## Session info

Here is the output of sessionInfo() on the system on which this document was compiled running pandoc 2.19.2:

## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods
## [8] base
##
## other attached packages:
##  [1] coin_1.4-2           survival_3.4-0       copula_1.1-1
##  [4] plot3D_1.4           usefr_0.1.0          edgeR_3.38.4
##  [7] limma_3.52.4         ggplot2_3.4.0        GenomicRanges_1.48.0
## [10] GenomeInfoDb_1.32.4  IRanges_2.30.1       S4Vectors_0.34.0
## [13] BiocGenerics_0.42.0
##
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7           matrixStats_0.63.0     fs_1.5.2
##  [4] rprojroot_2.0.3        numDeriv_2016.8-1.1    tools_4.2.2
##  [7] bslib_0.4.1            utf8_1.2.2             R6_2.5.1
## [10] DBI_1.1.3              colorspace_2.0-3       withr_2.5.0
## [13] tidyselect_1.2.0       compiler_4.2.2         pspline_1.0-19
## [16] textshaping_0.3.6      cli_3.4.1              sandwich_3.0-2
## [19] desc_1.4.2             sass_0.4.4             scales_1.2.1
## [22] mvtnorm_1.1-3          pkgdown_2.0.6          systemfonts_1.0.4
## [25] stringr_1.5.0          digest_0.6.30          rmarkdown_2.18
## [28] XVector_0.36.0         pkgconfig_2.0.3        htmltools_0.5.3
## [31] highr_0.9              fastmap_1.1.0          stabledist_0.7-1
## [37] jquerylib_0.1.4        generics_0.1.3         zoo_1.8-11
## [40] jsonlite_1.8.3         mixdist_0.5-5          mclust_6.0.0
## [43] BiocParallel_1.30.4    dplyr_1.0.10           RCurl_1.98-1.9
## [46] magrittr_2.0.3         nls2_0.3-3             modeltools_0.2-23
## [49] GenomeInfoDbData_1.2.8 Matrix_1.5-3           Rcpp_1.0.9
## [52] munsell_0.5.0          fansi_1.0.3            proto_1.0.0
## [55] lifecycle_1.0.3        multcomp_1.4-20        stringi_1.7.8
## [58] yaml_2.3.6             MASS_7.3-58.1          zlibbioc_1.42.0
## [61] grid_4.2.2             misc3d_0.9-1           parallel_4.2.2
## [64] lattice_0.20-45        splines_4.2.2          locfit_1.5-9.6
## [67] knitr_1.41             pillar_1.8.1           tcltk_4.2.2
## [70] cubature_2.0.4.5       codetools_0.2-18       glue_1.6.2
## [73] evaluate_0.18          vctrs_0.5.1            gtable_0.3.1
## [76] purrr_0.3.5            assertthat_0.2.1       cachem_1.0.6
## [79] xfun_0.35              libcoin_1.0-9          ragg_1.2.4
## [82] pcaPP_2.0-3            gsl_2.1-7.1            minpack.lm_1.2-2
## [85] tibble_3.1.8           memoise_2.0.1          TH.data_1.1-1