tableBoots — tableBoots • usefr

Parametric Bootstrap of \(n x m\) Contingency independence test. It is assumed that the provided contingency table summarizes the results where two or more groups are compared and the outcome is a categorical variable, i.e., disease vs. no disease, pass vs. fail, treated patient vs. control group, etc.

The goodness of fit statistic is the root-mean-square statistic (RMST) or Hellinger divergence, as proposed by Perkins et al. [1, 2]. Hellinger divergence (HD) is computed as proposed in [3].

tableBoots(
  x,
  stat = c("rmst", "hd", "chisq", "all"),
  out.stat = FALSE,
  num.permut = 100
)

Arguments

x: A numerical matrix corresponding to cross tabulation \(n x m\) table (contingency table).
stat: Statistic to be used in the testing: 'rmst','hdiv', or 'all'.
out.stat: logical(1). Whether to return the values of the statistics used: the bootstrap mean and the original value estimated.
num.permut: Number of permutations.

Value

A p-value probability

Details

For goodness-of-fit the following null hypothesis is tested \(H_\theta: p = p(\theta)\) To conduct a single simulation, we perform the following three-step procedure [1,2]:

To generate m i.i.d. draws according to the model distribution \(p(\theta)\), where \(\theta'\) is the estimate calculated from the experimental data,
To estimate the parameter \(\theta\) from the data generated in Step 1, obtaining a new estimate \(\theta\)est.
To calculate the statistic under consideration (HD, RMST), using the data generated in Step 1 and taking the model distribution to be \(\theta\)est, where \(\theta\)est is the estimate calculated in Step 2 from the data generated in Step 1.

After conducting many such simulations, the confidence level for rejecting the null hypothesis is the fraction of the statistics calculated in step 3 that are less than the statistic calculated from the empirical data. The significance level \(\alpha\) is the same as a confidence level of \(1-\alpha\).

References

Perkins W, Tygert M, Ward R. Chi^2 and Classical Exact Tests Often Wildly Misreport Significance; the Remedy Lies in Computers [Internet]. Uploaded to ArXiv. 2011. Report No.: arXiv:1108.4126v2.
Perkins, W., Tygert, M. & Ward, R. Computing the confidence levels or a root-mean square test of goodness-of-fit. 217, 9072-9084 (2011).
Basu, A., Mandal, A. & Pardo, L. Hypothesis testing for two discrete populations based on the Hellinger distance. Stat. Probab. Lett. 80, 206-214 (2010).

Author

Robersy Sanchez (https://genomaths.com).

Examples

set.seed(123)
## The numbers of methylated and unmethylted "read-counts" targeting DNA
## cytosine sites are the typical raw data resulting from bisulfate
## sequencing experiments (BiSeq). Methylation analysis requires for the
## application of some statistical tests to identify differentially
## methylated sites/ positions (DMSs or DMPs). BiSeq experiments on humans
## are relatively expensive (still in 2020) and some researcher is willing
## to accept a site coverage (total) of 4 read-counts as threshold for a
## site to be included in the analysis. Let's suppose that at a given site
## we have the following contingency table:

site_res <- matrix(c(3, 1, 1, 3),
    nrow = 2,
    dimnames = list(
        status = c("ctrl", "treat"),
        treat = c("meth", "unmeth")
    )
)
site_res
#>        treat
#> status  meth unmeth
#>   ctrl     3      1
#>   treat    1      3

## The application of classical test variants yield the following results
fisher.test(site_res)$p.value
#> [1] 0.4857143
fisher.test(site_res, alternative = "greater")$p.value
#> [1] 0.2428571
fisher.test(site_res, simulate.p.value = TRUE)$p.value
#> [1] 0.4857143
chisq.test(site_res)$p.value
#> [1] 0.4795001
chisq.test(site_res, simulate.p.value = TRUE)$p.value
#> [1] 0.4832584

## That is, these test did not detect differences between the read-counts
## for control and the treated patients. 'tableBoots' function suggests
## that the site is borderline.

tableBoots(site_res, stat = "all", num.permut = 1e3)
#>  rmst.p.value  hdiv.p.value chisq.p.value 
#>    0.09390609    0.04095904    0.14585415 

## A slight change in the read counts identify a meaningful difference
## not detected by the classicla tests
site_res <- matrix(c(3, 0, 1, 4),
    nrow = 2,
    dimnames = list(
        status = c("ctrl", "treat"),
        treat = c("meth", "unmeth")
    )
)
site_res
#>        treat
#> status  meth unmeth
#>   ctrl     3      1
#>   treat    0      4

set.seed(23)
tableBoots(site_res, stat = "all", num.permut = 1e3)
#>  rmst.p.value  hdiv.p.value chisq.p.value 
#>   0.007992008   0.008991009   0.010989011 
fisher.test(site_res)$p.value
#> [1] 0.1428571
fisher.test(site_res, alternative = "greater")$p.value
#> [1] 0.07142857
fisher.test(site_res, simulate.p.value = TRUE)$p.value
#> [1] 0.1428571
chisq.test(site_res)$p.value
#> [1] 0.144127
chisq.test(site_res, simulate.p.value = TRUE)$p.value
#> [1] 0.1324338

## Now, let's suppose that we want to test whether methylation status from
## two DNA sites the same differentially methylated region (DMR)
## are independent, i.e. whether the treatment affected the methylation
## status of the two sites. In this case, the counts grouped into four
## categories.
set.seed(1)
site_res <- matrix(c(3, 1, 1, 3, 3, 0, 1, 4),
    nrow = 2, byrow = FALSE,
    dimnames = list(
        status = c("ctrl", "treat"),
        treat = c(
            "meth.1", "unmeth.1",
            "meth.2", "unmeth.2"
        )
    )
)
site_res
#>        treat
#> status  meth.1 unmeth.1 meth.2 unmeth.2
#>   ctrl       3        1      3        1
#>   treat      1        3      0        4

## Chi-squared from the R package 'stats'
chisq.test(site_res)$p.bvalue
#> NULL
chisq.test(site_res, simulate.p.value = TRUE, B = 2e3)$p.value
#> [1] 0.1244378

tableBoots(site_res, stat = "all", num.permut = 999)
#>  rmst.p.value  hdiv.p.value chisq.p.value 
#>         0.065         0.087         0.068 

## Results above are in border. If we include, third site,
## then sinces would different.
site_res <- matrix(c(3, 1, 1, 3, 3, 0, 1, 4, 4, 0, 1, 4),
    nrow = 2, byrow = FALSE,
    dimnames = list(
        status = c("ctrl", "treat"),
        treat = c(
            "meth.1", "unmeth.1",
            "meth.2", "unmeth.2",
            "meth.3", "unmeth.3"
        )
    )
)
site_res
#>        treat
#> status  meth.1 unmeth.1 meth.2 unmeth.2 meth.3 unmeth.3
#>   ctrl       3        1      3        1      4        1
#>   treat      1        3      0        4      0        4

## That is, we have not reason to believe that the observed methylation
## levels in the treatment are not independent
chisq.test(site_res)$p.value
#> [1] 0.02764776
chisq.test(site_res, simulate.p.value = TRUE, B = 2e3)$p.value
#> [1] 0.017991
tableBoots(site_res, stat = "all", num.permut = 999)
#>  rmst.p.value  hdiv.p.value chisq.p.value 
#>         0.014         0.016         0.010