Parametric Bootstrap of \(n x m\) Contingency independence test. It is assumed that the provided contingency table summarizes the results where two or more groups are compared and the outcome is a categorical variable, i.e., disease vs. no disease, pass vs. fail, treated patient vs. control group, etc.

The goodness of fit statistic is the root-mean-square statistic (RMST) or Hellinger divergence, as proposed by Perkins et al. [1, 2]. Hellinger divergence (HD) is computed as proposed in [3].

tableBoots(
  x,
  stat = c("rmst", "hd", "chisq", "all"),
  out.stat = FALSE,
  num.permut = 100
)

Arguments

x

A numerical matrix corresponding to cross tabulation \(n x m\) table (contingency table).

stat

Statistic to be used in the testing: 'rmst','hdiv', or 'all'.

out.stat

logical(1). Whether to return the values of the statistics used: the bootstrap mean and the original value estimated.

num.permut

Number of permutations.

Value

A p-value probability

Details

For goodness-of-fit the following null hypothesis is tested \(H_\theta: p = p(\theta)\) To conduct a single simulation, we perform the following three-step procedure [1,2]:

  1. To generate m i.i.d. draws according to the model distribution \(p(\theta)\), where \(\theta'\) is the estimate calculated from the experimental data,

  2. To estimate the parameter \(\theta\) from the data generated in Step 1, obtaining a new estimate \(\theta\)est.

  3. To calculate the statistic under consideration (HD, RMST), using the data generated in Step 1 and taking the model distribution to be \(\theta\)est, where \(\theta\)est is the estimate calculated in Step 2 from the data generated in Step 1.

After conducting many such simulations, the confidence level for rejecting the null hypothesis is the fraction of the statistics calculated in step 3 that are less than the statistic calculated from the empirical data. The significance level \(\alpha\) is the same as a confidence level of \(1-\alpha\).

References

  1. Perkins W, Tygert M, Ward R. Chi^2 and Classical Exact Tests Often Wildly Misreport Significance; the Remedy Lies in Computers [Internet]. Uploaded to ArXiv. 2011. Report No.: arXiv:1108.4126v2.

  2. Perkins, W., Tygert, M. & Ward, R. Computing the confidence levels or a root-mean square test of goodness-of-fit. 217, 9072-9084 (2011).

  3. Basu, A., Mandal, A. & Pardo, L. Hypothesis testing for two discrete populations based on the Hellinger distance. Stat. Probab. Lett. 80, 206-214 (2010).

Author

Robersy Sanchez (https://genomaths.com).

Examples

set.seed(123)
## The numbers of methylated and unmethylted "read-counts" targeting DNA
## cytosine sites are the typical raw data resulting from bisulfate
## sequencing experiments (BiSeq). Methylation analysis requires for the
## application of some statistical tests to identify differentially
## methylated sites/ positions (DMSs or DMPs). BiSeq experiments on humans
## are relatively expensive (still in 2020) and some researcher is willing
## to accept a site coverage (total) of 4 read-counts as threshold for a
## site to be included in the analysis. Let's suppose that at a given site
## we have the following contingency table:

site_res <- matrix(c(3, 1, 1, 3),
    nrow = 2,
    dimnames = list(
        status = c("ctrl", "treat"),
        treat = c("meth", "unmeth")
    )
)
site_res
#>        treat
#> status  meth unmeth
#>   ctrl     3      1
#>   treat    1      3

## The application of classical test variants yield the following results
fisher.test(site_res)$p.value
#> [1] 0.4857143
fisher.test(site_res, alternative = "greater")$p.value
#> [1] 0.2428571
fisher.test(site_res, simulate.p.value = TRUE)$p.value
#> [1] 0.4857143
chisq.test(site_res)$p.value
#> [1] 0.4795001
chisq.test(site_res, simulate.p.value = TRUE)$p.value
#> [1] 0.4832584

## That is, these test did not detect differences between the read-counts
## for control and the treated patients. 'tableBoots' function suggests
## that the site is borderline.

tableBoots(site_res, stat = "all", num.permut = 1e3)
#>  rmst.p.value  hdiv.p.value chisq.p.value 
#>    0.09390609    0.04095904    0.14585415 

## A slight change in the read counts identify a meaningful difference
## not detected by the classicla tests
site_res <- matrix(c(3, 0, 1, 4),
    nrow = 2,
    dimnames = list(
        status = c("ctrl", "treat"),
        treat = c("meth", "unmeth")
    )
)
site_res
#>        treat
#> status  meth unmeth
#>   ctrl     3      1
#>   treat    0      4

set.seed(23)
tableBoots(site_res, stat = "all", num.permut = 1e3)
#>  rmst.p.value  hdiv.p.value chisq.p.value 
#>   0.007992008   0.008991009   0.010989011 
fisher.test(site_res)$p.value
#> [1] 0.1428571
fisher.test(site_res, alternative = "greater")$p.value
#> [1] 0.07142857
fisher.test(site_res, simulate.p.value = TRUE)$p.value
#> [1] 0.1428571
chisq.test(site_res)$p.value
#> [1] 0.144127
chisq.test(site_res, simulate.p.value = TRUE)$p.value
#> [1] 0.1324338

## Now, let's suppose that we want to test whether methylation status from
## two DNA sites the same differentially methylated region (DMR)
## are independent, i.e. whether the treatment affected the methylation
## status of the two sites. In this case, the counts grouped into four
## categories.
set.seed(1)
site_res <- matrix(c(3, 1, 1, 3, 3, 0, 1, 4),
    nrow = 2, byrow = FALSE,
    dimnames = list(
        status = c("ctrl", "treat"),
        treat = c(
            "meth.1", "unmeth.1",
            "meth.2", "unmeth.2"
        )
    )
)
site_res
#>        treat
#> status  meth.1 unmeth.1 meth.2 unmeth.2
#>   ctrl       3        1      3        1
#>   treat      1        3      0        4

## Chi-squared from the R package 'stats'
chisq.test(site_res)$p.bvalue
#> NULL
chisq.test(site_res, simulate.p.value = TRUE, B = 2e3)$p.value
#> [1] 0.1244378

tableBoots(site_res, stat = "all", num.permut = 999)
#>  rmst.p.value  hdiv.p.value chisq.p.value 
#>         0.065         0.087         0.068 

## Results above are in border. If we include, third site,
## then sinces would different.
site_res <- matrix(c(3, 1, 1, 3, 3, 0, 1, 4, 4, 0, 1, 4),
    nrow = 2, byrow = FALSE,
    dimnames = list(
        status = c("ctrl", "treat"),
        treat = c(
            "meth.1", "unmeth.1",
            "meth.2", "unmeth.2",
            "meth.3", "unmeth.3"
        )
    )
)
site_res
#>        treat
#> status  meth.1 unmeth.1 meth.2 unmeth.2 meth.3 unmeth.3
#>   ctrl       3        1      3        1      4        1
#>   treat      1        3      0        4      0        4

## That is, we have not reason to believe that the observed methylation
## levels in the treatment are not independent
chisq.test(site_res)$p.value
#> [1] 0.02764776
chisq.test(site_res, simulate.p.value = TRUE, B = 2e3)$p.value
#> [1] 0.017991
tableBoots(site_res, stat = "all", num.permut = 999)
#>  rmst.p.value  hdiv.p.value chisq.p.value 
#>         0.014         0.016         0.010