R: Bootstrap test for Goodness of fit (GoF)

mcgoftest {usefr}

R Documentation

Bootstrap test for Goodness of fit (GoF)

Description

To accomplish the nonlinear fit of a probability distribution function (*PDF*), dIfferent optimization algorithms can be used. Each algorithm will return a different set of estimated parameter values. AIC and BIC are not useful (in this case) to decide which parameter set of values is the best. The goodness-of-fit tests (GOF) can help in this case.

Usage

mcgoftest(varobj, distr, pars, num.sampl = 999, sample.size,
  stat = c("ks", "ad", "rmst", "chisq"), breaks = NULL,
  parametric = TRUE, seed = 1, num.cores = 1, tasks = 0)

Arguments

`varobj`	A a vector containing observations, the variable for which the CDF parameters was estimated.
`distr`	The name of the cummulative distribution function (CDF) or a concrete CDF from where estimate the cummulative probabilities. Distribution distr must be defined in environment-namespace from any package or environment defined by user.
`pars`	CDF model parameters. A list of parameters to evaluate the CDF.
`num.sampl`	Number of resamplings.
`sample.size`	Size of the samples used for each sampling.
`stat`	One string denoting the statistic to used in the testing: "ks": Kolmogorov–Smirnov, "ad": Anderson–Darling statistic, "chisq: Pearson's Chi-squared, and "rmst": Root Mean Square statistic.
`breaks`	Default is NULL. Basically, the it is same as in function `hist`. If breaks = NULL, then function 'nclass.FD' (see `nclass` is applied to estimate the breaks.
`parametric`	Logical object. If TRUE, then samples are drawn from the theoretical population described by distr. Default: TRUE.
`seed`	An integer used to set a 'seed' for random number generation.
`num.cores, tasks`	Paramaters for parallele computation using package `BiocParallel-package`: the number of cores to use, i.e. at most how many child processes will be run simultaneously (see `bplapply` and the number of tasks per job (only for Linux OS).

Details

The test is intended for continuos distributions. If sampling size is lesser the size of the sample, then the test becomes a Monte Carlo test. The thes is based on the use of measures of goodness of fit, statistics. The following statistics are availible:

Kolmogorov- Smirnov statistic (ks). Limitations: sensitive to ties [1]. Only the parametric Monte Carlo resampling (provided that there is not ties in the data) can be used.
Anderson–Darling statistic (ad) [2]. Limitation: by construction, it depends on the sample size. So, the size of the sampling must be close to the sample size if Monte Carlo resampling is used, which could be a limitation if the sample size is too large [2]. In particular, could be an issue in some genomic applications. It is worth highlighting that, for the current application, the Anderson–Darling statistic is not standardized as typically done in testing GoF for normal distribution with Anderson–Darling test. It is not required since, the statistic is not compared with a corresponding theoretical value. In addition, since the computation of this statistic requires for the data to be put in order [2], it does not make sense to perform a permutation test. That is, the maximum sampling size is the sample size less 1.
Pearson's Chi-squared statistic (chisq). Limitation: the sample must be discretized (partitioned into bins), which is could be a source of bias that leads to the rejection of the null hypothesis. Here, the discretization is done using function the resources from function hist.
Root Mean Square statistic (rmst). Limitation: the same as 'chisq'.

Value

A numeric vector with the following data:

Statistic value.
mc_p.value: the probability of finding the observed, or more extreme, results when the null hypothesis H_0 of a study question is true obtained Monte Carlo resampling approach.

Author(s)

Robersy Sanchez (https://genomaths.com).

References

Feller, W. On the Kolmogorov-Smirnov Limit Theorems for Empirical Distributions. Ann. Math. Stat. 19, 177–189 (1948).
Anderson, T. . & Darling, D. A. A Test Of Goodness Of Fit. J. Am. Stat. Assoc. 49, 765–769 (1954).
Watson, G. S. On Chi-Square Goodness-Of-Fit Tests for Continuous Distributions. J. R. Stat. Soc. Ser. B Stat. Methodol. 20, 44–72 (1958).

Examples

# Example 1
# Let us generate a random sample a from a specified Weibull distribution:
# Set a seed
set.seed( 1 )
# Random sample from Weibull( x | shape = 0.5, scale = 1.2 )
x = rweibull(10000, shape = 0.5, scale = 1.2)

# MC KS test accept the null hypothesis that variable x comes
# from Weibull(x | shape = 0.5, scale = 1.2), while the standard
# Kolmogorov-Smirnov test reject the Null Hypothesis.
mcgoftest(x, distr = pweibull, pars = c( 0.5, 1.2 ), num.sampl = 500,
        sample.size = 1000, num.cores = 4)

# Example 2
# Let us generate a random sample a random sample from a specified Normal
# distribution:
# Set a seed
set.seed( 1 )
x = rnorm(10000, mean = 1.5, sd = 2)

# MC KS test accept the null hypothesis that variable x comes
# from N(x | mean = 0.5, sd = 1.2), while the standard
# Kolmogorov-Smirnov test reject the Null Hypothesis.
mcgoftest(x, distr = pnorm, pars = c(1.5, 2), num.sampl = 500,
          sample.size = 1000, num.cores = 1)

[Package usefr version 0.1.0 ]