mcgoftest {usefr} R Documentation

Bootstrap test for Goodness of fit (GoF)

Description

To accomplish the nonlinear fit of a probability distribution function (*PDF*), dIfferent optimization algorithms can be used. Each algorithm will return a different set of estimated parameter values. AIC and BIC are not useful (in this case) to decide which parameter set of values is the best. The goodness-of-fit tests (GOF) can help in this case.

Usage

mcgoftest(varobj, distr, pars, num.sampl = 999, sample.size,
stat = c("ks", "ad", "rmst", "chisq"), breaks = NULL,
parametric = TRUE, seed = 1, num.cores = 1, tasks = 0)

Arguments

 varobj A a vector containing observations, the variable for which the CDF parameters was estimated. distr The name of the cummulative distribution function (CDF) or a concrete CDF from where estimate the cummulative probabilities. Distribution distr must be defined in environment-namespace from any package or environment defined by user. pars CDF model parameters. A list of parameters to evaluate the CDF. num.sampl Number of resamplings. sample.size Size of the samples used for each sampling. stat One string denoting the statistic to used in the testing: "ks": Kolmogorov–Smirnov, "ad": Anderson–Darling statistic, "chisq: Pearson's Chi-squared, and "rmst": Root Mean Square statistic. breaks Default is NULL. Basically, the it is same as in function hist. If breaks = NULL, then function 'nclass.FD' (see nclass is applied to estimate the breaks. parametric Logical object. If TRUE, then samples are drawn from the theoretical population described by distr. Default: TRUE. seed An integer used to set a 'seed' for random number generation. num.cores, tasks Paramaters for parallele computation using package BiocParallel-package: the number of cores to use, i.e. at most how many child processes will be run simultaneously (see bplapply and the number of tasks per job (only for Linux OS).

Details

The test is intended for continuos distributions. If sampling size is lesser the size of the sample, then the test becomes a Monte Carlo test. The thes is based on the use of measures of goodness of fit, statistics. The following statistics are availible:

• Kolmogorov- Smirnov statistic (ks). Limitations: sensitive to ties . Only the parametric Monte Carlo resampling (provided that there is not ties in the data) can be used.

• Anderson–Darling statistic (ad) . Limitation: by construction, it depends on the sample size. So, the size of the sampling must be close to the sample size if Monte Carlo resampling is used, which could be a limitation if the sample size is too large . In particular, could be an issue in some genomic applications. It is worth highlighting that, for the current application, the Anderson–Darling statistic is not standardized as typically done in testing GoF for normal distribution with Anderson–Darling test. It is not required since, the statistic is not compared with a corresponding theoretical value. In addition, since the computation of this statistic requires for the data to be put in order , it does not make sense to perform a permutation test. That is, the maximum sampling size is the sample size less 1.

• Pearson's Chi-squared statistic (chisq). Limitation: the sample must be discretized (partitioned into bins), which is could be a source of bias that leads to the rejection of the null hypothesis. Here, the discretization is done using function the resources from function hist.

• Root Mean Square statistic (rmst). Limitation: the same as 'chisq'.

Value

A numeric vector with the following data:

1. Statistic value.

2. mc_p.value: the probability of finding the observed, or more extreme, results when the null hypothesis H_0 of a study question is true obtained Monte Carlo resampling approach.

Author(s)

Robersy Sanchez (https://genomaths.com).

References

1. Feller, W. On the Kolmogorov-Smirnov Limit Theorems for Empirical Distributions. Ann. Math. Stat. 19, 177–189 (1948).

2. Anderson, T. . & Darling, D. A. A Test Of Goodness Of Fit. J. Am. Stat. Assoc. 49, 765–769 (1954).

3. Watson, G. S. On Chi-Square Goodness-Of-Fit Tests for Continuous Distributions. J. R. Stat. Soc. Ser. B Stat. Methodol. 20, 44–72 (1958).

Distribution fitting: fitMixDist, fitdistr, fitCDF.

Examples

# Example 1
# Let us generate a random sample a from a specified Weibull distribution:
# Set a seed
set.seed( 1 )
# Random sample from Weibull( x | shape = 0.5, scale = 1.2 )
x = rweibull(10000, shape = 0.5, scale = 1.2)

# MC KS test accept the null hypothesis that variable x comes
# from Weibull(x | shape = 0.5, scale = 1.2), while the standard
# Kolmogorov-Smirnov test reject the Null Hypothesis.
mcgoftest(x, distr = pweibull, pars = c( 0.5, 1.2 ), num.sampl = 500,
sample.size = 1000, num.cores = 4)

# Example 2
# Let us generate a random sample a random sample from a specified Normal
# distribution:
# Set a seed
set.seed( 1 )
x = rnorm(10000, mean = 1.5, sd = 2)

# MC KS test accept the null hypothesis that variable x comes
# from N(x | mean = 0.5, sd = 1.2), while the standard
# Kolmogorov-Smirnov test reject the Null Hypothesis.
mcgoftest(x, distr = pnorm, pars = c(1.5, 2), num.sampl = 500,
sample.size = 1000, num.cores = 1)

[Package usefr version 0.1.0 ]