Autmorphism Probability

This function applies a Dirichlet-Multinomial Modelling (in Bayesian framework) to compute the posterior probability of each type of mutational event. DNA bases are classified based on the physicochemical criteria used to ordering the set of codons: number of hydrogen bonds (strong-weak, S-W), chemical type (purine-pyrimidine, Y-R), and chemical groups (amino versus keto, M-K) (see reference 4). Preserved codon positions are labeled with letter “H”.

As a result, mutational events are grouped by type of mutation covering all the possible combinations of symbols: "Y", "R", "M", "W", "K", "S", and "H", for example: "YSH", "RSH", "MSH", "WSH", "KSH", "SSH", "HSH", "YHH", "RHH", and so on. Insertion/deletion mutations are not considered.

Maximum Likelihood Estimation (MLE) for Dirichlet Parameters

Given the observed data $n = (n_1, ..., n_k)$, where $n_i$ is the frequency for the mutation type $i$, we want to estimate the parameters of the Dirichlet distribution, $\alpha = (\alpha_1, ..., \alpha_k)$, that maximize the marginal likelihood.

Marginal Likelihood

The marginal likelihood of the data under the Dirichlet-multinomial model is given by

$$P\left(n\middle|\alpha\right) = \frac{N!}{\prod_{i=1}^{k}n_i} \frac{\Gamma\left(\sum_{i=1}^{k}\alpha_i\right)}{\Gamma \left(N+\sum_{i=1}^{k}\alpha_i\right)}\prod_{i=1}^{k}\frac{\Gamma \left(n_i+\alpha_i\right)}{\Gamma\left(\alpha_i\right)}$$

where $N = \sum_{i=1}^{k}n_i$.

Optimization

To perform MLE, we maximize the log of this likelihood: $log(P\left(n\middle|\alpha\right))$. That is, we aim to maximize this log-likelihood with respect to $\alpha$. This is done numerically because there's no closed-form solution. Here, we use:

$$Arg\max_{\alpha} {\{log\left(P\left(n\middle|\alpha\right)\right\}}$$

with initial guess set as $\alpha_i = 1 / (n_i + 1)$.

Posterior Distribution

Let be $\theta$ the vector of probabilities for each mutation type. It's the parameter we're estimating, which represents the probabilities of observing each mutation type in the multinomial distribution. The conjugate distribution in this context refers to the Dirichlet distribution, which is the prior distribution for $\theta$. When we have observed data, the posterior distribution of $\theta$ given this data is also a Dirichlet distribution due to the conjugate property.

The prior distribution, before observing any data, our belief about the distribution of $\theta$ is given by:

$$\theta\sim\ Dirichlet\left(\alpha_1,\alpha_k,\ldots,\alpha_k\right)$$

where the $\alpha_i$ are known as initial concentration parameters, which represents our initial belief or assumption about the distribution of the mutation types. These parameters control how the probability mass is distributed among the mutation types. We might set these initial parameters based on some prior knowledge or simply to provide a non-informative prior.

Once we have the MLE for $\alpha$, the posterior distribution of $\theta$ given the data updates these parameters to incorporate the observed data:

$$\theta\mid\ n\sim\ Dirichlet\left(\alpha_1^\ast+n1,\alpha_k^\ast+n_2, \ldots,\alpha_k^\ast+n_k\right)$$

where $\alpha^\ast_i$ are the MLE estimates, i.e., $\alpha_i^\ast+n_i$ are our updated belief about the frequencies of mutation types in the population sample.

The expected values of $\theta_i$, given the data under the posterior Dirichlet distribution is:

$$E\left[\theta_i\right]=\frac{\alpha_i^*+n_i} {\sum_{j=1}^{k}\left(\alpha_j^*+n_j\right)}$$

These expected values directly corresponds to the "posterior probability" of observing mutation type $i$.

This approach provides a rigorous estimation of the Dirichlet parameters under the Dirichlet-multinomial model using MLE.

Usage

automorphism_prob(x, ...)

# S4 method for class 'AutomorphismByCoef'
automorphism_prob(
  x,
  initial_alpha = NULL,
  method = c("L-BFGS-B", "Nelder-Mead", "BFGS", "CG", "SANN", "Brent"),
  maxit = 500,
  abstol = 10^-8,
  ...
)

# S4 method for class 'AutomorphismByCoefList'
automorphism_prob(
  x,
  initial_alpha = NULL,
  method = c("L-BFGS-B", "Nelder-Mead", "BFGS", "CG", "SANN", "Brent"),
  maxit = 500,
  abstol = 10^-8
)

Arguments

x: An AutomorphismByCoefList-class object returned by function automorphism_bycoef.
...: Not in use yet.
initial_alpha: A vector of initial guess values for pseudo counts $\alpha_i$ (see Description). Default is: $\alpha_i = 1 / (n_i + 1)$, which corresponds to $initial_alpha = NULL$.
method, maxit, abstol: Parameter values to pass into optim.

Value

A data frame with the posterior probabilities.

Examples

## Load the data set
data("autby_coef", package = "GenomAutomorphism")
post_prob <- automorphism_prob(autby_coef[1:10])
head(post_prob,10)
#>    Codon frequency     alpha Posterior_Probability
#> 1    RHH        27 1.3362573            0.17597995
#> 2    HHY        25 1.3362573            0.16355911
#> 3    HRH        25 1.3362573            0.16355911
#> 4    HHR         9 1.3362573            0.06419246
#> 5    HMH         9 1.3362573            0.06419246
#> 6    SHH         9 1.3362573            0.06419246
#> 7    HYH         4 1.3362573            0.03314038
#> 8    HSH         3 1.3362573            0.02692996
#> 9    MHH         3 1.3362573            0.02692996
#> 10   YHH         1 0.7934929            0.01113834