Skip to contents

Abstract

Analysis of DNA mutational events on DNA Multiple Sequence Alignment (MSA) by means of automorphisms between pairwise DNA sequences algebraically represented as Abelian finite group. Herein, show a simple example on how the package can be applied.

Overview

This is a R package to compute the autimorphisms between pairwise aligned DNA sequences represented as elements from a Genomic Abelian group as described in reference (1). In a general scenario, whole chromosomes or genomic regions from a population (from any species or close related species) can be algebraically represented as a direct sum of cyclic groups or more specifically Abelian p-groups. Basically, we propose the representation of multiple sequence alignments (MSA) of length N as a finite Abelian group created by the direct sum of homocyclic Abelian groups of prime-power order:

\[ \qquad G = (\mathbb{Z}_{p^{\alpha_{1}}_1})^{n_1} \oplus (\mathbb{Z}_{p^{\alpha_{2}}_1})^{n_2} \oplus \dots \oplus (\mathbb{Z}_{p^{\alpha_{k}}_k})^{n_k} \]

Where, the \(p_i\)’s are prime numbers, \(\alpha_i \in \mathbb{N}\) and \(\mathbb{Z}_{p^{\alpha_{i}}_i}\) is the group of integer modulo \(p^{\alpha_{i}}_i\).

For the purpose of estimating the automorphism between two aligned DNA sequences, \(p^{\alpha_{i}}_i \in \{5, 2^6, 5^3 \}\).

Automorphisms

Herein, automorphisms are considered algebraic descriptions of mutational event observed in codon sequences represented on different Abelian groups. In particular, as described in references (3-4), for each representation of the codon set on a defined Abelian group there are 24 possible isomorphic Abelian groups. These Abelian groups can be labeled based on the DNA base-order used to generate them. The set of 24 Abelian groups can be described as a group isomorphic to the symmetric group of degree four (\(S_4\), see reference (4)).

For further support about the symmetric group on the 24 Abelian group of genetic-code cubes, users can also see Symmetric Group of the Genetic-Code Cubes., specifically the Mathematica notebook IntroductionToZ5GeneticCodeVectorSpace.nb and interact with it using Wolfram Player, freely available (for Windows and Linux OS) at, https://www.wolfram.com/player/.

Automorphisms on \(\mathbb{Z}_{64}\)

First, we proceed to load the R package required for our analysis

Next, we proceed to check the DNA multiple sequence alignment (MSA) file. This ia a FASTA file carrying the MSA of mammals (somatic) Cystocrome c. Notice that we are familiar with the FASTA file, then it is better to directly read it with function automorphisms.

However, for the current example, this step can be bypassed, since the MSA is provided provided together with GenomAutomorphism R package

## Do not run it. This is included with package
URL <- paste0("https://github.com/genomaths/seqalignments/raw/master/CYCS/",
              "primate_cytochrome_c_(CYCS)_18_sequences.fasta")

cyc_aln <- readDNAMultipleAlignment(filepath = URL)

Load MSA available in the package

data("cyc_aln")
cyc_aln
#> DNAMultipleAlignment with 19 rows and 318 columns
#>       aln                                                                                           names               
#>  [1] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAA NM_018947.6 Homo ...
#>  [2] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAA gb|BC021994.1|  [...
#>  [3] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_019031021....
#>  [4] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAA dbj|AK311836.1|  ...
#>  [5] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAA gb|BC067222.1|  [...
#>  [6] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAA gb|BC068464.1|  [...
#>  [7] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAA gb|BC015130.1|  [...
#>  [8] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTGATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_032759077....
#>  [9] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTGATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_003270423....
#> [10] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTGATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_033192823....
#> [11] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTGATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_003896182....
#> [12] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTGATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_003895240....
#> [13] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTGATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_010375926....
#> [14] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTGATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_028845949....
#> [15] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTGATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_028845948....
#> [16] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTGATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_025380676....
#> [17] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTGATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|XM_025380675....
#> [18] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAA ref|NM_001131167....
#> [19] ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGT...AGGGCAGACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAA emb|CR857183.1|  ...

The sequence names

strtrim(rownames(cyc_aln), 100)
#>  [1] "NM_018947.6 Homo sapiens cytochrome c, somatic (CYCS), mRNA"                                         
#>  [2] "gb|BC021994.1|  [organism=Homo sapiens] [note=Vector: pDNR-LIB] [gcode=1] [clone=MGC:24510 IMAGE:409"
#>  [3] "ref|XM_019031021.2|  [location=chromosome] [chromosome=7] [organism=Gorilla gorilla gorilla] [isolat"
#>  [4] "dbj|AK311836.1|  [organism=Homo sapiens] [gcode=1] [clone=BRACE3022676] [tissue_type=cerebellum] [cl"
#>  [5] "gb|BC067222.1|  [organism=Homo sapiens] [note=Vector: pDNR-LIB] [gcode=1] [clone=MGC:72005 IMAGE:368"
#>  [6] "gb|BC068464.1|  [organism=Homo sapiens] [note=Vector: pBluescriptR] [gcode=1] [clone=MGC:87065 IMAGE"
#>  [7] "gb|BC015130.1|  [organism=Homo sapiens] [note=Vector: pDNR-LIB] [gcode=1] [clone=MGC:24248 IMAGE:393"
#>  [8] "ref|XM_032759077.1|  [chromosome=Unknown] [organism=Hylobates moloch] [isolate=HMO894] [note=Lionel]"
#>  [9] "ref|XM_003270423.4|  [location=chromosome] [chromosome=17] [organism=Nomascus leucogenys] [isolate=A"
#> [10] "ref|XM_033192823.1|  [chromosome=Unknown] [organism=Trachypithecus francoisi] [isolate=TF-2019V2] [g"
#> [11] "ref|XM_003896182.3|  [location=chromosome] [chromosome=4] [organism=Papio anubis] [isolate=15944] [g"
#> [12] "ref|XM_003895240.5|  [location=chromosome] [chromosome=2] [organism=Papio anubis] [isolate=15944] [g"
#> [13] "ref|XM_010375926.2|  [location=chromosome] [chromosome=6] [organism=Rhinopithecus roxellana] [isolat"
#> [14] "ref|XM_028845949.1|  [location=chromosome] [chromosome=3] [organism=Macaca mulatta] [isolate=AG07107"
#> [15] "ref|XM_028845948.1|  [location=chromosome] [chromosome=3] [organism=Macaca mulatta] [isolate=AG07107"
#> [16] "ref|XM_025380676.1|  [location=chromosome] [chromosome=3] [organism=Theropithecus gelada] [isolate=D"
#> [17] "ref|XM_025380675.1|  [location=chromosome] [chromosome=3] [organism=Theropithecus gelada] [isolate=D"
#> [18] "ref|NM_001131167.2|  [chromosome=7] [organism=Pongo abelii] [gcode=1] [chromosome=7] [map=7] Pongo a"
#> [19] "emb|CR857183.1|  [organism=Pongo abelii] [gcode=1] [clone=DKFZp468F1016] [tissue_type=heart] [clone_"

The corresponding aminoacid sequence is:

translate(cyc_aln@unmasked)
#> AAStringSet object of length 19:
#>      width seq                                                                                      names               
#>  [1]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* NM_018947.6 Homo ...
#>  [2]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* gb|BC021994.1|  [...
#>  [3]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* ref|XM_019031021....
#>  [4]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* dbj|AK311836.1|  ...
#>  [5]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* gb|BC067222.1|  [...
#>  ...   ... ...
#> [15]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* ref|XM_028845948....
#> [16]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* ref|XM_025380676....
#> [17]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* ref|XM_025380675....
#> [18]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* ref|NM_001131167....
#> [19]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* emb|CR857183.1|  ...

Next, function automorphisms will be applied to represent the codon sequence in the Abelian group \(\mathbb{Z}_{64}\) (i.e., the set of integers remainder modulo 64). The codon coordinates are requested on the cube ACGT. Following reference (4)), cubes are labeled based on the order of DNA bases used to define the sum operation.

In Z64, automorphisms are described as functions \(f(x) = k\,x \quad mod\,64\), where \(k\) and \(x\) are elements from the set of integers modulo 64. Below, in function automorphisms three important arguments are given values: group = “Z64”, cube = c(“ACGT”, “TGCA”), and cube_alt = c(“CATG”, “GTAC”). Setting for group specifies on which group the automorphisms will be computed. These groups can be: “Z5”, “Z64”, “Z125”, and “Z5^3”.

In groups “Z64” and “Z125” not all the mutational events can be described as automorphisms from a given cube. So, a character string denoting pairs of “dual” the genetic-code cubes, as given in references (1-4)), is given as argument for cube. That is, the base pairs from the given cubes must be complementary each other. Such a cube pair are call dual cubes and, as shown in reference (4)), each pair integrates group. If automorphisms are not found in first set of dual cubes, then the algorithm search for automorphism in a alternative set of dual cubes.

## Do not run it. This is included with package

nams <- c("human_1", "human_2", "gorilla", "human_3", "human_4",
          "human_5", "human_6", "silvery_gibbon", "white_cheeked_gibbon",
          "françois_langur", "olive_baboon_1", "olive_baboon_2",
          "golden_monkey", "rhesus_monkeys_1", "rhesus_monkeys_2",
          "gelada_baboon_1", "gelada_baboon_2", "orangutan_1", "orangutan_2")

cyc_autm <- automorphisms(filepath = URL, 
                      group = "Z64", 
                      cube = c("ACGT", "TGCA"),
                      cube_alt = c("CATG", "GTAC"),
                      nms = nams,
                      verbose = FALSE)

Object autm is included with package and can be load typing:

data(cyc_autm, package = "GenomAutomorphism")
cyc_autm
#> AutomorphismList object of length: 171
#> names(171): human_1.human_2 human_1.gorilla human_1.human_3 ... gelada_baboon_2.orangutan_1 gelada_baboon_2.orangutan_2 orangutan_1.orangutan_2 
#> ------- 
#> Automorphism object with 106 ranges and 8 metadata columns:
#>         seqnames    ranges strand |        seq1        seq2         aa1         aa2    coord1    coord2      autm
#>            <Rle> <IRanges>  <Rle> | <character> <character> <character> <character> <numeric> <numeric> <numeric>
#>     [1]        1         1      + |         ATG         ATG           M           M        50        50         1
#>     [2]        1         2      + |         GGT         GGT           G           G        43        43         1
#>     [3]        1         3      + |         GAT         GAT           D           D        11        11         1
#>     [4]        1         4      + |         GTT         GTT           V           V        59        59         1
#>     [5]        1         5      + |         GAG         GAG           E           E        10        10         1
#>     ...      ...       ...    ... .         ...         ...         ...         ...       ...       ...       ...
#>   [102]        1       102      + |         GCT         GCT           A           A        27        27         1
#>   [103]        1       103      + |         ACT         ACT           T           T        19        19         1
#>   [104]        1       104      + |         AAT         AAT           N           N         3         3         1
#>   [105]        1       105      + |         GAG         GAG           E           E        10        10         1
#>   [106]        1       106      + |         TAA         TAA           *           *        12        12         1
#>                cube
#>         <character>
#>     [1]        ACGT
#>     [2]        ACGT
#>     [3]        ACGT
#>     [4]        ACGT
#>     [5]        ACGT
#>     ...         ...
#>   [102]        ACGT
#>   [103]        ACGT
#>   [104]        ACGT
#>   [105]        ACGT
#>   [106]        ACGT
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> ...
#> <170 more DFrame element(s)>
#> Two slots: 'DataList' & 'SeqRanges'
#> -------

An AutomorphismList class object is returned. The codon sequences (seq1 and seq2) with their corresponding coordinates (left) are returned, as well as the coordinated representation on \(\mathbb{Z}_{64}\) (coord1 and coord2). Observe that two new columns were added, the automorphism coefficient \(k\) (named as autm) and the genetic-code cube where the automorphism was found. By convention the DNA sequence is given for the positive strand. Since the dual cube of “ACGT” corresponds to the complementary base order TGCA, automorphisms described by the cube TGCA represent mutational events affecting the DNA negative strand (-).

Bar plot automorphism distribution by cubes

The automorphism distribution by cubes can be summarized in the bar-plot graphic

counts <- table(unlist(lapply(cyc_autm@DataList, 
            function(x) {
              x = data.frame(x)
              x =  x[ x$autm != 1 | is.na(x$autm), ]
              return(x$cube)}
            )
        ))

par(family = "serif", cex = 0.8, font = 2, mar = c(4,6,4,4))
barplot(counts, #main="Automorphism distribution",
        xlab="Genetic-code cube representation",
        ylab="Fixed mutational events",
col=c("darkblue","red", "darkgreen", "magenta", "orange"), 
        border = NA, axes = FALSE, ylim = c(0,300),
        cex.lab = 2, cex.main = 1.5, cex.names = 2)
axis(2, at = c(0, 100, 200, 300), cex.axis = 1.5)
mtext(side = 1,line = -1.5, at = c(0.7, 1.9),
      text = paste0( counts ), cex = 1.4,
      col = c("white","yellow"))

Bar plot automorphism distribution by regions

The last result can be summarized by gene regions as follow:

autm_range <- automorphismByRanges(cyc_autm, 
                                  min.len = 2, 
                                  verbose = FALSE)
autm_range
#> GRangesList object of length 108:
#> $human_1.human_4
#> GRanges object with 3 ranges and 1 metadata column:
#>       seqnames    ranges strand |        cube
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]        1      1-60      + |        ACGT
#>   [2]        1        61      + |        TGCA
#>   [3]        1    62-106      + |        ACGT
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $human_1.silvery_gibbon
#> GRanges object with 3 ranges and 1 metadata column:
#>       seqnames    ranges strand |        cube
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]        1      1-94      + |        ACGT
#>   [2]        1        95      + |        TGCA
#>   [3]        1    96-106      + |        ACGT
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $human_1.white_cheeked_gibbon
#> GRanges object with 3 ranges and 1 metadata column:
#>       seqnames    ranges strand |        cube
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]        1      1-94      + |        ACGT
#>   [2]        1        95      + |        TGCA
#>   [3]        1    96-106      + |        ACGT
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> ...
#> <105 more elements>

That is, function automorphismByRanges permits the classification of the pairwise alignment of protein-coding sub-regions based on the mutational events observed on it quantitatively represented as automorphisms on genetic-code cubes.

Searching for automorphisms on \(\mathbb{Z}_{64}\) permits us a quantitative differentiation between mutational events at different codon positions from a given DNA protein-encoding region. As shown in reference (4)) a set of different cubes can be applied to describe the best evolutionary aminoacid scale highly correlated with aminoacid physicochemical properties describing the observed evolutionary process in a given protein.

More information about this subject can be found in the supporting material from reference (4)) at GitHub GenomeAlgebra_SymmetricGroup, particularly by interacting with the Mathematica notebook Genetic-Code-Scales_of_Amino-Acids.nb.

The automorphism distribution by cubes can be summarized in the bar-plot graphic

cts <- unlist(autm_range)
counts <- table(cts$cube)

par(family = "serif", cex = 0.6, font = 2, las = 1)
barplot(counts, main="Automorphism distribution",
        xlab="Genetic-code cube representation",
        col=c("darkblue","red", "darkgreen"), 
        border = NA, axes = T)
mtext(side = 1,line = -6, at = c(0.7, 1.9, 3.1),
      text = paste0( counts ), cex = 1.4,
      col = c("white","yellow"))

Grouping automorphism by automorphism’s coefficients

Automorphisms with the same automorphism’s coefficients can be grouped.

autby_coef <- automorphism_bycoef(cyc_autm, 
                                 verbose = FALSE)
autby_coef
#> AutomorphismByCoefList object of length 171:
#> $human_1.human_2
#> AutomorphismByCoef object with 45 ranges and 7 metadata columns:
#>        seqnames    ranges strand |        seq1        seq2         aa1         aa2      autm    mut_type        cube
#>           <Rle> <IRanges>  <Rle> | <character> <character> <character> <character> <numeric> <character> <character>
#>    [1]        1     1-106      + |         ATG         ATG           M           M         1         HHH        ACGT
#>    [2]        1     1-106      + |         GGT         GGT           G           G         1         HHH        ACGT
#>    [3]        1     1-106      + |         GAT         GAT           D           D         1         HHH        ACGT
#>    [4]        1     1-106      + |         GTT         GTT           V           V         1         HHH        ACGT
#>    [5]        1     1-106      + |         GAG         GAG           E           E         1         HHH        ACGT
#>    ...      ...       ...    ... .         ...         ...         ...         ...       ...         ...         ...
#>   [41]        1     1-106      + |         GAC         GAC           D           D         1         HHH        ACGT
#>   [42]        1     1-106      + |         TTA         TTA           L           L         1         HHH        ACGT
#>   [43]        1     1-106      + |         ATA         ATA           I           I         1         HHH        ACGT
#>   [44]        1     1-106      + |         GCT         GCT           A           A         1         HHH        ACGT
#>   [45]        1     1-106      + |         TAA         TAA           *           *         1         HHH        ACGT
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> ...
#> <170 more elements>

Every single base mutational event across the MSA was classified according IUPAC nomenclature: 1) According to the number of hydrogen bonds (on DNA/RNA double helix): strong S={C, G} (three hydrogen bonds) and weak W={A, U} (two hydrogen bonds). According to the chemical type: purines R={A, G} and pyrimidines Y={C, U}. 3). According to the presence of amino or keto groups on the base rings: amino M={C, A} and keto K={G, T}. Constant (hold) base positions were labeled with letter H. So, codon positions labeled as HKH means that the first and third bases remains constant and mutational events between bases G and T were found in the MSA.

cts <- unlist(autby_coef)
counts <- table(cts$mut_type[ cts$autm != 1 & cts$autm != -1 ])
counts <- sort(counts, decreasing = TRUE)

par(family = "serif", cex.axis = 1.4, font = 2, las = 1, 
    cex.main = 1.4, cex.lab = 2)
barplot(counts, main="Automorphism distribution per Mutation type",
        col = c("red", "orange", "yellow", "green", "dodgerblue"), 
        border = NA, axes = TRUE,las=2)
mtext(side = 1,line = -2, at = c(0.7, 1.9, 3.1, 4.3, 5.5),
      text = paste0( counts ), cex = 1.4,
      col = c("white", "red", "blue", "black", "yellow"))

Bar plot automorphism distribution by cubes

The automorphism distribution by cubes can be summarized in the bar-plot graphic.

Object autby_coef carried all the pairwise comparisons, while it will be enough to use data from a single species as reference, e.g., humans.

First the data must be reordered into a object:

hautby_coef <- autby_coef[ grep("human", names(autby_coef)) ]
h_autby_coef <- unlist(hautby_coef)
h_autby_coef <- h_autby_coef[ which(h_autby_coef$autm != 1) ]

nams <- names(h_autby_coef)
nams <- sub("human[_][1-6][.]", "", nams)
nams <- sub("[_][1-6]", "", nams)

dt <- data.frame(h_autby_coef, species = nams)
dt <- dt[, c("start", "autm", "species", "cube")]

Nominal variables are transformed into

dt$start <- as.numeric(dt$start)
dt$autm <- as.numeric(dt$autm)
dt$cube <- as.factor(dt$cube)
dt$species <- as.factor(dt$species)
DataFrame(dt)
#> DataFrame with 282 rows and 4 columns
#>         start      autm   species     cube
#>     <numeric> <numeric>  <factor> <factor>
#> 1          37         3   gorilla     ACGT
#> 2          51        19   human       ACGT
#> 3          61        51   human       TGCA
#> 4          41         3   human       ACGT
#> 5          18        33   human       ACGT
#> ...       ...       ...       ...      ...
#> 278        54        43 orangutan     TGCA
#> 279        16        54 orangutan     ACGT
#> 280        18        33 orangutan     ACGT
#> 281        37         3 orangutan     ACGT
#> 282        54        43 orangutan     TGCA
counts <- table(dt$cube)

par(family = "serif", cex = 0.6, font = 2, mar=c(4,6,4,4))
barplot(counts, #main="Automorphism distribution",
        xlab="Genetic-code cube representation",
        ylab="Fixed mutational events",
        col=c("darkblue","red", "darkgreen", "magenta", "orange"), 
        border = NA, axes = F, #ylim = c(0, 6000),
        cex.lab = 2, cex.main = 1.5, cex.names = 2)
axis(2, at = c(0, 50, 100, 150, 200), cex.axis = 1.5)
mtext(side = 1,line = -2.5, at = c(0.7, 1.9, 3.1, 4.3, 5.5),
      text = paste0( counts ), cex = 1.4,
      col = c("white", "yellow"))

Conserved and non-conserved regions

A specific function is provided to get the coordinates of conserved and non-conserved regions, which can be used in further downstream analyses.

The conserved regions from each pairwise comparison are retrieved typing:

conserved_regions(x = autby_coef)
#> AutomorphismByCoef object with 11023 ranges and 7 metadata columns:
#>                           seqnames    ranges strand |        seq1        seq2         aa1         aa2      autm
#>                              <Rle> <IRanges>  <Rle> | <character> <character> <character> <character> <numeric>
#>       human_1.orangutan_1        1      1-15      + |         ATG         ATG           M           M         1
#>       human_1.orangutan_1        1      1-15      + |         GGT         GGT           G           G         1
#>       human_1.orangutan_1        1      1-15      + |         GAT         GAT           D           D         1
#>       human_1.orangutan_1        1      1-15      + |         GTT         GTT           V           V         1
#>       human_1.orangutan_1        1      1-15      + |         GAG         GAG           E           E         1
#>                       ...      ...       ...    ... .         ...         ...         ...         ...       ...
#>   orangutan_1.orangutan_2        1     1-106      + |         GAC         GAC           D           D         1
#>   orangutan_1.orangutan_2        1     1-106      + |         TTA         TTA           L           L         1
#>   orangutan_1.orangutan_2        1     1-106      + |         ATA         ATA           I           I         1
#>   orangutan_1.orangutan_2        1     1-106      + |         GCT         GCT           A           A         1
#>   orangutan_1.orangutan_2        1     1-106      + |         TAA         TAA           *           *         1
#>                              mut_type        cube
#>                           <character> <character>
#>       human_1.orangutan_1         HHH        ACGT
#>       human_1.orangutan_1         HHH        ACGT
#>       human_1.orangutan_1         HHH        ACGT
#>       human_1.orangutan_1         HHH        ACGT
#>       human_1.orangutan_1         HHH        ACGT
#>                       ...         ...         ...
#>   orangutan_1.orangutan_2         HHH        ACGT
#>   orangutan_1.orangutan_2         HHH        ACGT
#>   orangutan_1.orangutan_2         HHH        ACGT
#>   orangutan_1.orangutan_2         HHH        ACGT
#>   orangutan_1.orangutan_2         HHH        ACGT
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The non-conserved regions from each pairwise comparison are obtained with the same function but different settings:

ncs <- conserved_regions(x = autby_coef, conserved = FALSE)
ncs
#> AutomorphismByCoef object with 396 ranges and 7 metadata columns:
#>                                seqnames    ranges strand |        seq1        seq2         aa1         aa2      autm
#>                                   <Rle> <IRanges>  <Rle> | <character> <character> <character> <character> <numeric>
#>            human_1.orangutan_1        1        16      + |         TCC         TCG           S           S        54
#>            human_1.orangutan_2        1        16      + |         TCC         TCG           S           S        54
#>            human_2.orangutan_1        1        16      + |         TCC         TCG           S           S        54
#>            human_2.orangutan_2        1        16      + |         TCC         TCG           S           S        54
#>            gorilla.orangutan_1        1        16      + |         TCC         TCG           S           S        54
#>                            ...      ...       ...    ... .         ...         ...         ...         ...       ...
#>   rhesus_monkeys_2.orangutan_2        1        95      + |         TTG         TTA           L           L        34
#>    gelada_baboon_1.orangutan_1        1        95      + |         TTG         TTA           L           L        34
#>    gelada_baboon_1.orangutan_2        1        95      + |         TTG         TTA           L           L        34
#>    gelada_baboon_2.orangutan_1        1        95      + |         TTG         TTA           L           L        34
#>    gelada_baboon_2.orangutan_2        1        95      + |         TTG         TTA           L           L        34
#>                                   mut_type        cube
#>                                <character> <character>
#>            human_1.orangutan_1         HHS        ACGT
#>            human_1.orangutan_2         HHS        ACGT
#>            human_2.orangutan_1         HHS        ACGT
#>            human_2.orangutan_2         HHS        ACGT
#>            gorilla.orangutan_1         HHS        ACGT
#>                            ...         ...         ...
#>   rhesus_monkeys_2.orangutan_2         HHR        ACGT
#>    gelada_baboon_1.orangutan_1         HHR        ACGT
#>    gelada_baboon_1.orangutan_2         HHR        ACGT
#>    gelada_baboon_2.orangutan_1         HHR        ACGT
#>    gelada_baboon_2.orangutan_2         HHR        ACGT
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Subsetting regions of interest

The application of some of the available bioinformatic tools is straightforward. For example, knowing the positions for the conserved and non-conserved regions, we can subset into the alignment region of potential interest.

The coordinates carrying mutational events are:

sort(unique(start(ncs)))
#> [1] 16 18 37 41 51 54 59 61 95

We will use function DNAStringSet from the Bioconductor R package Biostrings to subset the MSA of primate cytochrome c. 

cyc_aln_region <- DNAStringSet(cyc_aln, start = 37*3 - 2, end = 42*3)
cyc_aln_region
#> DNAStringSet object of length 19:
#>      width seq                                                                                      names               
#>  [1]    18 TTTGGGCGGAAGACAGGT                                                                       NM_018947.6 Homo ...
#>  [2]    18 TTTGGGCGGAAGACAGGT                                                                       gb|BC021994.1|  [...
#>  [3]    18 TTCGGGCGGAAGACAGGT                                                                       ref|XM_019031021....
#>  [4]    18 TTTGGGCGGAAGACAGGT                                                                       dbj|AK311836.1|  ...
#>  [5]    18 TTTGGGCGGAAGACAGGT                                                                       gb|BC067222.1|  [...
#>  ...   ... ...
#> [15]    18 TTCGGGCGGAAGACAGGT                                                                       ref|XM_028845948....
#> [16]    18 TTCGGGCGGAAGACAGGT                                                                       ref|XM_025380676....
#> [17]    18 TTCGGGCGGAAGACAGGT                                                                       ref|XM_025380675....
#> [18]    18 TTCGGGCGGAAGACAGGT                                                                       ref|NM_001131167....
#> [19]    18 TTCGGGCGGAAGACAGGT                                                                       emb|CR857183.1|  ...

Next, function ggseqlogo from the R package ggseqlogo will be applied to get the logo sequence from the selected MSA region.

ggseqlogo( as.character(cyc_aln_region) )
#> Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
#>  The deprecated feature was likely used in the ggseqlogo package.
#>   Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

The corresponding aminoacid sequence will be

cyc_aln_region_aa = translate(cyc_aln_region)
cyc_aln_region_aa
#> AAStringSet object of length 19:
#>      width seq                                                                                      names               
#>  [1]     6 FGRKTG                                                                                   NM_018947.6 Homo ...
#>  [2]     6 FGRKTG                                                                                   gb|BC021994.1|  [...
#>  [3]     6 FGRKTG                                                                                   ref|XM_019031021....
#>  [4]     6 FGRKTG                                                                                   dbj|AK311836.1|  ...
#>  [5]     6 FGRKTG                                                                                   gb|BC067222.1|  [...
#>  ...   ... ...
#> [15]     6 FGRKTG                                                                                   ref|XM_028845948....
#> [16]     6 FGRKTG                                                                                   ref|XM_025380676....
#> [17]     6 FGRKTG                                                                                   ref|XM_025380675....
#> [18]     6 FGRKTG                                                                                   ref|NM_001131167....
#> [19]     6 FGRKTG                                                                                   emb|CR857183.1|  ...

And the corresponding aminoacid sequence logo will is

ggseqlogo( as.character(cyc_aln_region_aa) )

That is, the first mutational event reported in the codon sequence correspond to a synonymous mutation.

Alternatively, we can start translating the DNA MSA.

cyc_aln_aa <- translate(cyc_aln@unmasked)
cyc_aln_aa
#> AAStringSet object of length 19:
#>      width seq                                                                                      names               
#>  [1]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* NM_018947.6 Homo ...
#>  [2]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* gb|BC021994.1|  [...
#>  [3]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* ref|XM_019031021....
#>  [4]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* dbj|AK311836.1|  ...
#>  [5]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* gb|BC067222.1|  [...
#>  ...   ... ...
#> [15]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* ref|XM_028845948....
#> [16]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* ref|XM_025380676....
#> [17]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* ref|XM_025380675....
#> [18]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* ref|NM_001131167....
#> [19]   106 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQ...LMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE* emb|CR857183.1|  ...
cyc_aln_aa_region <- AAStringSet(cyc_aln_aa, start = 37, end = 42)
cyc_aln_aa_region
#> AAStringSet object of length 19:
#>      width seq                                                                                      names               
#>  [1]     6 FGRKTG                                                                                   NM_018947.6 Homo ...
#>  [2]     6 FGRKTG                                                                                   gb|BC021994.1|  [...
#>  [3]     6 FGRKTG                                                                                   ref|XM_019031021....
#>  [4]     6 FGRKTG                                                                                   dbj|AK311836.1|  ...
#>  [5]     6 FGRKTG                                                                                   gb|BC067222.1|  [...
#>  ...   ... ...
#> [15]     6 FGRKTG                                                                                   ref|XM_028845948....
#> [16]     6 FGRKTG                                                                                   ref|XM_025380676....
#> [17]     6 FGRKTG                                                                                   ref|XM_025380675....
#> [18]     6 FGRKTG                                                                                   ref|NM_001131167....
#> [19]     6 FGRKTG                                                                                   emb|CR857183.1|  ...
ggseqlogo( as.character(cyc_aln_aa_region) )

References

1. Sanchez R, Morgado E, Grau R. Gene algebra from a genetic code algebraic structure. J Math Biol. 2005 Oct;51(4):431-57. doi: 10.1007/s00285-005-0332-8. Epub 2005 Jul 13. PMID: 16012800. ( PDF).

2. Robersy Sanchez, Jesús Barreto (2021) Genomic Abelian Finite Groups. doi: 10.1101/2021.06.01.446543.

3. M. V José, E.R. Morgado, R. Sánchez, T. Govezensky, The 24 possible algebraic representations of the standard genetic code in six or in three dimensions, Adv. Stud. Biol. 4 (2012) 119–152.PDF.

4. R. Sanchez. Symmetric Group of the Genetic–Code Cubes. Effect of the Genetic–Code Architecture on the Evolutionary Process MATCH Commun. Math. Comput. Chem. 79 (2018) 527-560. PDF.