Skip to contents

Abstract

A fast introduction into the analysis of DNA mutational events as modular matrix operation, i.e., as automorphisms over a ring.

Introduction

In the Abelian p-group defined on \(\mathbb{Z}_{p^{\alpha_{i}}_i}\) the endomorphisms \(\eta_i \in End(\mathbb{Z}_{p^{\alpha_{i}}_i})\) are described as functions \(f(x)=kx\) mod \(p^{\alpha_{i}}_i\), where \(k\) and \(x\) are elements from the set of integers modulo \(p^{\alpha_{i}}_i\).

Next, we proceed to load the R package required libraries

Modular operation on homocyclic group

The mutational events transforming the codon sequence ATACCCATGGCCAAC into the sequence ACACCCATGACCAAC can be represented by means of automorphisms.

In the cube ACGT the codon sequence ATACCCATGGCCAAC is represented by the vector \((48,21,50,25,1) \in (\mathbb{Z}_{2^6})^5\), where \((\mathbb{Z}_{2^6})^5\) stands for the homocyclic Abelian group of vectors of integers modulo \(\mathbb{Z}_{2^6}\), while the sequence ACACCCATGACCAAC is represented by \((16,21,50,17,1)\).

To translate the DNA sequence into numerical vectors from \((\mathbb{Z}_{2^6})^5\) they are previously incorporated into a DNAStringSet object.

dna <- DNAStringSet(c('ATACCCATGGCCAAC', 'ACACCCATGACCAAC'))
dna
#> DNAStringSet object of length 2:
#>     width seq
#> [1]    15 ATACCCATGGCCAAC
#> [2]    15 ACACCCATGACCAAC
cd <- codon_coord(dna, cube = "ACGT", group = "Z64")
cd
#> CodonGroup object with 5 ranges and 4 metadata columns:
#>       seqnames    ranges strand |        seq1        seq2      coord1      coord2
#>          <Rle> <IRanges>  <Rle> | <character> <character> <character> <character>
#>   [1]        1         1      + |         ATA         ACA          48          16
#>   [2]        1         2      + |         CCC         CCC          21          21
#>   [3]        1         3      + |         ATG         ATG          50          50
#>   [4]        1         4      + |         GCC         ACC          25          17
#>   [5]        1         5      + |         AAC         AAC           1           1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

A list of coordinates is retrieved from object cd

cd <- coordList(get_coord(cd))
cd
#> $coord1
#> [1] 48 21 50 25  1
#> 
#> $coord2
#> [1] 16 21 50 17  1

Homomorphism matrices

Most of the mutational events in small DNA segments can be represented by means homomorphisms with which, in turns, are presented by diagonal matrices. The diagonal elements of matrix a transforming coordinate elements from coord1 into coord2 are estimated as shown below

dg <- modlineq(cd$coord1, cd$coord2, 64)
dg
#> [1]  3  1  1 57  1

That is, we have the matrix:

diag(dg)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    3    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0   57    0
#> [5,]    0    0    0    0    1

The multiplication of vector coord1 by the matrix diag(dg) yields vector coord2:

(cd$coord1 %*% diag(dg)) %% 64
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]   16   21   50   17    1
# Or
cat("\n---- \n")
#> 
#> ----
(cd$coord1 %*% diag(dg)) %% 64 == cd$coord2
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,] TRUE TRUE TRUE TRUE TRUE
dna2 <- DNAStringSet(c('GACAGAGCAGTATTAGCTTCACAC', 'GAAAAC---GTATTA---TCAAAG'))
dna2
#> DNAStringSet object of length 2:
#>     width seq
#> [1]    24 GACAGAGCAGTATTAGCTTCACAC
#> [2]    24 GAAAAC---GTATTA---TCAAAG
cd2 <- codon_coord(dna2, cube = "ACGT", group = "Z64")
cd2
#> CodonGroup object with 8 ranges and 4 metadata columns:
#>       seqnames    ranges strand |        seq1        seq2      coord1      coord2
#>          <Rle> <IRanges>  <Rle> | <character> <character> <character> <character>
#>   [1]        1         1      + |         GAC         GAA           9           8
#>   [2]        1         2      + |         AGA         AAC          32           1
#>   [3]        1         3      + |         GCA         ---          24        <NA>
#>   [4]        1         4      + |         GTA         GTA          56          56
#>   [5]        1         5      + |         TTA         TTA          60          60
#>   [6]        1         6      + |         GCT         ---          27        <NA>
#>   [7]        1         7      + |         TCA         TCA          28          28
#>   [8]        1         8      + |         CAC         AAG           5           2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
cd2$coord2[ is.na(cd2$coord2) ] <- 0
cd2
#> CodonGroup object with 8 ranges and 4 metadata columns:
#>       seqnames    ranges strand |        seq1        seq2      coord1      coord2
#>          <Rle> <IRanges>  <Rle> | <character> <character> <character> <character>
#>   [1]        1         1      + |         GAC         GAA           9           8
#>   [2]        1         2      + |         AGA         AAC          32           1
#>   [3]        1         3      + |         GCA         ---          24           0
#>   [4]        1         4      + |         GTA         GTA          56          56
#>   [5]        1         5      + |         TTA         TTA          60          60
#>   [6]        1         6      + |         GCT         ---          27           0
#>   [7]        1         7      + |         TCA         TCA          28          28
#>   [8]        1         8      + |         CAC         AAG           5           2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
cd2 <- coordList(get_coord(cd2))
cd2
#> $coord1
#> [1]  9 32 24 56 60 27 28  5
#> 
#> $coord2
#> [1]  8  1  0 56 60  0 28  2

Coordinates of seq2 can be obtained from seq1 solving the equation: \(seq1 \cdot x = seq2\) mod \(n\)

modulo <- c(64,125,64,64,64,64,64,64)
coord1 <- cd2$coord1
coord2 <- cd2$coord2

dg <- modlineq(coord1, coord2, modulo)
dg
#> [1]  8 43  0  1  1  0  1 26

Or in a form of diagonal matrix:

m  <- diag(dg)

That is, the matrix multiplication below yields coordinates of seq2.

(coord1 %*% m) %% modulo
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,]    8    1    0   56   60    0   28    2

Or

(coord1 %*% m) %% modulo == coord2
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Affine Transformations

Next, we search for the homomorphims transforming coordinates coord2 into coodinates coord1:

modulo <- c(64,125,64,64,64,64,64,64)

dg1 <- modlineq(a = coord2, b = coord1, n = modulo, no.sol = 0L)
dg1
#> $diag
#> [1]  0 32  0  1  1  0  1  0
#> 
#> $translation
#> [1]  9  0 24  0  0 27  0  5

This is an affine transformation, which involves the sum of a homomorphism:

m1 <- diag(dg1$diag)
m1
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,]    0    0    0    0    0    0    0    0
#> [2,]    0   32    0    0    0    0    0    0
#> [3,]    0    0    0    0    0    0    0    0
#> [4,]    0    0    0    1    0    0    0    0
#> [5,]    0    0    0    0    1    0    0    0
#> [6,]    0    0    0    0    0    0    0    0
#> [7,]    0    0    0    0    0    0    1    0
#> [8,]    0    0    0    0    0    0    0    0

Plus a translation. That is:

(coord2 %*% m1 + dg1$translation) %% modulo
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,]    9   32   24   56   60   27   28    5

# Or
cat("\n---- \n")
#> 
#> ----

(coord2 %*% m1 + dg1$translation) %% modulo == coord1
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

However, the solution is not unique. For example, setting no.sol = 1, we have another solution

dg2 <- modlineq(a = coord2, b = coord1, n = modulo, no.sol = 1)
dg2
#> $diag
#> [1]  1 32  1  1  1  1  1  1
#> 
#> $translation
#> [1]  1  0 24  0  0 27  0  3
m2 <- diag(dg2$diag)

(coord2 %*% m2 + dg2$translation) %% modulo
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,]    9   32   24   56   60   27   28    5

# Or
cat("\n---- \n")
#> 
#> ----

(coord2 %*% m2 + dg2$translation) %% modulo == coord1
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Since the second element of seq2 has coordinate is 1, to build a homomorphism that transform coordinates from seq2 into coordinates from seq1, we can just sum the translation vector to the second column of matrix m2:

m3 <- m2
m3[, 2] <- (m2[, 2] + dg2$translation) %% modulo
m3
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,]    1    1    0    0    0    0    0    0
#> [2,]    0   32    0    0    0    0    0    0
#> [3,]    0   24    1    0    0    0    0    0
#> [4,]    0    0    0    1    0    0    0    0
#> [5,]    0    0    0    0    1    0    0    0
#> [6,]    0   27    0    0    0    1    0    0
#> [7,]    0    0    0    0    0    0    1    0
#> [8,]    0    3    0    0    0    0    0    1

It is easy to see that:

(coord2 %*% t(m3)) %% modulo
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,]    9   32   24   56   60   27   28    5

Or, as a column vector:

(m3 %*% coord2) %% modulo
#>      [,1]
#> [1,]    9
#> [2,]   32
#> [3,]   24
#> [4,]   56
#> [5,]   60
#> [6,]   27
#> [7,]   28
#> [8,]    5