Pattern Recognition of mtDNA with Associative Models

In this paper we applied an associative memory for the pattern recognition of mtDNA that can be useful to identify bodies and human remains. In particular, we used both morphological hetroassociative memories: max and min. We process the problem of pattern recognition as a classification task. Our proposal showed a correct recall, we obtained the 100% of recalling of all the learned patterns. We simulated a corrupted sample of mtDNA by adding noise of two types: additive and subtractive. The memory showed a correct recall when we applied less or equal than 55% of both types of noise.


Introduction
The scientific and technologic advances in the tools used in genetic have allowed new applications in other disciplines. More frequent, there is a collaboration between forensic groups and other medical specialities. Nowadays, the applications of computational algorithms in several areas have become essential. In this work, we applied artificial intelligent tools to create a software that is able to recognize patterns of mtDNA as a forensic application.
Mitochondria [1] are structures within cells that convert the energy from food into a form that cells can use. Although most DNA is packaged in chromosomes within the nucleus, mitochondria also have a small amount of their own DNA. This genetic material is known as mitochondrial DNA or mtDNA. This type of DNA resists adverse conditions without being degraded.
Some approaches are applied to analyse and recognize DNA. Craven and Shavlik [2] applied artificial neural networks to recognize promoters in DNA. A genetic algorithm was applied to find variable length motifs (transcription factor binding sites) [3]. A Self-Organizing Neural Network [4] was used to identify motifs. A related work with motifs applied the RISO software and box-link structures [5]. In addition, voting and matching pattern algorithms [6] were used to identify motifs. Wells and Sperling [7] used the software PAUP 4.0 to analyse and identify mtDNA of the blow fly subfamily of Chrysomyinae to help a forensic-entomological analysis. In other paper, a neural network based multi-classifier system [8] for the identification of Escheri-chia coli promoter sequences in strings of DNA is presented. Pushpalatha and Mukunthan [9] applied a Neural-fuzzy Mapping to identify DNA Finger Printing. A genetic algorithm [10] was used as a preprocessing tool to Identifying significant genes from DNA microarray. Inbamalar and Sivakumar [11] applied the DWT Coiflet 5 to detect the protein coding regions in DNA sequences.
In this work, we applied the Morphological Associative Memories to identify people from the mtDNA. We use the two types of heteroassociative memories: max and min. The results from these memories were mapping to binary numbers and then, we applied the AND operation between the two binary results.

Associative memory
Associative Memories (AM) [12] associate patterns x with y, which can represent any concept: faces, fingerprints, DNA sequences, animals, books, preferences, diseases, etc. We can extract particular features of these concepts to form patterns x and y.
There are two phases for designing an associative memory: Training and Recalling.
In the Training Phase, the process of associate patterns x with patterns y is performed. Now, we say that the memory is built.
The input and output patterns are represented by vectors. The task of association of these vectors is called Training Phase and the Recognizing Phase allows recovering patterns. The stimuli are the input patterns represented by the set x = {x 1 , x 2 , x 3 , …, x p } where p is the number of associated patterns. The responses are the output patterns and are represented by y = {y 1  . The set of associations of input and output patterns is called the fundamental set or training set and is represented as follows: {(x P, y P ) | P = 1, 2, ..., p}

Morphological memories
The basic computations occurring in the proposed morphological network [13] are based on the algebraic lattice structure (R, , , +), where the symbols and denote the binary operations of maximum and minimum, respectively. Using the lattice structure (R, , , +), for an m x n matrix A and a p x n matrix B with entries from R, the matrix product C = A B, also called the max product of A and B, is defined by equation (1).
The min product of A and B induced by the lattice structure is defined in a similar fashion. Specifically, the i,jth entry of C = A ' B is given by equation (2).
the notational burden is reduced by denoting these identical morphological outer vector products by y k u (- With these definitions, we present the algorithms for the training and recalling phase. Training Phase 1. For each p association (x P , y P ), the minimum product is used to build the matrix y P ' (-x P ) t of dimensions m x n, where the input transposed negative pattern x P is defined as x . 2. The maximum and minimum operators ( and ) are applied to the p matrices to obtain M and W memories as equations (3) and (4) show.

Recalling phase
In this phase, the minimum and maximum product, ' and , are applied between memories M or W and input pattern x Z , where Z {1, 2, ..., p}, to obtain the column vector y of dimension m as equations (5) and (6) shows:

mtDNA recognition
In this section, we will describe the dataset that we used in this work, after, we will describe the algorithm to store and recognize the mtDNA.

mtDNA dataset
The mtDNA dataset [14] was obtained from National Center for Biotechnology Information (NCBI) that is part of the National Library of Medicine of United States. Besides, the NCBI offers some bioinformatics tools for the analisys of DNA sequences, RNA and proteins, and they are free and online. The dataset has 276 patterns; each pattern has 60 characters such as adenine (a), cytosine (c), guanine (g) y thymine (t), which represent the four nitrogenous bases that are contained in DNA.

Algorithm
The data was stored in an Excel file. The data (characters) are converted to integers by the use of the ASCII code. Due to lower case characters start at the decimal number 97, we subtracted 97 to all the dataset, therefore, the numbers representing the characters are: 0-a, 2-c, 6-g and 19-t. For example, one of the sequences is, gatcacaggt ctatcaccct attaaccact cacgggagct ctccatgcat ttggtatttt then, the corresponding decimal numbers represented as a vector, [ This vector corresponds to an input pattern x. Then, we have 276 input patterns with dimension of 97 and, we have 276 output patterns, y, with dimension of 276. When we use the max memory, the vectors y are built as follows, The vectors to build a min memory are similar but we changed the value of 500 for -500. We used this value because we have observed that the value in the diagonal must be greater than the maximum value of the elements of the input vectors [15]. We want to highlight that we see the recognition problem as a classification task. Therefore, every pattern of mtDNA is seen as a class, then, we have 276 different classes. That is the reason for which the output vectors have a number 500 or -500 in the index that indicates the number of the class. Now, we applied the training phase to build both memories: max and min by the use of equations (3) and (4). Then, we present an input pattern to the memories to recall its corresponding output vector. The following example illustrates these operations. We used vectors with similar values but with lower dimension, and we just used max memory  (5) and (6)), we present an input pattern to M to recall its corresponding output pattern. From these results, we can observe that only the two last patterns are recalled or well classified. We conclude that because the number 500 appeared in the correct index of the vector.
The recalled patterns when we use the min memory The last step is to apply an and operation between yt and ypt vectors.

and and ypt yt
The last result indicates that the first pattern was well classified, i.e, the pattern corresponds to the class 1, because the number 1 is located at the first position. Now, we do the same with the remain vectors.

and and ypt yt
We can observe that in all the cases the patterns were classified in the correct class.

Results
We developed a software in Matlab 2013, implemented in a laptop Dell® with an Intel® Core i7 processor.
We built two morphological associative memories: max and min with the 276 patterns of the dataset.
In the recalling phase we recovered all the patterns. This means that our memory does not have forgetting factor: everything is learned everything is recalled. Then we can say that our memory has a correct recall.
We simulated the problem when the sample of DNA is corrupted with noise (or incomplete).
We applied additive and subtractive noise to the patterns. In Table 1, we show the results when we apply a percentage of additive noise.  10  yes  20  yes  30  yes  40  yes  50  yes  51  yes  52  yes  53  yes  54  yes  55  yes  56  no  57  no  58  no  59  no  60  no   From Table 1, it can be observed that we have the correct recall when the percentage of additive noise is less or equal than 55.
We observed the same behaviour when we applied subtractive noise to the patterns: from the 56% of subtractive noise, there is not correct recall.

Conclusions
We applied an artificial intelligent tool to recognize patterns of mtDNA. We chose this kind of DNA because it is resistant to degradation.
The Morphological Associative Memories were a suitable algorithm for this application. This model has a low complexity due to their main operations: addition, subtraction and max and min operators.
We deal the pattern recognition problem as a classification task. This is the reason that we built the output vectors with value of 500 or -500 in the index that indicates the number of its corresponding class.
Our proposal recalled all the patterns that were trained, therefore, it has a correct recall and it showed a forgetting factor of zero.
If a sample of DNA is corrupted, there is the possibility that the sequence cannot be complete or some elements of the sequence can be change, even when this type of DNA is resistant. This problem were simulated by adding noise to the patterns. We applied additive and subtracting noise. We found that when the patterns contained more than the 55% of noise, the memory was not able to recognize the corresponding class.