Analysis of Human Papillomavirus Using Datamining - Apriori, Decision Tree, and Support Vector Machine (SVM) and its Application Field

. Human Papillomavirus(HPV) has various types (compared to other viruses) and plays a key role in evoking diverse diseases, especially cervical cancer. In this study, we aim to distinguish the features of HPV of different degree of fatality by analyzing their DNA sequences. We used Decision Tree Algorithm, Apriori Algorithm, and Support Vector Machine in our experiment. By analyzing their DNA sequences, we discovered some relationships between certain types of HPV, especially on the most fatal types, 16 and 18. Moreover, we concluded that it would be possible for scientists to develop more potent HPV cures by applying these relationships and features that HPV virus exhibit.


Introduction
Human Papillomavirus, or HPV, is a DNA virus which is involved in the papillomavirus family.It can induce a number of infections in keratinocytes of the skin or mucous membranes.HPV is known worldwide to induce various genital cancers, such as cervical cancer, vaginal cancer, and anal cancer.Most viruses pass through cell membranes and progress self-proliferation to survive.During this process, HPV produces several kinds of proteins that make host cells infinitely progress the celldivision.E6 protein inhibits p53 which induce an apoptosis.And E7 protein inhibits Rb which restricts cell cycle.Therefore, host cells for HPV keep cell division permanently and eventually turn into cancer cells.
According to numerous studies concerned, HPV has been proved its importance in revealing the mechanism of cancer development.It is the necessary cause of human cancer, especially cervical cancer.In other words, without the infection of HPV and the presence of HPV DNA, several types of cancer will not develop.This fact implies that if we can curb the HPV infection and gene expression, it is possible to prevent and cure cervical cancer and other types of genital cancer.Since these cancers are highly fatal to human, scientists are trying to reveal the overall process of human papillomavirus's expression in human body in order to develop the medication for cervical cancer.Recently, HPV vaccines were invented, and are being offered for girls aged under 15.However, to minimize the adverse effects and maximize the efficiency of HPV vaccines, it is crucial to reveal the molecular structure of Human Papillomavirus to create a best-suited medication for HPV only.The molecular structure of HPV is determined by the sequence of amino acids.And the sequence of amino acids is encoded in mRNA using 4 types of bases: A, G, C, and T. In this paper, we analyzed RNA sequences of human papillomaviruses classified as 'highest risk' and 'probably high risk' in order to find out similarities and differences of those two categories.

Human Papillomavirus (HPV)
Human papillomavirus(HPV) causes many different types of diseases and it is classified by the risk levels into different types.By analyzing various cases, HPV types were classified as 15 high-risk types(16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 68, 73, 82), 3 probable highrisk types(26, 53, 66), and 12 low-risk types (6,11,40,42,43,44,54,61,70,72, 81, CP6108) [1].It has been found that DNA of human papillomavirus (HPV) types 16 and 18 is closely associated with human genital cancer.It supports the concept that HPV type 16 and 18 are key factors in the aetiology of genital cancer.Furthermore, searching about the vaccine [2] HPV 16/18 AS04adjuvanted vaccine was immunogenic, generally well tolerated, and effective against HPV-16 or HPV-18 infections and the research analyzed efficacy in the final event-driven analysis of the women who were vaccinated at months 0, 1, and 6.The HPV 16/18 AS04-adjuvanted vaccine showed high efficacy against CIN2+ associated with HPV-16/18 and non-vaccine oncogenic HPV types [3].In the field of cervical cancer, genital HPV has the role as the central etiologic factor in cervical cancer worldwide [5].Also the presence of HPV in virtually all cervical cancers implies the highest worldwide attributable fraction so far reported for a specific cause of any major human cancer [6].Furthermore, ingration of HPV-16 DNA, which occurs in cervices, can result in the increased expression of the viral E6 and E7 oncogenes through altered mRNA stability and occur cervical cancers.Also the demonstration that more than 20 different genital HPV types are associated with cervical cancer has important implications for cervical cancerprevention strategies that include the development of vaccines targeted to genital HPVs [4].

Datamining Algorithms
In this study, we used three different datamining algorithms: Decision Tree Algorithm, Apriori Algorithm, and Support Vector Machine Algorithm.

Support Vector Machine (SVM)
Support Vector Machines are supervised learning models with associated learning algorithms which analyze data and recognize patterns, utilized for classification and regression analysis.

Apriori Algorithm
Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases.It operates by recognizing the frequent individual items in the database and stretching them out to bigger and bigger item sets as long as those item sets show up sufficiently often in the database.

Decision Tree Algorithm
A decision tree is a decision support tool which uses a tree-like graph or model of decisions and their possible results.It is one way to display an algorithm.They are generally used in operations research, especially in decision analysis in order to help identify a strategy which is most likely to reach a goal.

Method
In this study, we used 11 base sequences of different types of HPV (HPV 16, HPV 18, HPV 26, HPV 31, HPV 33, HPV 35, HPV 53, HPV 66, HPV 68a, HPV 68b, HPV 82).As we mentioned in the relative research, three of them (HPV 26, HPV 53, HPV 66) are probable high-risk group, and the others are high-risk group.If we discover several similarities and differences in base sequences between these two groups, it means that we could find amino acids that play a dominant role in evoking cervical cancer.So, we extracted full RNA sequences of these viruses from NCBI (National Center for Biotechnology Information).And we applied several algorithms to figure out similarities and differences between these viruses.
In this paper, we used 3 types of datamining algorithms: Decision Tree Algorithm, Apriori Algorithm, and Support Vector Machine (SVM).Decision Tree Algorithm has its strength to clarify distinct differences between amino acid sequences, while Apriori Algorithm has its strength to declare similarities.And SVM can provide more accurate results since it can use higher dimensions than dimensions other algorithms use.
We conducted three experiments for each algorithm.The base sequence of each virus is too vast to analyze at once.So, we have to divide it into several parts.So, when we analyze HPV base sequences, we conducted three experiments: 9-windows, 13-windows, and 17-windows.
Especially, for SVM algorithm, we carried out the experiments with 10 -fold cross validation, and we applied four types of functions: normal, polynomial, polynoima2, and RBF.

Results
Table 1 shows the results of 9-windows Decision Tree algorithm.It is remarkable that every virus possesses its unique rule, and that every extracted rule has the probability of at least 0.75.This value is high enough to affirm that every HPV virus has its own distinguishable trait.Also, we can see frequent repeatance of Threonine, Leucine, and Valine in Table 1.It may indicate that these three amino acids play a key role in HPV virus.Moreover, we can find that the amino acids extracted from position 2 and position 9 represent the distinguishable rule of each virus.So, we can find that posiion 2 and position 9 is an important factor that makes HPV viruses different.

9 window results
Table 2 shows the results of 13-windows Decision Tree algorithm.All extracted rules have the probability of 0.75, which is quite high as we mentioned before.

13 window results
Also, Isoleucine, Valine and Threonine: these three amino acids play a significant role in extracting rules.Especially, HPV 68b has 6 amino acids, and those 6 amino acids have almost equal importance.Moreover, in 13-windows results, unlike 9-windows results, Threonine and Valine co-exist in many HPV viruses.HPV 82 L(36) T(32) Table 6 shows the results of 17-windows Apriori Algorithm.According to Table 6, it is also clear that Leucine plays a dominant rule in all HPV virus types.Also, unlike Table 4 and Table 5, extracted rules include more amino acids.Also, Threonine and Valine, and Serine take part in extracting distinguishable rules.Table 6.17 window rule (Amino Acid(Frequency)) Considering the fact that results of Apriori and Decision tree don't have a lot in common, we conducted the third experiment using SVM.However, because of the excessive data, infinite loop was created and we were only able to use the normal types.We used 10 fold cross validation, and the phrase "Average loss on test set 90.0000 Zero/one-error on test set 90.00% (35 correct, 315 incorrect, 350 total)" was repeated 10 times.The correct ratio was low due to the large amount of data.

HPV
After experiencing the difficulty of experiment using such big data, we decided to compare a few specific types considered to have big differences.From the results made by Apriori and Decision tree, we found that among the high-risk types, HPV 68b has the least similarity with HPV 16, 18 which is most frequently in cases of cancer.As a result, we repeated the experiment HPV 18 and HPV 68b in depth.
In Table 7, it represents 9-windows results of SVM algorithms.Average of Accuracy rate is highest in RBF, and lowest in Polynomial2 and Normal.Since the accuracy rate is quite low (about half), it is clear that HPV 18 and HPV 68b is not clearly divided into two parts.However, this result is still meaningful since the result of RBF has 76.416%, which means that HPV 18 and HPV 68b have different properties in amino acid sequences.

13 window results
of 9-windows.Also, this result indicates that HPV 18 and HPV 68b have some different traits, but it is not clearly divided.

Conclusion
In this study, we have found out that leucine, isoleucine, threonine, valine are the most dominant amino acids in Human Papillomavirus.Leucine acts on building muscles and regulating blood sugar.One of the major functions of isoleucine is proteinogenesis in the body.Threonine helps to maintain the proper protein balance in the body.Valine has a stimulant effect, so it is needed for muscle metabolism, tissue repair, and the maintenance of a proper nitrogen balance in the body.Most of the factors that induce cancer other than cervical cancer are mostly composed of leucine.As a result, inhibitors of leucine's synthesis are widely used as anti-cancer medicine, but it also distracts the synthesis of leucine that is essential to human body.Unlike other viruses, HPV is created with diverse amino acids.Malaria for instance, is mostly made of Leucine, which is a main amino acid which creates human muscle tissue.However HPV consists of Leucine, Valine, Threonine, Isoleucine, Cysteine, and many others in equal manner.This implicates that HPV can operate in various types of tissues composed of different amino acids.Furthermore, since the sequence and the types of amino acid is very different in every types of HPV, it will produce diverse kinds of proteins, which will cause different symptoms in human body.Since the extracting rules and components are distinct between all types of HPV, in further research, finding out the functions of proteins produced from different sequence of amino acids that attacks human body is necessary.Through this experiment, we figured out that there are differences and similarities between viruses that we analyzed.From the initial part of this study, we have classified Human Papillomaviruses into three groups: 15 high-risk types, 3 probable high-risk types, and 12 lowrisk types.Especially, HPV16 and HPV18, which are the key factors in inducing cervical cancers, show high percent of accordance in its amino acid composition, especially considering the percentage of Leucine and Valine.Also, HPV 18 and other Human Papillomavirus show some remarkable differences.This result indicates that we can make medications that have less adverse effect when we consider these structural differences between them.Regarding this fact, we concluded that we can treat cervical cancers by inhibiting the synthesis of certain kinds of amino acids which are prevalent in HPV16 and HPV18.

Table 3
shows the results of 17-windows Decision Tree algorithm.It is noticeable that many rules of various HPV are constituted of 2 amino acids.Also, in 17-windows results, position 6 plays a key role in extracting rules.Furthermore, HPV 16 and HPV 33 have Cysteine, and this finding is unique to these two viruses.So, it may implicate the key difference between HPV viruses.

Table 3 .
17 window rule (POS=Position) are exclusive.In other words, when one HPV virus has valine, it doesn't have thereonine.This rule may be able to discover the difference between lethal and non-lethal Human Papillomaviruses.Also, this table implies the differences between high-risk group and probable highrisk group.High-risk group has both Leucine and Valine, while probable high-risk group doesn't have.This fact implies that Leucine and Valine plays a key role in evoking high-risk cervical cancer.

Table 5
shows the results of 13-windows Apriori Algorithm.According toTable 5, it is also clear that Leucine plays a dominant rule in all HPV virus types.