# Trends in Artificial Intelligence

ISSN: 2643-6000

### Article Outline

REVIEW ARTICLE | VOLUME 3 | ISSUE 1 OPEN ACCESS

# Artificial Intelligence (AI) Tools Constructed via the 5-Steps Rule for Predicting Post-Translational Modifications

Kuo-Chen Chou

• Kuo-Chen Chou 1,2*
• Gordon Life Science Institute, Massachusetts, USA
• Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China

Kuo-Chen C (2019) Artificial Intelligence (AI) Tools Constructed via the 5-Steps Rule for Predicting Post-Translational Modifications. Trends Artif Intell 3(1):60-74.

Accepted: August 12, 2019 | Published Online: August 14, 2019

# Artificial Intelligence (AI) Tools Constructed via the 5-Steps Rule for Predicting Post-Translational Modifications

## Abstract

Identification of the sites of post-translational modifications (PTMs) in protein, RNA, and DNA sequences is currently a very hot topic. This is because the information thus obtained is very useful for in-depth understanding the biological processes at the cellular level and for developing effective drugs against major diseases including cancers and Alzheimer's as well. Although this can be realized by means of various experimental techniques, it is both time-consuming and costly to determine the PTM sites purely based on experiments. With the avalanche of biological sequences generated in the post-genomic age, it is highly desired to develop artificial intelligence (AI) tools for rapidly and effectively identifying the PTM sites. In the last few years, many efforts have been made in this regard, and considerable progresses have been achieved. This review is focused on those AI tools that have the following two features. (1) They have been developed by strictly observing the 5-steps rule so that they each have a user-friendly web-server for the majority of experimental scientists to easily get their desired data without the need to go through the detailed mathematics involved. (2) Their cornerstones have been based on PseAAC (Pseudo Amino Acid Composition) or PseKNC (Pseudo K-tuple Nucleotide Composition), and hence the prediction quality is generally remarkably higher than most of the other PTM prediction methods without such base.

## Keywords

Artificial intelligence (AI) tools, Five-step rules, Post-translational modifications, Absolute true rate, Web-server

## Introduction

Post-translational modification, or PTM, means the covalent and generally enzymatic modification of proteins right after they are biosynthesized. After being synthesized by ribosomes, proteins may undergo PTM to form the mature protein products. PTMs can occur on the amino acid side chains of a protein or at its C- or N- terminus. They can covalently modify the existing functional group of an amino acid and make it have other functional group. Therefore, the chemical repertoire of the 20 standard amino acids can be considerably extended via the process of PTMs.

According to their occurrence in three different types of biological sequences, PTMs can be classified into the following three different categories: (1) PTLM (post-translational modification) in proteins, (2) PTCM (post-transcriptional modification) in RNA, and (3) PTRM (post-replication modification) in DNA. PTMs play a key role in providing bio-macromolecules with structural and functional diversity, as well as in regulating cellular plasticity and dynamics. Meanwhile, PTMs are also closely associated with many major diseases including cancer, Alzheimer's, and Parkinson's. Therefore, identifying the PTM sites in biological sequences is very important for both basic research and drug development.

## Historical Reflection

Before going on, it is illuminative to make a historical reflection. For quite a long period of time, the information derived by the computational approaches were not trusted very much by most experimental scientists due to the notorious local minimum problem [1]. Actually, they only trusted the results determined by the experiments, and thought that computational results were not reliable unless they had been confirmed by experiments. This kind of situation has been changed during the last decade or so owing to the rapid development of structural bioinformatics and sequential bioinformatics. For the 3D structures of proteins, what they trusted most were those determined by the X-ray crystallography. Unfortunately, it is time-consuming and expensive, and not all proteins can be successfully crystallized. Membrane proteins are difficult to crystallize and most of them will not dissolve in normal solvents. Accordingly, so far very few membrane protein structures have been determined. NMR is indeed a very powerful tool in determining the 3D structures of membrane proteins (see, e.g., [2-19]), but it is also time-consuming and costly. In order to acquire the structural information in a timely manner, a series of 3D protein structures have been developed by means of structural bioinformatics tools (see, e.g., [20-32]) and they have been found very useful in conducting mutagenesis studies [33] for rational drug design. Meanwhile, facing the explosive growth of biological sequences discovered in the post-genomic age, to timely use them for drug development, a lot of useful information have been revealed or deducted by various AI tools via the PseAAC approach [34-36] and PseKNC approach [37-39]. Actually, this kind of AI technique has played increasingly important roles in driving the medicinal chemistry into an unprecedented revolution [40,41] by significantly speeding up the process of finding novel drugs [42-44].

As it was in the last few years that many AI tools were developed for predicting the PTM sites in biological sequences [40,45-87] in compliance with the Chou's 5-steps rule [88] by going through the following five procedures: (1) How to select or construct a valid benchmark dataset to train and test the predictor; (2) How to represent the samples with an effective formulation that can truly reflect their intrinsic correlation with the target to be predicted; (3) How to introduce or develop a powerful algorithm to conduct the prediction; (4) How to properly perform cross-validation tests to objectively evaluate the anticipated prediction accuracy; (5) How to establish a user-friendly web-server for the predictor that is accessible to the public.

The AI tools constructed thru the 5-steps rule bear the following notable merits: (1) Crystal clear in logic development, (2) Complete transparent in operation, (3) Quite easy to repeat the reported results by others, (4) Holding high potential in stimulating other sequence-analyzing methods, and (5) Very convenient to be used by broad experimental scientists.

Therefore, focused on the current review paper are only those AI tools that were born through the Chou's 5-steps rule [88]. As for the importance of the 5-steps rule and how to use it in developing new predictor for proteome and genome analyses, see an insightful Wikipedia article at https://en.wikipedia.org/wiki/5-step_rules.

Besides, with the avalanche of biological sequences in the post-genomic era, one of the most important but also most difficult problems in developing AI tools for investigation into biology is how to express a biological sequence with a discrete model or a vector, yet still considerably keep its sequence-order information or key pattern characteristic. This is because all the existing machine-learning algorithms (such as "Optimization" algorithm [89], "Covariance Discriminant" or "CD" algorithm [90,91], "Nearest Neighbor" or "NN" algorithm [92], and "Support Vector Machine" or "SVM" algorithm [92,93]) can only handle vectors as elaborated in a comprehensive review [40].

However, a vector defined in a discrete model may completely lose all the sequence-pattern information. To avoid completely losing the sequence-pattern information for proteins, the pseudo amino acid composition [34] or PseAAC [35] was proposed. Ever since the concept of Chou's PseAAC was proposed, it has been widely used in nearly all the areas of computational proteomics (see, e.g., [45,48,52,60,67,73,77,78,82,83,85-87,94-251] as well as a long list of references cited in [41]).

Because it has been widely and increasingly used, four powerful open access soft-wares, called 'PseAAC' [252], 'PseAAC-Builder' [128], 'propy' [146], and 'PseAAC-General' [166], were established: the former three are for generating various modes of Chou's special PseAAC [253]; while the 4th one for those of Chou's general PseAAC [88], including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as "Functional Domain" mode (see Eqs.9-10 of [88]), "Gene Ontology" mode (see Eqs.11-12 of [88]), and "Sequential Evolution" or "PSSM" mode (see Eqs.13-14 of [88]).

Meanwhile, the idea of PseAAC was extended to generate various modes of feature vectors for DNA and RNA sequences [37-39,254-258], and has been proved very useful as well.

Given an AI tool, its name can be defined as

Name of 𝔸𝕀 tool = 𝔸𝕀(𝕏) (1)

where the wildcard 𝕏 denotes the web-server or software based on which the AI tool has been constructed. For instance: when 𝕏 = Isno-PseAAC, the AI tool is for predicting cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition; when 𝕏 = Irna-PseU, the AI tool is for predicting RNA pseudouridine sites; when 𝕏 = Idna-Methyl, the AI tool is for predicting DNA methylation sites via pseudo trinucleotide composition; and so forth.

## Sixteen AI Tools for Identifying PTM or PTLM Sites in Protein Sequences

The 16 AI tools are: (1) 𝔸𝕀 SNO-PseAAC) [46]; (2) 𝔸𝕀 (iSNO-AAPair) [47]; (3) 𝔸𝕀 (iMethyl-AAC) [49]; (4) 𝕀 (iHyd-seAAC) [50]; (5) 𝔸𝕀 (iNitro-Tyr) [51]; (6) 𝔸𝕀 (iUbiq-Lys) [54]; (7) 𝔸𝕀 (iSuc-PseOpt) [56]; (8) 𝔸𝕀 (pSuc-Lys) [57]; (9) 𝔸𝕀 Car-PseCp) [58]; (10) 𝔸𝕀 (pSumo-CD) [59]; (11) 𝔸𝕀 (iHyd-PseCp) [62]; (12) 𝔸𝕀 (iPTM-mLys) [63]; (13) 𝔸𝕀 (iPhos-PseEn) [64]; (14) 𝔸𝕀 (iPGK-PseAAC) [68]; (15) 𝔸𝕀 (iPhos-PseEvo) [71]; (16) 𝔸𝕀 (iPreny-PseAAC) [72]. Their functions and web-server links are each given in Table 1.

## Seven AI Tools for Identifying PTM or PTCM Sites in RNA Sequences

The 7 AI tools are: (1) 𝔸𝕀 (iRNA-PseU) [55]; (2) 𝔸𝕀 (pRNAm-PC) [61]; (3) 𝔸𝕀 (iRNA-PseColl) [66]; (4) 𝔸𝕀 (iRNA-methyl) [69]; (5) 𝔸𝕀 (iRNAm5C-PseDNC) [70]; (6) 𝔸𝕀 (iRNA(m6A)-PseDNC) [75]; (7) 𝔸𝕀 (iRNA-3typeA) [76]. Their functions and web-server links are each given in Table 2.

## One AI Tool for Identifying PTM or PTRM Sites in DNA Sequences

𝔸𝕀 (iDNA-Methyl) is the AI tool for identifying the PTM sites in DNA sequences [259]. Its function and web-server link are given in Table 3.

## Discussions

For measuring the success rates of the AI tools, a set of four metrics [260] are usually used in literature. They are: (1) Overall accuracy or Acc, (2) Mathew's correlation coefficient or MCC, (3) Sensitivity or Sn, and (4) Specificity or Sp, as given below

$\left\{\begin{array}{l}{S}_{n}=\frac{TP}{TP+FN}\\ {S}_{p}=\frac{TN}{TN+FP}\\ Acc=\frac{TP+TN}{TP+TN+FP+FN}\\ MCC=\frac{\left(TP×TN\right)-\left(FP×FN\right)}{\sqrt{\begin{array}{cccc}\left(TP+FP\right)& \left(TP+FN\right)& \left(TN+FP\right)& \left(TN+FN\right)\end{array}}}\end{array}\right\\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\left(2\right)$

Although the above four metrics copied from math books were often t used in literature to measure the prediction quality of a prediction method, they are lacking intuitiveness and no easy-to-understand for most biologists. Particularly the MCC (the Matthews correlation coefficient), which is a very important metrics used for reflecting the stability of a prediction method. Fortunately, based on the Chou's symbols introduced for studying protein signal peptides [261,262], a set of four intuitive metrics were derived [47,263,264], as given below

According to Eq.3 we can easily see the following. When meaning none of the positive samples is mispredicted to be negative, we have the sensitivity Sn = 1; while meaning that all the positive samples are mispredicted to be negative, we have the sensitivity Sn = 0. Likewise, when meaning none of the negative samples is incorrectly predicted to be positive, we have the specificity Sp = 1; while meaning all the negative samples are incorrectly predicted to be positive, we have the specificity Sp = 0. When meaning that none of the positive samples and none of the negative samples is incorrectly predicted, we have the overall accuracy Acc = 1; while and meaning that all the positive samples and all the negative samples are mispredicted, we have the overall accuracy Acc = 0, and ; MCC = 1: when and we have MCC = 0 meaning no better than random prediction; when and we have MCC = -1 meaning total disagreement between prediction and observation. As we can see from the above discussion, it is much more intuitive and easier to understand when using Eq.3 instead of Eq.1 to examine a predictor for its four metrics, particularly for its Mathew's correlation coefficient.

It is instructive to point out, however, that some AI tools may have the multi-label feature, such as 𝔸𝕀 (iDNA-Methyl) [63] having the capacity to identify multiple lysine PTM sites and their different types. Actually, in the real world the multi-label systems (where a sample may simultaneously belong to several classes) have become more frequent in both system biology [265-292] and system medicine [293,294].

To examine the performance of multi-label AI tools, one also needs a set of global metrics [295,296], as elaborated below.

where Nq is the total number of query or tested samples, M is the total number of different labels for the investigated system, ||U| means the operator acting on the set therein to count the number of its elements, U means the symbol for the "union" in the set theory, ∩ denotes the symbol for the "intersection", ${L}_{k}^{}$ the subset that contains all the labels observed by experiments for the k-th tested sample, ${L}_{k}^{*}$ represents the subset that contains all the labels predicted for the k-th sample, and

In Eq.4, the first four metrics with an upper arrow $↑$ are called positive metrics, meaning that the larger the rate is the better the prediction quality will be; the 5th metrics with a down arrow $↓$ is called negative metrics, implying just the opposite meaning. As we can see from Eq.1: (1) The "Aiming" defined by the 1st sub-equation is for checking the rate or percentage of the correctly predicted labels over the practically predicted labels; (2) The "Coverage" defined in the 2nd sub-equation is for checking the rate of the correctly predicted labels over the actual labels in the system concerned; (3) The "Accuracy" in the 3rd sub-equation is for checking the average ratio of correctly predicted labels over the total labels including correctly and incorrectly predicted labels as well as those real labels but are missed in the prediction; (4) The "Absolute true" in the 4th sub-equation is for checking the ratio of the perfectly or completely correct prediction events over the total prediction events; (5) The "Absolute false" in the 5th sub-equation is for checking the ratio of the completely wrong prediction over the total prediction events.

The five metrics in Eq.4 reflect the quality of a multi-label predictor from five different angles at the global level. It is instructive to point out, however, among the five global metrics the most important one and also the most difficult to improve its success rate is the "Absolute true" or "perfectly correct" rate [295]. Why? This is because the score standard for the absolute true rate is very harsh. According to its definition, for a statistical sample that is actually simultaneously with the states ("A", "B", "C"). If the predicted result is not exactly the three states but ("A", "B") or ("A", "B", "C", "D"), no score whatsoever will be given. In other words, when and only when the predicted outcome for the statistical sample is perfectly identical to its actual status, can we add one point for the absolute true rate; otherwise, zero. That is why many investigators even chose not to mention the metrics of absolute true rate; otherwise they would face the embarrassment of reporting a very low success rate for their prediction methods.

The set of metrics in Eq.4 are used to evaluate the prediction quality of a multi-label AI tool for all the samples in the entire system concerned [296], and hence is called the "set of metrics for the global accuracy" or the "set of global metrics".

## Concluding Remarks and Perspectives

The AI tools introduced in this review paper for predicting PTM sites have been all established by following the 5-steps rule [88], and hence they each have a user-friendly web server for the majority of experimental scientists to easily get their desired data. Also, their cornerstones are based on PseAAC [34-36,88,253] or PseKNC [37,254,256, 257,264,297], and hence their prediction quality is usually higher than the other PTM prediction methods without using the PseAAC or PseKNC approach.

As we can see from the Sections 3, 4, and 5, the most web-servers available are for the AI tools aimed at identifying the PTM sites in protein sequences, the next are at DNA sequences, and the least at RNA sequences. It is anticipated, however, that with more experimental data available in the future, the benchmark datasets for the PTM sites in RNA and DNA sequences will be enriched as well. The existing AI tools will not only be easily extended to cover more RNA and DNA sequences, but also further improve the prediction quality in all kinds of biological sequences.

It is worthy of noting that recently the 5-ateps rule has also been used in many different areas [84,298-314].

Meanwhile it has not escaped our notice that using graphic approaches to study biological and medical systems can provide an intuitive vision and useful insights for helping analyze complicated relations therein as shown in the systems of enzyme fast reaction [315-317], graphical rules in molecular biology [318-321], and low-frequency internal motion in biomacromolecules (such as protein and DNA) [322]. Particularly, what happened is that this kind of insightful implication has also been demonstrated in [323] and many follow-up publications [324-339].

## Acknowledgement

The author wishes to thank Dr. Michelle Claus for the invitation to write this paper.

## Abstract

Identification of the sites of post-translational modifications (PTMs) in protein, RNA, and DNA sequences is currently a very hot topic. This is because the information thus obtained is very useful for in-depth understanding the biological processes at the cellular level and for developing effective drugs against major diseases including cancers and Alzheimer's as well. Although this can be realized by means of various experimental techniques, it is both time-consuming and costly to determine the PTM sites purely based on experiments. With the avalanche of biological sequences generated in the post-genomic age, it is highly desired to develop artificial intelligence (AI) tools for rapidly and effectively identifying the PTM sites. In the last few years, many efforts have been made in this regard, and considerable progresses have been achieved. This review is focused on those AI tools that have the following two features. (1) They have been developed by strictly observing the 5-steps rule so that they each have a user-friendly web-server for the majority of experimental scientists to easily get their desired data without the need to go through the detailed mathematics involved. (2) Their cornerstones have been based on PseAAC (Pseudo Amino Acid Composition) or PseKNC (Pseudo K-tuple Nucleotide Composition), and hence the prediction quality is generally remarkably higher than most of the other PTM prediction methods without such base.