users.cecs.anu.edu.au

Users.cecs.anu.edu.au

PROTEIN-CHEMICAL INTERACTION PREDICTION VIA KERNELIZED
SPARSE LEARNING SVM
YI SHI*1, XINHUA ZHANG1, XIAOPING LIAO2, GUOHUI LIN1, DALE SCHUURMANS1 1Department of Computing Science, University of Alberta, Edmonton, Alberta T6G 2E8, Canada 2Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, Alberta T6G 2P5, Canada Given the diﬃculty of experimental determination of drug-protein interactions, there is a signiﬁcantmotivation to develop eﬀective in silico prediction methods that can provide both new predictions forexperimental veriﬁcation and supporting evidence for experimental results. Most recently, classiﬁca-tion methods such as support vector machines (SVMs) have been applied to drug-target prediction.
Unfortunately, these methods generally rely on measures of the maximum "local similarity" betweentwo protein sequences, which could mask important drug-protein interaction information since drugsare much smaller molecules than proteins and drug-target binding regions must comprise only smalllocal regions of the proteins. We therefore develop a novel sparse learning method that considers setsof short peptides. Our method integrates feature selection, multi-instance learning, and Gaussiankernelization into an L1 norm support vector machine classiﬁer. Experimental results show that it notonly outperformed the previous methods but also pointed to an optimal subset of potential bindingregions. Supplementary materials are available at "www.cs.ualberta.ca/ ys3/drug_target".
Keywords: Drug-target interaction; SVM; Sparse learning; Kernelization.
Proteins operate in highly interconnected networks ("interactome networks") that play a cen-tral role in governing cell functions. If a protein's conformation is changed, its function canbe altered, thus aﬀecting cell function. Drugs are small molecules that bind to target pro-teins to intensionally change the protein conformation, ultimately achieving treatment eﬀects.
The function of many classes of pharmaceutically useful protein targets, such as enzymes, ionchannels, G protein coupled receptors (GPCRs), and nuclear receptors, can be modulated byligand interaction. Identifying interaction between ligands and proteins is therefore a key togenomic drug discovery.
Various high-throughput technologies for analyzing the genome, the transcriptome, and the proteome have enhanced our understanding of the space populated by protein classes.
Meanwhile, the development of high-throughput screening technology has enabled broaderexploration of the space of chemical compounds.1–3 The goal of the chemical genomics re-search is to identify potentially useful compounds, such as imaging probes and drug leads,by relating the chemical space to the genomic space. Unfortunately, our understanding of therelationship between the chemical and the genomic spaces remains insuﬃcient. For example,the PubChem database at NCBI4 contains information of millions of chemical compounds,but the number of compounds with known target proteins is limited. The lack of documentedprotein-chemical interactions suggests that many remain to be discovered, which motivates the need for improved methods for inferring potential drug-target interactions automaticallyand eﬃciently. To facilitate the study of protein-chemical interactions, Kuhn et al. created aprotein-chemical interaction database called STITCH,5 which, up to now, contains interactionsfor between 300,000 small molecules and 2.6 million proteins from 1,133 organisms.
By elucidating the interaction between proteins and drug molecules, 3D-structure based "docking analysis" has been the principle method for drug discovery.6–8 In docking analysis,drug-protein binding aﬃnities are modeled by non-covalent intermolecular interactions, such ashydrogen bonding, electrostatic interactions, hydrophobic and Van der Waals forces. Throughestablishing equations that model the physical interaction between a receptor and potentialligand, the potential energy of binding can be calculated. There are many docking softwaretools available, including DOCK,8 GOLD,6 and AutoDock.7 All these methods require com-plete 3D structural information for the target, which might not be available in practice. Sucha major disadvantage makes docking analyses infeasible for genome wide application.
Given the diﬃculty of experimental determination of compound-protein interactions,9,10 there is a signiﬁcant motivation to develop eﬀective in silico prediction methods that canprovide both new predictions for experimental veriﬁcation and supporting evidence for exper-imental results. To predict compound-protein interactions various computational approacheshave been developed. Keiser et al.11 propose using the known structure of a set of ligandsto predict target protein families. This method does not take advantage of available pro-tein sequence information, and is thus limited to those between known ligands and proteinfamilies. Campillos et al.12 propose predicting drug-target interaction based on similaritiesbetween side-eﬀects of known drugs. Some results of this approach have been veriﬁed by invitro binding assays, but the approach remains limited to predictions involving drugs withknown side-eﬀects. Yamanishi et al.13 have investigated the relationship between drug chem-ical structure, target protein sequence, and drug-target network topology, and developed aregression-based learning method for predicting unknown drug-target interactions. In partic-ular, they integrated the chemical and the genomic spaces into a uniﬁed space, referred toas the "pharmacological space", wherein chemical-chemical, protein-protein, and chemical-protein similarities can be modeled. Perlman et al.14 used a combination of Smith-Watermanscore, protein-protein interaction, and Gene Ontology information to measure the gene-genesimilarity (similarity between targets), but these ancillary information is not always availablemaking the prediction hard to extend to general case, and the way of combining diﬀerentinformation sources is somehow tricky.
Most recently, classiﬁcation methods have been adopted in drug-target prediction.15–17 These methods ﬁrstly calculate the similarities between targets and/or drugs, then use thesesimilarities to construct kernel matrices for the classiﬁers, such as the support vector machines(SVMs) for predicting novel drug-target interactions. The prediction can be cast into twoways, one for drug side or drug-to-target and the other for target side. For drug-to-targetprediction, drug-drug similarities are ﬁrst obtained, based on structural or pharmacologicalinformation; then a bipartite known drug-target interaction graph is constructed; for a newdrug with known structural or pharmacological information, its similarities to known drugs arecalculated to predict its interactions with known targets using the bipartite interaction graph.
Similarly for target-to-drug prediction, target-target similarities are ﬁrst obtained using theprimary amino acid sequences;13,17,18 then for a new target with known primary sequence, itssimilarities to known targets are calculated to predict its interactions with known drugs againusing the bipartite interaction graph.
It should be pointed out that in the state-of-the-art works of target-to-drug prediction, the target-target similarity is deﬁned out of the normalized Smith-Waterman score.17 ThisS-W score measures the maximum "local similarity" between two protein sequences,19 thusreasonable, but the local similarity still uses the whole sequences and consequently mightinvolve long substrings, which is unreasonable. In fact, long substrings could mask importantinteraction information, since drugs are usually much smaller molecules than proteins and thedrug-target binding sites mostly comprise of only small local regions of the target.
In this work, we focus on the latter target-to-drug prediction to address the issues in the existing works. We ﬁrst attempt to identify key local binding regions from the common shortsubstrings shared by proteins that interact with the same drug. These key short substringsare then used to construct a vector representation for a protein sequence, to be used in thetraining and testing phases of a classiﬁer. The use of key short substrings (i.e. potential bindingregions) as features for the targets is a more direct and meaningful representation for druginteraction prediction. Additionally, the explicit vector representation of targets, as opposedto assessing similarity based on the S-W score, maps the targets into higher dimensionalspaces, thus increasing the eﬀectiveness of kernel-based classiﬁers. We remark that our use ofcommon short substrings diﬀer from the substring composition representation for proteins,15which uses all substrings while disregarding whether interactions exist.
The rest of the paper is organized as follows. In Section 2, we introduce the details of our prediction method, in which we focus on the SVM classiﬁers. We demonstrate in Section 3the performance of our method compared against the existing ones. Lastly, in Section 4, wediscuss the advantages and disadvantages of our method and propose future work.
2. Methods
The drug-target interaction prediction framework is the same as in Bleakley et al.,17 in whichwe assume a dataset containing m drugs d1, d2, . . , dm and n targets t1, t2, . . , tn, and the binaryindicator on whether or not drug di interacts target tj. The goal is to predict which of thedrugs a new target tc will interact.
2.1. Target Vectorization
In the bipartite local model (BLM) by Bleakley et al.,17 to which our method will compareagainst, the similarity between two targets t and t′ is deﬁned as the normalized Smith-Waterman score:17 SW (t, t′) s(t, t′) = √ SW (t, t) SW (t′, t′) where SW (·, ·) denotes the original Smith-Waterman score.19 As we mentioned in the intro-duction, such a similarity measure might overlook the key short sequence regions to which adrug binds.
To address this issue, we want to identify the common short substrings of the targets that interact the same drug. We consider one drug, say di, at a time. From the dataset, weﬁrst retrieve the set of targets Ti = {ti1, ti2, . . , tin } interacting with d i. By including the new target tc, we obtain another set Ti ∪ {tc}. Using a substring length lower bound, we computefor each of the two sets Ti and Ti ∪ {tc} the multi-set of pairwise maximal common substrings,denoted as withoutSS = {si1, si2, . . , siq′} and withSS = {si1, si2, . . , sip′}, respectively. In eachof the two multi-sets, if two substrings diﬀer at at most one position, they are merged intoone and their frequencies are summed together. This way, we obtain two reduced sets with-outSS = {si1, si2, . . , siq} and withSS = {si1, si2, . . , sip}, containing q and p unique substringsrespectively, and each substring is associated with its number of occurences.
Using the substrings in set withSS and their occurrences, we can map the n training targets and the new target tc into the p dimensional Euclidean space, where each substring representsa dimension and the coordinate of target t in dimension s is calculated as the normalizedmatch score between t and s in set withSS : L(t, s) · cs M (t, s) = ∑p i=1 si where L(·, ·) is length of the longest common substring between the two sequences and cs isthe number of occurrence of substring s. Intuitively, if target tc contains a long substring thatis also frequent in the binding targets, then its match score for this feature substring willbe high indicating a high likelihood of binding. We use (M (t, s1), M (t, s2), . . , M (t, sp)) as thevector representation for target t.
This way we obtain an n × p training matrix X, where each row represents a training target, and a p × 1 testing vector xc representing the new target tc, along with the n × 1 binary
training label vector y (with 1 indicating the target interacts with drug di and −1 otherwise).
The task is to construct a classiﬁer to return 1 if the new target tc interacts with drug di, or
−1 otherwise.
The classiﬁcation problem can be analogously formulated using set withoutSS substring set. Next we show how to construct a classiﬁer from the training data.
2.2. Classification with Feature Selection
In any classiﬁcation problem, the quality of features used determines the accuracy of pre-dictions. Here, features correspond to substrings of target proteins, which comprise potentialbinding regions between the proteins and drugs. Thus, selecting good features not only im-proves classiﬁcation accuracy, but also provides candidate drug-target binding sites for furtherinvestigation. We investigated an approach that integrates feature selection in L1-norm basedsupport vector machine (SVM) classiﬁcation method.
The primal form of L1-norm SVM is: min β∥w∥1 + 1T ξ
ξ ≥ 1 − △(y)(Xw − b1),
ξ ≥ 0.
where △(y) denotes putting the vector y on the main diagonal of a square matrix. Here
X ∈ Rn×p, y ∈ {+1, −1}n, n is the number of data points (targets), and p is the number of
features. Since by Micchelli et al.20
w∥1 = min
+ γj) = min (wT △(γ)−1w + γT 1),
(wT △(γ)−1w + γT 1) + 1T ξ
ξ ≥ 1 − △(y)(Xw − b1),
ξ ≥ 0, γ ≥ 0.
By introducing Lagrangian multipliers λ ≥ 0 and µ ≥ 0, (4) becomes
(wT △(γ)−1w + γT 1) + 1T ξ + λT (1 − △(y)(Xw − b1) − ξ) − µT ξ
λ ≥ 0, µ ≥ 0, γ ≥ 0.
Let the objective function of (5) be L1, and let ∂L1 = 0, we get λ = 1 − µ. Therefore, since
µ ≥ 0, we conclude that λ ≤ 1, hence 0 ≤ λ ≤ 1. By substitution, (5) becomes
(wT △(γ)−1w + γT 1) + λT 1 − λT △(y)Xw + bλT △(y)1
0 ≤ λ ≤ 1,
γ ≥ 0.
Let the objective function of (6) be L2, and let ∂L2 = 0. We get λT y = 0, so (6) becomes
(wT △(γ)−1w + γT 1) + λT 1 − λT △(y)Xw
0 ≤ λ ≤ 1,
λT y = 0,
γ ≥ 0.
Let the objective function of (7) be L3, and let ∂L3 = 0, we get β△(γ)−1w − XT △(y)λ = 0,
so that w = 1 △(γ)XT △(y)λ. By substitution, (7) becomes
min max λT 1 − 1 λT △(y)X△(γ)XT △(y)λ +
0 ≤ λ ≤ 1,
λT y = 0,
γ ≥ 0.
Note that γ is the feature selection vector. Crucially, this problem is convex in γ and has no
local minima,21 hence it provides an optimal form of feature selection that can be eﬃcientlyobtained in conjunction with SVM training. Because a drug may bind to diﬀerent regionsof diﬀerent proteins, i.e., diﬀerent regions on diﬀerent targets can bind to the same drug, each positive data point may correspond to a diﬀerent set of important features (substrings).
Therefore, the nature of this drug-target classiﬁcation problem is essentially a multi-instanceclassiﬁcation problem. To address this, we consider two ideas: Idea (a) Use a radial basis function (RBF) kernel (Gaussian kernel), rather than a linear
kernel since this addresses the multi-instance classiﬁcation problem more eﬀectively afterimplicitly mapping data points to an inﬁnite dimensional space. After Gaussian kernelization, the original linear kernel matrix K = X△(γ)XT becomes K′ = e
Idea (b) Because each positive data point may correspond to a unique set of important
features, in principle each positive example xi should employ its own feature selection vector γ+
while all negative examples should share a same vector γ−. So we get K′′ = e
for all i and j, where γi = γ+ if y
i = +1, and γi = γ− if yi = −1. Here ⊙ stands for element-wise
Idea (a) can be easily applied to (8) at the sacriﬁce of convexity, while applying Idea (b) to (8) will introduce too many extra coeﬃcients which makes the model computationally
expensive. To circumvent these issues, we introduce an eﬃcient approach to re-weight the
features. Intuitively, we wish to down-weight the features that are false positive indicator of
binding, i.e. features that have a high score/value at some negative training examples (not
bind). This motivation is similar to the case in multi-instance learning, where false positive
indicators call for more strict control than true positive indicators. Towards this end, we
introduce a p-dimensional weight vector c corresponding to the p features, and re-scale the
feature matrix X by ˜
X = X△(c). A simple formula of c that concretizes our intuition is
ij , where aij = 1 if xij ≤ 1 − ϵ and yi = 1, and aij = 0 otherwise. Here ϵ is a small positive number, and all elements in X are assumed to have been normalized to [0, 1].
Therefore by replacing X with ˜ X in (8), we encourage using features that indicate less false positive, and formally we obtain min max λT 1 − 1 λT △(y)K′△(y)λ +
0 ≤ λ ≤ 1,
λT y = 0,
γ ≥ 0,
where K′ = exp −1 (˜
i − ˜
xi − ˜
We solve (9) by using a combination of L-BFGS-B (Limited-memory Broyden-Fletcher- Goldfarb-Shanno Bounded Optimization) and gradient decent method over γ. After optimiza-
tion, we get solutions for γ and λ. γ serves as a useful feature selector, with γj > ϵ indicating
the j's features should be selected and otherwise not. λ can be used to construct the hyper-
plane in the SVM and to predict new data points. Given a test data point (target) xc, we can
predict its label (binding to the drug or not) based on the sign of the classiﬁer's output:
xc − ˜
xc − ˜
xi) − b.
As a key step for solving (9), we need the partial derivative of the objective function in (9) (denoted as L4) with respect to the k's feature selector γk: xik − ˜ xjk)2 = exp xi − ˜
xi − ˜
xik − ˜ xjk)2 . 3. Experimental Results and Discussion
3.1. Methods under Comparison
We compared our method with the state-of-the-art method proposed by Bleakley et al.17 Inparticular, we focused on target-side prediction of their method to make the two approachescomparable. Bleakley et al.17 used the normalized Smith-Waterman score in (1) to evaluatethe similarity between two target sequences. In the context of SVM classiﬁcation, they usedthis target-target similarity matrix as the kernel matrix, i.e. the kernel matrix was ﬁxed intheir method. Based on this similarity measure, nearest neighbor (NN) classiﬁers can also beconstructed as a baseline. We refer to Bleakley et al.'s approach as BLM SVM and BLM NNrespectively. On the other hand, our methods include: • SS L1-SVM: L1-SVM with withSS feature (the main model of this paper),• SS L1-SVM: the classic L2 norm SVM with withSS feature,• SS NN FS: nearest neighbor classiﬁer based on the features selected by SS L1-SVM,• SS NN noFS: nearest neighbor classiﬁer based on all withSS features.
3.2. Experiment Settings
The framework of our experiment is similar to Bleakley et al.17 Speciﬁcally, we enumeratedall pairs of drug di and protein t in the whole data set. For each (di, t) pair, we treated t asthe single test example while the remaining proteins were used as training examples. To learnan L1 and L2 SVM, we chose the hyper-parameters (e.g. β and ϵ) by using three-fold crossvalidation on the training set, making sure that all the three folders contain at least one targetthat binds to the drug (i.e., at least one positive example). After the classiﬁcation model waslearned, we applied it to protein t in a way like (10), and obtained a score yit that could besubsequently used to compute useful performance measures (see Section 3.4). All yit calculatedby ranging over all drugs di and target t in the data set constituted a drug-by-target scoretable.
We set the minimum length of a feature sub-sequence to 5 after trying all lengths from 4 to 12, noting that a too small cutoﬀ generated excessively many features while a too bigcutoﬀ generated an insuﬃcient number of features.
We used drug-target interaction information from Bleakley et al.,17 which was collected fromthe KEGG BRITE,2 BRENDA,22 SuperTarget23 and DrugBank24 databases. In particular, we The precision-recall curves of the four methods SS L1-SVM, SS SVM, BLM SVM, BLM NN, SS NN FS, SS NN noFS on three data set. The results are based on training data with drug interactingwith at least 2 targets.
used three data sets—nuclear receptors, GPCRS, and ion channel—which have 54, 223, 210drugs, 26, 95, 204 targets, and 90, 635, 1476 interactions, respectively. The three data setsused in this article are identical to those used in the state-of-the-art study,17 which facilitatesa fair comparison between the two methods. Since we only focused on target-side prediction,we did not require any drug structural or pharmacological information to obtain drug-drugsimilarity information. The amino acid sequences of the target proteins were obtained fromthe KEGG GENES database.2 3.4. Classification results
We used ﬁve measurements to evaluate the quality of drug-target prediction: Area under thePrecision-Recall Curve (AUPR), Area under the ROC Curve (AUC), F-Measure, Precision(or Speciﬁcation), and Recall (or Sensitivity). With the prediction score table yit availablefrom Section 3.2, these performance measures were all computed in a micro-average fashion.
That is, given a cutoﬀ point, all yit could be converted into a binary label via thresholding(i.e., binding or not). By comparing these labels with the ground truth over the whole drug-by-target score table, we derived the number of false positive and false negative, which led toPrecision, Recall, and F-Measure. The AUPR and AUC were derived by increasing the cutoﬀswith a ﬁne step size, which led to thousands of points in the precision-recall curve. Of the ﬁvemeasurements, AUPR, AUC, and F-Measure are more robust than the others.
We only demonstrate the results based on withSS feature because the withoutSS feature set did not result in as good performance. Tables 1, 2, and 3 demonstrate the eﬀectivenessof the diﬀerent drug-target prediction methods over the ﬁve evaluation quantities. The F-Measure, Precision, and Recall scores reported in these tables were obtained at the cutoﬀ pointwhere F-Measure was maximized for respective methods. Figure 1 demonstrates the precision-recall curves of SS L1-SVM and SS SVM compared to BLM SVM, BLM NN, SS NN FS, andSS NN noFS on three data sets, namely Nuclear, GPCR, and Ion Channel from left to right.
Based on these evaluation, the SVM approaches that use withSS feature set (i.e., SS L1-SVM and SS SVM) outperform the current state-of-the-art methods BLM SVM andBLM NN. Moreover, the L1 norm feature selection method SS L1-SVM is more eﬀective thanthe traditional SVM method; it uses only 72.85%, 85.02%, and 62.86% of the original features Evaluations of classiﬁcation quality on Nuclear data set. The F-Measure, Precision, and Recall scores were obtained at the cutoﬀ point where F-Measure wasmaximized for respective prediction methods.
Evaluations of classiﬁcation quality on GPCR data set. The F-Measure, Precision, and Recall scores were obtained at the cutoﬀ point where F-Measure wasmaximized for respective prediction methods.
Evaluations of classiﬁcation quality on Ion data set. The F-Measure, Preci- sion, and Recall scores were obtained at the cutoﬀ point where F-Measure was maxi-mized for respective prediction methods.
in the Nuclear, GPCR, and Ion Channels datasets, respectively. The signiﬁcant reduction infeature set size can not only make the classiﬁcation more eﬃcient and eﬀective, it can alsohelp biological practitioners to identify important features more accurately.
We further investigated the prediction result generated by the SS L1-SVM method and the BLM SVM method. At the prediction cutoﬀ where both methods attained their ownmaximum F-Measure score, there were 8, 127, and 78 true positive interactions that SS L1-SVM managed to identify but were missed by BLM SVM. This was in comparison to 7, 16, 52 true positives that were identiﬁed by BLM SVM but not by SS L1-SVM. False positive isanother important measurement of a method. On the three datasets Nuclear, GPCR, and IonChannels, SS L1-SVM generated 0, 73, and 139 false positive interactions, compared to 2, 85,117 false positive interactions generated by BLM SVM.
Some interesting case studies are in order. On the Nuclear dataset, the two nearest neigh- bors of the target protein RORB (KEGG Homo sapiens protein ID "hsa6096") under the nor-malized Smith-Waterman score are RORA ("hsa6095") and RORC ("hsa6097"), with scores0.578 and 0.458 respectively. RORB and RORC share a common interacting drug Tretinoin(KEGG drug ID "D00094") while RORB and RORA do not. According to the BLM method,RORB will be predicted to have no interaction with Tretinoin because its nearest neighborRORA does not interact with Tretinoin. On the contrary, our method can correctly identifythe interaction between RORB and Tretinoin because the withSS feature set based methodcan discover two important substrings "EVVLVRMCRA-N" and "N-TV-FEGKYGGM" thatexist in both RORB and RORC. Therefore, although the overall match score between RORBand RORC is not the highest, their feature vectors (with respect to the two feature substrings)are the most similar.
On the GPCR dataset, the ﬁve nearest neighbors of the target protein CHRM1 (KEGG Homo sapiens protein ID "hsa1128") under the normalized Smith-Waterman scores areCHRM5 ("hsa1133"), CHRM3 ("hsa1131"), CHRM4 ("hsa1132"), CHRM2 ("hsa1129"),and HRH3 ("hsa11255"), with scores 0.4707, 0.4536, 0.4237, 0.4228, and 0.2446 respec-tively. Although CHRM1 is supposed to bind to drug Metoclopramide (KEGG drug ID"D00726"), none of its ﬁve nearest neighbors bind to this drug. In fact binding occursonly with the 6-th nearest neighbor, HRH2 ("hsa3274"), whose SWnorm score with re-spect to CHRM1 is 0.2137. Therefore, the BLM methods can hardly predict that CHRM1binds to Metoclopramide. In contrast, our method can correctly predict this interac-tion because the important substrings such as "KRTPRRAA", "Y-AKRTP-RAA-MI-L-W", and "NYFL-SLA-AD" are present in both CHRM1 and several proteins that bindto Metoclopramide, e.g., HTR1A ("hsa3350"), HTR1B ("hsa3351"), HTR1D ("hsa3352"),HTR1E ("hsa3354"), HTR1F ("hsa3355"), HTR2A ("hsa3356"), HTR2B ("hsa3357"),HTR2C("hsa3358"), HTR4("hsa3360"), HTR5A("hsa3361"), and HTR6("hsa3362"), whichare all considered as faraway neighbors according to the SWnorm scores.
The binding regions discovered by our computational model can also be found on the Ion dataset. To provide potential drug-target binding regions for further investigation, weproduced all the important common substrings selected by the SS L1-SVM method, whichare made available online at "http://www.cs.ualberta.ca/ ys3/drug_target".
In this article, we proposed a novel drug-target interaction prediction method based on poten-tial drug-target binding regions. According to the evaluation metrics, the proposed methodsigniﬁcantly outperformed the current state-of-the-art methods. More importantly, it identi-ﬁed a number of drug-target interactions that were missed by previous methods. We believethat the poor recall of previous methods is due to the use of a target kernel matrix based on Smith-Waterman score: a low overall similarity between two protein sequences does notmean they do not share common drug binding regions. This drawback was overcome in ourapproach by collecting a large number of candidate binding regions (i.e., common substrings)that subsequently played the primary role in interaction prediction. In addition, the use of anexplicit vector representation, as opposed to implicit similarity measure, enabled the easy useof non-linear kernel expansions that were not possible for ﬁxed kernel methods like BLM.
Besides the kernels based on substring feature vector, we believe the techniques of string kernel proposed in25 could be useful in this problem. One straightforward beneﬁt of using thestring kernel is that it will automatically consider all substrings of a given sequence pair. It canalso provide more intuitive understanding of substring-based sequence similarities than usingGaussian kernel. However, to employ the string kernel, one needs to customize the featureselection function into the model and to extend the non-gapped matching in string kernels.
We presented a feature selection method based on L1-norm SVM that could not only pre- dict the binding relations more accurately, but also ﬁnd important candidate binding regions(features). It integrated feature selection directly into L1-norm SVM and kernelized the op-timization model. A drawback was that the sparse regularization term tended to select onlya single feature from the candidate set. This is a well known problem with L1 based regu-larization.26 To avoid this limitation, we will investigate a combination of L1 and L2 normregularizers, known as the elastic net,26 which is generally more eﬀective at group featureselection. Another possible extension is to adopt the OSCAR model,27 which appears evenmore eﬀective. We also discovered that the inference problem of drug-target interaction—insome cases—can be considered as a multi-instance learning problem. So we proposed usingmultiple feature selection vectors for each positive training example in theory and applied thefeature cost vector to address the multi-instance problem in practice. We hypothesize thatmore advanced machine learning methods speciﬁcally tailored for multi-instance classiﬁcationcan further improve the accuracy of drug-target interaction prediction. Moreover, consideringthat protein 3D structures carry the essential binding information and an increasing amount ofprotein 3D structure is being made available (e.g., PSI Nature Structural Biology Knowledge-base28), incorporating protein 3D information in the prediction model in addition to sequenceinformation would lead to promising improvement.
1. C. M. Dobson, Chemical space and biology, Nature 432, 824 (2004).
2. M. Kanehisa, S. Goto, M. Hattori and et al., From genomics to chemical genomics: New devel-
opments in kegg, Nucleic Acids Res. 34, D354 (2006).
3. B. R. Stockwell, Chemical genetics: Ligand-based discovery of gene function, Nat. Rev. Genet. 1, 116 (2000).
4. D. L. Wheeler, T. Barrett, D. A. Benson and et al., Database resources of the national center for biotechnology information, Nucleic Acids Res. 34, D173 (2006).
5. M. Kuhn, D. Szklarczyk, A. Franceschini, C. von Mering, L. J. Jensen and P. Bork, Stitch 3: zooming in on protein-chemical interactions, Nucleic Acids Res. 40, 876 (2012).
6. G. Jones, P. Willett, R. C. Glen and et al., Development and validation for a genetic algorithm for ﬂexible docking, J. Mol. Biol. 267, 727 (1997).
7. G. M. Morris, D. S. Goodsell, R. S. Halliday and et al., Automated docking using a lamarckian genetic algorithm and empirical binding free energy function, J. Comput. Chem. 19, 1639 (1998).
8. B. K. Shoichet, D. L. Bodian and I. D. Kuntz, Molecular docking using shape descriptors, J. Comput. Chem. 13, 380 (1992).
9. S. J. Haggarty, K. M. Koeller, J. C. Wong and et al., Multidimensional chemical genetic analysis of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays, Chem. Biol.
10, 383 (2003).
10. F. G. Kuruvilla, A. F. Shamji, S. M. Sternson and et al., Dissecting glucose signalling with diversity-oriented synthesis and small-molecule microarrays, Nature 416, 653 (2002).
11. M. J. Keiser, B. L. Roth, B. N. Armbruster and et al., Relating protein pharmacology by ligand chemistry, Nat. Biotech. 25, 197 (2007).
12. M. Campillos, M. Kuhn, A. C. Gavin and et al., Drug target identiﬁcation using side-eﬀect similarity, Science 321, 263 (2008).
13. Y. Yamanishi, M. Araki, A. Gutteridge and et al., Prediction of drug-target interaction networks from the integration of chemical and genomic spaces, Bioinformatics 24, i232 (2008).
14. L. Perlman, A. Gottlieb, N. Atias, E. Ruppin and R. Sharan, Combining drug and gene similarity measures for drug-target elucidation, J. Comput. Biol. 18(2), 133 (2011).
15. N. Nagamine and Y. Sakakibara, Statistical prediction of protein-chemical interactions based on chemical structure and mass spectrometry data, Bioinformatics 23, 2004 (2007).
16. L. Jacob and J. P. Vert, Protein-ligand interaction prediction: An improved chemogenomics approach, Bioinformatics 24, 2149 (2008).
17. K. Bleakley and Y. Yamanishi, Supervised prediction of drug-target interactions using bipartite local models, Bioinformatics 25, 2397 (2009).
18. Y. Yamanishi, M. Kotera, M. Kanehisa and et al., Drug-target interaction prediction from chem- ical, genomic and pharmacological data in an integrated framework, Bioinformatics 26, i246
(2010).
19. T. F. Smith and M. Waterman, Identiﬁcation of common molecular subsequences, J. Mol. Biol 147, 195 (1981).
20. C. Micchelli and M. Pontil, Learning the kernel function via regularization, J. Mach. Learn. Res. 6, 1099 (2006).
21. G. Lanckriet, M. Cristianini, P. Bartlett and et al., Learning the kernel matrix with semi-deﬁnite programming, J. Mach. Learn. Res. 5, 27 (2004).
22. I. Schomburg, A. Chang, C. Ebeling and et al., Brenda, the enzyme database: Updates and major new developments, Nucleic Acids Res. 32, D431 (2004).
23. S. Gunther, M. Kuhn, M. Dunkel and et al., Supertarget and matador: Resources for exploring drug-target relationships, Nucleic Acids Res. 36, D919 (2008).
24. D. S. Wishart, C. Knox, A. C. Guo and et al., Drugbank: A knowledgebase for drugs, drug actions and drug targets, Nucleic Acids Res. 36, D901 (2007).
25. S. V. N. Vishwanathan and A. J. Smola, Fast kernels for string and tree matching, NIPS 15,
26. H. Zou and T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society B 67, 301 (2005).
27. W. Zhong and J. Kwok, Eﬃcient sparse modeling with automatic feature grouping, ICML 28,
28. H. M. Berman, Psi nature structural biology knowledgebase (www.sbkb.org/kb/index.html)

Source: http://users.cecs.anu.edu.au/~xzhang/papers/Shietal13.pdf

revistes.ub.edu

2. La Plataforma de Afectados por la Hipoteca. De la Crisis a la Estafa. Del Prozac al Lluís Mangot Sala1 La Plataforma d'Afectats per la Hipoteca ha esdevingut el moviment més potent dels últims dos anys en termes d'impacte en les estructures polítiques i econòmiques, així com de presència als mitjans de comunicació de l'Estat. A partir d'una perspectiva que combina l'anàlisi estructural del context en què

ÿþm

Diet Coke Biocultural Case Studies Phenylketonuria, Cretinism and Phenylalanine pathways  Phenylalanine hydroxylase insufficiency causes phyenlketonuria (PKU) 1. Phenylalanine hydroxylase  Most common genetic abnormality in the U.S. (1:10,000 overall, about 1:2,500 Europeans)  Growth retardation, mental retardation, depigmentation of skin, hair