Users.cecs.anu.edu.au
PROTEIN-CHEMICAL INTERACTION PREDICTION VIA KERNELIZED
SPARSE LEARNING SVM
YI SHI*1, XINHUA ZHANG1, XIAOPING LIAO2, GUOHUI LIN1, DALE SCHUURMANS1
1
Department of Computing Science, University of Alberta,
Edmonton, Alberta T6G 2E8, Canada
2
Department of Agricultural, Food and Nutritional Science, University of Alberta,
Edmonton, Alberta T6G 2P5, Canada
Given the difficulty of experimental determination of drug-protein interactions, there is a significantmotivation to develop effective
in silico prediction methods that can provide both new predictions forexperimental verification and supporting evidence for experimental results. Most recently, classifica-tion methods such as support vector machines (SVMs) have been applied to drug-target prediction.
Unfortunately, these methods generally rely on measures of the maximum "local similarity" betweentwo protein sequences, which could mask important drug-protein interaction information since drugsare much smaller molecules than proteins and drug-target binding regions must comprise only smalllocal regions of the proteins. We therefore develop a novel sparse learning method that considers setsof short peptides. Our method integrates feature selection, multi-instance learning, and Gaussiankernelization into an
L1 norm support vector machine classifier. Experimental results show that it notonly outperformed the previous methods but also pointed to an optimal subset of potential bindingregions. Supplementary materials are available at "www.cs.ualberta.ca/ ys3/drug_target".
Keywords: Drug-target interaction; SVM; Sparse learning; Kernelization.
Proteins operate in highly interconnected networks ("interactome networks") that play a cen-tral role in governing cell functions. If a protein's conformation is changed, its function canbe altered, thus affecting cell function. Drugs are small molecules that bind to target pro-teins to intensionally change the protein conformation, ultimately achieving treatment effects.
The function of many classes of pharmaceutically useful protein targets, such as enzymes, ionchannels, G protein coupled receptors (GPCRs), and nuclear receptors, can be modulated byligand interaction. Identifying interaction between ligands and proteins is therefore a key togenomic drug discovery.
Various high-throughput technologies for analyzing the genome, the transcriptome, and
the proteome have enhanced our understanding of the space populated by protein classes.
Meanwhile, the development of high-throughput screening technology has enabled broaderexploration of the space of chemical compounds.1–3 The goal of the chemical genomics re-search is to identify potentially useful compounds, such as imaging probes and drug leads,by relating the chemical space to the genomic space. Unfortunately, our understanding of therelationship between the chemical and the genomic spaces remains insufficient. For example,the PubChem database at NCBI4 contains information of millions of chemical compounds,but the number of compounds with known target proteins is limited. The lack of documentedprotein-chemical interactions suggests that many remain to be discovered, which motivates
the need for improved methods for inferring potential drug-target interactions automaticallyand efficiently. To facilitate the study of protein-chemical interactions, Kuhn et al. created aprotein-chemical interaction database called STITCH,5 which, up to now, contains interactionsfor between 300,000 small molecules and 2.6 million proteins from 1,133 organisms.
By elucidating the interaction between proteins and drug molecules, 3D-structure based
"docking analysis" has been the principle method for drug discovery.6–8 In docking analysis,drug-protein binding affinities are modeled by non-covalent intermolecular interactions, such ashydrogen bonding, electrostatic interactions, hydrophobic and Van der Waals forces. Throughestablishing equations that model the physical interaction between a receptor and potentialligand, the potential energy of binding can be calculated. There are many docking softwaretools available, including DOCK,8 GOLD,6 and AutoDock.7 All these methods require com-plete 3D structural information for the target, which might not be available in practice. Sucha major disadvantage makes docking analyses infeasible for genome wide application.
Given the difficulty of experimental determination of compound-protein interactions,9,10
there is a significant motivation to develop effective
in silico prediction methods that canprovide both new predictions for experimental verification and supporting evidence for exper-imental results. To predict compound-protein interactions various computational approacheshave been developed. Keiser et al.11 propose using the known structure of a set of ligandsto predict target protein families. This method does not take advantage of available pro-tein sequence information, and is thus limited to those between known ligands and proteinfamilies. Campillos et al.12 propose predicting drug-target interaction based on similaritiesbetween side-effects of known drugs. Some results of this approach have been verified by
invitro binding assays, but the approach remains limited to predictions involving drugs withknown side-effects. Yamanishi et al.13 have investigated the relationship between drug chem-ical structure, target protein sequence, and drug-target network topology, and developed aregression-based learning method for predicting unknown drug-target interactions. In partic-ular, they integrated the chemical and the genomic spaces into a unified space, referred toas the "pharmacological space", wherein chemical-chemical, protein-protein, and chemical-protein similarities can be modeled. Perlman et al.14 used a combination of Smith-Watermanscore, protein-protein interaction, and Gene Ontology information to measure the gene-genesimilarity (similarity between targets), but these ancillary information is not always availablemaking the prediction hard to extend to general case, and the way of combining differentinformation sources is somehow tricky.
Most recently, classification methods have been adopted in drug-target prediction.15–17
These methods firstly calculate the similarities between targets and/or drugs, then use thesesimilarities to construct kernel matrices for the classifiers, such as the support vector machines(SVMs) for predicting novel drug-target interactions. The prediction can be cast into twoways, one for drug side or drug-to-target and the other for target side. For drug-to-targetprediction, drug-drug similarities are first obtained, based on structural or pharmacologicalinformation; then a bipartite known drug-target interaction graph is constructed; for a newdrug with known structural or pharmacological information, its similarities to known drugs arecalculated to predict its interactions with known targets using the bipartite interaction graph.
Similarly for target-to-drug prediction, target-target similarities are first obtained using theprimary amino acid sequences;13,17,18 then for a new target with known primary sequence, itssimilarities to known targets are calculated to predict its interactions with known drugs againusing the bipartite interaction graph.
It should be pointed out that in the state-of-the-art works of target-to-drug prediction,
the target-target similarity is defined out of the normalized Smith-Waterman score.17 ThisS-W score measures the maximum "local similarity" between two protein sequences,19 thusreasonable, but the local similarity still uses the whole sequences and consequently mightinvolve
long substrings, which is unreasonable. In fact, long substrings could mask importantinteraction information, since drugs are usually much smaller molecules than proteins and thedrug-target binding sites mostly comprise of only small local regions of the target.
In this work, we focus on the latter target-to-drug prediction to address the issues in the
existing works. We first attempt to identify key local binding regions from the
common shortsubstrings shared by proteins that interact with the same drug. These key short substringsare then used to construct a vector representation for a protein sequence, to be used in thetraining and testing phases of a classifier. The use of key short substrings (i.e. potential bindingregions) as features for the targets is a more direct and meaningful representation for druginteraction prediction. Additionally, the explicit vector representation of targets, as opposedto assessing similarity based on the S-W score, maps the targets into higher dimensionalspaces, thus increasing the effectiveness of kernel-based classifiers. We remark that our use ofcommon short substrings differ from the substring composition representation for proteins,15which uses all substrings while disregarding whether interactions exist.
The rest of the paper is organized as follows. In Section 2, we introduce the details of our
prediction method, in which we focus on the SVM classifiers. We demonstrate in Section 3the performance of our method compared against the existing ones. Lastly, in Section 4, wediscuss the advantages and disadvantages of our method and propose future work.
2. Methods
The drug-target interaction prediction framework is the same as in Bleakley et al.,17 in whichwe assume a dataset containing
m drugs
d1
, d2
, . . , dm and
n targets
t1
, t2
, . . , tn, and the binaryindicator on whether or not drug
di interacts target
tj. The goal is to predict which of thedrugs a new target
tc will interact.
2.1. Target Vectorization
In the
bipartite local model (BLM) by Bleakley et al.,17 to which our method will compareagainst, the similarity between two targets
t and
t′ is defined as the normalized Smith-Waterman score:17
SW (
t, t′)
s(
t, t′) = √
SW (
t, t)
SW (
t′, t′)
where
SW (
·, ·) denotes the original Smith-Waterman score.19 As we mentioned in the intro-duction, such a similarity measure might overlook the key short sequence regions to which adrug binds.
To address this issue, we want to identify the common short substrings of the targets
that interact the same drug. We consider one drug, say
di, at a time. From the dataset, wefirst retrieve the set of targets
Ti =
{ti1
, ti2
, . . , tin } interacting with
d
i. By including the new
target
tc, we obtain another set
Ti ∪ {tc}. Using a substring length lower bound, we computefor each of the two sets
Ti and
Ti ∪ {tc} the multi-set of pairwise maximal common substrings,denoted as
withoutSS =
{si1
, si2
, . . , siq′} and
withSS =
{si1
, si2
, . . , sip′}, respectively. In eachof the two multi-sets, if two substrings differ at at most one position, they are merged intoone and their frequencies are summed together. This way, we obtain two
reduced sets
with-outSS =
{si1
, si2
, . . , siq} and
withSS =
{si1
, si2
, . . , sip}, containing
q and
p unique substringsrespectively, and each substring is associated with its number of occurences.
Using the substrings in set
withSS and their occurrences, we can map the
n training targets
and the new target
tc into the
p dimensional Euclidean space, where each substring representsa dimension and the coordinate of target
t in dimension
s is calculated as the normalizedmatch score between
t and
s in set
withSS :
L(
t, s)
· cs
M (
t, s) = ∑
p
i=1
si
where
L(
·, ·) is length of the longest common substring between the two sequences and
cs isthe number of occurrence of substring
s. Intuitively, if target
tc contains a long substring thatis also frequent in the binding targets, then its match score for this feature substring willbe high indicating a high likelihood of binding. We use (
M (
t, s1)
, M (
t, s2)
, . . , M (
t, sp)) as thevector representation for target
t.
This way we obtain an
n × p training matrix
X, where each row represents a training
target, and a
p × 1 testing vector
xc representing the new target
tc, along with the
n × 1 binary
training label vector
y (with 1 indicating the target interacts with drug
di and
−1 otherwise).
The task is to construct a classifier to return 1 if the new target
tc interacts with drug
di, or
−1 otherwise.
The classification problem can be analogously formulated using set
withoutSS substring
set. Next we show how to construct a classifier from the training data.
2.2. Classification with Feature Selection
In any classification problem, the quality of features used determines the accuracy of pre-dictions. Here, features correspond to substrings of target proteins, which comprise potentialbinding regions between the proteins and drugs. Thus, selecting good features not only im-proves classification accuracy, but also provides candidate drug-target binding sites for furtherinvestigation. We investigated an approach that integrates feature selection in
L1-norm basedsupport vector machine (SVM) classification method.
The primal form of
L1-norm SVM is:
min
β∥w∥1 +
1T ξ
ξ ≥ 1 − △(
y)(
Xw − b1)
,
ξ ≥ 0.
where
△(
y) denotes putting the vector
y on the main diagonal of a square matrix. Here
X ∈ R
n×p,
y ∈ {+1
, −1
}n,
n is the number of data points (targets), and
p is the number of
features. Since by Micchelli et al.20
w∥1 = min
+
γj) = min (
wT △(
γ)
−1
w +
γT 1)
,
(
wT △(
γ)
−1
w +
γT 1) +
1T ξ
ξ ≥ 1 − △(
y)(
Xw − b1)
,
ξ ≥ 0, γ ≥ 0.
By introducing Lagrangian multipliers
λ ≥ 0 and
µ ≥ 0, (4) becomes
(
wT △(
γ)
−1
w +
γT 1) +
1T ξ +
λT (
1 − △(
y)(
Xw − b1)
− ξ)
− µT ξ
λ ≥ 0, µ ≥ 0, γ ≥ 0.
Let the objective function of (5) be
L1, and let
∂L1 = 0, we get
λ =
1 − µ. Therefore, since
µ ≥ 0, we conclude that
λ ≤ 1, hence
0 ≤ λ ≤ 1. By substitution, (5) becomes
(
wT △(
γ)
−1
w +
γT 1) +
λT 1 − λT △(
y)
Xw +
bλT △(
y)
1
0 ≤ λ ≤ 1,
γ ≥ 0.
Let the objective function of (6) be
L2, and let
∂L2 = 0. We get
λT y = 0, so (6) becomes
(
wT △(
γ)
−1
w +
γT 1) +
λT 1 − λT △(
y)
Xw
0 ≤ λ ≤ 1,
λT y = 0
,
γ ≥ 0.
Let the objective function of (7) be
L3, and let
∂L3 = 0, we get
β△(
γ)
−1
w − XT △(
y)
λ =
0,
so that
w = 1
△(
γ)
XT △(
y)
λ. By substitution, (7) becomes
min max
λT 1 − 1
λT △(
y)
X△(
γ)
XT △(
y)
λ +
0 ≤ λ ≤ 1,
λT y = 0
,
γ ≥ 0.
Note that
γ is the feature selection vector. Crucially, this problem is convex in
γ and has no
local minima,21 hence it provides an optimal form of feature selection that can be efficientlyobtained in conjunction with SVM training. Because a drug may bind to different regionsof different proteins, i.e., different regions on different targets can bind to the same drug,
each positive data point may correspond to a different set of important features (substrings).
Therefore, the nature of this drug-target classification problem is essentially a multi-instanceclassification problem. To address this, we consider two ideas:
Idea (a) Use a radial basis function (RBF) kernel (Gaussian kernel), rather than a linear
kernel since this addresses the multi-instance classification problem more effectively afterimplicitly mapping data points to an infinite dimensional space. After Gaussian kernelization,
the original linear kernel matrix
K =
X△(
γ)
XT becomes
K′ =
e
Idea (b) Because each positive data point may correspond to a unique set of important
features, in principle each positive example
xi should employ its own feature selection vector
γ+
while all negative examples should share a same vector
γ−. So we get
K′′ =
e
for all
i and
j, where
γi =
γ+ if
y
i = +1, and
γi =
γ− if
yi =
−1. Here
⊙ stands for element-wise
Idea (a) can be easily applied to (8) at the sacrifice of convexity, while applying Idea
(b) to (8) will introduce too many extra coefficients which makes the model computationally
expensive. To circumvent these issues, we introduce an efficient approach to re-weight the
features. Intuitively, we wish to down-weight the features that are false positive indicator of
binding, i.e. features that have a high score/value at some negative training examples (not
bind). This motivation is similar to the case in multi-instance learning, where false positive
indicators call for more strict control than true positive indicators. Towards this end, we
introduce a
p-dimensional weight vector
c corresponding to the
p features, and re-scale the
feature matrix
X by ˜
X =
X△(
c). A simple formula of
c that concretizes our intuition is
ij , where
aij = 1 if
xij ≤ 1
− ϵ and
yi = 1, and
aij = 0 otherwise. Here
ϵ is a
small positive number, and all elements in
X are assumed to have been normalized to [0
, 1].
Therefore by replacing
X with ˜
X in (8), we encourage using features that indicate less false
positive, and formally we obtain
min max
λT 1 − 1
λT △(
y)
K′△(
y)
λ +
0 ≤ λ ≤ 1,
λT y = 0
,
γ ≥ 0,
where
K′ = exp
−1 (
˜
i − ˜
xi − ˜
We solve (9) by using a combination of L-BFGS-B (Limited-memory Broyden-Fletcher-
Goldfarb-Shanno Bounded Optimization) and gradient decent method over
γ. After optimiza-
tion, we get solutions for
γ and
λ.
γ serves as a useful feature selector, with
γj > ϵ indicating
the
j's features should be selected and otherwise not.
λ can be used to construct the hyper-
plane in the SVM and to predict new data points. Given a test data point (target)
xc, we can
predict its label (binding to the drug or not) based on the sign of the classifier's output:
xc − ˜
xc − ˜
xi)
− b.
As a key step for solving (9), we need the partial derivative of the objective function in (9)
(denoted as
L4) with respect to the
k's feature selector
γk:
xik − ˜
xjk)2 = exp
xi − ˜
xi − ˜
xik − ˜
xjk)2
.
3. Experimental Results and Discussion
3.1. Methods under Comparison
We compared our method with the state-of-the-art method proposed by Bleakley et al.17 Inparticular, we focused on target-side prediction of their method to make the two approachescomparable. Bleakley et al.17 used the normalized Smith-Waterman score in (1) to evaluatethe similarity between two target sequences. In the context of SVM classification, they usedthis target-target similarity matrix as the kernel matrix, i.e. the kernel matrix was fixed intheir method. Based on this similarity measure, nearest neighbor (NN) classifiers can also beconstructed as a baseline. We refer to Bleakley et al.'s approach as BLM SVM and BLM NNrespectively. On the other hand, our methods include:
• SS L1-SVM: L1-SVM with
withSS feature (the main model of this paper),
• SS L1-SVM: the classic L2 norm SVM with
withSS feature,
• SS NN FS: nearest neighbor classifier based on the features selected by SS L1-SVM,
• SS NN noFS: nearest neighbor classifier based on all
withSS features.
3.2. Experiment Settings
The framework of our experiment is similar to Bleakley et al.17 Specifically, we enumeratedall pairs of drug
di and protein
t in the whole data set. For each (
di, t) pair, we treated
t asthe single test example while the remaining proteins were used as training examples. To learnan L1 and L2 SVM, we chose the hyper-parameters (e.g.
β and
ϵ) by using three-fold crossvalidation on the training set, making sure that all the three folders contain at least one targetthat binds to the drug (i.e., at least one positive example). After the classification model waslearned, we applied it to protein
t in a way like (10), and obtained a score
yit that could besubsequently used to compute useful performance measures (see Section 3.4). All
yit calculatedby ranging over all drugs
di and target
t in the data set constituted a drug-by-target scoretable.
We set the minimum length of a feature sub-sequence to 5 after trying all lengths from
4 to 12, noting that a too small cutoff generated excessively many features while a too bigcutoff generated an insufficient number of features.
We used drug-target interaction information from Bleakley et al.,17 which was collected fromthe KEGG BRITE,2 BRENDA,22 SuperTarget23 and DrugBank24 databases. In particular, we
The precision-recall curves of the four methods SS L1-SVM, SS SVM, BLM SVM, BLM NN,
SS NN FS, SS NN noFS on three data set. The results are based on training data with drug interactingwith at least 2 targets.
used three data sets—nuclear receptors, GPCRS, and ion channel—which have 54, 223, 210drugs, 26, 95, 204 targets, and 90, 635, 1476 interactions, respectively. The three data setsused in this article are identical to those used in the state-of-the-art study,17 which facilitatesa fair comparison between the two methods. Since we only focused on target-side prediction,we did not require any drug structural or pharmacological information to obtain drug-drugsimilarity information. The amino acid sequences of the target proteins were obtained fromthe KEGG GENES database.2
3.4. Classification results
We used five measurements to evaluate the quality of drug-target prediction: Area under thePrecision-Recall Curve (AUPR), Area under the ROC Curve (AUC), F-Measure, Precision(or Specification), and Recall (or Sensitivity). With the prediction score table
yit availablefrom Section 3.2, these performance measures were all computed in a micro-average fashion.
That is, given a cutoff point, all
yit could be converted into a binary label via thresholding(i.e., binding or not). By comparing these labels with the ground truth over the whole drug-by-target score table, we derived the number of false positive and false negative, which led toPrecision, Recall, and F-Measure. The AUPR and AUC were derived by increasing the cutoffswith a fine step size, which led to thousands of points in the precision-recall curve. Of the fivemeasurements, AUPR, AUC, and F-Measure are more robust than the others.
We only demonstrate the results based on
withSS feature because the
withoutSS feature
set did not result in as good performance. Tables 1, 2, and 3 demonstrate the effectivenessof the different drug-target prediction methods over the five evaluation quantities. The F-Measure, Precision, and Recall scores reported in these tables were obtained at the cutoff pointwhere F-Measure was maximized for respective methods. Figure 1 demonstrates the precision-recall curves of SS L1-SVM and SS SVM compared to BLM SVM, BLM NN, SS NN FS, andSS NN noFS on three data sets, namely Nuclear, GPCR, and Ion Channel from left to right.
Based on these evaluation, the SVM approaches that use
withSS feature set (i.e.,
SS L1-SVM and SS SVM) outperform the current state-of-the-art methods BLM SVM andBLM NN. Moreover, the L1 norm feature selection method SS L1-SVM is more effective thanthe traditional SVM method; it uses only 72
.85%, 85
.02%, and 62
.86% of the original features
Evaluations of classification quality on Nuclear data set. The F-Measure,
Precision, and Recall scores were obtained at the cutoff point where F-Measure wasmaximized for respective prediction methods.
Evaluations of classification quality on GPCR data set. The F-Measure,
Precision, and Recall scores were obtained at the cutoff point where F-Measure wasmaximized for respective prediction methods.
Evaluations of classification quality on Ion data set. The F-Measure, Preci-
sion, and Recall scores were obtained at the cutoff point where F-Measure was maxi-mized for respective prediction methods.
in the Nuclear, GPCR, and Ion Channels datasets, respectively. The significant reduction infeature set size can not only make the classification more efficient and effective, it can alsohelp biological practitioners to identify important features more accurately.
We further investigated the prediction result generated by the SS L1-SVM method and
the BLM SVM method. At the prediction cutoff where both methods attained their ownmaximum F-Measure score, there were 8, 127, and 78 true positive interactions that SS L1-SVM managed to identify but were missed by BLM SVM. This was in comparison to 7, 16,
52 true positives that were identified by BLM SVM but not by SS L1-SVM. False positive isanother important measurement of a method. On the three datasets Nuclear, GPCR, and IonChannels, SS L1-SVM generated 0, 73, and 139 false positive interactions, compared to 2, 85,117 false positive interactions generated by BLM SVM.
Some interesting case studies are in order. On the Nuclear dataset, the two nearest neigh-
bors of the target protein RORB (KEGG Homo sapiens protein ID "hsa6096") under the nor-malized Smith-Waterman score are RORA ("hsa6095") and RORC ("hsa6097"), with scores0.578 and 0.458 respectively. RORB and RORC share a common interacting drug
Tretinoin(KEGG drug ID "D00094") while RORB and RORA do not. According to the BLM method,RORB will be predicted to have no interaction with
Tretinoin because its nearest neighborRORA does not interact with
Tretinoin. On the contrary, our method can correctly identifythe interaction between RORB and
Tretinoin because the
withSS feature set based methodcan discover two important substrings "EVVLVRMCRA-N" and "N-TV-FEGKYGGM" thatexist in both RORB and RORC. Therefore, although the overall match score between RORBand RORC is not the highest, their feature vectors (with respect to the two feature substrings)are the most similar.
On the GPCR dataset, the five nearest neighbors of the target protein CHRM1 (KEGG
Homo sapiens protein ID "hsa1128") under the normalized Smith-Waterman scores areCHRM5 ("hsa1133"), CHRM3 ("hsa1131"), CHRM4 ("hsa1132"), CHRM2 ("hsa1129"),and HRH3 ("hsa11255"), with scores 0.4707, 0.4536, 0.4237, 0.4228, and 0.2446 respec-tively. Although CHRM1 is supposed to bind to drug
Metoclopramide (KEGG drug ID"D00726"), none of its five nearest neighbors bind to this drug. In fact binding occursonly with the 6-th nearest neighbor, HRH2 ("hsa3274"), whose
SWnorm score with re-spect to CHRM1 is 0.2137. Therefore, the BLM methods can hardly predict that CHRM1binds to
Metoclopramide. In contrast, our method can correctly predict this interac-tion because the important substrings such as "KRTPRRAA", "Y-AKRTP-RAA-MI-L-W", and "NYFL-SLA-AD" are present in both CHRM1 and several proteins that bindto
Metoclopramide, e.g., HTR1A ("hsa3350"), HTR1B ("hsa3351"), HTR1D ("hsa3352"),HTR1E ("hsa3354"), HTR1F ("hsa3355"), HTR2A ("hsa3356"), HTR2B ("hsa3357"),HTR2C("hsa3358"), HTR4("hsa3360"), HTR5A("hsa3361"), and HTR6("hsa3362"), whichare all considered as faraway neighbors according to the
SWnorm scores.
The binding regions discovered by our computational model can also be found on the
Ion dataset. To provide potential drug-target binding regions for further investigation, weproduced all the important common substrings selected by the SS L1-SVM method, whichare made available online at "http://www.cs.ualberta.ca/ ys3/drug_target".
In this article, we proposed a novel drug-target interaction prediction method based on poten-tial drug-target binding regions. According to the evaluation metrics, the proposed methodsignificantly outperformed the current state-of-the-art methods. More importantly, it identi-fied a number of drug-target interactions that were missed by previous methods. We believethat the poor recall of previous methods is due to the use of a target kernel matrix based
on Smith-Waterman score: a low overall similarity between two protein sequences does notmean they do not share common drug binding regions. This drawback was overcome in ourapproach by collecting a large number of candidate binding regions (i.e., common substrings)that subsequently played the primary role in interaction prediction. In addition, the use of anexplicit vector representation, as opposed to implicit similarity measure, enabled the easy useof non-linear kernel expansions that were not possible for fixed kernel methods like BLM.
Besides the kernels based on substring feature vector, we believe the techniques of string
kernel proposed in25 could be useful in this problem. One straightforward benefit of using thestring kernel is that it will automatically consider all substrings of a given sequence pair. It canalso provide more intuitive understanding of substring-based sequence similarities than usingGaussian kernel. However, to employ the string kernel, one needs to customize the featureselection function into the model and to extend the non-gapped matching in string kernels.
We presented a feature selection method based on
L1-norm SVM that could not only pre-
dict the binding relations more accurately, but also find important candidate binding regions(features). It integrated feature selection directly into
L1-norm SVM and kernelized the op-timization model. A drawback was that the sparse regularization term tended to select onlya single feature from the candidate set. This is a well known problem with
L1 based regu-larization.26 To avoid this limitation, we will investigate a combination of
L1 and
L2 normregularizers, known as the elastic net,26 which is generally more effective at group featureselection. Another possible extension is to adopt the OSCAR model,27 which appears evenmore effective. We also discovered that the inference problem of drug-target interaction—insome cases—can be considered as a multi-instance learning problem. So we proposed usingmultiple feature selection vectors for each positive training example in theory and applied thefeature cost vector to address the multi-instance problem in practice. We hypothesize thatmore advanced machine learning methods specifically tailored for multi-instance classificationcan further improve the accuracy of drug-target interaction prediction. Moreover, consideringthat protein 3D structures carry the essential binding information and an increasing amount ofprotein 3D structure is being made available (e.g., PSI Nature Structural Biology Knowledge-base28), incorporating protein 3D information in the prediction model in addition to sequenceinformation would lead to promising improvement.
1. C. M. Dobson, Chemical space and biology,
Nature 432, 824 (2004).
2. M. Kanehisa, S. Goto, M. Hattori and et al., From genomics to chemical genomics: New devel-
opments in kegg,
Nucleic Acids Res. 34, D354 (2006).
3. B. R. Stockwell, Chemical genetics: Ligand-based discovery of gene function,
Nat. Rev. Genet.
1, 116 (2000).
4. D. L. Wheeler, T. Barrett, D. A. Benson and et al., Database resources of the national center
for biotechnology information,
Nucleic Acids Res. 34, D173 (2006).
5. M. Kuhn, D. Szklarczyk, A. Franceschini, C. von Mering, L. J. Jensen and P. Bork, Stitch 3:
zooming in on protein-chemical interactions,
Nucleic Acids Res. 40, 876 (2012).
6. G. Jones, P. Willett, R. C. Glen and et al., Development and validation for a genetic algorithm
for flexible docking,
J. Mol. Biol. 267, 727 (1997).
7. G. M. Morris, D. S. Goodsell, R. S. Halliday and et al., Automated docking using a lamarckian
genetic algorithm and empirical binding free energy function,
J. Comput. Chem. 19, 1639 (1998).
8. B. K. Shoichet, D. L. Bodian and I. D. Kuntz, Molecular docking using shape descriptors,
J.
Comput. Chem. 13, 380 (1992).
9. S. J. Haggarty, K. M. Koeller, J. C. Wong and et al., Multidimensional chemical genetic analysis
of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays,
Chem. Biol.
10, 383 (2003).
10. F. G. Kuruvilla, A. F. Shamji, S. M. Sternson and et al., Dissecting glucose signalling with
diversity-oriented synthesis and small-molecule microarrays,
Nature 416, 653 (2002).
11. M. J. Keiser, B. L. Roth, B. N. Armbruster and et al., Relating protein pharmacology by ligand
chemistry,
Nat. Biotech. 25, 197 (2007).
12. M. Campillos, M. Kuhn, A. C. Gavin and et al., Drug target identification using side-effect
similarity,
Science 321, 263 (2008).
13. Y. Yamanishi, M. Araki, A. Gutteridge and et al., Prediction of drug-target interaction networks
from the integration of chemical and genomic spaces,
Bioinformatics 24, i232 (2008).
14. L. Perlman, A. Gottlieb, N. Atias, E. Ruppin and R. Sharan, Combining drug and gene similarity
measures for drug-target elucidation,
J. Comput. Biol. 18(2), 133 (2011).
15. N. Nagamine and Y. Sakakibara, Statistical prediction of protein-chemical interactions based on
chemical structure and mass spectrometry data,
Bioinformatics 23, 2004 (2007).
16. L. Jacob and J. P. Vert, Protein-ligand interaction prediction: An improved chemogenomics
approach,
Bioinformatics 24, 2149 (2008).
17. K. Bleakley and Y. Yamanishi, Supervised prediction of drug-target interactions using bipartite
local models,
Bioinformatics 25, 2397 (2009).
18. Y. Yamanishi, M. Kotera, M. Kanehisa and et al., Drug-target interaction prediction from chem-
ical, genomic and pharmacological data in an integrated framework,
Bioinformatics 26, i246
(2010).
19. T. F. Smith and M. Waterman, Identification of common molecular subsequences,
J. Mol. Biol
147, 195 (1981).
20. C. Micchelli and M. Pontil, Learning the kernel function via regularization,
J. Mach. Learn. Res.
6, 1099 (2006).
21. G. Lanckriet, M. Cristianini, P. Bartlett and et al., Learning the kernel matrix with semi-definite
programming,
J. Mach. Learn. Res. 5, 27 (2004).
22. I. Schomburg, A. Chang, C. Ebeling and et al., Brenda, the enzyme database: Updates and
major new developments,
Nucleic Acids Res. 32, D431 (2004).
23. S. Gunther, M. Kuhn, M. Dunkel and et al., Supertarget and matador: Resources for exploring
drug-target relationships,
Nucleic Acids Res. 36, D919 (2008).
24. D. S. Wishart, C. Knox, A. C. Guo and et al., Drugbank: A knowledgebase for drugs, drug
actions and drug targets,
Nucleic Acids Res. 36, D901 (2007).
25. S. V. N. Vishwanathan and A. J. Smola, Fast kernels for string and tree matching,
NIPS 15,
26. H. Zou and T. Hastie, Regularization and variable selection via the elastic net,
Journal of the
Royal Statistical Society B 67, 301 (2005).
27. W. Zhong and J. Kwok, Efficient sparse modeling with automatic feature grouping,
ICML 28,
28. H. M. Berman, Psi nature structural biology knowledgebase (www.sbkb.org/kb/index.html)
Source: http://users.cecs.anu.edu.au/~xzhang/papers/Shietal13.pdf
2. La Plataforma de Afectados por la Hipoteca. De la Crisis a la Estafa. Del Prozac al Lluís Mangot Sala1 La Plataforma d'Afectats per la Hipoteca ha esdevingut el moviment més potent dels últims dos anys en termes d'impacte en les estructures polítiques i econòmiques, així com de presència als mitjans de comunicació de l'Estat. A partir d'una perspectiva que combina l'anàlisi estructural del context en què
Diet Coke Biocultural Case Studies Phenylketonuria, Cretinism and Phenylalanine pathways Phenylalanine hydroxylase insufficiency causes phyenlketonuria (PKU) 1. Phenylalanine hydroxylase Most common genetic abnormality in the U.S. (1:10,000 overall, about 1:2,500 Europeans) Growth retardation, mental retardation, depigmentation of skin, hair