Ir.cs.georgetown.edu
Learning the Relationships between Drug, Symptom, and Medical Condition
Mentions in Social Media
Information Retrieval Lab
Information Retrieval Lab
Information Retrieval Lab
Department of Computer Science
Department of Computer Science
Department of Computer Science
Georgetown University
Georgetown University
Georgetown University
(i.e., healthy, exposed, or infected) over time based on theusers' tweets. Paul and Dredze (2011) propose the Ailment
We consider the general problem of learning relationships be-
Topic Aspect Model (ATAM+) to associate treatment, symp-
tween drugs, symptoms, and medical conditions mentionedon Twitter, with the goal of estimating probability distribu-
tom, and general terms with latent ailment topics. While our
tions to reduce the difficulties presented by social media's in-
goal of learning the associations between drug, symptom,
complete picture. If a user mentions taking a drug and ex-
and condition mentions is similar to modeling health-related
periencing several unexpected symptoms, for example, are
topics, we differ in that we do not try to discover latent
the symptoms associated with that drug or is it more likely
topics. We use concept extraction and medical thesauri to
that the symptoms are associated with an unmentioned un-
identify mentions rather than training a topic model to dis-
derlying condition? We describe a model for learning from
cover topics (i.e., term categories). We envision our model as
and utilizing such relationships. We demonstrate that our ap-
augmenting health-related mining tasks such as discovering
proach identifies drugs that are similar based on their associ-
drug side effects or estimating the prevalence of a disease.
ated symptoms (or conditions), identifies conditions that aresimilar based on their associated symptoms, and can deter-mine whether a symptom is caused by a medical condition or
by a drug (i.e., a drug side effect).
We model relationships between symptoms, drugs, andmedical condition mentions in tweets as a Bayesian net-
work. A user's medical conditions determine what drugs the
Social media data are subject to many biases, and sampled
user takes. The user's conditions and drugs determine what
data sources such as Twitter's streaming API compound the
symptoms the user experiences; a symptom may be caused
problem. There is no guarantee that users will mention all
by the condition or it may be a side effect of a drug the user
details that are relevant to the mining task being performed,
is taking. We compute a joint probability distribution over
and even when users do provide complete information, the
symptoms, drugs, and conditions and use it to compute con-
public Twitter API's sampling may prevent all of the tweets
ditional probability distributions among them. We identify
containing relevant information from being collected.
symptoms, drugs, and conditions in tweets using the CRF
We reduce the difficulty of mining health-related data
method described in (Yates, Goharian, and Frieder 2015).
with incomplete information by modeling the relationships
Let S be a random variable over all symptoms, D be a
between drugs, symptoms, and medical conditions. If a user
random variable over all drugs, and C be a random vari-
mentions suffering from a medical condition and experienc-
able over all conditions. Let CountS,D,C;U be the number
ing a symptom, for example, the symptom may be caused by
of times S, D, and C are mentioned by the Twitter user U
the medical condition or it may be caused by an umentioned
during a d-day window. We use nil values in cases where
drug the user is taking. Our model addresses this problem by
a d-day window contains only one or two types of random
assessing the probability that a user who mentioned a con-
variable; that is, if a user mentions one or more symptoms or
dition and symptom is also associated with an unmentioned
conditions during a window but does not mention any drugs,
drug. Similarly, we demonstrate how our model estimates
we set D = ∅ for that window. The joint probability mass
the similarity between drugs, symptoms, and conditions and
determines whether symptom mentions are more likely to beassociated with a condition mention or with a drug mention
(i.e., drug side effects).
Many have considered the problem of modeling health-
related latent topics in social media. Chen et al. (2014) use
Conditional probabilities between any two of the random
a temporal topic model to model users' flu infection statuses
variables are then computed by marginalizing out the third
2016, Association for the Advancement of Artificial
variable. The conditional probability of extracting the symp-
Intelligence (www.aaai.org). All rights reserved.
tom S given a drug D, for example, is:
Aleve headache (0.10), sleepy (0.06), confused (0.06),
gnecomastia (0.04), cramps (0.04), throw up (0.04), feel
sick (0.03) , ache (0.03), fever (0.03) & cough (0.02)
Aspirin nasal polyps (0.08), polyps (0.08), tumor (0.05),
Conditional probability distributions can be inspected to
swelling (0.05), swollen (0.04), headache (0.03), salivary
identify associations between symptoms, drugs, and condi-
glands (0.03), runny nose (0.03), fever & apoptosis (0.02)
tions. The Kullback–Leibler divergence between distribu-
Tylenol asthma (0.06), headache (0.06), fever (0.04), con-
tions can be used to compare the similarity of random vari-
fused (0.04), hard of hearing, throw up, hearing loss, tu-
ables, such as comparing the similarity of symptoms associ-
mor, sleepy & migraine (0.03)
ated with two drugs D1 and D2:
Symptoms that NSAIDs are commonly taken to relieve,such as headaches and fevers, are associated with every
DKL(Pr(S D1) Pr(S D2))
drug. Symptoms of underlying conditions that NSAIDs do
Finally, we can identify symptoms that are either more
not cause or treat also appear, however, such as cough,
highly associated with a condition than a drug (i.e., are
sleepy, and runny nose. This illustrates that while condi-
symptoms of the condition) or more highly associated with
tional probability can be used to find associations between
a drug than a condition (i.e., are drug side effects) by taking
drugs and symptoms, the association may be an indirect
the difference of the given drug's and condition's conditional
association that also involves an underlying condition. We
demonstrate in a later section that Eq. 4 can be used toseparate symptoms caused by an underlying condition from
symptoms caused by a drug (i.e., drug side effects).
Table 1 shows the symptoms and drugs most strongly as-
sociated with four common conditions (i.e., those symptomsand drugs with the highest conditional probabilities given
Our dataset consists of a thesauri containing symptom, drug,
one of the conditions). Many of the symptoms are clearly
and condition terms and a Twitter corpus collected be-
symptoms of the condition: migraine, headache, sneeze,
tween November 2013 and November 2015. Rather than
swollen, and cough are symptoms of allergies; tumors,
using Twitter's streaming 1% sample API, we used Twit-
polyps, and weight loss are symptoms of breast cancer, nau-
ter's statuses/filter API and queried for tweets ge-
sea (feeling sick), fever, and headache are flu symptoms,
olocated within the United States and Canada to maximize
and shoulder pain is commonly associated with strokes. The
our per-user coverage. Our Twitter corpus contains approx-
relationships between conditions and drugs are less clear,
imately 1.5 billion tweets written by 11 million users, of
with alcohol, cocaine, and testosterone commonly appear-
which about 18 million (1.2%) tweets contained a term from
ing. Allergies are correctly associated with allergy medi-
our final thesauri (i.e., were health-related tweets). Most
cations (i.e., zyrtec, prednisone, benadryl, and histamine),
health-related tweets mentioned at least one symptom (57%)
however, as well as a drug that alleviates an allergy symp-
or condition (37%), with only 7.6% of health-related tweets
toms (imitrex). Tamoxifen, one of the most common breast
mentioning a drug. We use the SIDER (Kuhn et al. 2016)
cancer drugs, ranks highly for the breast cancer condition.
drug database to identify drug terms in our corpus. We use
Tylenol, which ranks highly for the flu, is not associated with
the MedSyn thesaurus (Yates and Goharian 2013), a the-
flu treatment but is used to alleviate some flu symptoms. The
saurus containing both lay person and expert terminology
model's difficulty identifying drugs associated with breast
derived from the Unified Medical Language System (Bo-
cancer, the flu, and strokes may be caused by the fact that no
denreider 2004), to identify health-related terms that may
drugs are commonly taken for these conditions; the flu is a
express symptom or condition concepts. Additionally, we
common condition with no cure. Drugs are used in the treat-
manually verified every symptom, drug, and condition term
ment of breast cancer and strokes, but these are relatively
that occurred at least 400 times in our corpus and removed
rare conditions so their associated drugs are less likely to be
ambiguous terms from our thesauri.
mentioned on Twitter.
Similarities between drugs
Drug, symptom, and condition associations
To evaluate how well our model can be used to mea-sure the similarity between two drugs, we compute the
To evaluate how well we learn associations between drugs,
KL divergence (Eq. 3) between NSAIDs (nonsteroidal anti-
symptoms, and conditions, we compute the conditional
inflammatory drugs). NSAIDs are commonly referred to
probability (Eq. 2) of symptoms given NSAIDs (nons-
both by their brand names and generic names, making them
teroidal anti-inflammatory drugs), which are a common
ideal for evaluating our drug-similarity metric. The KL di-
class of over-the-counter painkilling drugs (i.e., Pr(S D)).
vergences between pairs of NSAIDs are shown in Table 2.
The top ten symptoms for each drug are:
The corresponding brand name or generic name for each
Advil headache (0.11), confused (0.08), sleepy (0.07),
drug is shown in parentheses in the first column. Lower num-
throw up (0.06), pass out (0.04), cough, cramps, feel sick,
bers indicate higher degrees of similarity. KL divergence is
hangover & fever (0.03)
not symmetric, so we compare drugs D1 and D2 by taking
nasal polyps (0.03)
shoulder pain (0.05)
lose weight (0.02)
hemorrhage (0.02)
prednisone (0.03)
testosterone (0.02)
testosterone (0.04)
amoxicillin (0.02)
prednisone (0.02)
clarithromycin (0.01)
Table 1: The symptoms and drugs most strongly associated with four common conditions. Many symptoms (e.g., migraine,headache, sneeze, etc.) and drugs (e.g., zyrtec, prednisone, benadryl) are correctly associated with allergy condition. The otherconditions are correctly associated with many symptoms, but not with many drugs.
Table 2: KL divergences between nonsteroidal anti-inflammatory drugs derived from drug-symptom distributions. Lower num-bers indicate a higher similarity. Each drug's brand name and generic name was treated as a unique drug for the purpose ofevaluating the drug similarities. Our model correctly identifies Acetaminophen and Tylenol and Ibuprofen and Advil as sim-ilar drugs, but fails to identify Naproxen and Aleve as being similar. The relative results do not change when drug-conditiondistributions are instead used to compute the similarity.
the mean of the KL divergences of their drug-symptom dis-
probabilities as shown in Eq. 4. The drugs and conditions
tributions. This approach can also be used to measure the
to compare were chosen by selecting the five most fre-
similarity between distributions conditioned on two condi-
quently occurring drugs with a clearly associated condition;
tions or two symptoms.
some drugs such as morphine, adderall, aspirin, and be-
Our model correctly indicates that Ibuprofen is the
nadryl occurred more frequently, but were not strongly asso-
most similar drug to Advil and that Acetaminophen is the
ciated with any condition (as determined by Pr C D). Such
most similar drug to Tylenol. It incorrectly indicates that
drugs either belonged to more general classes of drugs that
Naproxen and Aleve are more similar to Ibuprofen and Advil
are often used for symptom relief (i.e., NSAIDs and anti-
than the two drugs are to each other. This error may be
histamines) or were drugs that are known to be commonly
caused by a much lower number of tweets mentioning Aleve
abused (e.g., morphine, adderall, xanax, etc.). The top five
and Naproxen; these drug terms occur in our corpus approx-
drugs and their associated conditions are: prednisone (aller-
imately 20% and 8% as often as the next most infrequent
gies), lipitor (diabetes), prozac (depression), zoloft (depres-
term (Acetaminophen), respectively.
sion), and paxil (depression).
The top ten symptoms attributed to each drug and con-
Condition symptoms vs. drug side effects
dition are shown in Table 3. Symptoms attributed to the
We evaluate how well our model can be used to distinguish
drug are shown in the D rows (top half) and symptoms at-
symptoms caused by a condition from symptoms caused
tributed to the condition are shown in the C rows (bottom
by a drug (i.e., drug side effects) by comparing conditional
half). The symptoms associated with depression (D=Prozac,
pulmonary emb. (.03)
stomach ache (.02)
suicidal th. (.06)
suicidal th. (.04)
gynecomastia (.03)
mood swings (.02)
liver damage (.02)
tooth decay (.04)
deep vein thromb. (.03)
gain weight (.06)
deep vein thromb. (.02)
gain weight (.03)
tooth decay (.04)
pulmonary emb. (.02)
irritability (.02)
suicidal th. (.01)
urinary incont. (.03)
gain weight (.01)
low testosterone (.02)
asthma attack (.01)
muscle soreness (.01)
irritable bowel (.02)
suicidal th. (.01)
inflammation (.02)
vein thromb. (.01)
irritability (.02)
lose weight (.01)
lose weight (.02)
lose weight (.02)
inflammation (.01)
lose weight (.01)
Table 3: Symptoms most strongly associated with drugs (D row) and conditions (C row) for each drug and condition pair. Manydrug side effects are correctly identified (e.g., nausea with Prednisone, weight gain with Prozac and Paxil, etc.). Similarly, thereis high agreement among the conditions associated with depression with no more than two entries differing between any pair.
Note the following terms were abbreviated: embolism, high blood pressure, incontinence, suicidal thoughts, and thrombosis.
D=Zoloft, and D=Paxil) are strikingly similar, with only
We find that our approach is often able to correctly iden-
one entry differing between the Prozac and Paxil columns
tify equivalent drugs as similar and to correctly separate a
(i.e., cramps vs. feel sick) and two entries unique to the
condition's symptoms from drug side effects. We envision
Zoloft column (i.e., hangover and inflammation). The symp-
incorporating our approach with health-related text mining
toms associated with the drugs that treat depression are less
systems to improve their accuracy. Systems for discover-
accurate, with several terms that do not appear to be related
ing expected and unexpected drug side effects could bene-
at all (e.g., clubfoot, sneeze, tumor, irritable bowel, etc.).
fit from our method for differentiating between conditions'
The drug symptoms caries, tooth decay, weight gain, incon-
symptoms and drug side effects, for example, and our drug
tinence, and diarrhea are known side effects.
similarity and condition similarity measures could be used
Similarly, many of the symptoms associated with Pred-
to help identify drug and condition synonyms.
nisone (i.e., swelling, mood changes, nausea, and exhaus-tion) and Lipitor (i.e., headache, migraine, weight gain, mus-
cle soreness, and stomach pain) are known side effects of
Bodenreider, O. 2004. The Unified Medical Language System
those drugs. Many of the symptoms associated with aller-
(UMLS): integrating biomedical terminology. Nucleic acids re-
gies are allergy symptoms, such as sneeze, headache, cough,
search 32(Database issue):D267–70.
and runny nose, whereas fewer symptoms appear to be cor-
Chen, L.; Hossain, K. S. M. T.; Butler, P.; Ramakrishnan, N.; and
rectly associated with diabetes (i.e., exhaustion, weight loss,
Prakash, B. A. 2014. Flu gone viral: Syndromic surveillance of flu
and nausea). These results illustrate that while we differenti-
on twitter using temporal topic models. In IEEE ICDM'14.
ate between symptoms caused by conditions and symptoms
Kuhn, M.; Letunic, I.; Jensen, L. J.; and Bork, P. 2016. The sider
caused by drugs (i.e., drug side effects), identifying causal
database of drugs and side effects. Nucleic Acids Res 44(Database
relationships is difficult and should be handled with care.
Paul, M., and Dredze, M. 2011. You are what you tweet: Analyzing
twitter for public health. In AAAI ICWSM'11.
Yates, A., and Goharian, N. 2013. ADRTrace: Detecting Expected
We described a model for learning associations between
and Unexpected Adverse Drug Reactions from User Reviews on
mentions of drugs, symptoms, and medical conditions in
Social Media Sites. In ECIR'13.
Twitter, and investigated its ability to (1) learn associations
Yates, A.; Goharian, N.; and Frieder, O. 2015. Extracting Adverse
between drugs, symptoms, and conditions, (2) to identify
Drug Reactions from Social Media. In Proceedings of the AAAI
conditions or drugs that are similar based on their asso-
Conference on Artificial Intelligence (AAAI'15).
ciated symptoms, and (3) to differentiate between symp-toms caused by drugs (i.e., drug side effects) and symptomscaused by a condition that a drug is being taken to treat.
Source: http://ir.cs.georgetown.edu/downloads/icwsm16_adr.pdf
UNEP/POPS/COP.6/INF/4/Rev.1 Distr.: General 24 April 2013 Stockholm Convention on Persistent Organic Pollutants Conference of the Parties to the Stockholm Convention on Persistent Organic Pollutants Sixth meeting Geneva, 28 April–10 May 2013 Item 5 (a) (ii) of the provisional agenda∗ Matters related to the implementation of the Convention: measures to reduce or eliminate releases from intentional production and use: exemptions
Chem. Res. Toxicol. 2006, 19, 164-172 The Greater Reactivity of Estradiol-3,4-quinone vs Estradiol-2,3-quinone with DNA in the Formation of Depurinating Adducts: Implications for Tumor-Initiating Activity Muhammad Zahid, Ekta Kohli, Muhammad Saeed, Eleanor Rogan, and Ercole Cavalieri* Eppley Institute for Research in Cancer and Allied Diseases, UniVersity of Nebraska Medical Center,