Sign up & Download
Sign in

Context-sensitive medical information retrieval

by Mordechai Auerbuch, Tom H Karson, Benjamin Ben-Ami, Oded Maimon, Lior Rokach
Studies In Health Technology And Informatics (2004)

Abstract

Substantial medical data such as pathology reports, operative reports, discharge summaries, and radiology reports are stored in textual form. Databases containing free-text medical narratives often need to be searched to find relevant information for clinical and research purposes. Terms that appear in these documents tend to appear in different contexts. The con-text of negation, a negative finding, is of special importance, since many of the most frequently described findings are those denied by the patient or subsequently "ruled out." Hence, when searching free-text narratives for patients with a certain medical condition, if negation is not taken into account, many of the retrieved documents will be irrelevant. The purpose of this work is to develop a methodology for automated learning of negative context patterns in medical narratives and test the effect of context identification on the performance of medical information retrieval. The algorithm presented significantly improves the performance of information retrieval done on medical narratives. The precision im-proves from about 60%, when using context-insensitive retrieval, to nearly 100%. The impact on recall is only minor. In addition, context-sensitive queries enable the user to search for terms in ways not otherwise available

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

Context-sensitive medical information retrieval

MEDINFO 2004
M. Fieschi et al. (Eds)
Amsterdam: IOS PressIn a 1973 review the Chief of the Computer Research Branch at
the US National Institutes of Health asserted that the data under-
lying the patient care process “are in the large majority nonnu-
meric in form and are formulated almost exclusively within the
constructs of natural language [1].” Today, over 30 years later,
much of the data stored in hospital information systems are still
stored as free-text, including history and physical exams, pathol-
ogy reports, operative reports, discharge summaries, and radiol-
ogy reports. Databases containing free-text medical narratives
often need to be searched to find relevant information for clinical
and research purposes.
Overview
Figure 1 presents a block diagram of the different components of
the system. All medical documents are loaded into a database.
Human experts review each document. Using a context tagging
application, the experts specify the context (c) of each appear-
ance of a medical term (t). The set of available contexts (C),
where C={C1,...,Cn}, is predefined based on the specific appli-
cation. For instance, in negation detection [5] the context set is
C={"Negative”, “Positive"}.
The resulting context-tagged document dataset (D) is divided
into 2 sets: (1) the training set which contains two-thirds (2/3) of© 2004 IMIA. All rights reserved
Context-Sensitive Medical Information Retrieval
Mordechai Averbuch a, Tom H. Karson b, Benjamin Ben-Ami c, Oded Maimond, Lior Rokachd
aTel-Aviv Sourasy Médical Center and Faculty of Medicine, Tel-Aviv University, Tel-Aviv, Israel
bDepartments of Clinical Informatics and Cardiology, Mount Sinai School of Medicine, New York, USA
cDepartment of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
dDepartment of Industrial Engineering, Tel-Aviv University, Tel-Aviv, Israel
Mordechai Averbuch, Tom H. Karson, Benjamin Ben-Ami, Oded Maimon, Lior Rokach
Abstract
Substantial medical data such as pathology reports, operative
reports, discharge summaries, and radiology reports are stored
in textual form. Databases containing free-text medical narra-
tives often need to be searched to find relevant information for
clinical and research purposes. Terms that appear in these doc-
uments tend to appear in different contexts. The context of nega-
tion, a negative finding, is of special importance, since many of
the most frequently described findings are those denied by the
patient or subsequently “ruled out.” Hence, when searching
free-text narratives for patients with a certain medical condi-
tion, if negation is not taken into account, many of the retrieved
documents will be irrelevant.
The purpose of this work is to develop a methodology for auto-
mated learning of negative context patterns in medical narra-
tives and test the effect of context identification on the
performance of medical information retrieval. The algorithm
presented significantly improves the performance of information
retrieval done on medical narratives. The precision improves
from about 60%, when using context-insensitive retrieval, to
nearly 100%. The impact on recall is only minor. In addition,
context-sensitive queries enable the user to search for terms in
ways not otherwise available.
Keywords:
Information Storage and Retrieval, Medical Informatics, Infor-
mation Management, Information Systems
Introduction
grammatically correct sentences. Acronyms and abbreviations
are frequently used. Very few of these abbreviations and acro-
nyms can be found in a dictionary and they are highly idiosyn-
cratic to the medical domain and local practice. Often
misspellings, errors in phraseology, and transcription errors are
found in dictated reports.
Various articles have been published evaluating methodologies
for efficient information retrieval in the medical domain
[2][3][4]. A search for patients with a specific symptom or set of
findings might result in numerous records retrieved. The mere
presence of a search term in the text, however, does not imply
that records retrieved are indeed relevant to the query. Depend-
ing upon the various contexts that a term might have, only a
small portion of the retrieved records may actually be relevant.
A number of investigators have tried to cope with the problem of
a negative context [5][6][7]. Their detection of negative context
is based on a regular expression built from a short list of negative
terms supplied by a human expert. There is no work that tries to
learn the profile of a context automatically and then uses this
profile to examine various methods of context classification in
the medical domain. Moreover, no work has been done to mea-
sure the effect of context on the result of medical information re-
trieval. The purpose of this work was to develop a methodology
for learning negative context patterns in medical narratives and
measure the effect of context identification on the performance
of medical information retrieval.
Methods282
Medical narratives present some unique problems. When a phy-
sician writes an encounter note, a highly telegraphic form of lan-
guage may be used. There are often very few (if any)
the documents along with the context of a few of the medical
terms and (2) the test set which contains the remaining docu-
ments along with the context of few different medical terms. The
Page 2
hidden
M. Averbuch et al.training set and the test set, therefore, contain different docu-
ments and the context of different medical terms.
The training set serves as the input to the learning algorithm.
The output of the learning algorithm is the context profile (L).
Each context has its own profile that consists of a list of indica-
tive terms. For instance the profile of a negative context may be
Lnegative={"negative for", "denies"}.
The context profile becomes an input into the retrieval algo-
rithm. Queries for terms found in the test set are then created and
run utilizing the retrieval algorithm, resulting in a set of retrieved
documents. The recall, precision and F-measure [8] were mea-
sured for each of the queries.
Learning Algorithm
The core of the system is the learning algorithm. Its output, the
context profile (L), is created by scanning the documents in the
training set. All words or phrases (w) that appear in the same sen-
tence as a tagged term are put on a list and statistics are generated
regarding their appearance in other contexts. This list is then fil-
tered using a threshold parameter, to eliminate rare words or
phrases. Based on the UMLS Dictionary [9], the list is further re-
duced by removing all words or phrases that have medical con-
text. (This removes medical terms that tend to correlate with
tagged terms.) The next step is calculating the information gain
(IG) for each term in each context. Equation 1 shows how IG is
calculated, where H(c) is the entropy of the context c and
H(c|term) is the conditional entropy for the context of the given
term.
The last step of the algorithm is to remove terms from each con-
text profile whose IG is below a given threshold. Pseudo-code of
the algorithm is shown in Figure 2 .
Sentence Boundary Identification
The learning algorithm presented above requires that free-text be
broken up into sentences. Normally, sentence boundaries can be
detected by scanning the text for a period, exclamation point or
question mark. This approach, however, does not work for med-
ical narratives.
Figure 2 - Pseudo-Code of the Proposed Learning Algorithm
As can be seen below, periods are frequently used within a sen-
tence:
• Patient was discharged on Lopressor 25 milligrams p.o.
b.i.d.
• After multiple attempts only 750 cc. of fluid were
removed.
• He was evaluated by Dr. ___ of Neurology.
• Rechecked potassium was 4.4.
In this study sentence boundary determination still begins by
scanning text for periods. Then, each period is evaluated to de-
termine if it is part of a regular expression. Table 1 shows regu-
lar expressions written using Perl notation. If a period is part of
a regular expression, it is marked as “not a separator.” All other
periods are considered sentence separators.
Table 1: Regular Expressions Marked As “Not a Separator”
Retrieval Algorithm
The retrieval part of the experiment is meant to simulate queries
made by physicians. All the documents in the test set are scanned
for the query terms tested. In each document where query terms
are found, a context classification, either positive or negative, is
made for each appearance of the term. The context is classified
by searching all the terms of the sentence where the query term
is found and comparing it to the negative context profile. If a
term is found in the negative context profile, that appearance of
the query term was marked as negative. After classifying all ap-
pearances of the query terms in a document, the document is re-
Figure 1 - Overview of Methodology
IG c term,( ) H c( ) H c term( )–=
(b|t|q)\.i\.d\.? p\.o\.? \.([0-9]+) cc\.
p\.r\.n q\.d\.? \. of \., and
q\.h\.s mg\. (Dr\.)(\s?)(\w+) \sq\.
{ )
{ , . .. , }
{ , , . . . , }
In p u t:
} A s e t o f C o n te x ts ( th ro u g h
A s e t o f D o c u m e n ts ( th ro u g h ) ; e .g ., th e T ra in in g S e t
D ic t io n a ry te rm s ( fo r in s ta n c e th e U M L S D ic t io n a r
1 n 1 n
tr a in 1 m 1 n
1 2 q
C c , .. . , c c c
D d d d d
T w w w
= −
= −
= −
{ }( , , )
y )
A s e t o f m a n u a lly ta g g e d te rm s ,
w h e re is th e c o n te x t o f te rm in d o c u m e n t
T h e v a lu e o f th e m in im u m re q u ire d n u m b e r o f a p p e a ra n c e s
T h e m in im u m v a lu e o f th e th re s h o ld
k
k
i j l
l j i
M d t c
c t d
m in _ a
m in _ i
= −

− )
, ,
( , , )
p a ra m e te r fo r in fo rm a t io n g a in (
F o r e a c h
F o r e a c h d o c u m e n t
F o r e a c h te rm , s . t .
F o r e a c h w o rd (a n d p h ra s e ) in th e s e n te n c e o f
I f
T h e n a d d to
E ls e
In c
c
t
tr a in
c
t
c
t
IG
L c t
c C
d D
t d t c M
w t
w L
w L
= ∅ ∀




re a s e th e n u m b e r o f a p p e a ra n c e s o f in
D e f in e A C o n te x t P ro f ile
R e m o v e fro m a ll w o rd s w ith a p p e a ra n c e s o f le s s th a n
R e m o v e a ll w o rd s in th a t a p p e a r in
F o r e a c h c a lc u la te t
c
t
c c
t
t
c
c
c
w L
L L
L m in _ a
L T
w L
= −

U
1
( , )
( , )
, . . . ,
h e in fo rm a tio n g a in
R e m o v e fro m a ll w o rd s w h e re is le s s th a n
R e tu rn n
c
cc
IG c w
L IG c w m in _ i
L L283

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

3 Readers on Mendeley
by Discipline
 
by Academic Status
 
33% Student (Bachelor)
 
33% Other Professional
 
33% Ph.D. Student
by Country
 
67% Australia
 
33% United Kingdom