In Proceedings of ESWC, 2009. Special Mention from the Best Paper award evaluation committee.
Concept Search
Fausto Giunchiglia, Uladzimir Kharkevich, Ilya Zaihrayeu
Department of Information Engineering and Computer Science University of Trento, Italy {fausto,kharkevi,ilya}@disi.unitn.it
Abstract
In this paper we present a novel approach, called Concept Search,
which extends syntactic search, i.e., search based on the
computation of string similarity between words, with semantic
search, i.e., search based on the computation of semantic
relations between concepts. The key idea of Concept Search is to
operate on complex concepts and to maximally exploit the semantic
information available, reducing to syntactic search only when
necessary, i.e., when no semantic information is available. The
experimental results show that Concept Search performs at least as
well as syntactic search, improving the quality of results as a
function of the amount of available semantics.
1 Introduction
Historically, there have been two major approaches to information
retrieval that we call syntactic search and semantic
search. Syntactic search use words or multi-word phrases as
atomic elements in document and query representations. The search
procedure is essentially based on the syntactic matching of
document and query representations. Search engines, exploiting
syntactic search, are known to suffer in general from low
precision while being good at recall. Semantic search is based on
fetching document and query representations through a semantic
analysis of their contents using natural language processing
techniques and, later, on retrieving documents by matching these
semantic representations. The idea is that, differently from
syntactic search, semantic search exploits the meaning of
words, thus avoiding many of the well known problems of syntactic
search, e.g., the problems of polysemy and synonymy.
Semantics-based approaches, in general, allow to reach a higher
precision but lower recall.
In this paper we propose a novel approach called Concept
Search (C-Search in short) which extends syntactic search
with semantics. The main idea is to keep the same machinery which
has made syntactic search so successful, but to modify it so that,
whenever possible, syntactic search is substituted by semantic
search, thus improving the system performance. This is why we say
that C-Search is semantics enabled syntactic search.
Semantics can be enabled along different dimensions, on different
levels, and to different extents forming a space of approaches
lying between purely syntactic search and fully semantic search.
We call this space the semantic continuum. In principle,
C-Search can be tuned to work at any point in the semantic
continuum taking advantage of semantics when and where possible.
As a special case, when no semantic information is available,
C-Search reduces to syntactic search, i.e., results produced by
C-Search and syntactic search are the same.
The remainder of the paper is organized as follows. In
Section 2, we first discuss information
retrieval (IR) in general, and then focus on syntactic search. In this
section, we provide a general model of IR which
will be latter extended to semantic search. In
Section 3, we introduce and describe
the semantic continuum. In Section 4, we
describe how C-Search is positioned within the semantic
continuum. In Section 5, we show how
C-Search can be efficiently implemented using inverted index
technology. Section 6 presents experimental
results. In Section 7, we discuss the
state-of-the-art in semantic search and compare our approach with
other related approaches. Section 8 concludes
the paper.
2 Syntactic Search
The goal of an IR system is to map
a natural language query q (in a query set Q), which specifies
a certain user information needs, to a set of documents d in the
document collection D which meet these needs, and to order these
documents according to their relevance. IR can therefore be
represented as a mapping function:
In order to implement an IR System we need to decide (i) which
models (Model) are used for document and query
representation, for computing query answers and relevance ranking,
(ii) which data structures (Data Structure) are used for
indexing document representations in a way to allow for an
efficient retrieval, (iii) what is an atomic element (Term)
in document and query representations, and (iv) which matching
techniques (Match) are used for matching document and query
terms. Thus, an IR System can be abstractly modelled as the
following 4-tuple:
|
IR_System = 〈Model, Data Structure, Term, Match 〉 |
| (2) |
The bag of words model, i.e., the model in which the
ordering of words in a document is not considered, is the most
widely used model for document representation. The boolean
model, the vector space model, and the probabilistic
model are the classical examples of models used for computing
query answers and relevance ranking [20].
Conventional search engines rank query results using the cosine
similarity from the vector space model with terms weighted by
different variations of the tf-idf weight measure. Various
index structures, such as the signature file and the
inverted index, are used as data structures for
efficient retrieval. Inverted index, which stores mappings from
terms to their locations in documents, is the most popular
solution [20]. Finally, in syntactic search,
Term and Match are instantiated as follows:
- Term - a word or a multi-word phrase;
- Match - a syntactic matching of words or phrases.
In the simplest case, syntactic matching is implemented
as search for equivalent words. These words are often stemmed.
Furthermore, some systems perform approximate matching by
searching for words with common prefixes or words within a certain
edit distance with a given word.
There are several problems which may negatively affect the
performance of syntactic search, as discussed below.
Polysemy. The same word may have multiple meanings
and, therefore, query results, computed by a syntactic search
engine, may contain documents where the query word is used in a
meaning which is different from what the user had in mind. For
instance, a document D1 (in Figure 1)
which talks about baby in the sense of a very young mammal
is irrelevant if the user looks for documents about baby in
the sense of a human child (see query Q1 in
Figure 1). An answer for query Q1,
computed by a syntactic search engine, includes document D1,
while the correct answer is the empty set.
Figure 1: Queries and a document collection
Synonymy. Two different words can express the same
meaning in a given context, i.e., they can be synonyms. For
instance, words mark and print are synonymous when
used in the sense of a visible indication made on a surface,
however, only documents using word print will be returned if
the user query was exactly this word. An answer for query Q2 (in
Figure 1), computed by a syntactic
search engine, is the empty set, while the correct answer includes
the document D3.
Complex concepts. Syntactic search engines fall
short in taking into account complex concepts formed by natural
language phrases and in discriminating among them. Consider, for
instance, document D2 (in Figure 1).
This document describes two concepts: a laptop computer and
a coffee table. Query Q3 (in
Figure 1) denotes concept computer
table which is quite different from both concepts described in
D2, whereas a syntactic search engine is likely to return D2
in response to Q3, because both words computer and table occur
in this document. The correct answer is the empty set.
Related concepts. Syntactic search does not take
into account concepts which are semantically related to the query
concepts. For instance, a user looking for carnivores might
not only be interested in documents which talk about carnivores
but also in those which talk about the various kinds of carnivores
such as dogs and cats. An answer for query Q4 (in
Figure 1), computed by a syntactic
search, is the empty set, while the correct answer might include
documents D1 and D3, depending on user information needs and
available semantic information.
3 The Semantic Continuum
In order to address the problems of syntactic search described in
Section 2, we extend syntactic search
with semantics. In the following, we identify three dimensions
where semantics can improve syntactic search and represent these
dimensions in the cartesian space shown in
Figure 2.
Figure 2: Semantic Continuum
From natural language to formal language
(NL2FL-axis in Figure 2). To solve the
problems related to the ambiguity of natural language, namely, the
problems of polysemy and synonymy, we need to move from words,
expressed in a natural language, to concepts (word senses),
expressed in an unambiguous formal language. An overview of
existing approaches to sense based IR is presented
in [26]. In the NL2FL-axis, 0 represents the
situation where only words are used, while 1 represents the
situation where only concepts are used. When we move from words to
concepts, it is not always possible to find a concept which
corresponds to a given word. The main reason for this problem is
the lack of background knowledge [16], i.e., a concept
corresponding to a given word may not exist in the lexical
database. To address this problem, indexing and retrieval in the
continuum are performed by using both syntactic and semantic
information, i.e., a word itself is used as a concept, when its
denoted concept is not known.
From words to phrases (W2P-axis in
Figure 2). To solve the problem related
to complex concepts, we need to analyze natural language phrases,
which denote these concepts. It is well known that in natural
language concepts are expressed mostly as noun
phrases [28]. An example of a noun phrase
parsing algorithm and its application to document indexing and
retrieval is described in [32]. There are approaches
in which the conceptual content of noun phrases is also analyzed
(see e.g. [1]). In general, concepts can be
expressed as more complex phrases (e.g., verb phrases) and
possibly as a free text. In the W2P-axis, 0 represents the
situation where only single words are used, while 1 represents the
situation where complex concepts, extracted from a free text, are
used.
From string similarity to semantic similarity
(KNOW-axis in Figure 2). The problem
with related concepts can be solved by incorporating knowledge
about term relatedness. For instance, it can be statistical
knowledge about word co-occurrence (see e.g. [10]),
lexical knowledge about synonyms and related words (see
e.g. [22]), or ontological knowledge about classes,
individuals, and their relationships (see
e.g. [23]). In the KNOW-axis, 0 represents the
situation where only string similarity is used during the term
matching process, while 1 represents the situation where complete
ontological knowledge is used during term matching.
The three-dimensional space contained in the cube (see
Figure 2) represents the semantic
continuum where the origin (0,0,0) is a purely syntactic search,
the point with coordinates (1,1,1) is a fully semantic search, and
all points in between represent search approaches in which
semantics is enabled to different extents. C-Search can be
positioned anywhere in the semantic continuum with syntactic
search being its base case, and semantic search being the optimal
solution, at the moment beyond the available technology.
4 Concept Search
C-Search is implemented according to the model of
Equations 1 and 2 from
Section 2. In our proposed solution,
C-Search reuses retrieval models (Model) and data
structures (Data Structure) of syntactic search with the
only difference in that now words (W) are substituted with
concepts (C) and syntactic matching of words (WMatch) is
extended to semantic matching of concepts (SMatch). This idea is
schematically represented in the equation below:
|
Syntatic Search → [[ Term(W → C), Match(WMatch → SMatch) ]] → C−Search |
|
Let us consider in details how the words in W are converted
into the complex concepts in C and also how the semantic
matching SMatch is implemented.
4.1 From Words To Complex Concepts (W → C)
Searching documents, in C-Search, is implemented using
complex concepts expressed in a propositional Description Logic
(DL) language (i.e., a DL language without roles). Complex
concepts are computed by analyzing meaning of the words and
phrases.
Single words are converted into atomic concepts uniquely
identified as lemma-sn, where lemma is the lemma of
the word, and sn is the sense number in WordNet. For
instance, the word dog used in the sense of a domestic dog,
which is the first sense in WordNet, is converted into the atomic
concept dog-1. The conversion of words into concepts is
performed as follows. First, we look up and enumerate all meanings
of the word in WordNet. Next, we perform word sense filtering,
i.e., we discard word senses which are not relevant in the given
context. In order to do this, we follow the approach presented
in [31], which exploits POS tagging information and
WordNet lexical database for disambiguation of words in short noun
phrases. Differently from [31] we do not use the
disambiguation technique which leaves only the most probable sense
of the word, because of its low accuracy. If more than one sense
is left after the word sense filtering step then we keep all these
senses. If no senses from WordNet are found then lemma is
used as the identifier for the atomic concept. In this case,
C-Search is reduced to syntactic search.
Complex concepts are computed by extracting phrases and by
analyzing their meaning. Noun phrases are translated into the
logical conjunction of atomic concepts corresponding to the words.
For instance, the noun phrase A little dog represents a
concept, whose extension is the set of all dogs of a small size.
It is translated into the concept little-4 ∩dog-1.
Concepts in natural language can be described ambiguously. For
instance, the phrase A little dog or a huge cat represents a
concept which encodes the fact that it is unknown whether the
only animal described in the document is a little dog
or a huge cat. In order to support complex concepts which
encode uncertainty (partial information) coming from the
coordination conjunction OR in natural language, we
introduce the notion of descriptive phrase. We define
descriptive phrase as a set of noun phrases, representing
alternative concepts, connected by OR:
|
descriptive_phrase :: = noun_phrase {OR noun_phrase} |
| (3) |
Descriptive phrases are translated into logical
disjunction of formulas corresponding to the noun phrases. For
instance, phrase A little dog or a huge cat is translated
into concept (little-4 ∩dog-1) ∪(huge-1 ∩cat-1). To locate descriptive phrases
we, first, follow a standard NLP pipeline to locate noun phrases,
i.e., we perform sentence detection, tokenization, part-of-speech
(POS) tagging, and noun phrase chunking. The first step allows as
to locate noun phrases. As a second step, that we call descriptive
phrase chunking, we locate descriptive phrases satisfying
Formula 3. Our implementation is based
on GATE [8].
Queries usually are short phrases (i.e., 1-3 words) and, as shown
in [31], standard NLP technology, primarily designed
to be applied on full-fledged sentences, is not effective enough
in this application scenario. An atomic concept, in a query, can
be computed incorrectly, because of the selection of a wrong
part-of-speech tag. In order to address this problem, for
short queries, we use a POS-tagger which is specifically trained
on short phrases [31]. On the other hand, for long
queries (i.e., 4 words or more), we use the standard NLP
technology.
Even if atomic concepts are computed correctly, complex concepts
can be erroneously computed. One of the reasons is that a complex
concept can be represented as a sequence of words without
following the grammar for noun phrases. For instance, the query
cat huge is converted into two atomic concepts
cat-1 and huge-1, while the correct concept
might be cat-1 ∩huge-1. Another reason is that
a query describing more than one concept, without properly
separating them, can be recognized as a single complex concept.
For instance, the query dog cat is converted into the
concept dog-1 ∩cat-1, while the user might be
actually looking for a document describing both animals, i.e.,
dog-1 and cat-1. The examples described above
show that, in general, it is unknown how atomic concepts Aq1,
..., Aqn, extracted from short queries, should be combined
in order to build complex query concepts. To represent this
uncertainty we use the following query
|
(Aq1 AND ... AND Aqn ) AND (Aq1 ∪... ∪Aqn) |
| (4) |
where the first part (Aq1 AND ... AND Aqn) encodes the fact that it is known that the query answer
should contain documents which are relevant to all the atomic
concepts in the query. The second part, i.e., the complex concept
Aq1 ∪... ∪Aqn, can be equivalently rewritten as
(Aq1) ∪(Aq2) ∪... ∪(Aq1 ∩Aq2) ∪... ∪(Aq1 ∩... ∩Aqn) and, therefore, encodes
the fact that it is unknown which complex concept (e.g.,
Aq1 ∩Aq2, or Aq1 ∩... ∩Aqn) should be
actually described.
4.2 From Word to Concept Matching (WMatch → SMatch)
In C-Search, we allow the search of documents describing
concepts which are semantically related to query concepts. We
assume that, when a user is searching for a concept, she is also
interested in more specific concepts.1 For example, the extension of concept
(little-4 ∩dog-1) ∪(huge-1 ∩cat-1) is a subset of the extension of concept
carnivore-1. Therefore, documents describing the former
concept should be returned as answers to the query describing the
latter concept. Formally a query answer A(Cq, T) is
defined as follows:
|
A(Cq, T) = {d | ∃Cd ∈ d , s.t. T |= Cd ⊆ Cq} |
| (5) |
where Cq is a complex query concept extracted from
the query q, Cd is a complex document concept extracted from the
document d, and T is a terminological knowledge base
(the background knowledge) which is used in order to check if
Cd is more specific then Cq.
Equation 5 states that the answer to a query
concept Cq is the set of all documents d, such that, there
exists concept Cd in d which is more specific than the query
concept Cq.
During query processing we need to compute A(Cq, T)
for every query concept Cq in the query. One approach is to
sequentially iterate through each concept Cd, compare it to the
query concept Cq using semantic matching [17], and
collect those concepts for which semantic matching returns
more specific ( ⊆ ). However, this approach may become
prohibitory expensive as there may be thousands and millions of
concepts described in documents. In order to allow for the
efficient computation of A(Cq, T), we propose an
approach described below.
Let us assume, as it is the case in the current implementation,
that T consists of the terminological knowledge base
TWN generated from WordNet and extended by words
(represented as concepts) for which no senses in WordNet are
found. One small fragment of TWN is represented in
Figure 3.
Figure 3: Example of terminological knowledge base TWN
TWN can be thought of as an acyclic graph, where
links represent subsumption axioms in the form Ai ⊆ Aj, with Ai and Aj atomic concepts.
Concepts Cd and Cq, are created by translating descriptive
phrases into propositional DL formulas (see Section 4.1
for details). The resulting concepts are disjunctions (∪) of
conjunctions (∩) of atomic concepts (A) without negation,
i.e., Cd ≡ ∪∩Ad and Cq ≡ ∪∩Aq.
For example, a possible document concept and query concept are:
Cd ≡ (little−4 ∩dog−1) ∪(huge−1 ∩cat−1) Cq ≡ (small−4 ∩canine−2) ∪(large−1 ∩feline−1) |
|
By substituting Cd with ∪∩Ad, Cq with ∪∩Aq, and T with TWN in
Equation 5, we obtain:
|
A(∪∩Aq, TWN) = {d | ∃(∪∩Ad) ∈ d , s.t. TWN |= ∪∩Ad ⊆ ∪∩Aq} |
| (6) |
Let by C∪∩Aq denote the set of all complex
document concepts ∪∩Ad (in all documents d), which
are equivalent to or more specific than ∪∩Aq, in
formulas
|
C∪∩Aq = { ∪∩Ad | TWN |= ∪∩Ad ⊆ ∪∩Aq} |
| (7) |
Then Equation 6 can be rewritten as
follows:
|
A(∪∩Aq, TWN) = {d | ∃(∪∩Ad) ∈ d , s.t. (∪∩Ad) ∈ C∪∩Aq } |
| (8) |
In order to compute set C∪∩Aq, as defined
in Equation 7, we need to solve the following
subsumption problem
Given that TWN consists only of
subsumption axioms between atomic concepts, and that concepts
∪∩Ad and ∪∩Aq do not contain negations, the
problem in Equation 9 can be reduced
to the set of subsumption problems
This problem reduction is obtained by applying the following three
equations:2
|
TWN|=∪∩Ad ⊆ ∪∩Aq iff for all ∩Ad in ∪∩Ad,TWN|=∩Ad ⊆ ∪∩Aq |
| (11) |
|
TWN|=∩Ad ⊆ ∪∩Aq iff there exists ∩Aq in ∪∩Aq,TWN|=∩Ad ⊆ ∩Aq |
| (12) |
|
TWN|=∩Ad ⊆ ∩Aq ifffor all Aq in ∩Aq , TWN|=∩Ad ⊆ Aq |
| (13) |
Notice that the second part of each equation is the same as the
first part of the following equation, and that the first part of
Equation 11 and the last part of
Equation 13 are exactly
Equations 9 and
10. This proves that the above problem
reduction is correct.
If by C∩C we denote a set of all conjunctive
components ∩Ad, which are equivalent to or more specific
than concept C, i.e.,
|
C∩C = { ∩Ad | TWN |=∩Ad ⊆ C}, where C ∈ {Aq, ∩Aq, ∪∩Aq} |
| (14) |
Given Equations 11, 12,
and 13, the query answer A(Cq, TWN), as defined in 8, can be computed
by using the algorithm in the figure below.
The five phases of the algorithm
can be explained as follows (the next section reports how these
phases are actually implemented and their output on a running
example):
- Phase 1 (line 6) We compute C∩Aq,
i.e., the set of all ∩Ad, such that, ∩Ad ⊆ Aq.
- Phase 2 (lines 5-13) We compute C∩∩Aq, i.e., a set of all ∩Ad, such that, ∩Ad ⊆ ∩Aq. As it follows from
Equation 13, ∩Ad ∈ C∩∩Aq only if for every Aq in ∩Aq,
∩Ad ∈ C∩Aq. To compute
C∩∩Aq, we intersect sets
C∩Aq for all Aq in ∩Aq.
- Phase 3 (lines 2-15) We compute the set
C∩∪∩Aq, i.e., the set of all ∩Ad, such that, ∩Ad ⊆ ∪∩Aq. As it follows
from Equation 12, ∩Ad ∈ C∩∪∩Aq if ∩Ad ∈ C∩∩Aq at least for one ∩Aq in ∪∩Aq. To compute the set C∩∪∩Aq,
we take the union of all the sets C∩∩Aq for
all ∩Aq in ∪∩Aq.
- Phase 4 (line 16) We compute the set C∪∩Aq, i.e., the set of all complex document concepts ∪∩Ad, such that, ∪∩Ad ⊆ ∪∩Aq. As
it follows from Equation 11, ∪∩Ad ∈ C∪∩Aq only if all the conjunctive
components ∩Ad in ∪∩Ad belong to
C∩∪∩Aq. To compute the set
C∪∩Aq, we collect all ∪∩Ad which
consist only from conjunctive components ∩Ad in
C∩∪∩Aq.
- Phase 5 (line 17) We compute A(∪∩Aq,TWN) as defined in Equation 8,
i.e., by collecting all the documents which contain concepts from
C∪∩Aq.
5 Concept Search via Inverted Indexes
In C-Search, every document is represented as an enumerated
sequence of conjunctive components ∩Ad possibly connected
by symbol "∪". For example, in
Figure 4 we show the sequences of
∩Ad extracted from documents in
Figure 1.
Figure 4: Document and Query Representations
Rectangles in Figure 4 represent
either conjunctive components ∩Ad or the disjunction symbol
"∪", a number in a square at the left side of a rectangle
represents the position of the rectangle in the whole
sequence. Note, that symbol "∪" is used to specify that
conjunctive components ∩Ad connected by this symbol form a
single disjunctive concept ∪∩Ad, namely:
|
∪∩Ad :: = (∩Ad) { ("∪") (∩Ad)} |
| (15) |
For example, the first three positions in the
sequence for document D3 in
Figure 4 represent the concept
(little-4 ∩dog-1) ∪(huge-1 ∩cat-1).
The resulting document representations (see
Figure 4) are indexed using a
positional inverted index. In a positional inverted index, as used
in syntactic search, there are two parts: the dictionary,
i.e., a set of terms (t) used for indexing; and a set of posting
lists P(t). A posting list P(t) is a list of all postings for term
t:
|
P(t) = [〈d, freq, [position] 〉] |
|
where 〈 d, freq, [position]〉 is a
posting consisting of a document d associated with term t, the
frequency freq of t in d, and a list [position] of
positions of t in d.
In C-Search, we adopt a positional inverted index to index
conjunctive components ∩Ad by all more general or
equivalent atomic concepts from TWN3. For example, in
Figure 5 we show a fragment of the
positional inverted index created by using the document
representations in Figure 4.
| Dictionary (t) | Posting lists (P(t)) |
| ∪ | [〈D3,1,[2] 〉] |
| baby-3 | [〈D1,1,[1] 〉] |
| canine-2 | [〈D1,1,[1] 〉; 〈D3,1,[1] 〉] |
| carnivore-1 | [〈D1,2,[1,3] 〉; 〈D3,2,[1,3] 〉] |
| computer-1 | [〈D2,1,[1] 〉] |
| feline-1 | [〈D1,1,[3] 〉; 〈D3,1,[3] 〉] |
| leave | [〈D3,1,[4] 〉] |
| little-4 | [〈D1,1,[1] 〉; 〈D3,1,[1] 〉] |
Figure 5: Positional Inverted Index
The inverted index dictionary, in C-Search, consists of
atomic concepts from TWN (e.g., concepts
baby-3 and canine-2 in
Figure 5), and symbol "∪" (e.g., the
first term in Figure 5). The posting list
P(A) for an atomic concept A stores the positions of
conjunctive components ∩Ad, such that, ∩Ad ⊆ A. For instance, P(canine-2) = [〈D1,1,[1] 〉;〈D3,1,[1] 〉], which means that at first position in
documents D1 and D3 there are conjunctive components (i.e.,
small-4 ∩baby-3 ∩dog-1 and
little-4 ∩dog-1) which are more specific than
canine-2. The posting list P(∪) stores the positions
of the symbol "∪".
Now, let us see how Algorithm 1 in
Section 4.2 can be implemented by using the
positional information of conjunctive components ∩Ad stored
in the inverted index. Notice that below instead of conjunctive
components themselves we work only with their positions.
- Phase 1 Positions of conjunctive components ∩Ad
in the set C∩Aq (line 6 in Algorithm 1)
are computed by fetching the posting list P(Aq) for an atomic
concept Aq. For instance (see Figure 5),
|
C∩little−4 = [〈D1,1,[1] 〉; 〈D3,1,[1] 〉] C∩carnivore−1 = [〈D1,2,[1,3] 〉; 〈D3,2,[1,3] 〉] |
|
- Phase 2 The intersection of the sets of conjunctive
components (line 10 in Algorithm 1) is implemented by the
intersection of corresponding posting lists. For instance,
|
C∩little−4 ∩carnivore−1 = [〈D1,1,[1] 〉; 〈D3,1,[1] 〉] |
|
- Phase 3 The union of the sets of conjunctive
components (line 14 in Algorithm 1) is implemented by
uniting corresponding posting lists. For instance,
|
C∩canine−2 ∪feline−1 = [〈D1,2,[1,3] 〉; 〈D3,2,[1,3] 〉] |
|
- Phase 4 Every concept in set C∪∩Aq (line 16 in Algorithm 1) should consists only from
the conjunctive components in C∩∪∩Aq.
In order to find the positions of such concepts, we take the union
of the posting lists for C∩∪∩Aq with
the posting list for the symbol "∪". Then we filter out all
the positions which does not comply with the pattern defined in
Equation 15. For instance, for complex query
concept canine-2 ∪feline-1, we will find the
following complex document concepts:
〈D1,1,[1] 〉⇒ 1 | small−4 ∩baby−3 ∩dog−1 〈D1,1,[3] 〉⇒ 3 | huge−1 ∩white−1 ∩cat−1 〈D3,1,[1,2,3] 〉⇒ 1 | little−4 ∩dog−1 2 | ∪ 3 | huge−1 ∩cat−1 |
|
- Phase 5 The query answer (line 17 in
Algorithm 1) is computed by collecting the documents from
all the postings. For instance,
|
A(canine−2 ∪feline−1, TWN) = {D1, D3} |
|
If n is a number of atomic concepts Aq in the query concept
∪∩Aq, then to compute A(Cq, TWN) it
takes n posting list merges (i.e., intersections and unions).
Note that, in a positional inverted index, the same number of
posting list merges is required to process a phrase query
consisting of n+1 words [20].
In C-Search, query concepts Cq can be combined into more
complex queries q by using the boolean operators AND and NOT.
Query answer A(q, TWN) in this case is computed by
recursively applying the following rules:
A(qi AND qj, TWN) = A(qi, TWN) ∩A(qj, TWN) A(qi NOT qj, TWN) = A(qi, TWN) / A(qj, TWN) |
|
For instance, the query answer for query
baby-1 AND dog-1 (in
Figure 4) is computed as follows:
A(baby-1 AND dog-1, TWN) =
A(baby-1, TWN) ∩A(dog-1, TWN) = ∅∩{D1, D3} = ∅
Relevance of documents, in C-Search, is computed by adopting
the cosine similarity from the vector space model. Concepts are
weighted by tf-idf weight measure modified to take into
account the semantic relatedness of concepts.
6 Evaluation
In order to evaluate our approach, we built two IR systems. The
first system is an instantiation of syntactic search and is build
on top of
Lucene4,
an open source IR toolkit used in many search
applications5. Standard
tokenization and English Snowball stemmer were used for document
and query preprocessing. The AND operator was used as a default
boolean operator in a query. The second system is a semantics
enabled version of Lucene, implemented following the methodology
described in Sections 4
and 5.
As a data-set for our experiments, we used the TREC ad-hoc
document
collection6
(disks 4 and 5 minus the Congressional Record documents) and three
query sets: TREC6 (topics 301-350), TREC7 (topics 351-400) and
TREC8 (topics 401-450). Only the title for each topic was used as
a query. The whole data-set consists of 523,822 documents and 150
queries.
In the evaluation we used the standard IR measures and in
particular the mean average precision (MAP) and precision at K
(P@K), where K was set to 5, 10, and 15. The average precision for
a query is the mean of the precision obtained after each relevant
document is retrieved (using 0 as the precision for not retrieved
documents which are relevant). MAP is the mean of the average
precisions for all the queries in the test collection. P@K is the
percentage of relevant documents among the top K ranked documents.
MAP is used to evaluate the overall accuracy of IR system, while
P@K is used to evaluate the utility of IR system for users who
only see the top K results.
First, in Table 1, we report the evaluation results
for the two systems and further, in figure 6,
we provide recall-precision graphs, i.e., we plot precision as a
function of recall, for these systems.
Table 1: Evaluation results
| TREC6 (301-350) |
| | MAP | P@5 | P@10 | P@15 |
| Lucene | 0.1361 | 0.3200 | 0.2960 | 0,2573 |
| C-Search(Lucene) | 0.1711(+25.7%) | 0.3920(+22.5%) | 0.3480(+17.6%) | 0.3000(+16.6%) | |
| TREC7 (351-400) |
| | MAP | P@5 | P@10 | P@15 |
| Lucene | 0.1138 | 0.3560 | 0.3280 | 0.3000 |
| C-Search(Lucene) | 0.1375(+20.8%) | 0.4200(+18.0%) | 0.3680(+12.2%) | 0.3427(+14.2%) | |
| TREC8 (401-450) |
| | MAP | P@5 | P@10 | P@15 |
| Lucene | 0.1689 | 0.4320 | 0.4000 | 0.3573 |
| C-Search(Lucene) | 0.2070(+22.6%) | 0.4760(+10.2%) | 0.4280(+7.0%) | 0.4013(+12.3%) |
Figure 6: Recall-Precision Graphs
The experiments show that, on TREC ad-hoc data sets,
C-Search performs better than purely syntactic search, which
supports the underlying assumption of our approach. In particular,
from Table 1 we observe that C-Search improves
precision P@K for all K in all three TREC data sets. This is
coherent with the intuition that semantics improve on precision.
Notice that it means that we are able to show users more relevant
documents at the top of the list. From
Figure 6 we observe that the recall-precision
graphs for C-Search are above those for Lucene, which means
that the improvement in precision achieved by C-Search does
not decrease recall.
7 Related work
The fact that the syntactic nature of classical IR leads to
problems with precision was recognized in the IR community long
ago (e.g., see [29]). There were two major approaches
to addressing this problem. The first is based on natural language
processing and machine learning techniques in which (noun) phrases
in a document corpus are identified and organized in a subsumption
hierarchy which is then used to improve the precision of retrieval
(e.g., see [30]). The second is based on using a
linguistic database which is used to associate words in a document
corpus with atomic lexical concepts in this database and then to
index these documents with the associated concepts (e.g.,
see [27]). Our approach is different from both these
approaches. In fact, the former approach is still essentially
syntactic (and semantics is only implicitly derived with no
guarantee of correctness), while in the latter approach only
atomic concepts are indexed, whereas C-Search allows for
indexing of complex concepts and explicitly takes into account
possible relations between them. More importantly, our approach
extends syntactic search and does not replace it as in the
latter approach and, therefore, supports the continuum from purely
syntactic to fully semantic search.
In the Semantic Web community, semantic search is primarily seen
as the task of querying an RDF graph based on the mapping of terms
appearing in the input natural language query to the elements of
the graph. An analysis of existing semantic search systems is
provided in [18]. Our approach is fundamentally
different because, similarly to classical IR, the input query is
mapped to document contents and not to elements of a knowledge
representation structure. Document retrieval approaches developed
in the context of the Semantic Web are surveyed
in [19]. The matching of document and query
representations, in these approaches, is based on query expansion
(e.g., see [6]), graph traversal (e.g.,
see [25]), and RDF reasoning (e.g.,
see [9]). Differently from these approaches,
C-Search is based on the semantic matching of complex
concepts, where semantic matching is implemented by using
positional inverted index. Hybrid Search [3] is
similar to our approach in that it integrates syntactic search
with semantic search. Differently from us, in [3],
semantic search is implemented on metadata and is totally
separated from syntactic search, implemented on keywords. Instead,
in our approach semantic search is seamlessly integrated into
syntactic search by substitution words with concepts and at the
same time keeping the underlying machinery of syntactic search.
8 Conclusion
In this paper we have presented an approach where syntactic search
is extended with a semantic layer. The proposed approach performs
as good as syntactic search while allowing for an improvement
whenever semantics are available and can be exploited. The reported
experimental results provide a proof of concept of the quality of
proposed solution.
We are still at the beginning and a lot of work still needs
to be done. Future work includes:
- the development of more accurate document relevance metrics based on both syntactic and semantic similarity of query and document descriptions;
- the integration of more accurate algorithms for concept identification;
- providing support for queries in which concepts can be associated with a semantic scope such as equivalence, more/less general, disjoint;
- improving the performance of the system. Even if the system
is reasonably fast in the sense that it provides answers in much
less than a second, namely in a time which is reasonable for the
user to wait, it is slower than Lucene and we believe that there
is a lot of scope for improving its efficiency.
References
- [1]
-
T. Andreasen, P. A. Jensen, J. F. Nilsson, P. Paggio, B. S.
Pedersen, H. E.
Thomsen, Content-based text querying with ontological descriptors, Data &
Know. Eng. 48 (2) (2004) 199-219.
- [2]
-
F. Baader, D. Calvanese, D. McGuinness, D. Nardi,
P. Patel-Schneider, The
Description Logic Handbook: Theory, Implementation and Applications,
Cambridge University Press, 2003.
- [3]
-
R. Bhagdev, S. Chapman, F. Ciravegna, V. Lanfranchi, D. Petrelli,
Hybrid
search: Effectively combining keywords and semantic searches, in: ESWC, 2008.
- [4]
-
A. Budanitsky, G. Hirst, Evaluating wordnet-based measures of
lexical semantic
relatedness, Computational Linguistics 32 (1) (2006) 13-47.
- [5]
-
P. Castells, M. Fernández, D. Vallet, An adaptation of the
vector-space
model for ontology-based information retrieval, IEEE Trans. Knowl. Data Eng.
19 (2) (2007) 261-272.
- [6]
-
I. Celino, E. D. Valle, D. Cerizza, A. Turati, Squiggle: a
semantic search
engine for indexing and retrieval of multimedia content, in: SEMPS, 2006.
- [7]
-
W. B. Croft, User-specified domain knowledge for document
retrieval, in: SIGIR,
1986.
- [8]
-
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, GATE: A
framework and
graphical development environment for robust NLP tools and applications, in:
40th Anniversary Meeting of the Association for Computational Linguistics,
2002.
- [9]
-
J. Davies, R. Weeks, QuizRDF: Search technology for the semantic
web, in:
HICSS, 2004.
- [10]
-
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas,
R. A. Harshman,
Indexing by latent semantic analysis, Journal of the American Society of
Information Science 41 (6) (1990) 391-407.
- [11]
-
W. A. Gale, K. W. Church, D. Yarowsky, One sense per discourse,
in: HLT '91:
Proceedings of the workshop on Speech and Natural Language, 1992.
- [12]
-
F. Giunchiglia, B. Dutta, V. Maltese, Faceted lightweight
ontologies, in:
Conceptual Modeling: Foundations and Applications, 2009.
- [13]
-
F. Giunchiglia, U. Kharkevich, I. Zaihrayeu, Concept search:
Semantics enabled
syntactic search, in: SemSearch2008 workshop at ESWC, 2008.
- [14]
-
F. Giunchiglia, U. Kharkevich, I. Zaihrayeu, Concept search, in:
ESWC, 2009.
- [15]
-
F. Giunchiglia, M. Marchese, I. Zaihrayeu, Encoding
classifications into
lightweight ontologies, in: Journal on Data Semantics (JoDS) VIII, Winter
2006.
- [16]
-
F. Giunchiglia, P. Shvaiko, M. Yatskevich, Discovering missing
background
knowledge in ontology matching, in: Proc. of ECAI, 2006.
- [17]
-
F. Giunchiglia, M. Yatskevich, P. Shvaiko, Semantic matching:
Algorithms and
implementation., Journal on Data Semantics (JoDS) 9 (2007) 1-38.
- [18]
-
M. Hildebrand, J. van Ossenbruggen, L. Hardman, An analysis of
search-based
user interaction on the semantic web, Tech. Rep. INS-E0706, CWI (2007).
- [19]
-
C. Mangold, A survey and classification of semantic search
approaches, Int. J.
Metadata Semantics and Ontology 2 (1) (2007) 23-34.
- [20]
-
C. Manning, P. Raghavan, H. Schtze, Introduction to Information
Retrieval,
Cambridge University Press, 2008.
- [21]
-
G. Miller, WordNet: An electronic Lexical Database, MIT Press,
1998.
- [22]
-
D. I. Moldovan, R. Mihalcea, Using wordnet and lexical operators
to improve
internet searches, IEEE Internet Computing 4 (1) (2000) 34-43.
- [23]
-
Gabor Nagypl, Improving information retrieval effectiveness by using
domain
knowledge stored in ontologies, OTM Workshops 2005, LNCS 3762.
- [24]
-
G. Pass, A. Chowdhury, C. Torgeson, A picture of search, in:
InfoScale'06:
Proceedings of the 1st international conference on Scalable information
systems, ACM, New York, NY, USA, 2006.
- [25]
-
C. Rocha, D. Schwabe, M. de Aragao, A hybrid approach for
searching in the
semantic web, in: 13th International World Wide Web Conference, 2004.
- [26]
-
M. Sanderson, Retrieving with good sense, Inf. Retr. 2 (1) (2000)
49-69.
- [27]
-
H. Schutze, J. O. Pedersen, Information retrieval based on word
senses, in: 4th
Annual Symposium on Document Analysis and Information Retrieval, 1995.
- [28]
-
J. F. Sowa, Conceptual Structures: Information Processing in Mind
and Machine,
Addison-Wesley, 1984.
- [29]
-
C. Stokoe, M. P. Oakes, J. Tait, Word sense disambiguation in
information
retrieval revisited (2003) 159-166.
- [30]
-
W. A. Woods, Conceptual indexing: A better way to organize
knowledge, Tech.
Rep. TR-97-61, Sun Microsystems Laboratories, USA (1997).
- [31]
-
I. Zaihrayeu, L. Sun, F. Giunchiglia, W. Pan, Q. Ju, M. Chi,
X. Huang, From web
directories to ontologies: Natural language processing challenges, in: 6th
International Semantic Web Conference (ISWC 2007), Springer, 2007.
- [32]
-
C. Zhai, Fast statistical parsing of noun phrases for document
indexing, Fifth
Conference on Applied Natural Language Processing (1997) 312-319.
Footnotes:
1This could be
easily generalized to any set of semantically related concepts.
The impact of this choice onto the system performance is part of
the future work.
2Note, that in general,
Equation 12 cannot be applied. One such case
is when negation is allowed, a counterexample is |= Ai ⊆ Aj ∪¬Aj. A second case is when T
contains axioms in the form Ai ⊆ Aj ∪Ak;
consider, e.g., T = { Ai ⊆ Aj ∪Ak }|= Ai ⊆ Aj ∪Ak.
3In [13], we showed how searching for
complex concepts can be implemented by indexing documents directly
by these concepts. The problem of this method is that the size of
the inverted index vocabulary, in the worst case, is exponential
with respect to size of terminology T. In the current
paper, we propose an approach in which we index complex concepts
by atomic concepts using the positional information in the
inverted index. The size of the vocabulary in this case is the
same as the size of T.
4http://lucene.apache.org/java/docs/index.html
5http://wiki.apache.org/lucene-java/PoweredBy
6http://trec.nist.gov/data/test_coll.html
File translated from
TEX
by
TTHgold,
version 4.00. On 23 Jan 2010, 16:50.
|