- Source: Extended Boolean model
The Extended Boolean model was described in a Communications of the ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry Wu. The goal of the Extended Boolean model is to overcome the drawbacks of the Boolean model that has been used in information retrieval. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the Vector Space Model with the properties of Boolean algebra and ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean model it wasn't.
Thus, the extended Boolean model can be considered as a generalization of both the Boolean and vector space models; those two are special cases if suitable settings and definitions are employed. Further, research has shown effectiveness improves relative to that for Boolean query processing. Other research has shown that relevance feedback and query expansion can be integrated with extended Boolean query processing.
Definitions
In the Extended Boolean model, a document is represented as a vector (similarly to in the vector model). Each i dimension corresponds to a separate term associated with the document.
The weight of term Kx associated with document dj is measured by its normalized Term frequency and can be defined as:
w
x
,
j
=
f
x
,
j
∗
I
d
f
x
m
a
x
i
I
d
f
i
{\displaystyle w_{x,j}=f_{x,j}*{\frac {Idf_{x}}{max_{i}Idf_{i}}}}
where Idfx is inverse document frequency and fx,j the term frequency for term x in document j.
The weight vector associated with document dj can be represented as:
v
d
j
=
[
w
1
,
j
,
w
2
,
j
,
…
,
w
i
,
j
]
{\displaystyle \mathbf {v} _{d_{j}}=[w_{1,j},w_{2,j},\ldots ,w_{i,j}]}
The 2 Dimensions Example
Considering the space composed of two terms Kx and Ky only, the corresponding term weights are w1 and w2. Thus, for query qor = (Kx ∨ Ky), we can calculate the similarity with the following formula:
s
i
m
(
q
o
r
,
d
)
=
w
1
2
+
w
2
2
2
{\displaystyle sim(q_{or},d)={\sqrt {\frac {w_{1}^{2}+w_{2}^{2}}{2}}}}
For query qand = (Kx ∧ Ky), we can use:
s
i
m
(
q
a
n
d
,
d
)
=
1
−
(
1
−
w
1
)
2
+
(
1
−
w
2
)
2
2
{\displaystyle sim(q_{and},d)=1-{\sqrt {\frac {(1-w_{1})^{2}+(1-w_{2})^{2}}{2}}}}
Generalizing the idea and P-norms
We can generalize the previous 2D extended Boolean model example to higher t-dimensional space using Euclidean distances.
This can be done using P-norms which extends the notion of distance to include p-distances, where 1 ≤ p ≤ ∞ is a new parameter.
A generalized conjunctive query is given by:
q
o
r
=
k
1
∨
p
k
2
∨
p
.
.
.
.
∨
p
k
t
{\displaystyle q_{or}=k_{1}\lor ^{p}k_{2}\lor ^{p}....\lor ^{p}k_{t}}
The similarity of
q
o
r
{\displaystyle q_{or}}
and
d
j
{\displaystyle d_{j}}
can be defined as:
:
s
i
m
(
q
o
r
,
d
j
)
=
w
1
p
+
w
2
p
+
.
.
.
.
+
w
t
p
t
p
{\displaystyle sim(q_{or},d_{j})={\sqrt[{p}]{\frac {w_{1}^{p}+w_{2}^{p}+....+w_{t}^{p}}{t}}}}
A generalized disjunctive query is given by:
q
a
n
d
=
k
1
∧
p
k
2
∧
p
.
.
.
.
∧
p
k
t
{\displaystyle q_{and}=k_{1}\land ^{p}k_{2}\land ^{p}....\land ^{p}k_{t}}
The similarity of
q
a
n
d
{\displaystyle q_{and}}
and
d
j
{\displaystyle d_{j}}
can be defined as:
s
i
m
(
q
a
n
d
,
d
j
)
=
1
−
(
1
−
w
1
)
p
+
(
1
−
w
2
)
p
+
.
.
.
.
+
(
1
−
w
t
)
p
t
p
{\displaystyle sim(q_{and},d_{j})=1-{\sqrt[{p}]{\frac {(1-w_{1})^{p}+(1-w_{2})^{p}+....+(1-w_{t})^{p}}{t}}}}
Examples
Consider the query q = (K1 ∧ K2) ∨ K3. The similarity between query q and document d can be computed using the formula:
s
i
m
(
q
,
d
)
=
(
1
−
(
(
1
−
w
1
)
p
+
(
1
−
w
2
)
p
2
p
)
)
p
+
w
3
p
2
p
{\displaystyle sim(q,d)={\sqrt[{p}]{\frac {(1-{\sqrt[{p}]{({\frac {(1-w_{1})^{p}+(1-w_{2})^{p}}{2}}}}))^{p}+w_{3}^{p}}{2}}}}
Improvements over the Standard Boolean Model
Lee and Fox compared the Standard and Extended Boolean models with three test collections, CISI, CACM and INSPEC.
Using P-norms they obtained an average precision improvement of 79%, 106% and 210% over the Standard model, for the CISI, CACM and INSPEC collections, respectively.
The P-norm model is computationally expensive because of the number of exponentiation operations that it requires but it achieves much better results than the Standard model and even Fuzzy retrieval techniques. The Standard Boolean model is still the most efficient.
Further reading
Choi, Jongpill; Kim, Minkoo; Raghavan, Vijay V. (March 2006), "Adaptive relevance feedback method of extended Boolean model using hierarchical clustering techniques", Information Processing & Management, 42 (2): 331–349, doi:10.1016/j.ipm.2005.05.009
Zanger, Daniel Z. (November 2002), "Interpolation of the extended Boolean retrieval model", Information Processing & Management, 38 (6): 743–748, doi:10.1016/S0306-4573(02)00023-7
Fox, E.; Betrabet, S.; Koushik, M.; Lee, W. (1992), Information Retrieval: Algorithms and Data structures; Extended Boolean model, Prentice-Hall, Inc., archived from the original on 2013-09-28, retrieved 2017-09-09
Skorkovská, Lucie; Ircing, Pavel (2009), "Experiments with Automatic Query Formulation in the Extended Boolean Model", Text, Speech and Dialogue, Lecture Notes in Computer Science, vol. 5729, Springer Berlin / Heidelberg, pp. 371–378, doi:10.1007/978-3-642-04208-9_51, hdl:11025/16985, ISBN 978-3-642-04207-2
See also
Information retrieval
References
Kata Kunci Pencarian:
- Tf–idf
- Daftar masalah matematika yang belum terpecahkan
- Extended Boolean model
- Boolean
- Fuzzy retrieval
- Boolean-valued model
- Boolean algebra
- Information retrieval
- DE-9IM
- Boolean circuit
- Boolean function
- Complete Boolean algebra