- Source: Range mode query
In data structures, the range mode query problem asks to build a data structure on some input data to efficiently answer queries asking for the mode of any consecutive subset of the input.
Problem statement
Given an array
A
[
1
:
n
]
=
[
a
1
,
a
2
,
.
.
.
,
a
n
]
{\displaystyle A[1:n]=[a_{1},a_{2},...,a_{n}]}
, we wish to answer queries of the form
m
o
d
e
(
A
,
i
:
j
)
{\displaystyle mode(A,i:j)}
, where
1
≤
i
≤
j
≤
n
{\displaystyle 1\leq i\leq j\leq n}
. The mode
m
o
d
e
(
S
)
{\displaystyle mode(S)}
of any array
S
=
[
s
1
,
s
2
,
.
.
.
,
s
k
]
{\displaystyle S=[s_{1},s_{2},...,s_{k}]}
is an element
s
i
{\displaystyle s_{i}}
such that the frequency of
s
i
{\displaystyle s_{i}}
is greater than or equal to the frequency of
s
j
∀
j
∈
{
1
,
.
.
.
,
k
}
{\displaystyle s_{j}\;\forall j\in \{1,...,k\}}
. For example, if
S
=
[
1
,
2
,
4
,
2
,
3
,
4
,
2
]
{\displaystyle S=[1,2,4,2,3,4,2]}
, then
m
o
d
e
(
S
)
=
2
{\displaystyle mode(S)=2}
because it occurs three times, while all other values occur fewer times. In this problem, the queries ask for the mode of subarrays of the form
A
[
i
:
j
]
=
[
a
i
,
a
i
+
1
,
.
.
.
,
a
j
]
{\displaystyle A[i:j]=[a_{i},a_{i+1},...,a_{j}]}
.
= Theorem 1
=Let
A
{\displaystyle A}
and
B
{\displaystyle B}
be any multisets. If
c
{\displaystyle c}
is a mode of
A
∪
B
{\displaystyle A\cup B}
and
c
∉
A
{\displaystyle c\notin A}
, then
c
{\displaystyle c}
is a mode of
B
{\displaystyle B}
.
= Proof
=Let
c
∉
A
{\displaystyle c\notin A}
be a mode of
C
=
A
∪
B
{\displaystyle C=A\cup B}
and
f
c
{\displaystyle f_{c}}
be its frequency in
C
{\displaystyle C}
. Suppose that
c
{\displaystyle c}
is not a mode of
B
{\displaystyle B}
. Thus, there exists an element
b
{\displaystyle b}
with frequency
f
b
{\displaystyle f_{b}}
that is the mode of
B
{\displaystyle B}
. Since
b
{\displaystyle b}
is the mode of
B
{\displaystyle B}
and that
c
∉
A
{\displaystyle c\notin A}
, then
f
b
>
f
c
{\displaystyle f_{b}>f_{c}}
. Thus,
b
{\displaystyle b}
should be the mode of
C
{\displaystyle C}
which is a contradiction.
Results
Lower bound
Any data structure using
S
{\displaystyle S}
cells of
w
{\displaystyle w}
bits each needs
Ω
(
log
n
log
(
S
w
/
n
)
)
{\displaystyle \Omega \left({\frac {\log n}{\log(Sw/n)}}\right)}
time to answer a range mode query.
This contrasts with other range query problems, such as the range minimum query which have solutions offering constant time query time and linear space. This is due to the hardness of the mode problem, since even if we know the mode of
A
[
i
:
j
]
{\displaystyle A[i:j]}
and the mode of
A
[
j
+
1
:
k
]
{\displaystyle A[j+1:k]}
, there is no simple way of computing the mode of
A
[
i
:
k
]
{\displaystyle A[i:k]}
. Any element of
A
[
i
:
j
]
{\displaystyle A[i:j]}
or
A
[
j
+
1
:
k
]
{\displaystyle A[j+1:k]}
could be the mode. For example, if
m
o
d
e
(
A
[
i
:
j
]
)
=
a
{\displaystyle mode(A[i:j])=a}
and its frequency is
f
a
{\displaystyle f_{a}}
, and
m
o
d
e
(
A
[
j
+
1
:
k
]
)
=
b
{\displaystyle mode(A[j+1:k])=b}
and its frequency is also
f
a
{\displaystyle f_{a}}
, there could be an element
c
{\displaystyle c}
with frequency
f
a
−
1
{\displaystyle f_{a}-1}
in
A
[
i
:
j
]
{\displaystyle A[i:j]}
and frequency
f
a
−
1
{\displaystyle f_{a}-1}
in
A
[
j
+
1
:
k
]
{\displaystyle A[j+1:k]}
.
a
≠
c
≠
b
{\displaystyle a\not =c\not =b}
, but its frequency in
A
[
i
:
k
]
{\displaystyle A[i:k]}
is greater than the frequency of
a
{\displaystyle a}
and
b
{\displaystyle b}
, which makes
c
{\displaystyle c}
a better candidate for
m
o
d
e
(
A
[
i
:
k
]
)
{\displaystyle mode(A[i:k])}
than
a
{\displaystyle a}
or
b
{\displaystyle b}
.
Linear space data structure with square root query time
This method by Chan et al. uses
O
(
n
+
s
2
)
{\displaystyle O(n+s^{2})}
space and
O
(
n
/
s
)
{\displaystyle O(n/s)}
query time. By setting
s
=
n
{\displaystyle s={\sqrt {n}}}
, we get
O
(
n
)
{\displaystyle O(n)}
and
O
(
n
)
{\displaystyle O({\sqrt {n}})}
bounds for space and query time.
= Preprocessing
=Let
A
[
1
:
n
]
{\displaystyle A[1:n]}
be an array, and
D
[
1
:
Δ
]
{\displaystyle D[1:\Delta ]}
be an array that contains the distinct values of A, where
Δ
{\displaystyle \Delta }
is the number of distinct elements. We define
B
[
1
:
n
]
{\displaystyle B[1:n]}
to be an array such that, for each
i
{\displaystyle i}
,
B
[
i
]
{\displaystyle B[i]}
contains the rank (position) of
A
[
i
]
{\displaystyle A[i]}
in
D
{\displaystyle D}
. Arrays
B
,
D
{\displaystyle B,D}
can be created by a linear scan of
A
{\displaystyle A}
.
Arrays
Q
1
,
Q
2
,
.
.
.
,
Q
Δ
{\displaystyle Q_{1},Q_{2},...,Q_{\Delta }}
are also created, such that, for each
a
∈
{
1
,
.
.
.
,
Δ
}
{\displaystyle a\in \{1,...,\Delta \}}
,
Q
a
=
{
b
|
B
[
b
]
=
a
}
{\displaystyle Q_{a}=\{b\;|\;B[b]=a\}}
. We then create an array
B
′
[
1
:
n
]
{\displaystyle B'[1:n]}
, such that, for all
b
∈
{
1
,
.
.
.
,
n
}
{\displaystyle b\in \{1,...,n\}}
,
B
′
[
b
]
{\displaystyle B'[b]}
contains the rank of
b
{\displaystyle b}
in
Q
B
[
b
]
{\displaystyle Q_{B[b]}}
. Again, a linear scan of
B
{\displaystyle B}
suffices to create arrays
Q
1
,
Q
2
,
.
.
.
,
Q
Δ
{\displaystyle Q_{1},Q_{2},...,Q_{\Delta }}
and
B
′
{\displaystyle B'}
.
It is now possible to answer queries of the form "is the frequency of
B
[
i
]
{\displaystyle B[i]}
in
B
[
i
:
j
]
{\displaystyle B[i:j]}
at least
q
{\displaystyle q}
" in constant time, by checking whether
Q
B
[
i
]
[
B
′
[
i
]
+
q
−
1
]
≤
j
{\displaystyle Q_{B[i]}[B'[i]+q-1]\leq j}
.
The array is split B into
s
{\displaystyle s}
blocks
b
1
,
b
2
,
.
.
.
,
b
s
{\displaystyle b_{1},b_{2},...,b_{s}}
, each of size
t
=
⌈
n
/
s
⌉
{\displaystyle t=\lceil n/s\rceil }
. Thus, a block
b
i
{\displaystyle b_{i}}
spans over
B
[
i
⋅
t
+
1
:
(
i
+
1
)
t
]
{\displaystyle B[i\cdot t+1:(i+1)t]}
. The mode and the frequency of each block or set of consecutive blocks will be pre-computed in two tables
S
{\displaystyle S}
and
S
′
{\displaystyle S'}
.
S
[
b
i
,
b
j
]
{\displaystyle S[b_{i},b_{j}]}
is the mode of
b
i
∪
b
i
+
1
∪
.
.
.
∪
b
j
{\displaystyle b_{i}\cup b_{i+1}\cup ...\cup b_{j}}
, or equivalently, the mode of
B
[
b
i
t
+
1
:
(
b
j
+
1
)
t
]
{\displaystyle B[b_{i}t+1:(b_{j}+1)t]}
, and
S
′
{\displaystyle S'}
stores the corresponding frequency. These two tables can be stored in
O
(
s
2
)
{\displaystyle O(s^{2})}
space, and can be populated in
O
(
s
⋅
n
)
{\displaystyle O(s\cdot n)}
by scanning
B
{\displaystyle B}
s
{\displaystyle s}
times, computing a row of
S
,
S
′
{\displaystyle S,S'}
each time with the following algorithm:
algorithm computeS_Sprime is
input: Array B = [0:n - 1],
Array D = [0:Delta - 1],
Integer s
output: Tables S and Sprime
let S ← Table(0:n - 1, 0:n - 1)
let Sprime ← Table(0:n - 1, 0:n - 1)
let firstOccurence ← Array(0:Delta - 1)
for all i in {0, ..., Delta - 1} do
firstOccurence[i] ← -1
end for
for i ← 0:s - 1 do
let j ← i × t
let c ← 0
let fc ← 0
let noBlock ← i
let block_start ← j
let block_end ← min{(i + 1) × t - 1, n - 1}
while j < n do
if firstOccurence[B[j]] = -1 then
firstOccurence[B[j]] ← j
end if
if atLeastQInstances(firstOccurence[B[j]], block_end, fc + 1) then
c ← B[j]
fc ← fc + 1
end if
if j = block_end then
S[i * s + noBlock] ← c
Sprime[i × s + noBlock] ← fc
noBlock ← noBlock + 1
block_end ← min{block_end + t, n - 1}
end if
end while
for all j in {0, ..., Delta - 1} do
firstOccurence[j] ← -1
end for
end for
= Query
=We will define the query algorithm over array
B
{\displaystyle B}
. This can be translated to an answer over
A
{\displaystyle A}
, since for any
a
,
i
,
j
{\displaystyle a,i,j}
,
B
[
a
]
{\displaystyle B[a]}
is a mode for
B
[
i
:
j
]
{\displaystyle B[i:j]}
if and only if
A
[
a
]
{\displaystyle A[a]}
is a mode for
A
[
i
:
j
]
{\displaystyle A[i:j]}
. We can convert an answer for
B
{\displaystyle B}
to an answer for
A
{\displaystyle A}
in constant time by looking in
A
{\displaystyle A}
or
B
{\displaystyle B}
at the corresponding index.
Given a query
m
o
d
e
(
B
,
i
,
j
)
{\displaystyle mode(B,i,j)}
, the query is split in three parts: the prefix, the span and the suffix. Let
b
i
=
⌈
(
i
−
1
)
/
t
⌉
{\displaystyle b_{i}=\lceil (i-1)/t\rceil }
and
b
j
=
⌊
j
/
t
⌋
−
1
{\displaystyle b_{j}=\lfloor j/t\rfloor -1}
. These denote the indices of the first and last block that are completely contained in
B
{\displaystyle B}
. The range of these blocks is called the span. The prefix is then
B
[
i
:
m
i
n
{
b
i
t
,
j
}
]
{\displaystyle B[i:min\{b_{i}t,j\}]}
(the set of indices before the span), and the suffix is
B
[
m
a
x
{
(
b
j
+
1
)
t
+
1
,
i
}
:
j
]
{\displaystyle B[max\{(b_{j}+1)t+1,i\}:j]}
(the set of indices after the span). The prefix, suffix or span can be empty, the latter is if
b
j
<
b
i
{\displaystyle b_{j}
.
For the span, the mode
c
{\displaystyle c}
is already stored in
S
[
b
i
,
b
j
]
{\displaystyle S[b_{i},b_{j}]}
. Let
f
c
{\displaystyle f_{c}}
be the frequency of the mode, which is stored in
S
′
[
b
i
,
b
j
]
{\displaystyle S'[b_{i},b_{j}]}
. If the span is empty, let
f
c
=
0
{\displaystyle f_{c}=0}
. Recall that, by Theorem 1, the mode of
B
[
i
:
j
]
{\displaystyle B[i:j]}
is either an element of the prefix, span or suffix. A linear scan is performed over each element in the prefix and in the suffix to check if its frequency is greater than the current candidate
c
{\displaystyle c}
, in which case
c
{\displaystyle c}
and
f
c
{\displaystyle f_{c}}
are updated to the new value. At the end of the scan,
c
{\displaystyle c}
contains the mode of
B
[
i
:
j
]
{\displaystyle B[i:j]}
and
f
c
{\displaystyle f_{c}}
its frequency.
Scanning procedure
The procedure is similar for both prefix and suffix, so it suffice to run this procedure for both:
Let
x
{\displaystyle x}
be the index of the current element. There are three cases:
If
Q
B
[
x
]
[
B
′
[
x
]
−
1
]
≥
i
{\displaystyle Q_{B[x]}[B'[x]-1]\geq i}
, then it was present in
B
[
i
:
x
−
1
]
{\displaystyle B[i:x-1]}
and its frequency has already been counted. Pass to the next element.
Otherwise, check if the frequency of
B
[
x
]
{\displaystyle B[x]}
in
B
[
i
:
j
]
{\displaystyle B[i:j]}
is at least
f
c
{\displaystyle f_{c}}
(this can be done in constant time since it is the equivalent of checking it for
B
[
x
:
j
]
{\displaystyle B[x:j]}
).
If it is not, then pass to the next element.
If it is, then compute the actual frequency
f
x
{\displaystyle f_{x}}
of
B
[
x
]
{\displaystyle B[x]}
in
B
[
i
:
j
]
{\displaystyle B[i:j]}
by a linear scan (starting at index
B
′
[
x
]
+
f
c
−
1
{\displaystyle B'[x]+f_{c}-1}
) or a binary search in
Q
B
[
x
]
{\displaystyle Q_{B[x]}}
. Set
c
:=
B
[
x
]
{\displaystyle c:=B[x]}
and
f
c
:=
f
x
{\displaystyle f_{c}:=f_{x}}
.
This linear scan (excluding the frequency computations) is bounded by the block size
t
{\displaystyle t}
, since neither the prefix or the suffix can be greater than
t
{\displaystyle t}
. A further analysis of the linear scans done for frequency computations shows that it is also bounded by the block size. Thus, the query time is
O
(
t
)
=
O
(
n
/
s
)
{\displaystyle O(t)=O(n/s)}
.
Subquadratic space data structure with constant query time
This method by uses
O
(
n
2
log
log
n
log
n
)
{\displaystyle O\left({\frac {n^{2}\log {\log {n}}}{\log {n}}}\right)}
space for a constant time query. We can observe that, if a constant query time is desired, this is a better solution than the one proposed by Chan et al., as the latter gives a space of
O
(
n
2
)
{\displaystyle O(n^{2})}
for constant query time if
s
=
n
{\displaystyle s=n}
.
= Preprocessing
=Let
A
[
1
:
n
]
{\displaystyle A[1:n]}
be an array. The preprocessing is done in three steps:
Split the array
A
{\displaystyle A}
in
s
{\displaystyle s}
blocks
b
1
,
b
2
,
.
.
.
,
b
s
{\displaystyle b_{1},b_{2},...,b_{s}}
, where the size of each block is
t
=
⌈
n
/
s
⌉
{\displaystyle t=\lceil n/s\rceil }
. Build a table
S
{\displaystyle S}
of size
s
×
s
{\displaystyle s\times s}
where
S
[
i
,
j
]
{\displaystyle S[i,j]}
is the mode of
b
i
∪
b
i
+
1
∪
.
.
.
∪
b
j
{\displaystyle b_{i}\cup b_{i+1}\cup ...\cup b_{j}}
. The total space for this step is
O
(
s
2
)
{\displaystyle O(s^{2})}
For any query
m
o
d
e
(
A
,
i
,
j
)
{\displaystyle mode(A,i,j)}
, let
b
i
′
{\displaystyle b_{i'}}
be the block that contains
i
{\displaystyle i}
and
b
j
′
{\displaystyle b_{j'}}
be the block that contains
j
{\displaystyle j}
. Let the span be the set of blocks completely contained in
A
[
i
:
j
]
{\displaystyle A[i:j]}
. The mode
c
{\displaystyle c}
of the block can be retrieved from
S
{\displaystyle S}
. By Theorem 1, the mode can be either an element of the prefix (indices of
A
[
i
:
j
]
{\displaystyle A[i:j]}
before the start of the span), an element of the suffix (indices of
A
[
i
:
j
]
{\displaystyle A[i:j]}
after the end of the span), or
c
{\displaystyle c}
. The size of the prefix plus the size of the suffix is bounded by
2
t
{\displaystyle 2t}
, thus the position of the mode isstored as an integer ranging from
0
{\displaystyle 0}
to
2
t
{\displaystyle 2t}
, where
[
0
:
2
t
−
1
]
{\displaystyle [0:2t-1]}
indicates a position in the prefix/suffix and
2
t
{\displaystyle 2t}
indicates that the mode is the mode of the span. There are
(
t
2
)
{\displaystyle {\binom {t}{2}}}
possible queries involving blocks
b
i
′
{\displaystyle b_{i'}}
and
b
j
′
{\displaystyle b_{j'}}
, so these values are stored in a table of size
t
2
{\displaystyle t^{2}}
. Furthermore, there are
(
2
t
+
1
)
t
2
{\displaystyle (2t+1)^{t^{2}}}
such tables, so the total space required for this step is
O
(
t
2
(
2
t
+
1
)
t
2
)
{\displaystyle O(t^{2}(2t+1)^{t^{2}})}
. To access those tables, a pointer is added in addition to the mode in the table
S
{\displaystyle S}
for each pair of blocks.
To handle queries
m
o
d
e
(
A
,
i
,
j
)
{\displaystyle mode(A,i,j)}
where
i
{\displaystyle i}
and
j
{\displaystyle j}
are in the same block, all such solutions are precomputed. There are
O
(
s
t
2
)
{\displaystyle O(st^{2})}
of them, they are stored in a three dimensional table
T
{\displaystyle T}
of this size.
The total space used by this data structure is
O
(
s
2
+
t
2
(
2
t
+
1
)
t
2
+
s
t
2
)
{\displaystyle O(s^{2}+t^{2}(2t+1)^{t^{2}}+st^{2})}
, which reduces to
O
(
n
2
log
log
n
log
n
)
{\displaystyle O\left({\frac {n^{2}\log {\log {n}}}{\log {n}}}\right)}
if we take
t
=
log
n
/
log
log
n
{\displaystyle t={\sqrt {\log {n}/\log {\log {n}}}}}
.
= Query
=Given a query
m
o
d
e
(
A
,
i
,
j
)
{\displaystyle mode(A,i,j)}
, check if it is completely contained inside a block, in which case the answer is stored in table
T
{\displaystyle T}
. If the query spans exactly one or more blocks, then the answer is found in table
S
{\displaystyle S}
. Otherwise, use the pointer stored in table
S
{\displaystyle S}
at position
S
[
b
i
′
,
b
j
′
]
{\displaystyle S[b_{i'},b_{j'}]}
, where
b
i
′
,
b
j
′
{\displaystyle b_{i'},b_{j'}}
are the indices of the blocks that contain respectively
i
{\displaystyle i}
and
j
{\displaystyle j}
, to find the table
U
b
i
′
,
b
j
′
{\displaystyle U_{b_{i'},b_{j'}}}
that contains the positions of the mode for these blocks and use the position to find the mode in
A
{\displaystyle A}
. This can be done in constant time.
References
Kata Kunci Pencarian:
- Graph database
- Daftar kata yang dilindungi di SQL
- Range mode query
- Range query (computer science)
- Power Query
- CUBRID
- OBD-II PIDs
- BigQuery
- IBM Cognos Analytics
- Domain Name System
- ArangoDB
- Q code