Search Results for “count distinct problem”

Source: Count-distinct problem

In computer science, the count-distinct problem
(also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements.
This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.

Formal definition

distinct

problem

Naive solution

problem

distinct

HyperLogLog algorithm

Streaming algorithms

distinct

= Min/max sketches

= Bottom-m sketches

count

distinct

= CVM Algorithm

then
If

|

B

|

<
s

{\displaystyle |B|
then
insert

(

a

t

,
u
)

{\displaystyle (a_{t},u)}

in B
else

(

a
′

,

u
′

)

{\displaystyle (a',u')}

such that

u
′

=
max
{

u
″

:
(

a
″

,

u
″

)
∈
B
,
∀

a
″

}

{\displaystyle u'=\max\{u'':(a'',u'')\in B,\forall a''\}}

/*

(

a
′

,

u
′

)

{\displaystyle (a',u')}

whose

u
′

{\displaystyle u'}

is maximum in B */
If

u
>

u
′

{\displaystyle u>u'}

then

p
←
u

{\displaystyle p\leftarrow u}

else
Replace

(

a
′

,

u
′

)

{\displaystyle (a',u')}

with

(

a

t

,
u
)

{\displaystyle (a_{t},u)}

p
←

u
′

{\displaystyle p\leftarrow u'}

End For
return

|

B

|

/

p

{\displaystyle |B|/p}

.

The previous version of the CVM algorithm is improved with the following modification by Donald Knuth, that adds the while loop to ensure B is reduced.

Initialize

p
←
1

{\displaystyle p\leftarrow 1}

Initialize max buffer size

s

{\displaystyle s}

, where

s
≥
1

{\displaystyle s\geq 1}

Initialize an empty buffer, B
For each element

a

t

{\displaystyle a_{t}}

in data stream

A

{\displaystyle A}

of size

n

{\displaystyle n}

do:
If

a

t

{\displaystyle a_{t}}

is in B then
Delete

a

t

{\displaystyle a_{t}}

from B

u
←

{\displaystyle u\leftarrow }

random number in

[
0
,
1
)

{\displaystyle [0,1)}

If

u
≤
p

{\displaystyle u\leq p}

then
Insert

(

a

t

,
u
)

{\displaystyle (a_{t},u)}

into B
While

|

B

|

=
s
∧
u
<
p

{\displaystyle |B|=s\wedge u
then
Remove every element of

(

a
′

,

u
′

)

{\displaystyle (a',u')}

of B with

u
′

>

p
2

{\displaystyle u'>{\frac {p}{2}}}

p
←

p
2

{\displaystyle p\leftarrow {\frac {p}{2}}}

End While
If

u
<
p

{\displaystyle u
then
Insert

(

a

t

,
u
)

{\displaystyle (a_{t},u)}

into B
End For
return

|

B

|

/

p

{\displaystyle |B|/p}

.

Weighted count-distinct problem

In its weighted version, each element is associated with a weight and the goal is to estimate the total sum of weights.
Formally,

Instance: A stream of weighted elements

x

1

,

x

2

,
…
,

x

s

{\displaystyle x_{1},x_{2},\ldots ,x_{s}}

with repetitions, and an integer

m

{\displaystyle m}

. Let

n

{\displaystyle n}

be the number of distinct elements, namely

n
=

|

{

x

1

,

x

2

,
…
,

x

s

}

|

{\displaystyle n=|\left\{{x_{1},x_{2},\ldots ,x_{s}}\right\}|}

, and let these elements be

{

e

1

,

e

2

,
…
,

e

n

}

{\displaystyle \left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}}

. Finally, let

w

j

{\displaystyle w_{j}}

be the weight of

e

j

{\displaystyle e_{j}}

.
Objective: Find an estimate

w
^

{\displaystyle {\widehat {w}}}

of

w
=

∑

j
=
1

n

w

j

{\displaystyle w=\sum _{j=1}^{n}w_{j}}

using only

m

{\displaystyle m}

storage units, where

m
≪
n

{\displaystyle m\ll n}

.
An example of an instance for the weighted problem is:

a
(
3
)
,
b
(
4
)
,
a
(
3
)
,
c
(
2
)
,
d
(
3
)
,
b
(
4
)
,
d
(
3
)

{\displaystyle a(3),b(4),a(3),c(2),d(3),b(4),d(3)}

. For this instance,

e

1

=
a
,

e

2

=
b
,

e

3

=
c
,

e

4

=
d

{\displaystyle e_{1}=a,e_{2}=b,e_{3}=c,e_{4}=d}

, the weights are

w

1

=
3
,

w

2

=
4
,

w

3

=
2
,

w

4

=
3

{\displaystyle w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3}

and

∑

w

j

=
12

{\displaystyle \sum {w_{j}}=12}

.
As an application example,

x

1

,

x

2

,
…
,

x

s

{\displaystyle x_{1},x_{2},\ldots ,x_{s}}

could be IP packets received by a server. Each packet belongs to one of

n

{\displaystyle n}

IP flows

e

1

,

e

2

,
…
,

e

n

{\displaystyle e_{1},e_{2},\ldots ,e_{n}}

. The weight

w

j

{\displaystyle w_{j}}

can be the load imposed by flow

e

j

{\displaystyle e_{j}}

on the server. Thus,

∑

j
=
1

n

w

j

{\displaystyle \sum _{j=1}^{n}{w_{j}}}

represents the total load imposed on the server by all the flows to which packets

x

1

,

x

2

,
…
,

x

s

{\displaystyle x_{1},x_{2},\ldots ,x_{s}}

belong.

Solving the weighted count-distinct problem

Any extreme order statistics estimator (min/max sketches) for the unweighted problem can be generalized to an estimator for the weighted problem
.
For example, the weighted estimator proposed by Cohen et al. can be obtained when the continuous max sketches estimator is extended to solve the weighted problem.
In particular, the HyperLogLog algorithm can be extended to solve the weighted problem. The extended HyperLogLog algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem.

Formal definition

Naive solution

HyperLogLog algorithm

Streaming algorithms

= Min/max sketches

= Bottom-m sketches

= CVM Algorithm

Weighted count-distinct problem

Solving the weighted count-distinct problem

See also

References

Kata Kunci Pencarian:

Recent Movies

Recent Movies

Categories

Recent Movies