- Source: Double hashing
Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is a classical data structure on a table
T
{\displaystyle T}
.
The double hashing technique uses one hash value as an index into the table and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, or the entire table has been searched; but this interval is set by a second, independent hash function. Unlike the alternative collision-resolution methods of linear probing and quadratic probing, the interval depends on the data, so that values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of clustering.
Given two random, uniform, and independent hash functions
h
1
{\displaystyle h_{1}}
and
h
2
{\displaystyle h_{2}}
, the
i
{\displaystyle i}
th location in the bucket sequence for value
k
{\displaystyle k}
in a hash table of
|
T
|
{\displaystyle |T|}
buckets is:
h
(
i
,
k
)
=
(
h
1
(
k
)
+
i
⋅
h
2
(
k
)
)
mod
|
T
|
.
{\displaystyle h(i,k)=(h_{1}(k)+i\cdot h_{2}(k)){\bmod {|}}T|.}
Generally,
h
1
{\displaystyle h_{1}}
and
h
2
{\displaystyle h_{2}}
are selected from a set of universal hash functions;
h
1
{\displaystyle h_{1}}
is selected to have a range of
{
0
,
|
T
|
−
1
}
{\displaystyle \{0,|T|-1\}}
and
h
2
{\displaystyle h_{2}}
to have a range of
{
1
,
|
T
|
−
1
}
{\displaystyle \{1,|T|-1\}}
. Double hashing approximates a random distribution; more precisely, pair-wise independent hash functions yield a probability of
(
n
/
|
T
|
)
2
{\displaystyle (n/|T|)^{2}}
that any pair of keys will follow the same bucket sequence.
Selection of h2(k)
The secondary hash function
h
2
(
k
)
{\displaystyle h_{2}(k)}
should have several characteristics:
It should never yield an index of zero.
It should cycle through the whole table.
It should be very fast to compute.
It should be pair-wise independent of
h
1
(
k
)
{\displaystyle h_{1}(k)}
.
The distribution characteristics of
h
2
{\displaystyle h_{2}}
are irrelevant. It is analogous to a random-number generator.
All
h
2
(
k
)
{\displaystyle h_{2}(k)}
should be relatively prime to |T|.
In practice:
If division hashing is used for both functions, the divisors are chosen as primes.
If |T| is a power of 2, the first and last requirements are usually satisfied by making
h
2
(
k
)
{\displaystyle h_{2}(k)}
always return an odd number. This has the side effect of doubling the chance of collision due to one wasted bit.
Analysis
Let
n
{\displaystyle n}
be the number of elements stored in
T
{\displaystyle T}
, then
T
{\displaystyle T}
's load factor is
α
=
n
/
|
T
|
{\displaystyle \alpha =n/|T|}
. That is, start by randomly, uniformly and independently selecting two universal hash functions
h
1
{\displaystyle h_{1}}
and
h
2
{\displaystyle h_{2}}
to build a double hashing table
T
{\displaystyle T}
. All elements are put in
T
{\displaystyle T}
by double hashing using
h
1
{\displaystyle h_{1}}
and
h
2
{\displaystyle h_{2}}
.
Given a key
k
{\displaystyle k}
, the
(
i
+
1
)
{\displaystyle (i+1)}
-st hash location is computed by:
h
(
i
,
k
)
=
(
h
1
(
k
)
+
i
⋅
h
2
(
k
)
)
mod
|
T
|
.
{\displaystyle h(i,k)=(h_{1}(k)+i\cdot h_{2}(k)){\bmod {|}}T|.}
Let
T
{\displaystyle T}
have fixed load factor
α
:
1
>
α
>
0
{\displaystyle \alpha :1>\alpha >0}
.
Bradford and Katehakis
showed the expected number of probes for an unsuccessful search in
T
{\displaystyle T}
, still using these initially chosen hash functions, is
1
1
−
α
{\displaystyle {\tfrac {1}{1-\alpha }}}
regardless of the distribution of the inputs. Pair-wise independence of the hash functions suffices.
Like all other forms of open addressing, double hashing becomes linear as the hash table approaches maximum capacity. The usual heuristic is to limit the table loading to 75% of capacity. Eventually, rehashing to a larger size will be necessary, as with all other open addressing schemes.
Variants
Peter Dillinger's PhD thesis points out that double hashing produces unwanted equivalent hash functions when the hash functions are treated as a set, as in Bloom filters: If
h
2
(
y
)
=
−
h
2
(
x
)
{\displaystyle h_{2}(y)=-h_{2}(x)}
and
h
1
(
y
)
=
h
1
(
x
)
+
k
⋅
h
2
(
x
)
{\displaystyle h_{1}(y)=h_{1}(x)+k\cdot h_{2}(x)}
, then
h
(
i
,
y
)
=
h
(
k
−
i
,
x
)
{\displaystyle h(i,y)=h(k-i,x)}
and the sets of hashes
{
h
(
0
,
x
)
,
.
.
.
,
h
(
k
,
x
)
}
=
{
h
(
0
,
y
)
,
.
.
.
,
h
(
k
,
y
)
}
{\displaystyle \left\{h(0,x),...,h(k,x)\right\}=\left\{h(0,y),...,h(k,y)\right\}}
are identical. This makes a collision twice as likely as the hoped-for
1
/
|
T
|
2
{\displaystyle 1/|T|^{2}}
.
There are additionally a significant number of mostly-overlapping hash sets; if
h
2
(
y
)
=
h
2
(
x
)
{\displaystyle h_{2}(y)=h_{2}(x)}
and
h
1
(
y
)
=
h
1
(
x
)
±
h
2
(
x
)
{\displaystyle h_{1}(y)=h_{1}(x)\pm h_{2}(x)}
, then
h
(
i
,
y
)
=
h
(
i
±
1
,
x
)
{\displaystyle h(i,y)=h(i\pm 1,x)}
, and comparing additional hash values (expanding the range of
i
{\displaystyle i}
) is of no help.
= Triple hashing
=Adding a quadratic term
i
2
,
{\displaystyle i^{2},}
i
(
i
+
1
)
/
2
{\displaystyle i(i+1)/2}
(a triangular number) or even
i
2
⋅
h
3
(
x
)
{\displaystyle i^{2}\cdot h_{3}(x)}
(triple hashing) to the hash function improves the hash function somewhat but does not fix this problem; if:
h
1
(
y
)
=
h
1
(
x
)
+
k
⋅
h
2
(
x
)
+
k
2
⋅
h
3
(
x
)
,
{\displaystyle h_{1}(y)=h_{1}(x)+k\cdot h_{2}(x)+k^{2}\cdot h_{3}(x),}
h
2
(
y
)
=
−
h
2
(
x
)
−
2
k
⋅
h
3
(
x
)
,
{\displaystyle h_{2}(y)=-h_{2}(x)-2k\cdot h_{3}(x),}
and
h
3
(
y
)
=
h
3
(
x
)
.
{\displaystyle h_{3}(y)=h_{3}(x).}
then
h
(
k
−
i
,
y
)
=
h
1
(
y
)
+
(
k
−
i
)
⋅
h
2
(
y
)
+
(
k
−
i
)
2
⋅
h
3
(
y
)
=
h
1
(
y
)
+
(
k
−
i
)
(
−
h
2
(
x
)
−
2
k
h
3
(
x
)
)
+
(
k
−
i
)
2
h
3
(
x
)
=
…
=
h
1
(
x
)
+
k
h
2
(
x
)
+
k
2
h
3
(
x
)
+
(
i
−
k
)
h
2
(
x
)
+
(
i
2
−
k
2
)
h
3
(
x
)
=
h
1
(
x
)
+
i
h
2
(
x
)
+
i
2
h
3
(
x
)
=
h
(
i
,
x
)
.
{\displaystyle {\begin{aligned}h(k-i,y)&=h_{1}(y)+(k-i)\cdot h_{2}(y)+(k-i)^{2}\cdot h_{3}(y)\\&=h_{1}(y)+(k-i)(-h_{2}(x)-2kh_{3}(x))+(k-i)^{2}h_{3}(x)\\&=\ldots \\&=h_{1}(x)+kh_{2}(x)+k^{2}h_{3}(x)+(i-k)h_{2}(x)+(i^{2}-k^{2})h_{3}(x)\\&=h_{1}(x)+ih_{2}(x)+i^{2}h_{3}(x)\\&=h(i,x).\\\end{aligned}}}
= Enhanced double hashing
=Adding a cubic term
i
3
{\displaystyle i^{3}}
or
(
i
3
−
i
)
/
6
{\displaystyle (i^{3}-i)/6}
(a tetrahedral number), does solve the problem, a technique known as enhanced double hashing. This can be computed efficiently by forward differencing:
In addition to rectifying the collision problem, enhanced double hashing also removes double-hashing's numerical restrictions on
h
2
(
x
)
{\displaystyle h_{2}(x)}
's properties, allowing a hash function similar in property to (but still independent of)
h
1
{\displaystyle h_{1}}
to be used.
See also
Cuckoo hashing
2-choice hashing
References
External links
How Caching Affects Hashing by Gregory L. Heileman and Wenbin Luo 2005.
Hash Table Animation
klib a C library that includes double hashing functionality.
Kata Kunci Pencarian:
- Hash
- Jaringan Bitcoin
- Double hashing
- Hash table
- Hash function
- Double-spending
- Linear probing
- Cryptographic hash function
- Cuckoo hashing
- Open addressing
- Hash collision
- List of data structures