Lec08 全域哈希和完全哈希

MIT算法导论课程：Lec08 全域哈希和完全哈希，对应书上的章节：Section 11.5

1. A weakness of hashing

Problem: For any hash function h, a set of keys exists that can cause the average access time of a hash table to skyrocket.存在一组keys插入同一个slot，影响速度。

IDEA：random

2. Universal hashing 全域哈希

定义： Let $U$ be a universe of keys, and let $H$ be a finite collection of hash functions, each mapping $U$ to $\{0, 1, …, m–1\}$. We say H is universal if for all $x, y ∈ U$, where $x ≠ y$, we have $|{h\in H:h(x)=h(y)}|=|H|/m$ .

随机的从H中选取h，x和y发生collision的概率是1/m.

3. Universality theorem

Theorem: Let h be a hash function chosen (uniformly) at random from a universal set H of hash functions. Suppose h is used to hash n arbitrary keys into the m slots of a table T. Then, for a given key x, we have$E[collisions\ with x]<n/m$. 其中 $n/m$：load factor.

为什么是小于？

Proof

Let $C_x$ be the random variable denoting the total number of collisions of keys in T with x and let $c_{x y}=\left\{\begin{array}{ll} 1 & \text { if } h(x)=h(y) \\ 0 & \text { otherwise. } \end{array}\right.$ ，$C_x$表示和$x$碰撞的key的总数。

可知：$E[c_xy]=1/m$，$C_{x}=\sum_{y \in T-\{x\}} c_{x y}$

可证： \[ \begin{aligned} E\left[C_{x}\right] &=E\left[\sum_{y \in T-\{x\}} c_{x y}\right] \\ &=\sum_{y \in T-\{x\}} E\left[c_{x y}\right] \\ &=\sum_{y \in T-\{x\}} 1 / m \\ &=\frac{n-1}{m} . \end{aligned} \]

4. Constructing a set of universal hash functions

Let m be prime.Decompose key k into r + 1 digits, each with value in the set {0, 1, …, m–1}. That is, let $k = 〈 k_0, k_1, …, k_r 〉$, where $0 ≤ ki < m$.将k进行“m进制表示”。

Randomized strategy

Pick $a = 〈 a_0, a_1, …, a_r 〉$ where each $a_i$ is chosen randomly from {0, 1, …, m–1}.

Define $h_{a}(k)=\sum_{i=0}^{r} a_{i} k_{i} \bmod m$ 对k和a做点积再对m取余。

How big is $H=\{h_a\}$, $|H|=m^{r+1}$

Theorem. The set $H = {h_a}$ is universal.

Proof. Suppose that $x = 〈x_0, x_1, …, x_r 〉$ and$ y = 〈y_0, y_1, …, y_r 〉$ be distinct keys. Thus, they differ in at least one digit position, wlog position 0.（假设为0位置）For how many $h_a ∈ H$ do x and y collide. We must have$h_a(x)=h_a(y)$. which implies that:$\sum_{i=0}^{r} a_{i} x_{i} \equiv \sum_{i=0}^{r} a_{i} y_{i} \quad(\bmod m)$, \[ \sum_{i=0}^{r} a_{i}\left(x_{i}-y_{i}\right) \equiv 0 \quad(\bmod m) \]

\[ a_{0}\left(x_{0}-y_{0}\right)+\sum_{i=1}^{r} a_{i}\left(x_{i}-y_{i}\right) \equiv 0 \quad(\bmod m) \]

\[ a_{0}\left(x_{0}-y_{0}\right) \equiv-\sum_{i=1}^{r} a_{i}\left(x_{i}-y_{i}\right) \quad(\bmod m) \]

因为假设了$x_0\neq y_0$，所以存在$(x_0-y_0)^{-1}$，所以： \[ a_{0} \equiv\left(-\sum_{i=1}^{r} a_{i}\left(x_{i}-y_{i}\right)\right) \cdot\left(x_{0}-y_{0}\right)^{-1} \quad(\bmod m) \] 可以看出$a_0$是由其他$a_i$决定的，其他$a_i$确定使得collide的$a_0$就确定了。

如果发生碰撞，the number of h为$m^r*1=|H|/m$.证明为全域哈希。

Fact from number theory

Theorem. Let m be prime. For any $z ∈ Z_m$ such that $z ≠ 0$, there exists a unique $z^{–1} ∈ Z_m$ such that:$z \cdot z^{-1} \equiv 1 \quad(\bmod m)$

如果不是质数就不成立。

5. Perfect hashing

Given a set of n keys, construct a static hash table of size $m = O(n)$ such that SEARCH takes $\Theta(1)$ time in the worst case.固定keys静态表$O(1)$时间查找。

IDEA: Two level scheme with universal hashing at both levels.No collisions at level 2.

如果有$n_i$个项被同时哈希到一级表的槽i，那么我们将有$m_i=n_i^2$个槽在二级表。

Theorem. Let H be a class of universal hash functions for a table of size $m = n^2$. Then, if we use a random $h ∈ H$ to hash n keys into the table, the expected number of collisions is at most 1/2. 如果使用全域哈希$m = n^2$，如果随机采样$h ∈ H$ 期望碰撞数为1/2.

Proof

马尔可夫不等式（Markov’s inequality）

for any nonnegative random variable X, we have $Pr\{X>=t\}<=E[X]/t$

Corollary. The probability of no collisions is at least 1/2.

应用马尔可夫不等式，设置t=1， the probability of 1 or more collisions is at most 1/2。

Thus, just by testing random hash functions in H, we’ll quickly find one that works.

目标是创建静态表，所以这样证明很容易找到合适的哈希函数。

Analysis of storage

For the level-1 hash table T, choose m = n, and let ni be random variable for the number of keys that hash to slot i in T.By using $n_i^2$ slots for the level-2 hash table Si, the expected total storage required for the two-level scheme is therefore \[ E\left[\sum_{i=0}^{m-1} \Theta\left(n_{i}^{2}\right)\right]=\Theta(n) \] since the analysis is identical to the analysis from recitation of the expected running time of bucket sort.