Hashing
Motivation
- database
- compilers and interpreters
- network router
- substring search
- string commonalities
- file/dir synchronization
Direct Access table:
0 | |
1 | |
2 | |
3 | |
4 | |
| … |
| … |
| … |
Hash requires mapping to an integer.
ideally:
- prehash hash(x) == hash(y) only when x == y
badness:
- keys may not be non negative integers
- gigantic memory hog
hashing -> hatchet
- reduce universe of all possible keys and reduce them down to some reasonable set of integers.
- idea: m = theta(n)
-
size of table proportial to size of # of things.
-
0 | | | | | | ———- 1 | | ———- 2 | | ———- 3 | | ———- 4 | | ———- | … | ———- | … | ———- m-1 | … | ———-
Collision: Multiple keys map to same slot in hash table.
- h(ki) == h(kj) but ki <> kj
- we can use chaining
How to define?
- function (h)
- all keys map to
h = { 0, ..., m-1 }
Two ways to deal with collisions:
- Chaining: store collisions as a list. (i.e. linked list)
- Open addressing
Chaining
----------
0 | | –> (item1) –> (item2) –> (nil) ———- 1 | | ———- 2 | | ———- 3 | | ———- 4 | | ———- | … | ———- | … | ———- m-1 | … | ———-
Worst case:
- theta(n) (i.e. all items have a collision)
Simple uniform hashing:
- Convenient for analysis.
- each key is equally likely to be hashed to any slot of the table, independent of where other keys hashing.
Analysis:
- expected length of a chain for n keys, m slots.
- n/m == alpha == load factor
- O(1 + alpha)
Hash functions:
- division method:
h(k) == k mod m- works okay if
mis prime and not close to power of 2 or power of 10.
- multiplication method:
h(k) = [(a*k) mod 2^w] >> (w - r)
- Universal hasing:
h(k) = [(ak+b) mod p] mod ma,bare random between 0 andp-1. p is a big prime number larger than the universe.{0, ..,p-1}
- worst case keys: k1 != k2
1/m
How to choose m?
- Want
m = theta(n)0(1)
Idea:
Start small; grow/shrink as necessary
E.g.
-
If n > m: grow table
-
grow table:
- m -> m'
- make table of size
m' - build new hash function
h' - rehash
- for item in T:
- t:insert(item)
- for item in T:
m' = 2mtable doubling
- make table of size
- m -> m'
Amortization:
- operation takes
T(n) amortized- if
koperations take <=k*T(n)time
- if
- spread out the high cost so you pay a little bit, but more often.
- ~ Think of meaning
T(n)on average, where average is taken over all the oeprations.
spread out the high cost so that it’s cheap on average all the time.
Open Adressing
- no chaining
- use arrays
- m: # of slots in the table
- m >= # of elements
----------
| item 1 |
----------
| item 2 |
----------
| |
----------
| |
----------
probing: try to see if we can insert something into the hashtable. if we fail, we will compute a slightly different hash until we find an empty slot that we can insert into this table.
Hash function specifies the order of slots to probe for a key. (for insert/search/delete)
hash function h(U, trial_count)
- U: universe of keys
- trial_count: integer between 0 -> m-1
h(k,1)
arbitrary key k
h(k,1), h(k,2), … h(k, m-1)
to be a permutation of 0,1 … m-1
----------
0 | |
----------
1 | |
----------
2 | |
----------
3 | |
----------
4 | |
----------
5 | |
----------
6 | |
----------
7 | |
----------
insert(k,v): keep probing until an empty slot is found. insert item when found.search(k): As long as the slots encountered are occupied by keys != k. keep probing until you either encounter k or find an empty slot.