Hashing

video

Motivation

database
compilers and interpreters
network router
substring search
string commonalities
file/dir synchronization

Direct Access table:

0 | |

1 | |

2 | |

3 | |

4 | |

| … |

Hash requires mapping to an integer.

ideally:

prehash hash(x) == hash(y) only when x == y

badness:

keys may not be non negative integers
gigantic memory hog

hashing -> hatchet

reduce universe of all possible keys and reduce them down to some reasonable set of integers.
idea: m = theta(n)
- size of table proportial to size of # of things.

0 | | | | | | ———- 1 | | ———- 2 | | ———- 3 | | ———- 4 | | ———- | … | ———- | … | ———- m-1 | … | ———-

Collision: Multiple keys map to same slot in hash table.

h(ki) == h(kj) but ki <> kj
we can use chaining

How to define?

function (h)
all keys map to h = { 0, ..., m-1 }

Two ways to deal with collisions:

Chaining: store collisions as a list. (i.e. linked list)
Open addressing

Chaining

----------

0 | | –> (item1) –> (item2) –> (nil) ———- 1 | | ———- 2 | | ———- 3 | | ———- 4 | | ———- | … | ———- | … | ———- m-1 | … | ———-

Worst case:

theta(n) (i.e. all items have a collision)

Simple uniform hashing:

Convenient for analysis.
each key is equally likely to be hashed to any slot of the table, independent of where other keys hashing.

Analysis:

expected length of a chain for n keys, m slots.
- n/m == alpha == load factor
- O(1 + alpha)

Hash functions:

division method:

h(k) == k mod m
works okay if m is prime and not close to power of 2 or power of 10.

multiplication method:

h(k) = [(a*k) mod 2^w] >> (w - r)

Universal hasing:

h(k) = [(ak+b) mod p] mod m
- a, b are random between 0 and p-1. p is a big prime number larger than the universe.
- {0, ..,p-1}
worst case keys: k1 != k2
- 1/m

How to choose m?

Want m = theta(n)
- 0(1)

Idea:

Start small; grow/shrink as necessary

E.g.

If n > m: grow table
grow table:
- m -> m'
  - make table of size m'
  - build new hash function h'
  - rehash
    - for item in T:
      - t:insert(item)
  - m' = 2m table doubling

Amortization:

operation takes T(n) amortized
- if k operations take <= k*T(n) time
spread out the high cost so you pay a little bit, but more often.
~ Think of meaning T(n) on average, where average is taken over all the oeprations.

spread out the high cost so that it’s cheap on average all the time.

Open Adressing

no chaining
use arrays
m: # of slots in the table
m >= # of elements

----------
| item 1 |
----------
| item 2 |
----------
|        |
----------
|        |
----------

probing: try to see if we can insert something into the hashtable. if we fail, we will compute a slightly different hash until we find an empty slot that we can insert into this table.

Hash function specifies the order of slots to probe for a key. (for insert/search/delete)

hash function h(U, trial_count)

U: universe of keys
trial_count: integer between 0 -> m-1

h(k,1) arbitrary key k

h(k,1), h(k,2), … h(k, m-1)

to be a permutation of 0,1 … m-1

  ----------
0 |        |
  ----------
1 |        |
  ----------
2 |        |
  ----------
3 |        |
  ----------
4 |        |
  ----------
5 |        |
  ----------
6 |        |
  ----------
7 |        |
  ----------

insert(k,v): keep probing until an empty slot is found. insert item when found.
search(k): As long as the slots encountered are occupied by keys != k. keep probing until you either encounter k or find an empty slot.

Hashing

Motivation

0 | |

1 | |

2 | |

3 | |

4 | |

| … |

| … |

| … |

Chaining

Hash functions:

How to choose m?

Open Adressing

Cryptographic Hash