master

Hashing

video

Motivation

  • database
  • compilers and interpreters
  • network router
  • substring search
  • string commonalities
  • file/dir synchronization

Direct Access table:


0 | |

1 | |

2 | |

3 | |

4 | |

| … |

| … |

| … |

Hash requires mapping to an integer.

ideally:

  • prehash hash(x) == hash(y) only when x == y

badness:

  1. keys may not be non negative integers
  2. gigantic memory hog

hashing -> hatchet

  • reduce universe of all possible keys and reduce them down to some reasonable set of integers.
  • idea: m = theta(n)
    • size of table proportial to size of # of things.


0 | | | | | | ———- 1 | | ———- 2 | | ———- 3 | | ———- 4 | | ———- | … | ———- | … | ———- m-1 | … | ———-

Collision: Multiple keys map to same slot in hash table.

  • h(ki) == h(kj) but ki <> kj
  • we can use chaining

How to define?

  • function (h)
  • all keys map to h = { 0, ..., m-1 }

Two ways to deal with collisions:

  • Chaining: store collisions as a list. (i.e. linked list)
  • Open addressing

Chaining

----------

0 | | –> (item1) –> (item2) –> (nil) ———- 1 | | ———- 2 | | ———- 3 | | ———- 4 | | ———- | … | ———- | … | ———- m-1 | … | ———-

Worst case:

  • theta(n) (i.e. all items have a collision)

Simple uniform hashing:

  • Convenient for analysis.
  • each key is equally likely to be hashed to any slot of the table, independent of where other keys hashing.

Analysis:

  • expected length of a chain for n keys, m slots.
    • n/m == alpha == load factor
    • O(1 + alpha)

Hash functions:

  1. division method:
  • h(k) == k mod m
  • works okay if m is prime and not close to power of 2 or power of 10.
  1. multiplication method:
  • h(k) = [(a*k) mod 2^w] >> (w - r)
  1. Universal hasing:
  • h(k) = [(ak+b) mod p] mod m
    • a, b are random between 0 and p-1. p is a big prime number larger than the universe.
    • {0, ..,p-1}
  • worst case keys: k1 != k2
    • 1/m

How to choose m?

  • Want m = theta(n)
    • 0(1)

Idea:

Start small; grow/shrink as necessary

E.g.

  • If n > m: grow table

  • grow table:

    • m -> m'
      • make table of size m'
      • build new hash function h'
      • rehash
        • for item in T:
          • t:insert(item)
      • m' = 2m table doubling

Amortization:

  • operation takes T(n) amortized
    • if k operations take <= k*T(n) time
  • spread out the high cost so you pay a little bit, but more often.
  • ~ Think of meaning T(n) on average, where average is taken over all the oeprations.

spread out the high cost so that it’s cheap on average all the time.

Open Adressing

  • no chaining
  • use arrays
  • m: # of slots in the table
  • m >= # of elements
----------
| item 1 |
----------
| item 2 |
----------
|        |
----------
|        |
----------

probing: try to see if we can insert something into the hashtable. if we fail, we will compute a slightly different hash until we find an empty slot that we can insert into this table.

Hash function specifies the order of slots to probe for a key. (for insert/search/delete)

hash function h(U, trial_count)

  • U: universe of keys
  • trial_count: integer between 0 -> m-1

h(k,1) arbitrary key k

h(k,1), h(k,2), … h(k, m-1)

to be a permutation of 0,1 … m-1

  ----------
0 |        |
  ----------
1 |        |
  ----------
2 |        |
  ----------
3 |        |
  ----------
4 |        |
  ----------
5 |        |
  ----------
6 |        |
  ----------
7 |        |
  ----------
  • insert(k,v): keep probing until an empty slot is found. insert item when found.
  • search(k): As long as the slots encountered are occupied by keys != k. keep probing until you either encounter k or find an empty slot.

Cryptographic Hash