Commit `b08b548`

mo khan <mo.khan@gmail.com>

2020-08-04 03:18:31

Add notes on Karp-Rabin algorithm

master

1 parent f5303d4

Changed files (2)

doc

mit-ocw

hash.md

kr.md

@@ -1,8 +1,8 @@
 # Hashing
 
-https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/lecture-8-hashing-with-chaining/
+[video][mit-ocw]
 
-movitation
+## Motivation
 
 * database
 * compilers and interpreters
@@ -15,16 +15,23 @@ Direct Access table:
 
   ----------
 0 |        |
+  ----------
 1 |        |
+  ----------
 2 |        |
+  ----------
 3 |        |
+  ----------
 4 |        |
+  ----------
   | ...    |
+  ----------
   | ...    |
+  ----------
   | ...    |
   ----------
 
-Hash requires mapping to an integer.
+> Hash requires mapping to an integer.
 
 ideally:
 
@@ -116,8 +123,40 @@ Analysis:
 2. multiplication method:
   * `h(k) = [(a*k) mod 2^w] >> (w - r)`
 3. Universal hasing:
-  * `h(k) = [(ak+b) mod p] mod m
+  * `h(k) = [(ak+b) mod p] mod m`
     * `a`, `b` are random between 0 and `p-1`. p is a big prime number larger than the universe.
-    * {0, ..,p-1}
+    * `{0, ..,p-1}`
   * worst case keys: k1 != k2
     * `1/m`
+
+## How to choose m?
+
+* Want `m = theta(n)`
+  * `0(1)`
+
+Idea:
+
+> Start small; grow/shrink as necessary
+
+E.g.
+
+* If n > m: grow table
+
+* grow table:
+  * m -> m'
+    * make table of size `m'`
+    * build new hash function `h'`
+    * rehash
+      * for item in T:
+        * t:insert(item)
+    * `m' = 2m` table doubling
+
+Amortization:
+* operation takes `T(n) amortized`
+  * if `k` operations take <= `k*T(n)` time
+* spread out the high cost so you pay a little bit, but more often.
+* ~ Think of meaning `T(n)` on average, where average is taken over all the oeprations.
+
+> spread out the high cost so that it's cheap on average all the time.
+
+[mit-ocw]: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/lecture-8-hashing-with-chaining/

@@ -0,0 +1,74 @@
+# Karp-Rabin string matching algorithm
+* https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/lecture-9-table-doubling-karp-rabin/
+
+* given two string `s` & `t`: does `s` occur as a substring of `t`?
+
+```plaintext
+t: [ | | | | | | | | | | | | | | | | ]
+s: [ | | | ]
+  s: [ | | | ]
+    s: [ | | | ]
+      s: [ | | | ]
+        s: [ | | | ]
+          s: [ | | | ]
+            s: [ | | | ]
+              s: [ | | | ]
+                s: [ | | | ]
+                  s: [ | | | ]
+                    s: [ | | | ]
+                      s: [ | | | ]
+                        s: [ | | | ]
+                          s: [ | | | ]
+```
+
+Time:
+  * `theta(|s| * |t| - |s|)
+  * `theta(|s| * |t|) # length of s * length of t
+
+Can we use hashing to get this down to linear time?
+
+* We are looking for `O(|s|+|t|)` length of `s` + `t` is better than `s` times `t`.
+
+Rolling hash ADT:
+
+* `r.append(c)`:
+  * add char.c to end of x
+* `r.skip(c)`:
+  * delete fist character of x (assuming it is c)
+
+
+* `r` maintains a string `x`
+  * `r()` hash value of x = `h(x)`
+
+```python
+for c in s: rs.append(c)
+for c in t[:len(s)]:
+  rt.append(c)
+if rs() == rt(): ...
+for i in range(len(s), len(t)):
+  rt.skip(t[i - len(s)]
+  rt.append(t[i])
+  if rs() == rt:
+    s == t[i-len(s) + 1: i + i]
+    if equal:
+      found match
+    else:
+      happens with probability <= 1/|s|
+```
+
+division method:
+* `h(k) = k mod m` where `m` is a random prime >= |s| (prime is greater than length of s)
+* treat string x as a multi digit number base is size of alphabet.
+  * ascii -> 256
+  * unicode -> x
+
+```python
+r.append(c):
+  u -> u * a + ord(c)
+  r -> r * a + ord(c) mod m
+
+r.skip():
+  u -> u - c * a^|u|-1
+```
+
+* https://www.youtube.com/watch?v=qQ8vS2btsxI

Commit b08b548

Commit `b08b548`