20-CS-4003-001 Organization of Programming Languages Fall 2017
Lab Assignment 11

Lambda calculus, Type theory, Formal semantics, Program analysis

Bloom Filters

Due: 22 November, 2017 (submit instructions: here)

Rationale:
    Simple use of the Maybe monad. Simple use of closure. Some sophistication in the design of Haskell code.
 
Background:
A Bloom filter is a simple probabilistic data structure for helping to determine whether a given object is in a given database. If the queried object is in the data base the response of the query is some form of "don't know," and if the queried object is not in the database then the response of the query is some form of "definitely not in the database." Bloom filters use very little space and very little time, which is their important benefit, but they must be designed carefully to keep the false positive rate low. Objects can be added to a Bloom filter but with each additional object the false positive rate may increase. Remember, Bloom filters are used only to determine set membership, not to retrieve information about an object that may be stored in a database. Google Chrome uses Bloom filters to make a preliminary decision as to whether a particular web site is malicious or safe. Bloom filters are used in spell checkers: a quick check that a word is not in the database results in the word being underlined in Microsoft Word for example. Bloom filters are used in a variety of large-scale network applications such as shared web caches, query routing, and replica location. In fact, you use Bloom filters everyday without realizing it.

Associated with a Bloom filter are k different hash functions, H1,H2,...,Hk. The number k is a parameter that is determined after an analysis of the database to be stored. Each hash function maps a string to a positive integer. Thus, for example, the hash functions might produce the following numbers for input Hello:

  H1("Hello") = 13
  H2("Hello") = 10
  ...
  Hk("Hello") = 25
The maximum number that can be output by any hash function is also a parameter which will be called nbits here. Also associated with a Bloom filter is an array of bits of size nbits. Each entry in the database contributes up to k 1s to this array through the k hash functions: each hash function output is an index into this array where a 1 is placed. For example, if k is 10, nbits is 60, the hash functions produce outputs {3,1,34,32,55,10,13,18,43,22} on input Hello, then, from a fresh database (no entries yet), adding Hello results in the following array where bit indices are in increasing order starting at 1 from the left:
  1010000001001000010001000000000101000000001000000000001000000
If There is added to the database with hash function outputs as follows: {16,56,10,17,34,8,19,49,41,32} then the array becomes this:
  1010000101001001111001000000000101000000101000001000001100000
and so on. In practice, of course, nbits is extremely large and k is not so great.

To determine whether a string is in the database the bits in the array at indices given by the outputs of the hash functions applied to the string are checked. If they are all 1 then nothing can be said because string string could just map to a false positive result. But if at least one bit has value 0, then it can be said with certainly that the string is not in the database.

 
Lab Problem:
Write haskell functions that implement a simple Bloom filter. It is OK to choose your own family of hash functions but the functions below are provided for your convenience and it is OK to use them also.
 {- A hash function that maps string to integers: start with an accumulator
    of value 'seed', for each character in the string update the accumulator
    as follows: just multiply the current accumulator value by the number 
    corresponding to the character then take the result modulo a 'modulus' 
    and add 1 to make sure 0 is not obtained.  The final accumulator value is
    the hash of the string.  But f returns the above missing an input string.
    Use f, for example, like this: 
      ((f 17 34) "Hello") which gives 
      13
 -}
 f :: Int -> Int -> [Char] -> Int
 f modulus seed = foldl (\acc x -> ((acc*(fromEnum x) `mod` modulus)+1)) seed
Thus, a collection of different hash functions can be obtained by using different seeds. The modulus will be the number of bits in the database array so will be the same for all hash functions. Define a variable of type Int named nbits. The database array can be called db and is of type [Bool]. Create a function named addToDB which takes a database array and string as input and outputs a new database array that includes the result of applying the hash functions to the input string. It should not be necessary to pass neither k nor nbits as arguments to addToDB.
 addToDB :: [Bool] -> [Char] -> [Bool]
 addToDB db str = ...
Create a function named makeDB which takes as input a list of strings corresponding to all data base entries and outputs a database array that takes into account all the strings in the input list.
 makeDB :: [[Char]] -> [Bool]
 makeDB strs = ...
Finally, create a function named query which takes as input a database array and a string and returns either Nothing if it cannot be determined that the input string is in the database and Just False if it is certainly not in the database.
 query :: [Char] -> [Bool] -> Maybe Bool
 query str db = ...
Note: implementing other functions may be necessary or at least useful but you must implement addToDB, makeDB, and query.