1 minute read

Number of unique quotes in the text documents are estimated using Flajolet Martin algorithm.

1. Libraries are imported

The required libraries are: Glob, Pyhash, Scikit-learn, NumPy

2. Built hasher function with filtered quotes only. Trailing zeros are calculated

Generate n, d-dimensional arrays using the following pseudo-code:

Initialize empty list

define (Hasher):
  read lines
  use non-cryptographic Murmurhash and convert it to binary
  calculate trailing zeros
  append to empty list
}

3. Calculate maximum trailing zeros in each chunk

Compute the trailing zeros for specific chunks of quotes, and find the maximum of each chunk of the quotes.

4. Overall trailing zeros estimation

Calculate the median of mean maximum trailing zeros of each chunk. Let us call this value .

5. Estimate the number of unique quotes

According to the Flajolet Martin algorithm, the total number of unique quotes is .

Result:

Spam filter for big-data is built using Bloom filter.

1. Set an arbitrary bitarray size, and set everything to false

Arbitrarily set bitarray size as 350,000.

2. Count number of lines in train/test sets

Setup train/test as 80/20 split.

3. Calculate optimal number of hash functions

Result:

4. Training

After finding the value (k=5), implemented 5 hash functions (Murmur Hash, FNV Hash, XX Hash, Spooky Hash, and Farm Fingerprint Hash). Calculated mod m of the value, and set the value of bitarray as true for the corresponding index value as the output of the computation.

Trained the bitarray through the training set of only spams.

5. Testing

Result: Accurately predicted the spams with a FPR of 4%.

Check out the code here.