BloomFilter broken when really big

_[Original issue](https://code.google.com/p/guava-libraries/issues/detail?id=1119) created by **Maaartinus** on 2012-08-24 at 05:40 PM_

---

**BUG 1:
When the number of bits in a BloomFilter gets high, it's FPP is much worse than expected. The culprit is the modular arithmetic in BloomFilterStrategies.MURMUR128_MITZ_32.

You compute `x%N`&nbsp;where `N=bits.size()`&nbsp;and `x`&nbsp;is uniformly distributed in range `0..Integer.MAX_VALUE`. For big `N`, `x%N`&nbsp;is far from uniform in range `0..N-1`. For example, with `N=3<<29`, values below `1<<29`&nbsp;are twice as probable as the others.

This non-uniformity leads to results like this:
desiredFpp   0.000001000000
expectedFpp  0.000000610089
realFpp      0.000003000000

Here, `desiredFpp`&nbsp;is the value used in `BloomFilter.create`, `expectedFpp`&nbsp;was reported after exactly `expectedInsertions`&nbsp;were done. Obviously, much fewer bits than expected were set. If this happened once, it might be a good luck but here it's a sign of this bug, as the `realFpp`&nbsp;shows.

This problem is well reproducible, it's no glitch caused by bad luck with selected values. AFAIK it concerns all versions since the switch from powers of two.

**BUG 2:
With "31149b4 The number of bits can reach Integer.MAX_VALUE now, rather than Integer.MAX_VALUE/64" another bug was introduced. The commit message is obviously wrong, as there can be allocated up to `Integer.MAX_VALUE`&nbsp;longs, allowing nearly `2**37`&nbsp;bits. However, the arithmetic is still int-based and allows to address only`2**31`&nbsp;bits. So most of the allocated memory get wasted.

Even worse, `bits.size()`&nbsp;may overflow leading to all kinds of disaster, like "/ by zero" (e.g. for expectedInsertions=244412641 and desiredFpp=1e-11) or using only 64 bits.

**INEFFICIENCY:
In `MURMUR128_MITZ_32`&nbsp;there are one modulus operation and one unpredictable branch per hash function. This is quite wasteful, as it's enough to compute modulus for the basic two hashes and than use conditional subtraction.

**ENHANCEMENT 1:
As the filter may take up to 16 GB, there should be a method to find out the memory consumption.

**ENHANCEMENT 2:
Possibly there could be a strategy using a power of two table, which may be faster. In case the speed up is non-negligible, such a strategy makes a lot of sense, as the additional memory (assuming rounding up) is not wasted at all -- you get better FPP.

QUESTION:
I see no reason for limiting `numHashFunctions`&nbsp;to 25.5 In the `SerialForm`, there's an `int`, so why?

**PROPOSED SOLUTION:
Because of serialized form compatibility, I'd suggest to leave MURMUR128_MITZ_32 alone, and create MURMUR128_MITZ_64, which
- extracts two longs instead of two ints from the HashCode
- uses long arithmetic for everything

The `BitArray`&nbsp;must use long indexes, long `bitCount()`&nbsp;and `size()`, too. This works with both strategies and the `SerialForm`&nbsp;needs no change.

For small filters (up to few millions bits, before the non-uniformity starts to make problems) it's fine to use the old strategy,
for larger the new one must get used. I'd suggest to use the new one for all new filters.

In order to get maximum speed, the following comes in mind:
- create package-private `HashCode.asSecondLong()`
- compute hash2 only if `numHashFunctions>1`

The attached patch solves it all.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BloomFilter broken when really big #1119

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BloomFilter broken when really big #1119

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions