Description
Original issue created by Maaartinus on 2012-08-24 at 05:40 PM
**BUG 1:
When the number of bits in a BloomFilter gets high, it's FPP is much worse than expected. The culprit is the modular arithmetic in BloomFilterStrategies.MURMUR128_MITZ_32.
You compute x%N
where N=bits.size()
and x
is uniformly distributed in range 0..Integer.MAX_VALUE
. For big N
, x%N
is far from uniform in range 0..N-1
. For example, with N=3<<29
, values below 1<<29
are twice as probable as the others.
This non-uniformity leads to results like this:
desiredFpp 0.000001000000
expectedFpp 0.000000610089
realFpp 0.000003000000
Here, desiredFpp
is the value used in BloomFilter.create
, expectedFpp
was reported after exactly expectedInsertions
were done. Obviously, much fewer bits than expected were set. If this happened once, it might be a good luck but here it's a sign of this bug, as the realFpp
shows.
This problem is well reproducible, it's no glitch caused by bad luck with selected values. AFAIK it concerns all versions since the switch from powers of two.
**BUG 2:
With "31149b4 The number of bits can reach Integer.MAX_VALUE now, rather than Integer.MAX_VALUE/64" another bug was introduced. The commit message is obviously wrong, as there can be allocated up to Integer.MAX_VALUE
longs, allowing nearly 2**37
bits. However, the arithmetic is still int-based and allows to address only2**31
bits. So most of the allocated memory get wasted.
Even worse, bits.size()
may overflow leading to all kinds of disaster, like "/ by zero" (e.g. for expectedInsertions=244412641 and desiredFpp=1e-11) or using only 64 bits.
**INEFFICIENCY:
In MURMUR128_MITZ_32
there are one modulus operation and one unpredictable branch per hash function. This is quite wasteful, as it's enough to compute modulus for the basic two hashes and than use conditional subtraction.
**ENHANCEMENT 1:
As the filter may take up to 16 GB, there should be a method to find out the memory consumption.
**ENHANCEMENT 2:
Possibly there could be a strategy using a power of two table, which may be faster. In case the speed up is non-negligible, such a strategy makes a lot of sense, as the additional memory (assuming rounding up) is not wasted at all -- you get better FPP.
QUESTION:
I see no reason for limiting numHashFunctions
to 25.5 In the SerialForm
, there's an int
, so why?
**PROPOSED SOLUTION:
Because of serialized form compatibility, I'd suggest to leave MURMUR128_MITZ_32 alone, and create MURMUR128_MITZ_64, which
- extracts two longs instead of two ints from the HashCode
- uses long arithmetic for everything
The BitArray
must use long indexes, long bitCount()
and size()
, too. This works with both strategies and the SerialForm
needs no change.
For small filters (up to few millions bits, before the non-uniformity starts to make problems) it's fine to use the old strategy,
for larger the new one must get used. I'd suggest to use the new one for all new filters.
In order to get maximum speed, the following comes in mind:
- create package-private
HashCode.asSecondLong()
- compute hash2 only if
numHashFunctions>1
The attached patch solves it all.