Is a nodegraph a bloom filter? #1898

olgabot · 2019-11-05T01:09:44Z

Hello! Seems like I should have figured this out by now but somehow I haven't.

I'm calculating the expected false positive rates for various table sizes and numbers of hash functions. Am I understanding correctly that for Nodegraph, the n_tables is the same as the number of hash functions, i.e. k in the wikpedia? And is the tablesize the same as m from wikipedia?

Here is some code to calculate the false positive probability that all k-mers from a read were false positives:

def per_read_false_positive_coding_rate(n_kmers_in_read, n_total_kmers=1e7,
                                        n_hash_functions=DEFAULT_N_TABLES,
                                        tablesize=DEFAULT_MAX_TABLESIZE):
    exponent = - n_hash_functions * n_total_kmers / tablesize
    print(f"exponent: {exponent}")

    # Probability that a single k-mer is randomly in the data
    # per_kmer_fpr = math.pow(1 - math.exp(exponent), n_hash_functions)

    # Use built-in `exp1m` = exp - 1
    # - (exp - 1) = 1 - exp
    per_kmer_fpr = math.pow(- math.expm1(exponent), n_hash_functions)
    print(f"per kmer false positive rate: {per_kmer_fpr}")

    # Probability that the number of k-mers in the read are all false positives
    per_read_fpr = math.pow(per_kmer_fpr, n_kmers_in_read)
    return per_read_fpr

And here's some example error rates:

per_read_false_positive_coding_rate(30, 1e7, 4, 1e7) 
exponent: -4.0
per kmer false positive rate: 0.9287257558982396
Out[88]: 0.10879894741447617
per_read_false_positive_coding_rate(30, 1e7, 2, 1e7) 
exponent: -2.0
per kmer false positive rate: 0.7476450724155088
Out[89]: 0.00016250407621135139

The text was updated successfully, but these errors were encountered:

olgabot · 2019-11-05T01:10:14Z

Also see: czbiohub-sf/orpheum#8 (comment)

standage · 2019-11-06T00:07:49Z

Am I understanding correctly that for Nodegraph, the n_tables is the same as the number of hash functions, i.e. k in the wikpedia? And is the tablesize the same as m from wikipedia?

Close. You are correct that n_tables is equivalent to k in the wikipedia description. But tablesize is not equivalent to m in the Wikipedia definition. Rather, m is the sum (approximately) of all the tables in the Nodegraph.

I typically visualize khmer's implementation of the Nodegraph/Nodetable/Countgraph/Counttable as a table with several rows, many columns, and a jagged right edge. Each row in this table is what khmer is referring to with n_tables, and tablesize is the target length of each row. I say "target" length here, because the rows aren't the same length. khmer selects the length of each row by starting at tablesize and finding the next n prime numbers smaller than tablesize.

When it comes time to inserting or querying the filter, the element is hashed n times, the hash function simply being division modulo the row length for each row in the table. Because each row length is a different prime number, the n hash functions are linearly independent (a property required for accuracy guarantees).

So rather than allocating a single bit vector (or byte vector for the Count-min sketch) m units long and implementing k linearly independent hash functions (as commonly described for Bloom filters), khmer allocates k bit/byte vectors m/k units long each, using division modulo the vector length as the hash function for each of the k vectors.

olgabot · 2019-11-06T02:02:32Z

Ah, okay got it!!! That makes a lot of sense. This helps a LOT with calculating expected false positives. Warmest, Olga --- Olga Botvinnik, PhD olgabotvinnik.com <http://www.olgabotvinnik.com>

…

On Tue, Nov 5, 2019 at 4:07 PM Daniel Standage ***@***.***> wrote: Am I understanding correctly that for Nodegraph, the n_tables is the same as the number of hash functions, i.e. k in the wikpedia? And is the tablesize the same as m from wikipedia? Close. You are correct that n_tables is equivalent to k in the wikipedia description. But tablesize is not equivalent to m in the Wikipedia definition. Rather, m is the sum (approximately) of all the tables in the Nodegraph. I typically visualize khmer's implementation of the Nodegraph/Nodetable/Countgraph/Counttable as a table with several rows, many columns, and a jagged right edge. Each row in this table is what khmer is referring to with n_tables, and tablesize is the target length of each row. I say "target" length here, because the rows aren't the same length. khmer selects the length of each row by starting at tablesize and finding the next n prime numbers smaller than tablesize. When it comes time to inserting or querying the filter, the element is hashed n times, the hash function simply being division modulo the row length for each row in the table. Because each row length is a different prime number, the n hash functions are linearly independent (a property required for accuracy guarantees). So rather than allocating a single bit vector (or byte vector for the Count-min sketch) m units long and implementing k linearly independent hash functions (as commonly described for Bloom filters), khmer allocates k bit/byte vectors m/k units long each, using division modulo the vector length as the hash function for each of the k vectors. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1898?email_source=notifications&email_token=AAGE24ATK7WO7KHCUDCYKY3QSIDFVA5CNFSM4JI4FRAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDEZBWA#issuecomment-550080728>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGE24HGR7NBUSDV64SUW4LQSIDFVANCNFSM4JI4FRAA> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is a nodegraph a bloom filter? #1898

Is a nodegraph a bloom filter? #1898

olgabot commented Nov 5, 2019

olgabot commented Nov 5, 2019

standage commented Nov 6, 2019

olgabot commented Nov 6, 2019 via email

Is a nodegraph a bloom filter? #1898

Is a nodegraph a bloom filter? #1898

Comments

olgabot commented Nov 5, 2019

olgabot commented Nov 5, 2019

standage commented Nov 6, 2019

olgabot commented Nov 6, 2019 via email