Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Is a nodegraph a bloom filter? #1898

Open
olgabot opened this issue Nov 5, 2019 · 3 comments
Open

Is a nodegraph a bloom filter? #1898

olgabot opened this issue Nov 5, 2019 · 3 comments

Comments

@olgabot
Copy link
Contributor

olgabot commented Nov 5, 2019

Hello! Seems like I should have figured this out by now but somehow I haven't.

I'm calculating the expected false positive rates for various table sizes and numbers of hash functions. Am I understanding correctly that for Nodegraph, the n_tables is the same as the number of hash functions, i.e. k in the wikpedia? And is the tablesize the same as m from wikipedia?

Here is some code to calculate the false positive probability that all k-mers from a read were false positives:

def per_read_false_positive_coding_rate(n_kmers_in_read, n_total_kmers=1e7,
                                        n_hash_functions=DEFAULT_N_TABLES,
                                        tablesize=DEFAULT_MAX_TABLESIZE):
    exponent = - n_hash_functions * n_total_kmers / tablesize
    print(f"exponent: {exponent}")

    # Probability that a single k-mer is randomly in the data
    # per_kmer_fpr = math.pow(1 - math.exp(exponent), n_hash_functions)

    # Use built-in `exp1m` = exp - 1
    # - (exp - 1) = 1 - exp
    per_kmer_fpr = math.pow(- math.expm1(exponent), n_hash_functions)
    print(f"per kmer false positive rate: {per_kmer_fpr}")

    # Probability that the number of k-mers in the read are all false positives
    per_read_fpr = math.pow(per_kmer_fpr, n_kmers_in_read)
    return per_read_fpr

And here's some example error rates:

per_read_false_positive_coding_rate(30, 1e7, 4, 1e7) 
exponent: -4.0
per kmer false positive rate: 0.9287257558982396
Out[88]: 0.10879894741447617
per_read_false_positive_coding_rate(30, 1e7, 2, 1e7) 
exponent: -2.0
per kmer false positive rate: 0.7476450724155088
Out[89]: 0.00016250407621135139
@olgabot
Copy link
Contributor Author

olgabot commented Nov 5, 2019

Also see: czbiohub-sf/orpheum#8 (comment)

@standage
Copy link
Member

standage commented Nov 6, 2019

Am I understanding correctly that for Nodegraph, the n_tables is the same as the number of hash functions, i.e. k in the wikpedia? And is the tablesize the same as m from wikipedia?

Close. You are correct that n_tables is equivalent to k in the wikipedia description. But tablesize is not equivalent to m in the Wikipedia definition. Rather, m is the sum (approximately) of all the tables in the Nodegraph.

I typically visualize khmer's implementation of the Nodegraph/Nodetable/Countgraph/Counttable as a table with several rows, many columns, and a jagged right edge. Each row in this table is what khmer is referring to with n_tables, and tablesize is the target length of each row. I say "target" length here, because the rows aren't the same length. khmer selects the length of each row by starting at tablesize and finding the next n prime numbers smaller than tablesize.

When it comes time to inserting or querying the filter, the element is hashed n times, the hash function simply being division modulo the row length for each row in the table. Because each row length is a different prime number, the n hash functions are linearly independent (a property required for accuracy guarantees).

So rather than allocating a single bit vector (or byte vector for the Count-min sketch) m units long and implementing k linearly independent hash functions (as commonly described for Bloom filters), khmer allocates k bit/byte vectors m/k units long each, using division modulo the vector length as the hash function for each of the k vectors.

@olgabot
Copy link
Contributor Author

olgabot commented Nov 6, 2019 via email

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants