-
Notifications
You must be signed in to change notification settings - Fork 9
Plugins
Unlike Lucene and similar bitset-backed technologies, Miru is designed to grant low-level access to its indexes for greater versatility in how data is ingested and consumed.
While Miru handles ingress and indexing on behalf of clients, applications may use either the included or custom plugins to join and expose the underlying indexes.
For aggregates and counts, bitsets keep on giving. While any combination of boolean operations (AND, OR, XOR, AND NOT, etc.) is being performed, we can track the number of set (1) bits and, presto, we have counts! (Cardinality can be expensive to calculate for compressed bitmaps.) According to various performance tests, the overhead to keep track of the count is negligible. In fact, a similar technique is used in parallel to track the ID of the last set bit (also expensive for compressed bitmaps). The count in conjunction with the index of the last set bit allows us very efficiently to compute the raw count of aggregates, or to compose raw activities into streams with counts per aggregate.
To elaborate more on how we aggregate within Miru:
Aggregation of activities using bitsets is a relatively straightforward endeavor. We start by building a query centered around an "aggregation field" (but containing any number of term filters), which yields a subset of activities that should be considered. The resulting bitset is the starting point for aggregation. The first step is to take the index of the last set bit and look up the associated activity's term for the aggregation field. With the term in hand, we look up the term's complete bitset. After taking note of the current size (cardinality) of the initial result, we remove all of the set (1) bits from the initial result that are found in common with the last set bit's term index. In other words:
oldCardinality = resultBitSet.cardinality();
newResultBitSet = resultBitSet.andNot(lastActivity.aggregateField.term.bitSet);
newCardinality = newResultBitSet.cardinality();
Now take note of the size (cardinality) of the new result. We know that the last set bit's aggregate term contained oldCardinality - newCardinality
activities in common with the initial result. Since we have removed all the activities from the beginning result set, the basic subtraction of cardinalities gives us the number of activities contributing to the "count" of the initial aggregate term. At this point our new result no longer contains any activity relating to that term, and we can start the process all over by asking the new result set for its last set bit, and so on.
To put it another way, let's say you want the most recent 10 aggregate terms. We know that we will have to subtract 10 bitsets from the initial result set. We determine which bitsets to subtract by leveraging the index of the last set bit to look up the activity and get the associated aggregate term. We have shown that we can do approximately 10,000 bitset subtractions in about 50 milliseconds. Taking 50 milliseconds as the time required to build the initial result set, we'll need about 100 milliseconds to aggregate 10,000 terms. Note: These timings are approximate and downplay the cost of forward lookups, and they will vary based on the distribution and cardinalities of all the bitsets that come into play. Despite this disclaimer, we find that a typical request needs at most 50 aggregate terms, so the characteristics of bitset aggregation are well within our latency tolerance.
Akin to aggregate counts, Miru supports the collection of distinct terms for a particular field. However, gathering distincts additionally allows wildcard lookups in the form of term prefixes, e.g. gathering terms cat, camel, and caribou by requesting the term prefix ca. Term indexing additionally supports numeric prefixes for true numeric range lookups by storing a lexicographical representation of the numeric bytes. In the absence of additional term filters, Miru optimizes the distinct term lookup by performing a range scan directly against the term lookup table.
Forthcoming.