-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[draft] adapter: distributed timestamp oracle backed by "postgres" #21671
[draft] adapter: distributed timestamp oracle backed by "postgres" #21671
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few superficial comments from a very high-level scan, don't bother doing anything in response to them, i'm mostly flushing them out so github doesn't eat them. i'll do a deeper pass later this week
one high-level thing i noticed is that this makes quite a few non-async fns into async fns. from what I understand, it's possible that those were intentionally non-async so it's easier to reason about what's happening in the coord loop. we should check with folks on that
/// reported completed write timestamps, and strictly less than all subsequently | ||
/// emitted write timestamps. | ||
#[async_trait] | ||
pub trait TimestampOracle<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this whole module be its own crate? if so, i'd lean toward do it
@@ -261,6 +264,8 @@ fn parse_query_when(s: &str) -> QueryWhen { | |||
/// Transaction isolation can also be set. The `determine` directive runs determine_timestamp and | |||
/// returns the chosen timestamp. Append `full` as an argument to it to see the entire | |||
/// TimestampDetermination. | |||
// allow `futures::block_on` for testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this turns out to be a pretty big footgun in practice, even for tests. highly recommend figuring out a way to use tokio's block_on
|
||
use crate::coord::timeline::WriteTimestamp; | ||
|
||
pub mod consensus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we planning on merging all three impls or just the final one?
use crate::coord::timeline::WriteTimestamp; | ||
|
||
pub mod consensus; | ||
pub mod durable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a better name for this than "durable" is "catalog"? they're all durable, no?
} | ||
|
||
pub fn schedule_storage_usage_collection(&self) { | ||
pub async fn schedule_storage_usage_collection(&self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be nice to keep this non-async so it's obvious that it doesn't do anything expensive. which I think could be as easy as passing in the ts as an arg?
UPDATE timestamp_oracle SET write_ts = GREATEST(write_ts, $2), read_ts = GREATEST(read_ts, $2) | ||
WHERE timeline = $1; | ||
"#; | ||
let client = self.get_connection().await.expect("todo"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aljoscha this looks like something that could use some more chaos testing and such. Please ping the QA team and/or add me as a reviewer once the time arrives . Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, I didn't have many useful comments.
/// Mark a write at `write_ts` completed. | ||
/// | ||
/// All subsequent values of `self.read_ts()` will be greater or equal to | ||
/// `write_ts`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you meant lower_bound
or you meant to change the name of the parameter?
let new_read_ts = if write_ts > self.read_ts { | ||
write_ts.clone() | ||
} else { | ||
self.read_ts.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just return here since the invariant
All subsequent values of
self.read_ts()
will be greater or equal towrite_ts
.
is true?
//! A timestamp oracle that relies on the [`Catalog`] for persistence/durability | ||
//! and reserves ranges of timestamps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something or is TimestampOracle
never implemented in this file?
// TODO(aljoscha): These internal details of the oracle are leaking through to | ||
// multiple places in the coordinator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat on the fence about who's responsibility this should be. TIMESTAMP_PERSIST_INTERVAL
doesn't make a lot of sense anymore since we don't reserve ranges of timestamps and TIMESTAMP_INTERVAL_UPPER_BOUND
only really makes sense with the EpochMillis
timeline. For other timelines, do we actually care about the timestamp oracle jumping ahead by a large amount? monotonic_now
especially makes some assumptions that now()
is hooked up to some wall clock that is advancing at a rate of 1 per millisecond.
fc897f2
to
3245e31
Compare
3245e31
to
7c019ae
Compare
Really, a CRDB-backed oracle, of course...
Note that the consensus oracle is implemented in a non-ideal way and could be further simplified if all operations where turned into purely SQL statements against CRDB. Then we wouldn't need the cached read/write ts anymore.
…timestamp We don't need to go through "linearize reads" anymore because the timestamp oracle by itself now linearizes peeks. We do still need this stage for peeks that are scheduled in the future (with respect to the current oracle-aware read timestamp).
…ites When there are already pending writes, a group commit has already been triggered so we can sneak into that one and don't have to trigger our own. Any superfluous group commits that get triggered will find an empty list of pending writes but still perform costly timestamp oracle operations and will forward all table uppers, which are costly persist operations. This becomes noticeable when timestamp operations, such as peek_write_ts and write_ts are more expensive, and before this "optimization" we were doing a copious amount of those, even with nothing to write.
At least we comment out the assert ... 😅
We can do that after a previous change for batching calls to the timestamp oracle, which factored calls to the oracle out of determine_timestamp_for, which was why we had to make it async in the first place.
7c019ae
to
47e9868
Compare
@aljoscha please re-request a review from the QA team once the merge conflicts are resolved. I would like to run another Nightly at the very least. |
#22262 has been merged |
Motivation
A part of MaterializeInc/database-issues#6635 , but so far a draft for shopping the thing around.
The commits tell the story of the evolution, with intermediary steps, like the consensus-backed timestamp oracle that can be removed again. Also, there's more steps we could do like now share the final version of the postgres oracle using an
Arc<dyn TimestampOracle>
, meaning we wouldn't have to do the shallow clones anymore.Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.