Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[draft] adapter: distributed timestamp oracle backed by "postgres" #21671

Closed

Conversation

aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Sep 8, 2023

Motivation

A part of MaterializeInc/database-issues#6635 , but so far a draft for shopping the thing around.

The commits tell the story of the evolution, with intermediary steps, like the consensus-backed timestamp oracle that can be removed again. Also, there's more steps we could do like now share the final version of the postgres oracle using an Arc<dyn TimestampOracle>, meaning we wouldn't have to do the shallow clones anymore.

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@aljoscha aljoscha requested review from a team as code owners September 8, 2023 13:14
@aljoscha aljoscha requested a review from jkosh44 September 8, 2023 13:14
@aljoscha aljoscha marked this pull request as draft September 8, 2023 13:14
Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few superficial comments from a very high-level scan, don't bother doing anything in response to them, i'm mostly flushing them out so github doesn't eat them. i'll do a deeper pass later this week

one high-level thing i noticed is that this makes quite a few non-async fns into async fns. from what I understand, it's possible that those were intentionally non-async so it's easier to reason about what's happening in the coord loop. we should check with folks on that

/// reported completed write timestamps, and strictly less than all subsequently
/// emitted write timestamps.
#[async_trait]
pub trait TimestampOracle<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this whole module be its own crate? if so, i'd lean toward do it

@@ -261,6 +264,8 @@ fn parse_query_when(s: &str) -> QueryWhen {
/// Transaction isolation can also be set. The `determine` directive runs determine_timestamp and
/// returns the chosen timestamp. Append `full` as an argument to it to see the entire
/// TimestampDetermination.
// allow `futures::block_on` for testing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this turns out to be a pretty big footgun in practice, even for tests. highly recommend figuring out a way to use tokio's block_on


use crate::coord::timeline::WriteTimestamp;

pub mod consensus;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we planning on merging all three impls or just the final one?

use crate::coord::timeline::WriteTimestamp;

pub mod consensus;
pub mod durable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a better name for this than "durable" is "catalog"? they're all durable, no?

}

pub fn schedule_storage_usage_collection(&self) {
pub async fn schedule_storage_usage_collection(&self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to keep this non-async so it's obvious that it doesn't do anything expensive. which I think could be as easy as passing in the ts as an arg?

UPDATE timestamp_oracle SET write_ts = GREATEST(write_ts, $2), read_ts = GREATEST(read_ts, $2)
WHERE timeline = $1;
"#;
let client = self.get_connection().await.expect("todo");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philip-stoev
Copy link
Contributor

@aljoscha this looks like something that could use some more chaos testing and such. Please ping the QA team and/or add me as a reviewer once the time arrives . Thank you!

Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I didn't have many useful comments.

/// Mark a write at `write_ts` completed.
///
/// All subsequent values of `self.read_ts()` will be greater or equal to
/// `write_ts`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you meant lower_bound or you meant to change the name of the parameter?

let new_read_ts = if write_ts > self.read_ts {
write_ts.clone()
} else {
self.read_ts.clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just return here since the invariant

All subsequent values of self.read_ts() will be greater or equal to write_ts.

is true?

Comment on lines 10 to 11
//! A timestamp oracle that relies on the [`Catalog`] for persistence/durability
//! and reserves ranges of timestamps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something or is TimestampOracle never implemented in this file?

Comment on lines 70 to 258
// TODO(aljoscha): These internal details of the oracle are leaking through to
// multiple places in the coordinator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat on the fence about who's responsibility this should be. TIMESTAMP_PERSIST_INTERVAL doesn't make a lot of sense anymore since we don't reserve ranges of timestamps and TIMESTAMP_INTERVAL_UPPER_BOUND only really makes sense with the EpochMillis timeline. For other timelines, do we actually care about the timestamp oracle jumping ahead by a large amount? monotonic_now especially makes some assumptions that now() is hooked up to some wall clock that is advancing at a rate of 1 per millisecond.

Really, a CRDB-backed oracle, of course...
Note that the consensus oracle is implemented in a non-ideal way and
could be further simplified if all operations where turned into purely
SQL statements against CRDB. Then we wouldn't need the cached read/write
ts anymore.
…timestamp

We don't need to go through "linearize reads" anymore because the
timestamp oracle by itself now linearizes peeks. We do still need this
stage for peeks that are scheduled in the future (with respect to the
current oracle-aware read timestamp).
…ites

When there are already pending writes, a group commit has already been
triggered so we can sneak into that one and don't have to trigger our
own.

Any superfluous group commits that get triggered will find an empty list
of pending writes but still perform costly timestamp oracle operations
and will forward all table uppers, which are costly persist operations.
This becomes noticeable when timestamp operations, such as peek_write_ts
and write_ts are more expensive, and before this "optimization" we were
doing a copious amount of those, even with nothing to write.
We can do that after a previous change for batching calls to the
timestamp oracle, which factored calls to the oracle out of
determine_timestamp_for, which was why we had to make it async in the
first place.
@aljoscha aljoscha force-pushed the adapter-distributed-ts-oracle branch from 7c019ae to 47e9868 Compare October 5, 2023 13:13
@philip-stoev
Copy link
Contributor

@aljoscha please re-request a review from the QA team once the merge conflicts are resolved. I would like to run another Nightly at the very least.

@aljoscha
Copy link
Contributor Author

#22262 has been merged

@aljoscha aljoscha closed this Nov 23, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants