Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

RFC-0075: Faster Failover and Configuration Push #123

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

avsej
Copy link
Member

@avsej avsej commented Jun 14, 2023

No description provided.

@avsej avsej force-pushed the 0075-faster-failover-and-configuration-push branch 8 times, most recently from fe812c0 to 5978c09 Compare June 20, 2023 11:58
@avsej avsej force-pushed the 0075-faster-failover-and-configuration-push branch 2 times, most recently from 0ffe096 to 6840048 Compare June 21, 2023 14:49
@avsej avsej force-pushed the 0075-faster-failover-and-configuration-push branch from 6840048 to 6914a45 Compare June 21, 2023 14:50
avsej added 4 commits June 27, 2023 14:43
…r-and-configuration-push

* origin/master:
  Update links in 0048-sdk3-bootstrapping.md
* clarify bootstrap changes
* explain SnappyEverywhere name
@avsej avsej requested review from chvck and brett19 June 27, 2023 12:37
@avsej avsej force-pushed the 0075-faster-failover-and-configuration-push branch from 49e926d to af58f5a Compare June 27, 2023 19:52

## Feature Checklist

1. `GetClusterConfigWithKnownVersion` (`0x1d`). The SDK should always supply current configuration version if the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that we're using known versions and cccp polling, if the node we've polled returns success but no body should we consider that a successful poll and retry again after the poll period or should we try the next node to see if it has a newer config?

`epoch`/`revision` pair.
Once configuration applied, the SDK must check if new configuration routes the operation to new endpoint or new
vbucket on the old endpoint, and *immediately* dispatch operation to new endpoint (or same endpoint in case vbucketID
has changed). In any other case, the SDK should send operation to retry orchestrator.
Copy link
Contributor

@programmatix programmatix Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get this part. Say the SDK is doing N concurrent operations per node. With dedup, just one of those operations will get a NVMB with a config - the other N-1 will get a NVMB with an empty body. Why is there a bunch of special case logic for the single operation? It would be a lot simpler to implement in the JVM clients if we could asynchronously apply the NVMB config body (all config updating is async in JVM world), and send all requests into the RetryOrchestrator but on a tight retry - like a few milliseconds. That would be pretty easy to implement and those few milliseconds wouldn't seem to impact fast failover at all.

(There is a tiny risk that the config wouldn't have been applied before those few milliseconds elapsed. But this would just generate a bit of extra traffic and a few more NVMBs.)

Copy link
Contributor

@brett19 brett19 Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@programmatix In that case, how would you provide determinism that the config had been applies before you attempted to execute the operation again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't pathfound in the JVM clients yet (David's assigned but asked me to look at it while he's away), but IIRC the config mechanism is an async pub-sub. Something pushes a config and all subscribers will, unless something has gone tragically wrong, get that config in nanoseconds. So I'm proposing - and I may want to revise this after pathfinding - permitting an alternative setup that's technically non-deterministic, but will be in practice in basically all cases: send the new config to the config publisher, and scheduled the operation to be retried in X milliseconds. It should be easy to implement, and it's much less risk than introducing a new blocking mechanism to the config publisher that will be challenging to test and will only trigger during rebalances. And if something odd happens and the config publisher hasn't published by X milliseconds, then all that happens is operations are retried, we get another NVMB, and go around again. That would be some more traffic to the cluster - but to stress, this is an in-memory pub-sub stream, and it should always have delivered the config within those X millis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Graham's commentary. "Immediately retry now" isn't feasible for the JVM SDKs.

| .NET | Jeffry Morris | | |
| C/C++ | Sergey Avseyev | | |
| Go | Charles Dixon | | |
| Java/Kotlin | David Nault | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Java/Kotlin | David Nault | | |
| Java/Kotlin | David Nault | 2023-07-18 | #1 |

dnault
dnault previously approved these changes Jul 18, 2023
Copy link
Contributor

@jeffrymorris jeffrymorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions regarding the packet layout for GET_CONFIG and CLUSTERMAP_CHANGE_NOTIFICATION. Looks like typos, but is confusing to me.

Co-authored-by: David Nault <dnault@users.noreply.github.com>
@ingenthr
Copy link
Contributor

ingenthr commented Jun 3, 2024

Hey @avsej, @brett19, @dnault, et. al.… looks like this is a bit stuck? What do we need to resolve to get them unstuck?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants