Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

flexible reply for sending entries #787

Closed
drmingdrmer opened this issue Apr 24, 2023 · 9 comments
Closed

flexible reply for sending entries #787

drmingdrmer opened this issue Apr 24, 2023 · 9 comments
Assignees

Comments

@drmingdrmer
Copy link
Member

drmingdrmer commented Apr 24, 2023

BTW, I'm wondering if the backoff is the right way to go. It depends on the network implementation. If you have a relatively low volume of updates and use, for instance, UDP to send the messages, then yes, this will likely work very well.

What I am worried about is when you have a lot of data to send. Then, the server will retry sending entire missing log at once. Maybe it would be better to allow the connection to report a partial success. Say, I want to send 100 entries, 50 are through and then the timeout hits. The next time, we'll try to re-send all 100 entries (and some) again, so the timeout hits after 50 entries again. And we'll never make progress. Of course, this can likely be handled in the Network implementation by noting down the last successful entry index before the send was killed by the timeout and next time just sending the rest.

But, it might be more desirable to pass the ttl to send_append_entries() instead of limiting the time to send from the outside. As long as there is progress sending individual entries, the call would not time out. I.e., instead of timeouting the entire call of sending 100 entries, you'd time out sending individual entries in the network implementation. This could be made even compatible, by invoking a default-implemented wrapper method send_append_entries_with_timeout(), which would by default just call timeout(ttl, send_append_entries(...)).await.

Any take on this?

Originally posted by @schreter in #462 (comment)

@github-actions
Copy link

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@drmingdrmer
Copy link
Member Author

@schreter :

I think @Licenser 's requirement is to stop sending out anything if a node is known to be unreachable(send_append_entries() returns Unreachable), and the case you mentioned is another problem(send_append_entries() returns NetworkError).

Pass in a ttl to send_append_entries() would be great!

And with current framework, openraft can not deal with a partial successful result, and it will always resend the full bulk of entries again.

For now, maybe it is feasible to let the network implementation return a timeout hint back to openraft therefore the next time openraft will use a bigger timeout.

I think you were talking about a stream like send-entries API, with which the a big bulk of entries is split into small chunks.
Let me add it when the storage v2 refactoring is done.

@Licenser
Copy link
Contributor

I think this would be a really neat feature that improves network reliability and would work well hand in hand with unreachable state or even initial joins.

One thing we've been doing in tremor is using dynamic backpressure from the downstream system to throttle messages (either drop which isn't reasonable here, or slow down sending which might work well).

Perhaps a similar logic could be added here? I think with a timeout hint it could even be done inside the Network implementation and not force openraft to make a decision, a downstream system could send back a "please slow don" message that the network protocol implements, and then the sender goes and increases timeouts or adds delays as required.

@schreter
Copy link
Collaborator

I think with a timeout hint it could even be done inside the Network implementation and not force openraft to make a decision

Exactly, this was the intention :-).

@drmingdrmer
Copy link
Member Author

I think with a timeout hint it could even be done inside the Network implementation and not force openraft to make a decision

Exactly, this was the intention :-).

So this is the partial success reply? send_append_entries() has to be able to return what log entries are replicated and what not, instead of a single bool.

@drmingdrmer drmingdrmer changed the title flexible timeout for sending entries flexible reply for sending entries Apr 24, 2023
@Licenser
Copy link
Contributor

Many messaging systems use always increasing message IDs and sparse akcs to allow for batching (in this fashion) without needing to have full batches to be confirmed. I think we can "steal" the idea here. The basic idea is not every message has to be acknowledged but any acknowledgement given means this and everything before was received and processed. I think that would satisfied @schreter's requirement and be quite scalable as we still have 1 ack / batch in most cases even during partial acks

@drmingdrmer
Copy link
Member Author

Certainly! At the moment, openraft follows a request/response design. Implementing a streaming pattern as you proposed would bring about increased flexibility and efficiency.

I have been focusing on refactoring the storage API to a stream-like interface, after which the network API will be revised accordingly.

@Licenser
Copy link
Contributor

Ja no hurry, I'm just thinking out loud :)

@schreter
Copy link
Collaborator

So this is the partial success reply? send_append_entries() has to be able to return what log entries are replicated and what not, instead of a single bool.

There are two parts to this:

  • timeout handling for large batches
  • partial success reporting

For the former, it's sufficient to pass TTL into the method instead of timeouting on the outside. The Network implementation can then apply the timeout to individual messages/sub-batches.

The latter is the possibility to advance commit index before the entire batch is processed. Here, the Network implementation might return early, indicating Success(LogIndex) instead of plain Success, so the caller can adjust the commit index and do the needful before diving back to Network with remaining entries to process.

Alternatively, it could be possible to completely decouple it (streaming API), so the Network would send back the status via a callback batch-wise, similar to the updates done to the logging already. I.e., there would be no big difference between local logging and log replication. Potentially, even the same callback type could be used.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants