Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

(WIP) Distributed in-memory cache for Singularity data #1965

Closed
wants to merge 31 commits into from

Conversation

ssalinas
Copy link
Member

@ssalinas ssalinas commented Jun 23, 2019

Even with recent updates to zk usage, non-leader instances are still very hard on zookeeper and much slower than the leader that has all data in memory. This PR aims to remedy that situation by keeping an eventually consistent view of the leader's data in memory on all non-leader instances as well. It also will unify the caching across different manager classes since we currently have leader cache, web cache, and multiple ZkCache classes.

The distributed view of data is currently accomplished using atomix, with the goal of keeping all data in memory on the leader, with writes replicated to other cluster members (primary-backup in atomix). Atomix member discovery is done via zk leader latch since we already have zk present (though DNS could be another option we provide). Atomix provides leader election/raft protocols as well, but we are using it mostly as a readonly cache so leaving those alone for now

TODOs:

  • Finish editing all manager classes to use the single cache class
    • Replace older zk cache and web cache usages with either distributed maps for data on the leader, or guava caches otherwise (e.g. task/deploy data)
  • Determine how we can tell if our current distributed view is in sync or not, so we can determine when/if non-leading instances should fall back to zk. This is mostly relevant for getAll type methods, since we can easily fall back to zk if checking a map and the key doesn't exist
  • Wire up a test atomix framework to run with unit tests
  • test this in our staging cluster to make sure that startup/shutdown ordering is correct as well as all available atomix configuration params
    • test a single-node works fine
    • test multiple nodes work on a rolling deploy
    • test multiple nodes work when starting from scratch
    • test a node can take leadership and correctly rewrite cache data from zk
  • update endpoints, ui calls, and the client to have a skipCache rather than useCache param, since the cache will be the new default

@ssalinas
Copy link
Member Author

Additions to this. When trying to get atomix into our current unit testing setup, the tests were so slow and expensive that I couldn't run them on local laptop. Even on our m5.12x it was taking 10mins+ and sometimes failing. We've known about the slow tests for a while and this was a good excuse to fix. So, Pr also includes and update to junit5, which allows us to use their BeforeAll/AfterAll methods with dependency injection. This means we only create hk2/test zk server/atomix once per test class instead of once per test method. Average build times on our infra down from 6-7mins to just under 2 mins for the SingularityService module.

@ssalinas ssalinas mentioned this pull request Jul 16, 2019
@ssalinas
Copy link
Member Author

while a cool technology, this ended up over complicating the startup/leader procedure to the point where I'd be too worried about data consistency/integrity to move forward with it.I'm going to pick apart some of the more usable pieces of this PR into smaller PRs and find a different approach for speeding up some endpoints on non-leading instances

@ssalinas ssalinas closed this Jul 16, 2019
@ssalinas ssalinas deleted the caching_update branch September 5, 2019 12:53
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant