-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
(TK-202) Add support for restart and HUP handling #198
(TK-202) Add support for restart and HUP handling #198
Conversation
@igalic thanks for the heads up. That's disconcerting :/ |
(init [this context] (lc-fn context :init-service3)) | ||
(start [this context] (lc-fn context :start-service3)) | ||
(stop [this context] (lc-fn context :stop-service3)) | ||
(service3-fn [this] (lc-fn nil :service3-fn)))])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might suggest moving these down so that they're just above the tests that use them.
+1 other than the one minor nitpick. Probably also worth doing a |
@pcarlisle do you want to review this? @haus added a test that actually exercises the signal. |
cool, looks fine to me |
(internal/register-sighup-handler [app]) | ||
(beckon/raise! "HUP") | ||
(let [start (System/currentTimeMillis)] | ||
(while (or (not= (count @call-seq) 15) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop logic is inverted. I'll get a fix out this afternoon.
supposedly travis is no longer overflowing its heap, so i kicked the tests. 🎱 |
this)))) | ||
this) | ||
(a/restart [this] | ||
(run-lifecycle-fns app-context s/stop "stop" (reverse ordered-services)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should happen if one of these fail - e.g., if an unhandled exception is thrown up the stack from one of a service's lifecycle functions? Should the process that the apps are running in be shut down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me to shut the app process down in that case, but maybe @cprice404 has thoughts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in shutdown!
we'll catch an Exception thrown from an individual service's stop function, log it, and continue calling the other services so they still have a chance to stop? Should we do that here too? Maybe we should share some of this logic between the restart and shutdown!
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just pushed out a new commit that wraps the restart in a try-catch to achieve this effect.
Not that I think it's a big deal or even desirable, true statement that we won't reparse the bootstrap.cfg - potentially starting up different services than were present at initial start - after a SIGHUP, right? |
(is (= [:init-service1 :init-service2 :init-service3 | ||
:start-service1 :start-service2 :start-service3] | ||
:start-service1 :start-service2 :start-service3 | ||
:stop-service3 :stop-service2 :stop-service1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
In earlier descriptions about this feature, I had the impression that it wasn't going to be possible for a service to store state in its context and then be able to get that state back after a restart. But it looks like it does actually preserve the service specific context across the restart. Is that intentional? I suppose that might be nice for some services that want to conditionally skip re-initializing something if they can preserve it across a restart but probably worth documenting that as such so that services are prepared for that. For example, it might be bad for the service to end up with a "dirty" context that is a combination of some new stuff the service wanted to add to the context after the restart but wasn't aware of old unintended stuff was implicitly preserved on the context from the previous service startup. |
@camlow325 good catch, no, that was not intentional, we should most definitely be clearing state. |
Overall, looks really good and seemed to do what I expected when testing it. I left some questions about expected error handling - like should the process get shut down if the restart fails - and whether or not data is supposed to be preserved in the service context across a restart. @cprice404 and I had also talked about the concern I had about the restart being executed from the signal handler's thread while concurrent activity might be happening on the Trapperkeeper main thread - and specifically how it might be better to make assurances that lifecycle functions only ever be executed from the same thread to avoid potential issues. That led to https://tickets.puppetlabs.com/browse/TK-300 being filed. I'm good with letting that question be sorted out there. |
@camlow325 I pushed some new commits to ensure context is being cleared on a restart. |
(bootstrap-services-with-empty-config services))] | ||
; We can't use the with-app-with-empty-config macro because we don't want to use its implicit tk-app/stop call | ||
; We're asserting that the stop will happen because of the exception. So instead, we use the tk/run-app here to | ||
; block on the app until the restart is called an explodes in an exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are really long - we typically break them at ~80 characters (in Clojure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. will fix up. fixed up.
I took a quick peek. 🙈 |
okay travis. let's try this again. |
check out that green from travis |
@cprice404 sounds reasonable to me. |
(stop [this context] (throw (IllegalStateException. "Exploding Service")))))] | ||
(try | ||
(ks-testutils/with-no-jvm-shutdown-hooks | ||
(let [app (internal/throw-app-error-if-exists! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could combine this let block with the next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. seems to work. pushed out a commit with that change.
👍 with just the remaining questions about stop/shutdown processing in the event of a restart failure. |
This commit adds a `restart` function to the `tk-app` protocol. The implementation simply stops all of the services, then calls their `init` and `start` functions to bring them back up. We intend to add another commit that adds support for handling SIGHUP signals, by calling this function for all running TK apps.
Prior to this commit, we had a single global var that we'd populate with a reference to the last TK app that was started. This was intended to make it possible to connect to a running process via nrepl, and get a reference to the app. This would cause issues if there were more than one TK app running in the same process; this commit changes the variable to be an atom that maintains a vector of all running TK apps.
This commit applies the indentation recommended by the pl clojure style guide to the services tests.
The conditional in the while loop was previously inverted and only worked because the count did eventually become 15. If it hadn't, it would have looped forever. This commit updates the conditional to break the loop if the count is correct or if 1000 milliseconds have elapsed.
Previously during a restart the contexts for various services would be left intact, which is undesirable. This commit updates the restart function to re-initialize all of the service contexts after they have stopped.
This commit adds a simple test to validate that the context is re-initialized after a restart.
If any step in the restart fails, it won't be a good idea to attempt to continue the restart process. This commit wraps the restart in a try catch that will cause a shutdown if an exception is thrown at any point in the restart.
(stop [this] "Stop the services") | ||
(restart [this] "Stop and restart the services")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably return the tk to for consistency with start, stop and init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. updated to do so.
I'm seeing a test failure related to the optional deps. more to come. |
Ah, yes rebasing the optional-deps work in has broken those tests as they use build-app. |
I'm going to see about updating build-app to take both a function and maintain previous behavior. Ideally, my last commit can go away then and existing repls will continue to work. |
pushed out a commit that updates build-app to support a map or function for config-data. i've verified that the puppet-server repl works as is with these changes. |
To be consistent with the other app functions, this commit updates the restart to return `this`.
A previous iteration of test used with-test-logging-debug, but that is no longer the case so this commit removes the refer.
:post [(satisfies? app/TrapperkeeperApp %)]} | ||
(config/initialize-logging! config-data) | ||
(internal/build-app* services config-data (promise))) | ||
(let [config-data-fn #(if (map? config-data) (constantly config-data) config-data)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add the map-or-fn support to the other similar functions below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we need the anonymous function that wraps the if map?
or can we just get rid of that like:
(let [config-data-fn (if (map? config-data) (constantly config-data) config-data] ...)
An earlier commit had changed the signature of build-app in a backward incompatible manner. This commit updates build-app to handle the previous behavior, where config-data was a map, as well as the new behavior, where config-data is a function.
;; This adds the TrapperkeeperApp instance to the tk-apps list, so that | ||
;; it can be referenced in a remote nREPL session, etc. | ||
(swap! internal/tk-apps conj app) | ||
(internal/register-sighup-handler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I notice here:
We don't do the initial registration of the sighup handler until after the app is booted. I think this lessens the concerns that people had about a HUP causing indeterminate behavior if the user sent it before the app had finished booting up. The behavior won't be indeterminate -- it will just shut down the process.
We might want to revisit that in the future, and change this up so that the registration of the handler happens earlier and it's smart enough not to initiate the restart while the initial startup is in progress, but I think we can defer that for now. The fact that the behavior is not indeterminate seems to alleviate the most pressing concerns.
I left a few minor nits, and some suggestions for making one of the tests just a bit more complete. Other than I'm +1 and will merge this today unless anyone objects. (There is still some additional work to do before we'd roll out a release, I'm just talking about merging this PR... main implication there is that we can't do any other new TK releases that don't include this, after this is merged, unless we create a new branch... which would be nice to avoid.) @KevinCorcoran @camlow325 @pcarlisle @senior @adreyer if you have any concerns about me merging this today once @haus has had a chance to review the last round of feedback, let me know, but I feel like I've already pretty much checked with most of you and it seems like everyone's good with it. |
This commit corrects a typo, collects parens, and moves a service down a line to respond to feedback in the PR.
This commit updates the test-context-cleared-on-restart test to include some additional state and assertions as suggested by cprice. This includes validating that the context is cleared after a restart and that init is called twice during the test lifetime.
+1 |
(TK-202) Add support for restart and HUP handling
This PR adds a restart function for TK apps that allows trapperkeeper to cleanly handle SIGHUP by calling restart on all of the registered apps.