Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Observe non-working times feature #48

Closed
twildeboer opened this issue Dec 1, 2017 · 6 comments
Closed

Observe non-working times feature #48

twildeboer opened this issue Dec 1, 2017 · 6 comments

Comments

@twildeboer
Copy link

I'm proposing a feature addition to chaoskube that would add the ability suspend the chaos during nights, weekends and holidays using the following command-line options. These are designed to be somewhat consistent with the current pattern of chaoskube options as well as the configuration options for Chaos Monkey. They should be self-explanatory:

--observe-off-times true # defaults to false
--location 'America/New_York' # , or 'UTC'. Req'd if observe-off-times is true
--offdays 'Saturday, Sunday'         # default
--workhours 'start=09:00, end=17:00' # default
--holidays '2017-12-25, 2018-01-01'  # defaults to empty list

The options above imply that both --observe-off-times true and --location '...' must be present for the feature to take effect. There is purposefully no default location so the user is forced to provide this, since most SRE staff is probably not working in the GMT timezone, so defaulting to UTC would not really make sense in this case.

Note that this requires a IANA Time Zone as opposed to a three-letter timezone abbreviation such as 'EDT' or 'EST', that would have to change with Daylight Saving conventions. Daylight Saving is automatically accounted for by using the IANA Time Zones.

I intend to post a PR as soon as I have this implemented, but wanted to get some feedback in case I'm missing something.

@klautcomputing
Copy link
Contributor

I did some of the things you proposed in my PR already. If you want extend it with the things I don't have that'd be the easiest for you.

@linki
Copy link
Owner

linki commented Dec 1, 2017

@twildeboer @klautcomputing I try to look at the PR again over the weekend. At first sight the way to specify the time frame as well as the implementation seemed quite complicated to me.

@klautcomputing would you think defining the range similar to https://github.com/hjacobs/kube-downscaler#configuration would simplify usage as well as implementation and still be able to capture your use cases?

e.g.: be active at work time as well as midday on weekends would be:

--active-at "Fri-Fri 10:00-16:00 CET, Sat-Sun 10:00-12:00 CET"

@klautcomputing
Copy link
Contributor

@linki could you leave a couple of comments on my code where you think my implementation is too complicated?

--active-at "Fri-Fri 10:00-16:00 CET, Sat-Sun 10:00-12:00 CET"

Did you maybe mean Mon-Fri? Because otherwise I don't see how that format is meant to work. If yes, then I think that'd be easily doable. Thinking about it again we might not want this as a flag for choaskube but instead as a label in the manifest which would allow individual teams to specify their own schedule.

This raises the general question of whether we want chaoskube to be purely opt in. Given that chaos engineering is not something that should surprise a team, but they should have made an active decision to test their systems with chaos it might be the right choice and would get rid of --percentage in my code and make it a little easier.

@twildeboer
Copy link
Author

twildeboer commented Dec 1, 2017

Generally speaking, I suggest being careful to resist the temptation to over-engineer features. Rather, design and implmenet what you know is needed and then see how that goes and whether there is demand for more or something different.

Regarding this feature specifically, speaking only for our own use case, we do not have need for both detailed "off-time" and "on-time" specifications. Our team has typical work hours and has an on-call rotation for non-working hours. I imagine that would generally describe the majority of the chaoskube users. Since chaoskube is (from our perspective) intended to be run as an on-going stabiliity test, all we care about is being able to limit which services are impacted, and not making on-call life harder on anyone unnecessarily. You may notice that chaosmonkey does not provide such detailed scheduling, AFAIK. If someone wants to run chaoskube on the weekend, they can just deploy another instance of it to do whatever they want. The scheduling will never be perfect anyway, since the holidays will need to be updated from time to time, at least. Finally, we view chaoskube as a tool that gives us confidence in the resilience of our systems, but it is not critical to our infrastructure and does not need precise scheduling capabilities.

Another reason to avoid precise scheduling capability is that it is significantly more difficult to implement correctly. You will have to include all kinds of logic to handle periods that span midnight and Daylight Saving jumps. And you will have to try to find a way to support such configuration that is not confusing. People will get confused about what their configuration really means, no matter how carefully you write your documentation, and then you will get all kinds of bug reports that are actually user-error or user misunderstanding.

You could, perhaps, if the need was shown to be significant, add the ability to override each global "off-time" attribute with service-specific ones through annotations. But I would wait and see if this is a real need, because it adds complexity. Our team does not need this.

@twildeboer
Copy link
Author

@linki - PR for this feature waiting for you.

@linki
Copy link
Owner

linki commented Apr 18, 2018

@klautcomputing @twildeboer Thank you for all your input.

The above feature is part of v0.8.0 so I'm going to close this issue. I think we found a fairly easy way to configure it althought the equivalent of --workhours is not defined including but excluding similar to --offdays and --holidays.

I also think that at some point some configuration should be overridable by annotations or moved entirely to annotations, e.g. for users defining a mean-time-to-failure on a per-pod basis and independent of the cluster size (the "percentage" feature, #20).

@linki linki closed this as completed Apr 18, 2018
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants