-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Experiment stuck due to hitting Suggestion
custom resource size limits
#1847
Comments
Thanks for creating this issue. Can you provide more info about the experiment yaml and other relevant information for reproducibility? MaxTrials - 14500 |
Yes, please see below for the experiment
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: debug
spec:
objective:
type: maximize
goal: 500
objectiveMetricName: cost
algorithm:
algorithmName: grid
parallelTrialCount: 20
maxTrialCount: 14641
maxFailedTrialCount: 2000
parameters:
- name: a
parameterType: double
feasibleSpace:
min: "1.30"
max: "1.41"
step: "0.01"
- name: b
parameterType: categorical
feasibleSpace:
list: ["0.0010", "0.0025", "0.0063", "0.0158", "0.0398", "0.1000", "0.2512", "0.6310", "1.5849", "3.9811", "10.0000"]
- name: c
parameterType: double
feasibleSpace:
min: "1.30"
max: "1.41"
step: "0.01"
- name: d
parameterType: categorical
feasibleSpace:
list: ["0.0010", "0.0025", "0.0063", "0.0158", "0.0398", "0.1000", "0.2512", "0.6310", "1.5849", "3.9811", "10.0000"]
trialTemplate:
retain: false
primaryContainerName: training-container
trialParameters:
- name: a
reference: a
description: ""
- name: b
reference: b
description: ""
- name: c
reference: c
description: ""
- name: d
reference: d
description: ""
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: docker.io/python:alpine3.15
volumeMounts:
- name: script
mountPath: /app/run.py
subPath: run.py
command:
- "python3"
- "/app/run.py"
- "${trialParameters.a}"
- "${trialParameters.b}"
- "${trialParameters.c}"
- "${trialParameters.d}"
restartPolicy: Never
volumes:
- name: script
configMap:
name: script The mock implementation import sys
import time
time.sleep(4)
cost = sum([float(x) for x in sys.argv[1:]])
print(f"cost={cost}") EDIT: and to provide you an idea of the error messages arising from the {"level":"info","ts":1649925772.8911624,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"simulation/simulation-nr-fb","Suggestion Requests":8165,"Suggestion Count":8143}
{"level":"info","ts":1649925774.5568578,"logger":"suggestion-client","msg":"Getting suggestions","Suggestion":"simulation/simulation-nr-fb","endpoint":"simulation-nr-fb-grid.simulation:6789","Number of current request parameters":22,"Number of response parameters":22}
{"level":"info","ts":1649925775.6414711,"logger":"suggestion-controller","msg":"Update suggestion instance status failed, reconciler requeued","Suggestion":"simulation/simulation-nr-fb","err":"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2100613 vs. 2097152)"} |
I have the same problem running Sobol suggestions. You basically just need enough trials and it will collapse on you. First time its with the etcd request size being too large, which you can "fix" by increasing the etcd max request size, but then you run into the issue in this thread, which is that we are hitting CRD limits. I don't think you can try to work around this.
|
Hi @robertzsun-dev, can it be limitation of Goptuna algorithm that we use for Sobol ? Also, related issue: #1058. |
@andreyvelich I think issue is due to Katib architecture :
We can see that new suggestions are simply appended to the I don't think its K8S-kosher to just add data to this |
With the related issue of #1058. I don't think its directly related, but indirectly related in the fact that Katib is not really built for running mass-scale experiments. Both the fact that Suggestions/Trials are stored as an Etcd resource, as well as the other issue linked, which prevents long-running suggestions (necessary because 1000s of trials takes a while to compute the next suggestion) from working, prevent users from achieving scale with Katib. |
I think @robertzsun-dev is correct in his assessment. At work we moved to a custom system (non-Katib) for performing large scale (> 20k) experiments which does not rely on the Kubernetes custom resource model. |
Thanks for the information @robertzsun-dev. Yes, you are right the etcd default size is 1.5 MiB, which makes it impossible to store large chunk of data in the Custom Resource. I understand that ordinary HP Tuning Experiments might not require 10000 Trials, but for some cases it might be useful. Since, Katib allows you to use optimisation algorithms for any type of the task (as long as Trial is set), we can find work-around for it. @robertzsun-dev @nielsmeima Please can you describe your use-case, when you need to run Experiments for more than 10000 Trials. As a solution, we can store such information in Katib DB instead of Suggestion CR or Experiment CR. |
I would propose we set the limit for the number of katib/pkg/apis/controller/suggestions/v1beta1/suggestion_types.go Lines 46 to 49 in b9dc63e
katib/pkg/apis/controller/suggestions/v1beta1/suggestion_types.go Lines 54 to 55 in b9dc63e
wdyt? |
I think, storing such information in the ConfigMap might also be a problem, since the limit is 1 Mb: https://kubernetes.io/docs/concepts/configuration/configmap/#motivation That is why, I suggested to store that info in the Katib DB if that is possible. |
Yes, that's right. |
I see. @tenzen-y any objections you see to store that info in the MySQL/Postgres DB ? |
As per the opinions of the on-prem cluster administrator side. Currently, even if the katib-db crashes, it is easy to check the result of experiments since CRs have the results for the experiments in etcd. But, Storing the result of experiments to the katib-db increases the importance of katib-db. I wouldn't increase the number of storage with high importance. |
My use case is for heuristic-driven learning or other types of probabilistic models. Training or learning behaviors from scratch for complex behaviors (for me in robotics) is not really feasible or takes a long time. We introduce a set of heuristics to inform the algorithms. We can do this in multi-step processes, with heuristics only approaches and then training with heuristics, and so on. Generally, evaluating the heuristics takes a very short amount of time, maybe 5 - 20 minutes, and we want to hyperparameter search over this heuristic space. One such case is say we will do something if the distance to an object is < X meters. What should X be? There might be many of such heuristics, each with their own < Y, > Z, etc... Together, the combination of heuristics and hyper-parameters chosen can greatly affect the quality of the algorithm. We can even search over how to weigh these heuristics against each other. So we use katib and mass-scale hyperparameter search to just close-in on a good set of hyperparameters - we can even do a sort of distributed coordinate descent by first searching the X, Y, Z's above, then searching the weights, then going back and doing X, Y, Z, etc.... Once we arrive at a good set of heuristics and weights, we can do some more learning on top. The possibilities are endless. This is just one of many use cases, but highlights the value of mass-scale hyperparameter search. |
I tend to think a traditional DB is really the only way to do this properly. What makes etcd more robust or better for this application? The fact that it is HA? Etcd was chosen by K8S for its great consistency properties and ability to do leader election very well. Is this necessarily a need for Katib? If we are worried about DB crashes or loss of data with a centralized DB, we could use Redis HA with AOF. |
Thanks for the explanation @robertzsun-dev. I've added this item to discuss it in one of the upcoming Katib Community Meetings: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.vqsljon7kcug Also, we should consider adding this problem in one of our ROADMAP items for 2023: #2153. That sounds like a problem to use Katib at scale. |
I just realized the issue I posted here isn't just related to max etcd request size. Once you increase that limit within etcd, you get a grpc message max size error: The katib controller error is: So even if etcd size is good, the grpc message size is not big enough either. |
Is there a way to give different configs for the rest client here via the katib configmap? Maybe we can pass in a larger gRPC max message size. |
Yes, we also noticed that during our discussion that we have some limitations on the gRPC side. I think, we can specify the max message size for the gRPC here:
Note, that Suggestion servers also should be able to receive such big message (we need to decide how to pass such settings to the Suggestions (e.g. via env vars and Katib Config)):
Maybe we can add such feature to the Katib Config, once we redesign the UX for it: #2150 cc @tenzen-y But, we should also decide if that is correct approach to increase etcd and gRPC limitations for the large-scale Experiments. |
Uhm... |
@tenzen-y CRD resources can freed by moving items to DB(as a backup) but DBs in the active control path is not good idea. It is difficult to track consistency of the data in etcd between DB. |
Increasing the etcd and gRPC limitations are definitely not the right solutions here. They are short term fixes that I hoped were already implemented to bypass the issues, but if you have to write extra code to get this through, I'd recommend against it. Do you think I can help in any way? I can start with helping generate this document, or we can meet together to "pair program" this document. I can also help in the roadmap meetings or architecture meetings to give some advice/opinions/reviews. I can ask a bit from my architects at my company to see what kind of ideas of architecture that may be good for the suggestion service problem. |
@johnugeorge Yes, that's right. IIRC, we considered the issue during the older alpha API days. |
Yes, that's right. However, I think we can calculate the worst case and limit the amount the controller can write to CRs and configMaps. In k/k, when designing API, we do that. |
I asked around and it seems like the best architecture would be backing suggestions with a database instead of in the CRD resource. Suggestions service still has to be a deployed service since you might have "unlimited" experiments running at the same time, and you don't want the katib controller to use up infinite resources. Still good to separate the two. However, which suggestions have been used and which suggestions map to which trials should be backed in a database. That way controller and suggestion don't have state still and robust to restart. Can do pagination style messaging if we want the controller and suggestions to iterate in lockstep. Or just point to the idxs or some other column identifier in the request. |
Does "around" mean this issue? Or other places? If that means other places, can you share that? I'm interested in that.
IIUC currently, the katib-controller saves the information to Suggestion resources, and suggestion resources aren't automatically removed even once the experiment is completed. So, I believe we can write the information to the Suggestion resource, and then if the number of the suggestion map reaches the limit, we back up the suggestion maps to configMap and flush Suggestion status. Are your concerns the case of when you want a temporary stop Experiment (removes Experiment)? So I think we should introduce cancel semantic (currently not supporting) instead of persistently saving to DB. ref: #934 |
This means I proposed 2 features:
|
Haha, I asked my coworkers - who have far more experience writing operators and other distributed systems (with or without K8S) than me, and we arrived quickly at the DB solution as the most scalable. I wasn't really thinking about I was mainly thinking about high level architecture of Katib, but I may be wrong so anyone please correct me if so:
I don't foresee how configmaps can be scalable for this problem. A database effectively offloads the state storage of the Katib controller so it's outside of process memory. It makes it easier to pass this storage/state along to other services. It also makes it easy to paginate which is critical for a scalable # of suggestions. Not incredibly attached to whatever proposal or architecture, but I do think internet-text back and forth will not get the points/analysis across effectively. |
Yes, that should work @robertzsun-dev.
That's correct. You can read more about Suggestion proposal that was introduced by @gaocegege in 2019 here: https://github.com/kubeflow/katib/blob/master/docs/proposals/suggestion.md
I believe, it is not always true. Some Suggestion services have state. For example, we store recorded Trials for SkOpt Optimize Suggestion. That allows us to tell about only newly created Trials to the Skopt: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/skopt/base_service.py#LL110C43-L110C58. I think, we can start Google doc to collaborate whether we should chose the DB approach or ConfigMaps. After that, we can convert it to one of the Katib Proposals WDYT @tenzen-y @robertzsun-dev @johnugeorge @nielsmeima ? Historically, we've been using Katib DB to store data (e.g. metrics) that we can't store in etcd. Also, usually for the Experiment results only |
Yeah, some algorithms (in skopt and some others) we used require state. Personally, prefer config maps but look forward to the proposal. |
Sounds good, happy to contribute to the doc. @gaocegege - what would happen if the suggestion pod crashes or gets preempted and it loses the in-memory state? |
@robertzsun-dev Thanks for sharing.
@andreyvelich I'm happy with participating in the discussion on google docs, although my bandwidth for the katib is limited since I'm focusing on distributed training and job scheduling in this quarter.
Good point. We may need to consider more clean architecture for the Stable Katib version (v1). |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle frozen |
@andreyvelich: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
What steps did you take and what happened:
Submitting a large (i.e. resulting in a large number of trials, in this case ~14500 with 4 hyperparameters with 10/11 values per hyperparameter) experiment results in the
Suggestion
custom resource reaching the size limits of custom resources dictated by Kubernetes due to all suggestions being stored in this resource. This results in the following error being output by the Katib controller when trying to update the Suggestion custom resource:Request entity too large
and the experiment not being able to progress. This issue seems to describe the exact problem.Argo Workflows seems to have encountered the same problem, described here and solved it by allowing for 1) compression of the data stored in the status field of the custom resource and 2) storage of information under the status field in a relational database as described here.
What did you expect to happen:
I expected Katib to be able to handle search spaces or arbitrary size.
Anything else you would like to add:
A workaround would be to manually split the experiment into smaller subexperiments to circumvent the size limits of custom resources. Ideally, this is solved by following a similar approach as Argo does for their Workflow custom resources.
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
The text was updated successfully, but these errors were encountered: