-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[BUG] gc fails with postgres backend, violates foreign key constraint #13254
Comments
@gmertes Could you try
And see if there're any experiments? Wondering if the run comes from these experiments |
Hi @serena-ruan There are deleted experiments, and it seems like it is indeed trying to delete a run from one of them: experiments = mlflow.search_experiments(view_type=ViewType.DELETED_ONLY)
for e in experiments:
print(e.experiment_id)
runs = mlflow.search_runs(experiment_ids=[6,7], run_view_type=ViewType.DELETED_ONLY)
print(runs[['run_id', 'experiment_id']])
The run I am trying to run the gc on is |
@gmertes Could you try specify |
Adding the experiment id just gives me this error, since that experiment is not deleted:
output: |
I think you can use |
But I don't want to delete the whole experiment, just that one run. In any case, I tried it: deleted experiment 5, then ran the command again. No luck, still the exact same error ( Inspecting the metrics and runs table in my postgres db, there is no cascade set on the run_uuid foreign key. So it seems |
Yea that's true, seems like we don't have ondelete="CASCADE" on the ForeignKey setup
But this would require a db migration change, are you willing to contribute? |
I'd be willing to contribute, but someone with knowledge of the code base could probably do it better and faster than me. I'm not sure where to start on the database migration, for example. I see there is a folder with db patches, but unsure how to proceed. Is it possible that not setting ondelete=cascade was a deliberate choice, and that gc is supposed to manually clean up the metrics first, as a sanity check? Also there seem to be 2 bugs/issues:
I had a brief look at the gc command but could not immediately see where number 2 comes from, because gc does filter on --run-ids |
@mlflow/mlflow-team Please assign a maintainer and start triaging this issue. |
Is there someone in the mlflow team who could assist with this? Or give pointers on where to start contributing? Because the |
Issues Policy acknowledgement
Where did you encounter this bug?
Local machine
Willingness to contribute
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
MLflow version
System information
Describe the problem
mlflow gc
fails with a postgres backend.This is the same issue as in #2542 and #6127. The latter was able to solve it by updating postgres, but that did not solve it for me (I'm on the latest version 16).
There seem to be 2 bugs:
--run-ids
parameter seems to be ignored, withgc
trying to delete another deleted run. As you can see in the error trace, the run_uuid in the log is not the same as the one specified in--run-ids
Tracking information
No response
Code to reproduce issue
MLFLOW_TRACKING_URI=https://*** /opt/venv/mlflow_test/bin/mlflow gc --backend-store-uri postgresql://*** --artifacts-destination s3://*** --run-ids 881f4c0ef58144ae874ec2877e899439
Stack trace
Note how the run_uuid in this log is different from the one specified in --run-ids
run_uuid 8921db49dff849a9b1419f0bf1d7d70d
--run-ids 881f4c0ef58144ae874ec2877e899439
Both of these do exist in the backend, as individual runs (they are not forks or child runs). But gc seems to be trying to delete the wrong one.
Other info / logs
No response
What component(s) does this bug affect?
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrationsarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingWhat interface(s) does this bug affect?
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportWhat language(s) does this bug affect?
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: