Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] gc fails with postgres backend, violates foreign key constraint #13254

Open
2 of 23 tasks
gmertes opened this issue Sep 26, 2024 · 10 comments
Open
2 of 23 tasks

[BUG] gc fails with postgres backend, violates foreign key constraint #13254

gmertes opened this issue Sep 26, 2024 · 10 comments
Labels
area/server-infra MLflow Tracking server backend bug Something isn't working

Comments

@gmertes
Copy link

gmertes commented Sep 26, 2024

Issues Policy acknowledgement

  • I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Local machine

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

  • Client: 2.16.2
  • Tracking server: 2.16.2

System information

  • OS Platform and Distribution: Red Hat Enterprise Linux 9.4
  • Python version: 3.9.18
  • PostgreSQL 16.2 on x86_64-pc-linux-gnu

Describe the problem

mlflow gc fails with a postgres backend.

This is the same issue as in #2542 and #6127. The latter was able to solve it by updating postgres, but that did not solve it for me (I'm on the latest version 16).

There seem to be 2 bugs:

  • The run fails to be deleted because of foreign key constraints
  • The --run-ids parameter seems to be ignored, with gc trying to delete another deleted run. As you can see in the error trace, the run_uuid in the log is not the same as the one specified in --run-ids

Tracking information

No response

Code to reproduce issue

MLFLOW_TRACKING_URI=https://*** /opt/venv/mlflow_test/bin/mlflow gc --backend-store-uri postgresql://*** --artifacts-destination s3://*** --run-ids 881f4c0ef58144ae874ec2877e899439

Stack trace

Note how the run_uuid in this log is different from the one specified in --run-ids

run_uuid 8921db49dff849a9b1419f0bf1d7d70d
--run-ids 881f4c0ef58144ae874ec2877e899439

Both of these do exist in the backend, as individual runs (they are not forks or child runs). But gc seems to be trying to delete the wrong one.

/opt/venv/mlflow_test/lib/python3.9/site-packages/mlflow/store/db/utils.py:148: SAWarning: DELETE statement on table 'metrics' expected to delete 42 row(s); 18 were matched.  Please set confirm_deleted_rows=False within the mapper configuration to prevent this warning.
  session.commit()
Traceback (most recent call last):
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.ForeignKeyViolation: update or delete on table "runs" violates foreign key constraint "metrics_run_uuid_fkey" on table "metrics"
DETAIL:  Key (run_uuid)=(8921db49dff849a9b1419f0bf1d7d70d) is still referenced from table "metrics".


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/mlflow_test/lib/python3.9/site-packages/mlflow/store/db/utils.py", line 148, in make_managed_session
    session.commit()
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", line 2028, in commit
    trans.commit(_to_root=True)
  File "<string>", line 2, in commit
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go
    ret_value = fn(self, *arg, **kw)
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", line 1313, in commit
    self._prepare_impl()
  File "<string>", line 2, in _prepare_impl
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go
    ret_value = fn(self, *arg, **kw)
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", line 1288, in _prepare_impl
    self.session.flush()
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", line 4352, in flush
    self._flush(objects)
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", line 4488, in _flush
    transaction.rollback(_capture_exception=True)
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", line 4448, in _flush
    flush_context.execute()
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 466, in execute
    rec.execute(self)
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 679, in execute
    util.preloaded.orm_persistence.delete_obj(
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 193, in delete_obj
    _emit_delete_statements(
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1465, in _emit_delete_statements
    c = connection.execute(
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 2355, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/opt/venv/mlflow_test/lib64/python3.9/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.ForeignKeyViolation) update or delete on table "runs" violates foreign key constraint "metrics_run_uuid_fkey" on table "metrics"
DETAIL:  Key (run_uuid)=(8921db49dff849a9b1419f0bf1d7d70d) is still referenced from table "metrics".

[SQL: DELETE FROM runs WHERE runs.run_uuid = %(run_uuid)s]
[parameters: {'run_uuid': '8921db49dff849a9b1419f0bf1d7d70d'}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/mlflow_test/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/opt/venv/mlflow_test/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/venv/mlflow_test/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/opt/venv/mlflow_test/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/venv/mlflow_test/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/venv/mlflow_test/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/venv/mlflow_test/lib/python3.9/site-packages/mlflow/cli.py", line 637, in gc
    backend_store._hard_delete_run(run_id)
  File "/opt/venv/mlflow_test/lib/python3.9/site-packages/mlflow/store/tracking/sqlalchemy_store.py", line 687, in _hard_delete_run
    session.delete(run)
  File "/usr/lib64/python3.9/contextlib.py", line 126, in __exit__
    next(self.gen)
  File "/opt/venv/mlflow_test/lib/python3.9/site-packages/mlflow/store/db/utils.py", line 161, in make_managed_session
    raise MlflowException(message=e, error_code=BAD_REQUEST)
mlflow.exceptions.MlflowException: (psycopg2.errors.ForeignKeyViolation) update or delete on table "runs" violates foreign key constraint "metrics_run_uuid_fkey" on table "metrics"
DETAIL:  Key (run_uuid)=(8921db49dff849a9b1419f0bf1d7d70d) is still referenced from table "metrics".

[SQL: DELETE FROM runs WHERE runs.run_uuid = %(run_uuid)s]
[parameters: {'run_uuid': '8921db49dff849a9b1419f0bf1d7d70d'}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

Other info / logs

No response

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@gmertes gmertes added the bug Something isn't working label Sep 26, 2024
@github-actions github-actions bot added the area/server-infra MLflow Tracking server backend label Sep 26, 2024
@serena-ruan
Copy link
Collaborator

@gmertes Could you try

from mlflow.entities.view_type import ViewType

mlflow.search_experiments(view_type=ViewType.DELETED_ONLY)

And see if there're any experiments? Wondering if the run comes from these experiments

@gmertes
Copy link
Author

gmertes commented Sep 27, 2024

Hi @serena-ruan

There are deleted experiments, and it seems like it is indeed trying to delete a run from one of them:

experiments = mlflow.search_experiments(view_type=ViewType.DELETED_ONLY)

for e in experiments:
    print(e.experiment_id)
7
6
runs = mlflow.search_runs(experiment_ids=[6,7], run_view_type=ViewType.DELETED_ONLY)
print(runs[['run_id', 'experiment_id']])
                             run_id experiment_id
0  00ef598794c547f88495c40ea5d8c948             7
1  ec6950e4d01f48968e367f358065e5bb             7
2  f68c0b8303ee47fa942866353b2f02a7             7
3  0569de9633774229b7add67a85777d87             7
4  8ed2a5327ea1475e96d0813a77f48fc5             7
5  4e72ad3f9a1d403eb7e3df962a6edc15             7
6  8921db49dff849a9b1419f0bf1d7d70d             6  <----- this one

8921db49dff849a9b1419f0bf1d7d70d is the run id from the error trace, and it belongs to experiment id 6.

The run I am trying to run the gc on is 881f4c0ef58144ae874ec2877e899439 , which belongs to another experiment id (5). That one has not been deleted.

@serena-ruan
Copy link
Collaborator

@gmertes Could you try specify --experiment-ids with the experiment id that includes the run you want to delete and see if that can bypass this problem?

@gmertes
Copy link
Author

gmertes commented Sep 30, 2024

Adding the experiment id just gives me this error, since that experiment is not deleted:

MLFLOW_TRACKING_URI=https://*** /opt/venv/mlflow_test/bin/mlflow gc --backend-store-uri postgresql://*** --artifacts-destination s3://*** --run-ids 881f4c0ef58144ae874ec2877e899439 --experiment-ids 5

output:
mlflow.exceptions.MlflowException: Experiments ['5'] are not in the deleted lifecycle stage. Only experiments in the deleted lifecycle stage can be hard-deleted.

@serena-ruan
Copy link
Collaborator

I think you can use mlflow.delete_experiment first to move the experiment to deleted lifecycle stage

@gmertes
Copy link
Author

gmertes commented Oct 1, 2024

But I don't want to delete the whole experiment, just that one run.

In any case, I tried it: deleted experiment 5, then ran the command again. No luck, still the exact same error (mlflow.exceptions.MlflowException: (psycopg2.errors.ForeignKeyViolation) update or delete on table "runs" violates foreign key constraint "metrics_run_uuid_fkey" on table "metrics").

Inspecting the metrics and runs table in my postgres db, there is no cascade set on the run_uuid foreign key. So it seems gc is not properly cleaning up the metrics table first before trying to delete the run?

@serena-ruan
Copy link
Collaborator

Yea that's true, seems like we don't have ondelete="CASCADE" on the ForeignKey setup

run_uuid = Column(String(32), ForeignKey("runs.run_uuid"))

But this would require a db migration change, are you willing to contribute?

@gmertes
Copy link
Author

gmertes commented Oct 3, 2024

I'd be willing to contribute, but someone with knowledge of the code base could probably do it better and faster than me. I'm not sure where to start on the database migration, for example. I see there is a folder with db patches, but unsure how to proceed.

Is it possible that not setting ondelete=cascade was a deliberate choice, and that gc is supposed to manually clean up the metrics first, as a sanity check?

Also there seem to be 2 bugs/issues:

  1. The foreign key constraint
  2. gc throwing an error on a run that was not specified in --run-ids

I had a brief look at the gc command but could not immediately see where number 2 comes from, because gc does filter on --run-ids

Copy link

github-actions bot commented Oct 4, 2024

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

@gmertes
Copy link
Author

gmertes commented Oct 9, 2024

Is there someone in the mlflow team who could assist with this? Or give pointers on where to start contributing? Because the gc command is effectively broken for me :(

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/server-infra MLflow Tracking server backend bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants