Skip to content

Extend Support for Dependency Management #1512

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 16 commits into
base: feature/synapse-cred-configuration
Choose a base branch
from

Conversation

sundarshankar89
Copy link
Collaborator

@sundarshankar89 sundarshankar89 commented Apr 3, 2025

Extend Support for Dependency Management while executing pipeline with python file.

@sundarshankar89 sundarshankar89 marked this pull request as ready for review April 9, 2025 03:57
@sundarshankar89 sundarshankar89 requested a review from a team as a code owner April 9, 2025 03:57
Copy link

github-actions bot commented Apr 9, 2025

❌ 13/15 passed, 2 failed, 1 skipped, 51s total

❌ test_run_python_dep_failure_pipeline: assert 'Script execution failed' in "Failed to install dependencies: ERROR: Invalid requirement: 'databricks_labs_ucx=0.1.0': Expected end or semicolon (after name and no valid version specifier)\n databricks_labs_ucx=0.1.0\n ^\nHint: = is not a valid operator. Did you mean == ?\n" (13.035s)
assert 'Script execution failed' in "Failed to install dependencies: ERROR: Invalid requirement: 'databricks_labs_ucx=0.1.0': Expected end or semicolon (after name and no valid version specifier)\n    databricks_labs_ucx=0.1.0\n                       ^\nHint: = is not a valid operator. Did you mean == ?\n"
 +  where "Failed to install dependencies: ERROR: Invalid requirement: 'databricks_labs_ucx=0.1.0': Expected end or semicolon (after name and no valid version specifier)\n    databricks_labs_ucx=0.1.0\n                       ^\nHint: = is not a valid operator. Did you mean == ?\n" = StepExecutionResult(step_name='package_status', status=<StepExecutionStatus.ERROR: 'ERROR'>, error_message="Failed to install dependencies: ERROR: Invalid requirement: 'databricks_labs_ucx=0.1.0': Expected end or semicolon (after name and no valid version specifier)\n    databricks_labs_ucx=0.1.0\n                       ^\nHint: = is not a valid operator. Did you mean == ?\n").error_message
[gw3] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
07:38 INFO [databricks.labs.remorph.assessments.pipeline] Creating a virtual environment for Python script execution: $/tmp/tmphdx_6mer/venv
07:38 ERROR [root] Failed to install dependencies: ERROR: Invalid requirement: 'databricks_labs_ucx=0.1.0': Expected end or semicolon (after name and no valid version specifier)
    databricks_labs_ucx=0.1.0
                       ^
Hint: = is not a valid operator. Did you mean == ?
07:38 INFO [databricks.labs.remorph.assessments.pipeline] Creating a virtual environment for Python script execution: $/tmp/tmphdx_6mer/venv
07:38 ERROR [root] Failed to install dependencies: ERROR: Invalid requirement: 'databricks_labs_ucx=0.1.0': Expected end or semicolon (after name and no valid version specifier)
    databricks_labs_ucx=0.1.0
                       ^
Hint: = is not a valid operator. Did you mean == ?
[gw3] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
❌ test_run_pipeline: AssertionError: Step usage_2 failed with status SKIPPED (22.897s)
AssertionError: Step usage_2 failed with status SKIPPED
assert <StepExecutionStatus.SKIPPED: 'SKIPPED'> == <StepExecutionStatus.COMPLETE: 'COMPLETE'>
  
  - COMPLETE
  + SKIPPED
[gw0] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python
07:38 INFO [databricks.labs.remorph.assessments.pipeline] Creating a virtual environment for Python script execution: $/tmp/tmp8t8998gx/venv
07:38 INFO [databricks.labs.remorph.assessments.pipeline] Creating a virtual environment for Python script execution: $/tmp/tmp8t8998gx/venv
[gw0] linux -- Python 3.10.17 /home/runner/work/remorph/remorph/.venv/bin/python

Running from acceptance #570

@@ -60,6 +71,12 @@ def test_run_python_failure_pipeline(extractor, python_failure_config, get_logge
pipeline.execute()


def test_run_python_dep_failure_pipeline(extractor, pipeline_dep_failure_config, get_logger):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to fail the entire Step if one of the dependencies cannot be installed. Out of curiosity, do you think that this should fail the entire Pipeline execution run too?

@@ -26,3 +26,6 @@ steps:
mode: overwrite
frequency: daily
flag: active
dependencies:
- pandas
- duckdb

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a test for a dependency with a version specified as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check.

Copy link

@goodwillpunning goodwillpunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks great! I really like this design a lot. Added a few comments around runtime exceptions and also I think you may have left a debugging statement in. Other than that, I think this PR is ready to ship.

@sundarshankar89 sundarshankar89 changed the title Extend Support for Dependency Management and Env Variables Extend Support for Dependency Management Apr 17, 2025
Copy link
Contributor

@asnare asnare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I've left a few comments for consideration.

One thing I considered was whether we should skip the virtual environment if there aren't any dependencies, but it turns out that this also avoids a quirk prior to this PR where you don't know exactly what python refers to when executing a step. (And now we do.)

except json.JSONDecodeError:
logging.info(f"Python script output: {result.stdout}")

except CalledProcessError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like upon failure we drop anything that was written to stdout. Do you think it's useful to log that?

@@ -22,6 +22,17 @@ def pipeline_config():
return config


@pytest.fixture(scope="module")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about why this is needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the scope variable?

@sundarshankar89 sundarshankar89 changed the base branch from main to feature/synapse-cred-configuration April 30, 2025 09:47
@sundarshankar89 sundarshankar89 added the stacked PR Should be reviewed, but not merged label May 6, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
feature/profiler stacked PR Should be reviewed, but not merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants