Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

#192: (WIP: Needs Work) Replace PythonDataflowTask With BeamDataflowJobTask #199

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

rclough
Copy link
Contributor

@rclough rclough commented Jun 18, 2019

Removes PythonDataflowTask and related files/tests to replace it with the luigi.contrib BeamDataflowJobTask

@rclough rclough changed the title #192: Replace PythonDataflowTask With BeamDataflowTask #192: Replace PythonDataflowTask With BeamDataflowJobTask Jun 18, 2019
@codecov
Copy link

codecov bot commented Jun 18, 2019

Codecov Report

Merging #199 into master will decrease coverage by 1.19%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #199     +/-   ##
=========================================
- Coverage   89.11%   87.91%   -1.2%     
=========================================
  Files          12       11      -1     
  Lines         744      596    -148     
=========================================
- Hits          663      524    -139     
+ Misses         81       72      -9
Impacted Files Coverage Δ
spotify_tensorflow/luigi/tfx_task.py 66.66% <80%> (-27.78%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e807b9...7bb29e7. Read the comment docs.

Copy link
Contributor

@brianmartin brianmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a new version minimum for luigi?

luigi>=2.3.3,<3.0.0

@@ -52,6 +55,10 @@ def tfx_args(self):
"--schema_file=%s" % self.get_schema_file()
]

def dataflow_executable(self):
""" Must be overwritten from the BeamDataflowTaski """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good catch

PR is still WIP cause the new beam task is refactored a bunch, trying to fix tests

@rclough rclough changed the title #192: Replace PythonDataflowTask With BeamDataflowJobTask #192: (WIP: Needs Work) Replace PythonDataflowTask With BeamDataflowJobTask Jun 21, 2019
@rclough
Copy link
Contributor Author

rclough commented Jun 21, 2019

There is still work to be done on this ticket. Documenting some discussions on how to proceed:

Switching to BeamDataflowJobTask is just a bit more involved than what I'd initially realized, with the refactoring that was done for Dataflow input arguments.

The most pertinent point is that now there is a class called DataflowParamKeys that needs to be passed to the BeamDataflowJobTask on initialization. Historically, beam tasks have simply initialized the object and directly accessed the variables in the class to set parameters.

There are, generally, 2 approaches that can be taken here:

  1. Pass the responsibility of passing in the DataflowParamKeys to the end user, and use the new BeamDataflowJobTask more or less as a drop in replacement with a few extra TF specific parameters. This would be a pretty big API break that would require a decent amount of work from end users to upgrade to
  2. Keep the current API, and find a way to convert it into DataflowParamKeys or otherwise work around the requirement. The former would require some magic (ie not directly inheriting the BeamDataflowJobTask, but rather initializing it on run when all the parameters have been set the old way). The latter seems like the easiest approach: pass an empty DataflowParamKeys on init, and allow the user to edit the params directly. If theres a refactored name, you can use a setter for that value to make sure its mapped to the new name.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants