-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[REVIEW] release HRP greenflow plugin #143
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the unit tests failed.
test_max_drawdown (unit.test_max_drawdown.TestMaxDrawdown) ... FAIL
Minor flake warnings.
$ flake8 greenflow_hrp_plugin/
greenflow_hrp_plugin/__init__.py:41:79: W291 trailing whitespace
greenflow_hrp_plugin/kernels.py:592:80: E501 line too long (81 > 79 characters)
greenflow_hrp_plugin/loadCsvNode.py:109:13: F841 local variable 'client' is assigned to but never used
I'm not done reviewing. Some comments are just remarks. So far the unit test above should be fixed.
gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/aggregateTimeFeature.py
Outdated
Show resolved
Hide resolved
gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/transactionCostNode.py
Outdated
Show resolved
Hide resolved
addressed all the comments. RAPIDS 21.08 will break the dask_cudf dataframe type. RAPIDS 21.06 is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a slight nuisance in the notebooks. Some of the workflows produce a warning that port is missing in incoming meta data:
Task "max_drawdown_feature" Node Type "[<class 'greenflow_hrp_plugin.featureNode.FeatureNode'>]" missing required port "feature_df" in incoming meta data. Should the port be connected?
Task "all_max_drawdown_feature" Node Type "[<class 'greenflow_hrp_plugin.featureNode.FeatureNode'>]" missing required port "feature_df" in incoming meta data. Should the port be connected?
The reason is that the node greenflow_hrp_plugin.featureNode.FeatureNode
has a port feature_df
that is optional, i.e. it is not required to be connected (but if it is connected its data is used).
For such a scenario there is a port spec called "optional" (PortsSpecSchema.optional: True
). I tweaked the code to get this working to inhibit the warning. These are the changes I made:
diff --git a/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py b/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py
index 8a8da3f..ba6b6f1 100644
--- a/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py
+++ b/gQuant/plugins/hrp_plugin/greenflow_hrp_plugin/featureNode.py
@@ -40,7 +40,8 @@ class FeatureNode(TemplateNodeMixin, Node):
port_type: [
"pandas.DataFrame", "cudf.DataFrame",
"dask_cudf.DataFrame", "dask.dataframe.DataFrame"
- ]
+ ],
+ PortsSpecSchema.optional: True
}
}
port_outports = {
diff --git a/greenflow/greenflow/dataframe_flow/_node_flow.py b/greenflow/greenflow/dataframe_flow/_node_flow.py
index 6fda95f..749c64f 100644
--- a/greenflow/greenflow/dataframe_flow/_node_flow.py
+++ b/greenflow/greenflow/dataframe_flow/_node_flow.py
@@ -858,6 +858,10 @@ class NodeTaskGraphMixin(NodeTaskGraphExtensionMixin):
dy = PortsSpecSchema.dynamic
if inports[iport].get(dy, False):
continue
+
+ if inports[iport].get(PortsSpecSchema.optional, False):
+ continue
+
# Is it possible that iport not connected? If so iport should
# not be in required. Should raise an exception here.
warn_msg = \
diff --git a/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py b/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py
index 0363df0..9be1fc4 100644
--- a/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py
+++ b/greenflow/greenflow/dataframe_flow/_node_taskgraph_extension_mixin.py
@@ -1,3 +1,4 @@
+from copy import deepcopy
from .portsSpecSchema import (PortsSpecSchema, NodePorts)
from .metaSpec import (MetaDataSchema, MetaData)
@@ -56,12 +57,10 @@ class NodeTaskGraphExtensionMixin:
dynamic = None
inports = {}
for input_port in port_inports:
+ inports[input_port] = deepcopy(port_inports[input_port])
if input_port in input_connections:
determined_type = input_connections[input_port]
- inports[input_port] = {port_type: determined_type}
- else:
- types = port_inports[input_port][port_type]
- inports[input_port] = {port_type: types}
+ inports[input_port].update({port_type: determined_type})
if dy in port_inports[input_port]:
inports[input_port][dy] = True
I set that port as optional in featureNode.py
. Then I had to tweak the logic a bit in _node_flow.py:_validate_connected_metadata
, and _node_taskgraph_extension_mixin.py:ports_setup_ext
to account for the PortsSpecSchema.optional
case.
I did not test this thoroughly, but the notebooks seem to run fine and the warning is now correctly handled.
What do you think of making these updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finished reviewing. There's a bug when drawing the taskgraph with ports in ipynb mode.
taskGraph.draw(show='ipynb', show_ports=True)
That draw method is not called with show_ports
option in the notebooks, nevertheless it can be fixed via:
diff --git a/greenflow/greenflow/dataframe_flow/taskGraph.py b/greenflow/greenflow/dataframe_flow/taskGraph.py
index 70149b1..66fc0ff 100644
--- a/greenflow/greenflow/dataframe_flow/taskGraph.py
+++ b/greenflow/greenflow/dataframe_flow/taskGraph.py
@@ -353,7 +353,7 @@ class TaskGraph(object):
if (to_type == OUTPUT_TYPE):
continue
- task_node = get_node_obj(itask)
+ task_node = get_node_obj(itask, tgraph_mixin=True)
I had trouble running through the 17assets.ipynb notebook. Even reducing the bootstrap samples I would run out of memory on the GPU. I had to re-start the notebook kernel and run each section individually, and that way I was able to run through the whole notebook. I think there's a bug with cudf.merge
as that method seems to cause memory leak on the GPUs.
My suggestion would be to split that notebook "17assets.ipynb" into several notebooks corresponding to 1 or 2 sections at a time. Name them something like "17assets_1of5.ipynb", "17assets_2of5.ipynb", and so on, or whatever makes sense. That might make it easier to run each notebook, clean the python kernel or shutdown the notebook, run the next notebook, and so on.
Let me know if you'd like to make these changes or not. I'll resolve this comment then, and approve/merge the PR.
Thanks
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True) Could you give it a try? |
ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. This looks good. I was able to run the notebook. I had to be a little patient after running the kernel shutdown cell. Otherwise this worked.
Did you re-run the 17assets.ipynb notebook completely? Because in the gtihub posted version (https://github.com/yidong72/gQuant/blob/branch-hrp/gQuant/plugins/hrp_plugin/notebooks/17assets.ipynb) the sections after "Hyperparameter Optimization" have the warning that was fixed. Running it myself, none of those warnings appeared anymore.
I added your 17assets notebook and added the synthetic 10 stocks dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks!
Release the code for HRP project.