-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Removed aggregation in util_list.py, get_dataframe() #1033
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #1033 +/- ##
=============================================
+ Coverage 71.769% 72.448% +0.679%
=============================================
Files 225 221 -4
Lines 51895 51116 -779
=============================================
- Hits 37245 37033 -212
+ Misses 14650 14083 -567
|
Ok, this is not an ideal solution. The dataframe will become sparse when the number of stresses changes between periods. I think it would be more proper to concatenate different periods as rows instead of columns. That would require the period number as an additional index level. |
Any thoughts @mnfienen-usgs, @aleaf-usgs, and/or @jtwhite79?? |
For what it is worth, I think I am probably responsible for some of that aggregation business. From what I remember, it was to try and avoid issues with repeating indices in the resulting dataframe. I think @tomvansteijn is right though, it doesn't make much sense for some of the package list-types that now exist, so should be all good as long as repeated cells no longer cause and issue in the df construction. |
To be honest I think a better solution would be to add the period as an index level and concatenate the stress period data in row direction instead of the column direction. Instead of the dataframe looking like this:
Have this:
This would make aggregation no longer necessary and make it easier to get a Series over period index for a given k, i, j (or node). What do you think? |
@tomvansteijn - I think adding stress period as an index level is probably the best approach since it would let us slice and dice more easily. And Im fine with dropping the aggregation. |
@tomvansteijn, kper index level also works for me. I think we could still end up with repeated index though, if a cell entry is repeated within a stress period (which I think can happen in MODFLOW inputs) (kper, k, i, j) would be the same. Not sure how pandas deals with this now, previously it threw an error, hence the introduction of that aggregation step to ensure unique index. |
Good to know. I thinkt there is no problem when there are duplicates in the index, when concatenating the stress period data in along the same dimension. I will adjust the PR for this. The change may break some people's code though. |
Hey @tomvansteijn, one way to fix the duplicate issue would be to include boundary number as another column in the data frame. The boundary number would just correspond to which feature it is in the bound list. So if there are 10 wells, for a stress period, they could just be numbered zero to nine. That way, if the ten wells were all in the same cell, they could would at least have a useful indicator of which one is which. Not sure what to call that column, maybe IBNDNO? Or something like that? |
Hey @tomvansteijn, we are going to be making a flopy release in the next week or two. Any chance you are able to update this PR? |
Hi Chris, thanks for the reminder. I have made some changes to include "per" as the first index column. Seems to work, but I am still checking the tests. |
Some trouble here with test t004_test_utilarray.py::test_mflist. The last assert fails in this block:
The expression If you agree I will change the assertion to:
Honestly I think it is better to turn the squeeze option off by default, since it is changing the size of the input data implicitly. But then we would still want a test for |
That squeeze option sounds dangerous! I think if the sum of input flux is 25 then the assert is technically correct right? |
The assertion is not correct when two entries are removed in the squeeze operation. The remaining entries sum to 16.0. You can also consider to move the squeeze operation to a separate util function, or to remove it. |
Tests are passing so I think this is it. Let me know what you think. |
Clueless why t065 is failing now, as I did not change anything related to gridintersect. t065 is passing on my machine. |
@dbrakenhoff, we are getting some weird failures with this PR, and I've restarted a couple times, but they keep occurring. From what I can tell, the version of shapely has not changed since we've had a successful tests, so it's unclear what is going on here. This PR changes the behavior of util_list.py and retrieval of a pandas data frame. Any thoughts? PR failure can be seen here: https://github.com/modflowpy/flopy/pull/1033/checks?check_run_id=1856766475
|
I saw the comment and I already tried replicating the error locally but to no avail. The intersect result seems to be different for some reason. I'll try to figure out what's going on... |
I don't quite understand why yet, but the error seems to be caused by the odd behavior when intersecting LineStrings with a polygon representing the bounding box . The LineStrings in these cases intersect the bounding box, but the intersection result is an empty LineString. The snippet below illustrates the problem: from shapely.geometry import LineString, box
pl = box(0, 0, 20, 20)
shp = LineString([(5., 5.), (15., 5.)])
shp.intersects(pl) # returns True
intersect = shp.intersection(pl)
intersect.__geo_interface__ # Returns: {'type': 'LineString', 'coordinates': ()} Also weird, I can only replicate this error (thus far) using the environment file So I'm sort of at a loss now. Anyone else have any ideas? |
Hi all. I'm able to replicate the error using GEOS 3.9.0. This is found in a few recent commits from the development branch, e.g. GEOS 3.9.0 (used by Shapely) is the current version, released December 2020, which uses a new overlay engine dubbed OverlayNG. The snippet that @dbrakenhoff showed above indeed looks strange. The expected behaviour is found with geos==3.8.1, so my suggestion is to pin this version while I investigate why Shapely/GEOS is providing the wrong intersection with geos==3.9.0. |
Some good news: this issue (libgeos/geos#408) was just fixed 19 hours ago in the geos 3.9 branch (thanks @dr-jts !). The bad news that you'll need to avoid geos==3.9.0 and wait until geos==3.9.1 comes out -- and I'm not sure when this could be. Update: a release could be expected in the next few days. Update2: a release was announced. So it's just a matter of time for (e.g.) conda to catch-up to enable geos==3.9.1 |
@tomvansteijn, can you merge develop into this PR and resubmit? I think the t065 test should pass now. It seems like there is still a little bit of unsettled business with this squeeze option, but turning it off by default seems like the safest idea for now. The Pandas people might consider taking a closer look at this get_dataframe() method to ensure that records and data aren't being lost, which is what it sounds like now, when the squeeze option is used. |
Getting a "rate limit exceeded" while building Modflow using pymake in the test setup.. |
As for the squeeze option in the get_dataframe method. The original doc says:
That is, stress periods where none of the cells have changed, in any variable, since the previous stress period, are ommited from the dataframe. For example if the stress period data of period 5 is identical to the stress period data in period 4, period 5 is omitted from the dataframe. This would not lead to a different simulation after putting the dataframe back to the package data (unless the omitted period becomes an empty block). Moving the periods from the columns to the index led me to a different understanding of the squeeze operation: Omit any row (period, node) where the variables have not changed since the previous period. This different from the original implementation. For example the stress period data for node (1, 1, 3) in period 5 is equal to the stress period data in period 4, therefore node (1, 1, 3) is omitted from period 5 in the dataframe. I understand now that this would lead to a different result in a simulation. So this implementation is not correct. In general I think it is best to maintain a 1-to-1 relation between the package data and the dataframe, hence my suggestion to remove the squeeze option completely. Otherwise I can correct it. |
I agree. Squeeeze makes sense if the entire dataframe is identical between stress periods but otherwise, we lose that 1-to-1 relation. |
I have adjusted the squeeze option once more. Only fully duplicate periods are removed from the stress period data. See the additional test in t004. @langevin-usgs I have used your suggestion of adding a unique identifier for multiple entries in the same period and cell index. |
Thanks for the perseverance! |
See #764.
In the current implementation list input is grouped by layer, row, column when creating a dataframe. Variables are summed. This is incorrect for levels, e.g. stages or river bottoms.
Removed the groupby altogether. I believe it is not needed here? Not aware of the requirements for NetCDF, but a groupby can be performed after this function call.