Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Batch performant discover_nhdplus_id #417

Open
mhweber opened this issue Dec 16, 2024 · 9 comments
Open

Batch performant discover_nhdplus_id #417

mhweber opened this issue Dec 16, 2024 · 9 comments

Comments

@mhweber
Copy link
Contributor

mhweber commented Dec 16, 2024

Currently the StreamCatTools sc_get_comid() function is calling discover_nhdplus_id to derive NHDPlus COMIDs for sets of lat and lons values. A number of users have recently been trying to speed this up through parallelizing or sending batch requests that exceed server limit in underlying NLDI service.

StreamCatTools has a similar function called lc_get_comid which calls nhdplusTools get_waterbodies and pulls NHDPlus waterbody COMIDs from the subset features.

Would calling NHDPlus subset service directly or via nhdplusTools be more performant and robust than discover_nhdplus_id for large calls to derive COMIDs for a large set of lat and lons?

@dblodgett-usgs
Copy link
Collaborator

Thanks for prompting this, @mhweber -- I've run into this use case a few times where people have long lists and end up using patterns that don't scale well. I'll look at an alternate discover_nhdplus_id implementation and put some thought into whether there is a faster way to do it via geoserver services.

@dblodgett-usgs
Copy link
Collaborator

I just merged a change that will help a bit. Will leave this open and think about whether there's a more significant update where we could do a spatial join remotely to retrieve comids.

@DEQathomps
Copy link

This is timely. I'm really interested in using nhdplusTools for watershed delineations. Will you please explain how to batch process? Following the code from the vignette, this function works great when dealing with a single station (only first lines of code presented for simplicity):

start_point <- st_sfc(st_point(c(-122.802489389074, 43.85780225517)), crs = 4269)
start_comid <- discover_nhdplus_id(start_point)

However, processing multiple stations at once results in errors, server timeouts, etc. Example code below.

lon2<-c(-122.802489389074, -122.691787093599)
lat2<-c(43.85780225517, 43.9239837521485)
start_points <- st_sfc(st_point(c(lon2, lat2)), crs = 4269)
start_comids <- discover_nhdplus_id(start_points)

I run into similar complications with other steps (e.g., flowlines, catchments). Have tried multiple approaches and have the most recent package installed.

Has anyone processed multiple stations simultaneously or been able to create batch watershed delineations? Any tips would be much appreciated!

@dblodgett-usgs
Copy link
Collaborator

Under the hood, discover_nhdplus_id() for a point is doing a point in polygon against the NHDPlusV2.

In a previous version, it was calling a web service with a little more overhead than what I switched it to in the current implementation but it's still basically just dropping your point into a catchment.

For batches of points, downloading the NHDPlusV2 catchments and using sf::st_join() to get the COMID for each point is going to be best.

If you need to iterate against the web service, it is best to use httr::RETRY() or httr2::req_retry() with exponential back off to ensure you don't overload the server. Any degree of parallelism is probably not advised as the back end systems are not architected for it.

@mhweber
Copy link
Contributor Author

mhweber commented Dec 27, 2024

Thanks @dblodgett-usgs! @DEQathomps I've put together geoparquet files for all NHDPlusV2 lake watersheds and am interested in publishing in an S3 bucket and was curious about potentially doing the same for NHDPlusV2 reach COMIDs. We have a method tied to StreamCat and LakeCat that uses staged numpy arrays to (fairly) quickly generate watersheds on the fly but publishing as geoparquet seems like it would be a useful product. The intent would be to add functionality to StreamCatTools to request watersheds for given lakes or reaches. Looking into the feasibility of this at the moment.

@dblodgett-usgs
Copy link
Collaborator

I guess I'm missing something -- you are talking about "watershed delineation" and "generating watersheds" but discover_nhdplus_id() just returns a COMID for a place in the network. What's the rest of the workflow look like here?

@mhweber
Copy link
Contributor Author

mhweber commented Dec 27, 2024

Sorry @dblodgett-usgs - not really related to this issue, my bad - was just following up on @DEQathomps question above, has anyone "been able to create batch watershed delineations" - just wanted to point out I was working toward potentially sharing out staged watershed delineations for lakes via geoparquet in S3. Which may or may not really be a viable way - but I have them all and want to make easily accessible.

@mhweber
Copy link
Contributor Author

mhweber commented Jan 27, 2025

@dblodgett-usgs I'm guessing you can close this with the recommendation for using httr::RETRY or httr2::req_retry or alternatively downloading the NHDPlusV2 catchments and using sf::st_join() to get the COMIDs unless you want to leave this open for some further developments you're thinking of?

@dblodgett-usgs
Copy link
Collaborator

I'll leave it open. I want to think some more about whether there is a more scalable way to do discovery.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants