Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

TRT-571 - Update net2cog to process multiple variables. #38

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

owenlittlejohns
Copy link
Collaborator

@owenlittlejohns owenlittlejohns commented Feb 3, 2025

Github Issue: #32 (also one of the items mentioned in #35).

Description

This PR adds support for requests asking for multiple variables, either explicitly, or when a Harmony request specifies "all".

Overview of work done

There's a fair bit of shuffling around, but essentially the code changes are:

  • If there is an empty list specified for variables, this is an "all" request, and so do all data variables.
  • The _write_cogtiff function has been rescoped to just write a single COG. The iteration now takes place in netcdf_converter.
  • I implemented the generate_output_filename from the harmony-service-lib-py module, which now has a reformatted suffix (note, this is a minor difference from converted). I removed a bunch of custom code that was geared around generating the output filename, as that is no longer needed.
  • I consolidated test fixtures into a conftest.py.

Overview of verification done

  • Updated unit tests (including consolidating test fixtures into tests/conftest.py). There are now tests that run to confirm multiple variables can be processed, either via explicitly requesting them or by using "all" in a Harmony request. These tests exist at the overall service level, and the netcdf_converter function level.
  • I also have updated the net2cog regression tests (see: TRT-571 - Update regression tests for net2cog. nasa/harmony-regression-tests#131)

Overview of integration done

PR checklist:

  • Linted
  • Updated unit tests
  • Updated changelog
  • Integration testing

See Pull Request Review Checklist for pointers on reviewing this pull request

raise Net2CogError(error) from error
except LookupError as err:
logger.info("Variable %s cannot be converted to tif: %s", variable_name, err)
return None
Copy link
Collaborator Author

@owenlittlejohns owenlittlejohns Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that return None is the best thing for each of these cases, but it was more of an equivalent to the previous continue when there was a loop in this function. I'd probably prefer to raise an exception instead in a few of these return None instances. What do you think @frankinspace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it makes sense to raise an exception here. Perhaps create a custom FailedConversion exception or similar so that we could in the future add custom error handling in netcdf_convert_harmony

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

279641

Currently the Net2CogError was only being raised within this function, so I opted to use that. Happy to make a child of that exception type, though, if that feels better.

if not var_list:
# Empty list means "all" variables, so get all variables in
# the `xarray.Dataset`.
var_list = list(xds.data_vars.keys())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I wanted to call out.

I think using xarray.Dataset.data_vars will essentially prevent an "all" request from trying to convert coordinate variables, so it feels like the right thing to choose, instead of all varibles.

That said, if a user specifies the coordinate variables by name, we'll still hit the code that checks against EXCLUDED_VARIABLES. Maybe it would be more scaleable to check the xarray.Dataset.coordinates instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable step forward. The EXCLUDED_VARIABLES was a pretty simplistic way of trying to deal with not converting coordinate variables so this does improve that IMO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, what we have in this PR is good for now, and perhaps swapping EXCLUDED_VARIABLES for xarray.Dataset.coordinates is something for future work? Happy either way, just making sure I'm not missing some work to do here.

Comment on lines 79 to 83
# Rename the downloaded output from a SHA256 hash to the basename
# of the source data file
source_asset_basename = os.path.basename(asset.href)
os.rename(input_filename, os.path.join(output_dir, source_asset_basename))
input_filename = os.path.join(output_dir, source_asset_basename)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit on the fence with this change. It feels a bit clunky.

When Harmony downloads a file, it assigns a SHA256 hash as the filename, based on the URL it got the file from. But, we want to be passing around the base name of the original granule, so the output file names can be generated for every variable. I was considering two options:

  1. This: rename the downloaded file to have a base name that matches the source file, instead of the SHA 256 hash.
  2. Pass around the source granule base name an additional variable.

I opted for a bit more complexity here, to avoid having another string floating around and being confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to reconsider this approach given my previous comment about removing harmony dependency from netcdf_convert

Copy link
Collaborator Author

@owenlittlejohns owenlittlejohns Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - if we aren't using the generate_output_filename in netcdf_convert.py, then I think we can avoid having to rename things. So this can be cleaned up to just use the SHA256-derived name.

I think the other part of it, though, is making sure that there is a mapping in the output from variable name to file path, so that we can generate the correct file name in netcdf_convert_harmony.py. If you are fine with tweaking the output of netcdf_convert.py::netcdf_converter to something like a dictionary, then I can work up something like that and move the renaming out to netcdf_convert_harmony.py.

Update: A simpler alternative - the default (non-Harmony) file name could just be <variable name>.tif.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like <variable name>.tif. Do we need to worry about special characters there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about special characters there?

Great catch!

I had just run with using <variable_name>.tif in 279641, but hadn't accounted for the possibility of a slash (which will come up with hierarchical collections). I will at least do something to swap out slash characters for underscores and push that up. Should be a quick tweak, and I think that's compatible with what harmony-service-lib does to those characters as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There we go: 656f49

I just do a simple string replace for slashes.

Copy link
Member

@frankinspace frankinspace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thank you for the contribution! Just a few comments to consider

xarray dataset loaded from NetCDF file
variable_name: str
Name of the variable as passed in to Harmony.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit picky but shouldn't mention Harmony in the non-harmony module. Intent is that this netcdf_convert module could be used independently of Harmony.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 279641

raise Net2CogError(error) from error
except LookupError as err:
logger.info("Variable %s cannot be converted to tif: %s", variable_name, err)
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it makes sense to raise an exception here. Perhaps create a custom FailedConversion exception or similar so that we could in the future add custom error handling in netcdf_convert_harmony

if not var_list:
# Empty list means "all" variables, so get all variables in
# the `xarray.Dataset`.
var_list = list(xds.data_vars.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable step forward. The EXCLUDED_VARIABLES was a pretty simplistic way of trying to deal with not converting coordinate variables so this does improve that IMO

Comment on lines 79 to 83
# Rename the downloaded output from a SHA256 hash to the basename
# of the source data file
source_asset_basename = os.path.basename(asset.href)
os.rename(input_filename, os.path.join(output_dir, source_asset_basename))
input_filename = os.path.join(output_dir, source_asset_basename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to reconsider this approach given my previous comment about removing harmony dependency from netcdf_convert

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants