Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] earthaccess.download gets different data than curl from an OPeNDAP service #887

Closed
1 task done
itcarroll opened this issue Dec 4, 2024 · 8 comments · Fixed by #920
Closed
1 task done
Labels
type: bug Something isn't working

Comments

@itcarroll
Copy link
Collaborator

Is this issue already tracked somewhere, or is this a new report?

  • I've reviewed existing issues and couldn't find a duplicate for this problem.

Current Behavior

Reporting an issue noted by @tsnow03 on the CryoCloud slack.

Giving earthaccess.download a URL for the LAADS OPeNDAP service (in this case, one that returns a NetCDF4 formatted version of the archival HDF-EOS file) returns a gzipped file. Using curl on the same URL returns an uncompressed file. If it is intended that earthaccess.download get a compressed file, then some notification should be given. If not ...

Expected Behavior

I expect earthaccess.download to download a file identical to what curl downloads for a given URL.

Steps To Reproduce

Show that earthaccess writes a compressed file:

import earthaccess

url = "https://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/61/MOD07_L2/2000/055/MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4"
earthaccess.download(url, "data")

with open("data/MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4", "rb") as f:
    print(f.read(4))
b'\x1f\x8b\x08\x00'

That looks to me like a gzipped file, and passing the file through gunzip does allow it to be opened with netCDF4.

On the other hand, curl writes an uncompressed HDF5 file.

!curl -O "https://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/61/MOD07_L2/2000/055/MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4"

with open("MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4", "rb") as f:
    print(f.read(4))
b'\x89HDF'

Environment

- OS: CryoCloud (ubuntu jammy)
- Python: 3.11.9

Additional Context

No response

@mfisher87 mfisher87 added the type: bug Something isn't working label Dec 4, 2024
@mfisher87
Copy link
Collaborator

This is certainly weird. It must be OPeNDAP conditionally compressing the data based on headers or something like that? waves hands

I don't think this is intended.

@itcarroll
Copy link
Collaborator Author

I had originally suspected that something was up with the OPeNDAP service, but LAADS user services response indicates they do not compress on their end. https://forum.earthdata.nasa.gov/viewtopic.php?t=6247

@mfisher87
Copy link
Collaborator

I think we need to debug and compare the HTTP requests! Then we can probably force this to occur with curl as well through trial and error.

@maxrjones
Copy link

maxrjones commented Dec 12, 2024

requests uses gzip for the content-encoding header but automatically decodes compressed content. However, earthaccess bypasses this automatic decoding by using Response.raw in

with open(path, "wb") as f:
# This is to cap memory usage for large files at 1MB per write to disk per thread
# https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content
shutil.copyfileobj(r.raw, f, length=1024 * 1024)
If you want to copy the automatically decoded content to the file instead, you should use Response.content instead of Response.raw but that'd impact the memory usage cap strategy.

@maxrjones
Copy link

The note in this section of the docs actually provides a better explanation of what's causing this issue - https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content

@itcarroll
Copy link
Collaborator Author

itcarroll commented Dec 12, 2024

Thanks @maxrjones!

The practice recommended in those docs is to use Response.iter_content instead of raw, which also lets us set a chunk size.

Any opposition to testing that switch?

@mfisher87
Copy link
Collaborator

Good find @maxrjones , thank you!

The practice recommended in those docs is to use Response.iter_content instead of raw, which also lets us set a chunk size.

Any opposition to testing that switch?

🚀 🚀 🚀

@mfisher87
Copy link
Collaborator

Thanks @itcarroll for the implementation!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
type: bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants