[BUG] earthaccess.download gets different data than curl from an OPeNDAP service #887

itcarroll · 2024-12-04T04:15:37Z

Is this issue already tracked somewhere, or is this a new report?

I've reviewed existing issues and couldn't find a duplicate for this problem.

Current Behavior

Reporting an issue noted by @tsnow03 on the CryoCloud slack.

Giving earthaccess.download a URL for the LAADS OPeNDAP service (in this case, one that returns a NetCDF4 formatted version of the archival HDF-EOS file) returns a gzipped file. Using curl on the same URL returns an uncompressed file. If it is intended that earthaccess.download get a compressed file, then some notification should be given. If not ...

Expected Behavior

I expect earthaccess.download to download a file identical to what curl downloads for a given URL.

Steps To Reproduce

Show that earthaccess writes a compressed file:

import earthaccess

url = "https://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/61/MOD07_L2/2000/055/MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4"
earthaccess.download(url, "data")

with open("data/MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4", "rb") as f:
    print(f.read(4))

b'\x1f\x8b\x08\x00'

That looks to me like a gzipped file, and passing the file through gunzip does allow it to be opened with netCDF4.

On the other hand, curl writes an uncompressed HDF5 file.

!curl -O "https://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/61/MOD07_L2/2000/055/MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4"

with open("MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4", "rb") as f:
    print(f.read(4))

b'\x89HDF'

Environment

- OS: CryoCloud (ubuntu jammy)
- Python: 3.11.9

Additional Context

No response

The text was updated successfully, but these errors were encountered:

mfisher87 · 2024-12-04T17:17:54Z

This is certainly weird. It must be OPeNDAP conditionally compressing the data based on headers or something like that? waves hands

I don't think this is intended.

itcarroll · 2024-12-04T18:42:30Z

I had originally suspected that something was up with the OPeNDAP service, but LAADS user services response indicates they do not compress on their end. https://forum.earthdata.nasa.gov/viewtopic.php?t=6247

mfisher87 · 2024-12-04T19:07:52Z

I think we need to debug and compare the HTTP requests! Then we can probably force this to occur with curl as well through trial and error.

maxrjones · 2024-12-12T02:15:25Z

requests uses gzip for the content-encoding header but automatically decodes compressed content. However, earthaccess bypasses this automatic decoding by using Response.raw in

earthaccess/earthaccess/store.py

Lines 679 to 682 in ffbfddd

    
           with open(path, "wb") as f: 
        
               # This is to cap memory usage for large files at 1MB per write to disk per thread 
        
               # https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content 
        
               shutil.copyfileobj(r.raw, f, length=1024 * 1024)

If you want to copy the automatically decoded content to the file instead, you should use Response.content instead of Response.raw but that'd impact the memory usage cap strategy.

maxrjones · 2024-12-12T02:19:56Z

The note in this section of the docs actually provides a better explanation of what's causing this issue - https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content

itcarroll · 2024-12-12T14:27:32Z

Thanks @maxrjones!

The practice recommended in those docs is to use Response.iter_content instead of raw, which also lets us set a chunk size.

Any opposition to testing that switch?

mfisher87 · 2024-12-12T15:52:18Z

Good find @maxrjones , thank you!

The practice recommended in those docs is to use Response.iter_content instead of raw, which also lets us set a chunk size.

Any opposition to testing that switch?

🚀 🚀 🚀

mfisher87 · 2025-01-07T21:04:27Z

Thanks @itcarroll for the implementation!

github-project-automation bot added this to earthaccess project Dec 4, 2024

github-project-automation bot moved this to 🆕 New in earthaccess project Dec 4, 2024

mfisher87 added the type: bug Something isn't working label Dec 4, 2024

github-actions bot mentioned this issue Jan 1, 2025

Monthly issue metrics report: 2024-12-01..2024-12-31 #910

Closed

itcarroll mentioned this issue Jan 7, 2025

download using chunk iteration rather than raw response #920

Merged

9 tasks

itcarroll closed this as completed in #920 Jan 7, 2025

github-project-automation bot moved this from 🆕 New to ✅ Done in earthaccess project Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] earthaccess.download gets different data than curl from an OPeNDAP service #887

[BUG] earthaccess.download gets different data than curl from an OPeNDAP service #887

itcarroll commented Dec 4, 2024

mfisher87 commented Dec 4, 2024

itcarroll commented Dec 4, 2024

mfisher87 commented Dec 4, 2024

maxrjones commented Dec 12, 2024 •

edited

Loading

maxrjones commented Dec 12, 2024

itcarroll commented Dec 12, 2024 •

edited

Loading

mfisher87 commented Dec 12, 2024

mfisher87 commented Jan 7, 2025

[BUG] earthaccess.download gets different data than curl from an OPeNDAP service #887

[BUG] earthaccess.download gets different data than curl from an OPeNDAP service #887

Comments

itcarroll commented Dec 4, 2024

Is this issue already tracked somewhere, or is this a new report?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Additional Context

mfisher87 commented Dec 4, 2024

itcarroll commented Dec 4, 2024

mfisher87 commented Dec 4, 2024

maxrjones commented Dec 12, 2024 • edited Loading

maxrjones commented Dec 12, 2024

itcarroll commented Dec 12, 2024 • edited Loading

mfisher87 commented Dec 12, 2024

mfisher87 commented Jan 7, 2025

maxrjones commented Dec 12, 2024 •

edited

Loading

itcarroll commented Dec 12, 2024 •

edited

Loading