Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[parquet] stringify source to handle both URLs and local paths #1913

Merged
merged 1 commit into from
Jun 8, 2023

Conversation

ajkerrigan
Copy link
Collaborator

@ajkerrigan ajkerrigan commented Jun 8, 2023

Stringify the source path that we feed into pq.read_table. This produces consistent results for local paths and URLs, where using a visidata.Path directly tries to load the underlying pathlib.Path object at _path.

Looks like the issue is that trying to tuck a URL inside a pathlib path drops a slash, turning http://www... into http:/www.... So it sort of looks like a URL but sort of like a local path...

Dog? Pig? Loaf of bread? Credit to https://fablefire.tumblr.com/post/650255824292397056/dog-pig-dog-pig-dog-pig-loaf-of-bread

To reproduce this issue, use the following .vdj:

#!vd -p
{"sheet": "", "col": "", "row": "", "longname": "exec-python", "input": "visidata.loaders.parquet.ParquetSheet('', source=visidata.Path('https://www.example.com/test.parquet')).reload()", "comment": "execute Python statement with expression scope"}

The URL is invalid but that doesn't really matter, the replay dies with this stacktrace:

Traceback (most recent call last):
  File "/home/aj/.local/pipx/venvs/visidata/lib/python3.11/site-packages/visidata/basesheet.py", line 200, in execCommand
    escaped = super().execCommand2(cmd, vdglobals=vdglobals)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aj/.local/pipx/venvs/visidata/lib/python3.11/site-packages/visidata/basesheet.py", line 73, in execCommand2
    exec(code, vdglobals, LazyChainMap(vd, self))
  File "exec-python", line 1, in <module>
    from . import kvpairs
                          
  File "<string>", line 1, in <module>
  File "/home/aj/.local/pipx/venvs/visidata/lib/python3.11/site-packages/visidata/vdobj.py", line 22, in _execAsync
    return visidata.vd.execAsync(func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aj/.local/pipx/venvs/visidata/lib/python3.11/site-packages/visidata/cmdlog.py", line 359, in <lambda>
    vd.execAsync = lambda func, *args, sheet=None, **kwargs: func(*args, **kwargs) # disable async
                                                             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/aj/.local/pipx/venvs/visidata/lib/python3.11/site-packages/visidata/sheets.py", line 236, in reload
    for r in self.iterload():
  File "/home/aj/.local/pipx/venvs/visidata/lib/python3.11/site-packages/visidata/loaders/parquet.py", line 19, in iterload
    self.tbl = pq.read_table(self.source)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aj/.local/pipx/venvs/visidata/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2926, in read_table
    dataset = _ParquetDatasetV2(
              ^^^^^^^^^^^^^^^^^^
  File "/home/aj/.local/pipx/venvs/visidata/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2452, in __init__
    finfo = filesystem.get_file_info(path_or_paths)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 571, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected a local filesystem path, got a URI: 'https:/www.example.com/test.parquet'

Hat tip to https://github.com/spren9er who reported this issue for S3 specifically in ajkerrigan/visidata-plugins#27.

Stringify the source path that we feed into `pq.read_table`. This
produces consistent results for local paths and URLs, where using a
`visidata.Path` directly tries to load the underlying `pathlib.Path`
object at `_path`.
@@ -16,7 +16,7 @@ def iterload(self):
pq = vd.importExternal('pyarrow.parquet', 'pyarrow')
from visidata.loaders.arrow import arrow_to_vdtype

self.tbl = pq.read_table(self.source)
self.tbl = pq.read_table(str(self.source))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may remove the ability to use e.g. parquet files inside .zip files. (Not sure if this works at present either though--depends on if read_table can take a file object).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good call. Just tested locally and that seems to work properly with or without this change 👍

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh! How does that work? :) But if it does, I believe you, and we'll merge it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh! How does that work? :) But if it does, I believe you, and we'll merge it.

I didn't like how the answer to this was "I have noooo idea" so I started following it and realized I was wrong. It's broken in both cases!

...but my test setup made it look like it worked. Because I just added a parquet file to a zip in the same directory:

├── test.zip
│   ├── benchmark.parquet
benchmark.parquet

So when I ran vd and opened the zip file, then opened the parquet file inside it, it worked! But under the covers it was transforming the source path into the local path outside the zip. If I moved the uncompressed benchmark.parquet file elsewhere, things stopped working (with or without this change). Doh.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, okay. Thanks for re-checking. In any event, this change doesn't break existing behavior, so when I want to revisit my .parquetz file idea (in order to have multiple parquet tables in a single file), I'll see if I can fix it without breaking this use case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, opening this as a separate issue #1916

I'll try to take a look at it since I'm halfway in the hole already, but if anyone else sorts it out first that works too :)

@saulpw saulpw merged commit 321bdbf into saulpw:develop Jun 8, 2023
@ajkerrigan ajkerrigan deleted the fix/parquet-stringify-source branch June 8, 2023 15:39
@takacsd takacsd mentioned this pull request Nov 22, 2023
2 tasks
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants