Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Usage with pyarrow parquet #10

Open
tanguycdls opened this issue Apr 15, 2021 · 2 comments
Open

Usage with pyarrow parquet #10

tanguycdls opened this issue Apr 15, 2021 · 2 comments

Comments

@tanguycdls
Copy link

Hello, I'm very interested by the library usage however I struggle to apply it to a parquet file other than the dremel example.

from struct2tensor import expression_impl
import struct2tensor as s2t
import pyarrow as pa
import pyarrow.parquet as pq

tbl = pa.table([pa.array([0, 1])], names='a')
pq.ParquetWriter('/tmp/test', tbl.schema).write_table(tbl)
filenames = ["/tmp/test"]
batch_size = 2

exp = s2t.expression_impl.parquet.create_expression_from_parquet_file(filenames)
ps = exp.project(['a'])

val = s2t.expression_impl.parquet.calculate_parquet_values([ps], exp, 
                                        filenames, batch_size)
for h in val:
    break

segfaults with the error:
2021-04-15 15:30:40.254237: E struct2tensor/kernels/parquet/parquet_reader.cc:198]
The repetition type of the root node was 0, but should be 2. There may be something wrong with your supplied parquet schema. We will treat it as a repeated field.

2021-04-15 15:31:46.428109: W tensorflow/core/framework/dataset.cc:477]
Input of ParquetDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.

I also tried saving again the dremel file loaded with Pyarrow and dumping it right away and I can reproduce the error.

How do you advise to save your parquet ?

Thanks for your help !

@andylou2
Copy link
Contributor

Hi Tanguy,

The dremel example was created with parquet's c++ api [1]. The last time I checked (~2 years ago), pyarrow's parquet writer/reader did not properly support structured data. But this could have changed.

Do you have the full stack trace? The errors you listed are not fatal errors.

[1] https://github.com/apache/parquet-cpp

@tanguycdls
Copy link
Author

Hello thanks for the answer !

It's actually a core dump SEGFAULT:
I tried gdb but i dont have the symbols and source configured so it's not so clear to me:

#0  _PyErr_GetTopmostException (tstate=0x0) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/errors.c:98
#1  PyErr_SetObject (exception=0x55d759d71d00 <_PyExc_RuntimeError>, value=0x7f74d422b390) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/errors.c:98
#2  0x000055d759b15b4d in PyErr_SetString (exception=0x55d759d71d00 <_PyExc_RuntimeError>, string=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/errors.c:170
#3  0x00007f755e84cb78 in pybind11::detail::translate_exception(std::__exception_ptr::exception_ptr) () from /opt/conda/envs/model/lib/python3.7/site-packages/tensorflow/python/_pywrap_tfe.so
#4  0x00007f755e87ca1b in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /opt/conda/envs/model/lib/python3.7/site-packages/tensorflow/python/_pywrap_tfe.so
#5  0x000055d759b9c427 in _PyMethodDef_RawFastCallKeywords (method=<optimized out>, self=0x7f755ebeb210, args=0x55d75e4c1540, nargs=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:693
#6  0x000055d759b9dad8 in _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, args=0x55d75e4c1540, func=0x7f755ebea960) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:723
#7  call_function (pp_stack=0x7ffeaf5dd3c0, oparg=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:4568
#8  0x000055d759bc874a in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3093
#9  0x000055d759b0baf2 in PyEval_EvalFrameEx (throwflag=0, f=0x55d75e4c1360) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3930
#10 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>, kwargs=<optimized out>, kwcount=<optimized out>, kwstep=<optimized out>, defs=<optimized out>, defcount=<optimized out>, kwdefs=<optimized out>, closure=<optimized out>, name=<optimized out>, qualname=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3930
#11 0x000055d759b3a030 in _PyFunction_FastCallKeywords (func=<optimized out>, stack=0x7f74e0065f68, nargs=1, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:433
#12 0x000055d759b9d9c8 in call_function (pp_stack=0x7ffeaf5dd6c0, oparg=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:4616
#13 0x000055d759bc51d9 in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3139
#14 0x000055d759b39e94 in PyEval_EvalFrameEx (throwflag=0, f=0x7f74e0065de0) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:544
#15 function_code_fastcall (globals=0x7f755c7d1140, nargs=<optimized out>, args=<optimized out>, co=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:283
#16 _PyFunction_FastCallKeywords (func=<optimized out>, stack=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:408
#17 0x000055d759b9d9c8 in call_function (pp_stack=0x7ffeaf5dd8a0, oparg=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:4616
#18 0x000055d759bc4544 in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3110
#19 0x000055d759b0cead in PyEval_EvalFrameEx (throwflag=0, f=0x7f74d4387050) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:544
(More stack frames follow...)

In python side there is nothing except the log right before.

I remember some conversations on Pyarrow ability to store those but I thought it was resolved. The parquet-cpp however seems to now be in Arrow repo.

I'll try to see if I can understand the difference between both format !

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants