Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Access to Undefined Reference #122

Open
Deduction42 opened this issue Dec 2, 2020 · 4 comments
Open

Access to Undefined Reference #122

Deduction42 opened this issue Dec 2, 2020 · 4 comments

Comments

@Deduction42
Copy link

I can only partially iterate from a file created by parquet-mr. I can iterate through it once, but trying to do this a second time yields

iterate(cursor)
ERROR: UndefRefError: access to undefined reference
[1] getindex at ./array.jl:809 [inlined]
 [2] colcursor_values(::Parquet.ColCursor{String}, ::Int64, ::Type{Array{Union{Missing, String},1}}, ::Nothing) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:289
 [3] (::Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}})(::Tuple{Parquet.ColCursor{String},DataType}) at ./none:0
 [4] iterate at ./generator.jl:47 [inlined]
 [5] collect_to!(::Array{Array{T,1} where T,1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}}}, ::Int64, ::Tuple{Int64,Int64}) at ./array.jl:732
 [6] collect_to!(::Array{Array{Union{Missing, Decimals.Decimal},1},1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}) at ./array.jl:740
 [7] collect_to_with_first!(::Array{Array{Union{Missing, Decimals.Decimal},1},1}, ::Array{Union{Missing, Decimals.Decimal},1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}) at ./array.jl:710
 [8] collect(::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}) at ./array.jl:691
 [9] iterate(::BatchedColumnsCursor{NamedTuple{...}, ::Int64) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:336
 [10] iterate(::BatchedColumnsCursor{NamedTuple{...}) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:350
 [11] top-level scope at REPL[5]:1

Note that NamedTuple{...} is abridged becasue the actual tuple is a massive long list of the entire file schema. I can't give you the original file for this one, but I wouldn't be surprised if it has something to do with initializing a mutable type with #undef and failing to populate it. There could be sizable gaps in data for some of the columns. Note that it was created by parquet-mr

Parquet file: Input/input_data.parquet
version: 1
nrows: 4887400
created by: parquet-mr version 1.9.0 (build 38262e2c80015d0935dad20f8e18f2d6f9fbd03c)
cached: 157 column chunks

@Deduction42
Copy link
Author

This issue could potentially be related to Issue #120. There could potentially be a string column that fails to parse a delimiter, putting a whole whack of data into a single cell, causing the rest of the columns.

@Deduction42
Copy link
Author

If I iterate through the file with no batch size specified, I get an inexact error (trying to convert a NaN to Int32)

cursor = BatchedColumnsCursor(parFile, use_threads=false, reusebuffer=false)
df = DataFrame(iterate(cursor)[1])
ERROR: InexactError: Int32(NaN)
Stacktrace:
 [1] Int32 at ./float.jl:689 [inlined]
 [2] read_plain_values(::Parquet.InputState, ::Parquet.OutputState{Decimals.Decimal}, ::Int32, ::Parquet.var"#32#35"{Int32,DataType}, ::Int32) at /home/user/.julia/packages/Parquet/yx9gp/src/codec.jl:170
 [3] iterate(::Parquet.ColumnChunkPageValues{Decimals.Decimal}, ::Int64) at /home/user/.julia/packages/Parquet/yx9gp/src/reader.jl:283
 [4] iterate at /home/user/.julia/packages/Parquet/yx9gp/src/reader.jl:237 [inlined]
 [5] setrow(::Parquet.ColCursor{Decimals.Decimal}, ::Int64) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:114
 [6] colcursor_advance at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:267 [inlined]
 [7] colcursor_values(::Parquet.ColCursor{Decimals.Decimal}, ::Int64, ::Type{Array{Union{Missing, Decimals.Decimal},1}}, ::Nothing) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:296
 [8] (::Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}})(::Tuple{Parquet.ColCursor{Decimals.Decimal},DataType}) at ./none:0
 [9] iterate at ./generator.jl:47 [inlined]
 [10] collect_to!(::Array{Array{T,1} where T,1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}}, ::Int64, ::Tuple{Int64,Int64}) at ./array.jl:732
 [11] collect_to!(::Array{Array{Union{Missing, Decimals.Decimal},1},1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}}, ::Int64, ::Tuple{Int64,Int64}) at ./array.jl:740
 [12] collect_to_with_first!(::Array{Array{Union{Missing, Decimals.Decimal},1},1}, ::Array{Union{Missing, Decimals.Decimal},1}, ::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}}, ::Tuple{Int64,Int64}) at ./array.jl:710
 [13] collect(::Base.Generator{Base.Iterators.Zip{Tuple{Array{Parquet.ColCursor,1},Core.SimpleVector}},Parquet.var"#55#58"{BatchedColumnsCursor{NamedTuple{...}}}}) at ./array.jl:691
 [14] iterate(::BatchedColumnsCursor{NamedTuple{...}}, ::Int64) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:336
 [15] iterate(::BatchedColumnsCursor{NamedTuple{...}}) at /home/user/.julia/packages/Parquet/yx9gp/src/cursor.jl:350
 [16] top-level scope at REPL[12]:1

@Deduction42
Copy link
Author

Deduction42 commented Dec 4, 2020

I just verified the fix to Issue #120, that fix doesn't fix this problem unfortunately so this issue is still open.

@nickrobinson251
Copy link

nickrobinson251 commented Feb 12, 2021

i also get exactly this UndefRefError error trying to read a parquet file written using python/pandas, reading with Parquet v0.8.0, Julia v1.5.3

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants