-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Error writing STRUCT
to parquet in parallel: internal error: entered unreachable code: cannot downcast Int64 to byte array
#8853
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Similar to #8851, the parallelized parquet writer code is to blame here. There is something wrong with how that code is handling nested types.
|
I tried to make a minimal reproducer with just arrow-rs, but it appears to work fine. It must be then that there is an issue with the tokio implementation of this logic in DataFusion. use std::sync::Arc;
use arrow_array::*;
use arrow_schema::*;
use parquet::arrow::arrow_to_parquet_schema;
use parquet::arrow::arrow_writer::{ArrowLeafColumn, compute_leaves, get_column_writers};
use parquet::file::properties::WriterProperties;
use parquet::file::writer::SerializedFileWriter;
fn main(){
let schema = Arc::new(Schema::new(vec![
Field::new("struct", DataType::Struct(vec![
Field::new("b", DataType::Boolean, false),
Field::new("c", DataType::Int32, false),].into()), false
)
]));
// Compute the parquet schema
let parquet_schema = arrow_to_parquet_schema(schema.as_ref()).unwrap();
let props = Arc::new(WriterProperties::default());
// Create writers for each of the leaf columns
let col_writers = get_column_writers(&parquet_schema, &props, &schema).unwrap();
// Spawn a worker thread for each column
// This is for demonstration purposes, a thread-pool e.g. rayon or tokio, would be better
let mut workers: Vec<_> = col_writers
.into_iter()
.map(|mut col_writer| {
let (send, recv) = std::sync::mpsc::channel::<ArrowLeafColumn>();
let handle = std::thread::spawn(move || {
for col in recv {
col_writer.write(&col)?;
}
col_writer.close()
});
(handle, send)
})
.collect();
// Create parquet writer
let root_schema = parquet_schema.root_schema_ptr();
let mut out = Vec::with_capacity(1024); // This could be a File
let mut writer = SerializedFileWriter::new(&mut out, root_schema, props.clone()).unwrap();
// Start row group
let mut row_group = writer.next_row_group().unwrap();
let boolean = Arc::new(BooleanArray::from(vec![false, false, true, true]));
let int = Arc::new(Int32Array::from(vec![42, 28, 19, 31]));
// Columns to encode
let to_write = vec![Arc::new(
StructArray::from(vec![
(
Arc::new(Field::new("b", DataType::Boolean, false)),
boolean.clone() as ArrayRef,
),
(
Arc::new(Field::new("c", DataType::Int32, false)),
int.clone() as ArrayRef,
),
])) as _,
];
// Spawn work to encode columns
let mut worker_iter = workers.iter_mut();
for (arr, field) in to_write.iter().zip(&schema.fields) {
for leaves in compute_leaves(field, arr).unwrap() {
worker_iter.next().unwrap().1.send(leaves).unwrap();
}
}
// Finish up parallel column encoding
for (handle, send) in workers {
drop(send); // Drop send side to signal termination
let chunk = handle.join().unwrap().unwrap();
chunk.append_to_row_group(&mut row_group).unwrap();
}
row_group.close().unwrap();
let metadata = writer.close().unwrap();
assert_eq!(metadata.num_rows, 4);
} |
@tustvold is it apparent to you what the issue is within the DataFusion parallel parquet code? If not, I propose we disable the feature by default and add many more tests to cover writing nested parquet files and other data types like dictionaries (#8854). Then take a longer time and likely multiple PRs to bring the parallel parquet writer in DataFusion to feature parity with the non-parallel version. |
I can take some time to take a look next week, my guess is it is something in the logic that performs slicing for row group parallelism |
I don't think this issue should block the datafusion release. @devinjdangelo set the feature I updated this PR's description to mention that. Once we re-enable single file parallelism by default we should verify this query still works |
STRUCT
to parquet: internal error: entered unreachable code: cannot downcast Int64 to byte arraySTRUCT
to parquet in parallel: internal error: entered unreachable code: cannot downcast Int64 to byte array
Describe the bug
I can't write a struct to parquet when trying to write in parallel, instead I get the following error
To Reproduce
Expected behavior
I expect the parquet file to be written successfully. This works fine with JSON:
Additional context
No response
The text was updated successfully, but these errors were encountered: