Skip to content

Using PARTITION BY in SQL generates 'Error: External(NotImplemented("it is not yet supported to write to hive partitions with datatype Float64"))' #13602

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
ajazam opened this issue Nov 29, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@ajazam
Copy link

ajazam commented Nov 29, 2024

Describe the bug

I am trying to create a parquet file with hive partitioning, from csv data and get error

Error: External(NotImplemented("it is not yet supported to write to hive partitions with datatype Float64"))

To Reproduce

main.rs
use std::fs::File;
use std::io::Write;
use arrow::datatypes::{DataType, Field, Schema};
use datafusion::prelude::*;
use tempfile::tempdir;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let dir = tempdir()?;
let file_path = dir.path().join("example.csv");

let mut file = File::create(&file_path)?;
file.write_all(
        r#"dte,ot

2016-07-01 00:00:00,2
2016-07-01 06:45:00,3"#
.as_bytes())?;

let file_path = file_path.to_str().unwrap();

let ctx = SessionContext::new();
let csv_df = ctx.read_csv(file_path, CsvReadOptions::default()).await?;
csv_df.show().await?;

let schema = Schema::new(vec![
    Field::new("dte", DataType::Timestamp(arrow::datatypes::TimeUnit::Second, None), false),

    Field::new("ot", DataType::UInt16, false),
]);

ctx.register_csv("data" ,file_path, CsvReadOptions::new().schema(&schema).has_header(true)).await?;

let df = ctx.sql("copy (SELECT dte, ot, EXTRACT(YEAR FROM dte) AS year from data) to './partitioned_output' stored as parquet PARTITIONED BY (year)").await?;
df.count().await?;

Ok(())

}

cargo.toml
[package]
name = "datafusion_csv"
version = "0.1.0"
edition = "2021"

[dependencies]
tokio = { version = "1", features = ["full"] }
datafusion = "43.0.0"
arrow = "53.3.0"
tempfile = "3.14.0"

Expected behavior

I am expecting a folder year=2016 containing a parquet file

Additional context

I was original trying to have folders for month and day, couldn't get the application to work and then created this simpler example.

@ajazam ajazam added the bug Something isn't working label Nov 29, 2024
@ajazam
Copy link
Author

ajazam commented Nov 29, 2024

I've tried rust 1.82 and 1.83

@delamarch3
Copy link
Contributor

Looks like date_part was updated to return Int32 instead of Float64 in this PR #13466 which should fix this issue. As a workaround you could try casting it like arrow_cast(EXTRACT(..), 'Int64')

@Omega359
Copy link
Contributor

I didn't implement float64 for hive partitioning because, well, floats in general are not exact values. Best to cast to an int.

@ajazam
Copy link
Author

ajazam commented Dec 1, 2024

Thanks gents I got it working. For anybody else who comes up against this issue I made the following alteration

let df = ctx.sql("copy (SELECT dte, ot, arrow_cast(EXTRACT(YEAR FROM dte), 'Int32') AS year from data) to './partitioned_output' stored as parquet PARTITIONED BY (year)").await?;

@ajazam ajazam closed this as completed Dec 1, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants