Skip to content

Regression: DataFrame::schema returns incorrect schema for NATURAL JOIN #14058

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Tracked by #14008
DDtKey opened this issue Jan 9, 2025 · 4 comments · Fixed by #14102
Closed
Tracked by #14008

Regression: DataFrame::schema returns incorrect schema for NATURAL JOIN #14058

DDtKey opened this issue Jan 9, 2025 · 4 comments · Fixed by #14102
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed regression Something that used to work no longer does

Comments

@DDtKey
Copy link
Contributor

DDtKey commented Jan 9, 2025

Describe the bug

Affected Version: 42.x, 43.x, 44.x (regression since 41.x)

The DataFrame::schema (=> LogicalPlan::schema) method returns a schema that includes all columns from the joined sources (using NATURAL JOIN), including columns not present in the final output. This behavior is incorrect and inconsistent with the documented behavior:

Returns the DFSchema describing the output of this DataFrame.

To Reproduce

Simple MRE here:

// Works for datafusion: 41.x and earlier
// Failed for datafusion: 42.x and later (including 44.x)

use datafusion::arrow::util::pretty;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();

    // Create table1
    ctx.sql(
        r#"
        CREATE TABLE table1 AS
        SELECT * FROM (
            VALUES
            (1, 'a'),
            (2, 'b'),
            (3, 'c')
        ) AS t(id, value1)
        "#,
    )
    .await?;

    // Create table2
    ctx.sql(
        r#"
        CREATE TABLE table2 AS
        SELECT * FROM (
            VALUES
            (1, 'x'),
            (3, 'y'),
            (4, 'z')
        ) AS t(id, value2)
        "#,
    )
    .await?;

    // Execute NATURAL JOIN query
    let df = ctx.sql("SELECT * FROM table1 NATURAL JOIN table2").await?;

    // Incorrect schema includes all columns from both tables
    let schema = df.schema().as_arrow().clone();
    println!("Schema: {:?}", schema);

    // Output does not include all columns
    let result = df.collect().await?;
    pretty::print_batches(&result)?;

    let result_schema = result.first().unwrap().schema();
    assert_eq!(&schema, &*result_schema, "Schema mismatch");

    Ok(())
}

Deps:

datafusion = "44.0.0"
tokio = { version = "1", features = ["full"] }

Expected behavior

The schema returned by DataFrame::schema should match the structure of the output produced by collect/collect_partitioned and etc. Specifically:

  • Excluded columns from the result of a NATURAL JOIN should not appear in the schema.

Or, if it was intended - the documentation should be aligned and be clear how to access the schema.
However, I find previous behavior correct and useful (e.g - get schema before methods like write_parquet/csv/json)

Additional context

This is a regression, as the method previously worked correctly in version 41.x.x and earlier.

Also, it probably points to the missing test coverage for particular code-paths. In a sense it's not enough to compare SQL execution results

@DDtKey DDtKey added the bug Something isn't working label Jan 9, 2025
@DDtKey
Copy link
Contributor Author

DDtKey commented Jan 9, 2025

Bisect shows that the behavior changed in 3438b355308afa23dba399f4aec5760969d054c5 commit
It's #11681

Used to work properly before merging this PR

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

I think @jonahgao is working on something else related to Natural join as well

@alamb alamb added regression Something that used to work no longer does help wanted Extra attention is needed labels Jan 10, 2025
@jonahgao jonahgao self-assigned this Jan 10, 2025
@jonahgao
Copy link
Member

I'll try to fix this.

@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

I added this to the list of items that i think we should fix before releasing version 45.0.0:

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working help wanted Extra attention is needed regression Something that used to work no longer does
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants