Skip to content

Regression: Unneeded fields pushed to TableProvider if struct field is part of query #8735

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
simonvandel opened this issue Jan 3, 2024 · 3 comments · Fixed by #8774
Closed
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@simonvandel
Copy link
Contributor

Describe the bug

The following test works on Datafusion 33, but fails on 34:

    #[tokio::test]
    async fn struct_projection_regression() -> datafusion::error::Result<()> {
        let ctx = SessionContext::new();
        let schema = Arc::new(Schema::new(vec![
            Field::new("a", DataType::Int64, false),
            Field::new_struct(
                "s",
                vec![
                    Field::new("x", DataType::Int64, false),
                    Field::new("y", DataType::Int64, false),
                ],
                false,
            ),
        ]));

        struct TestProvider(SchemaRef);

        #[async_trait]
        impl TableProvider for TestProvider {
            fn as_any(&self) -> &dyn std::any::Any {
                self
            }

            fn schema(&self) -> SchemaRef {
                self.0.clone()
            }

            fn table_type(&self) -> TableType {
                TableType::Base
            }

            async fn scan(
                &self,
                _state: &SessionState,
                projection: Option<&Vec<usize>>,
                _filters: &[Expr],
                _limit: Option<usize>,
            ) -> datafusion::error::Result<Arc<dyn ExecutionPlan>> {
                dbg!(&projection);
                assert!(projection.unwrap().len() == 1);
                Ok(Arc::new(ValuesExec::try_new_from_batches(
                    self.schema().project(projection.unwrap())?.into(),
                    vec![RecordBatch::try_new(
                        self.schema(),
                        vec![
                            Arc::new(Int64Array::from(vec![1, 2, 3])),
                            Arc::new(StructArray::from(vec![
                                (
                                    Arc::new(Field::new("x", DataType::Int64, false)),
                                    Arc::new(Int64Array::from(vec![4, 5, 6])) as ArrayRef,
                                ),
                                (
                                    Arc::new(Field::new("y", DataType::Int64, false)),
                                    Arc::new(Int64Array::from(vec![7, 8, 9])) as ArrayRef,
                                ),
                            ])),
                        ],
                    )?
                    .project(projection.unwrap())?],
                )?))
            }
        }

        let df = ctx
            .read_table(Arc::new(TestProvider(schema)))?
            .select(vec![col("s").field("x")])?;

        let df_results = df.explain(false, false)?.collect().await?;

        assert_batches_eq!(
            [
                "+---------------+--------------------------------------------------+",
                "| plan_type     | plan                                             |",
                "+---------------+--------------------------------------------------+",
                "| logical_plan  | Projection: (?table?.s)[x]                       |",
                "|               |   TableScan: ?table? projection=[s]              |",
                "| physical_plan | ProjectionExec: expr=[(s@0).[x] as ?table?.s[x]] |",
                "|               |   ValuesExec                                     |",
                "|               |                                                  |",
                "+---------------+--------------------------------------------------+",
            ],
            &df_results
        );

        Ok(())
    }

Datafusion 34 fails this line assert!(projection.unwrap().len() == 1); because projection contains [1, 2].

If I only select col("s") without accessing the field x, the projection correctly only contains [1].

To Reproduce

A testcase is provided above.

Expected behavior

I would expect the projection passed to TableProvider to only contain [1] as that is the only field needed in the query.

Additional context

No response

@simonvandel simonvandel added the bug Something isn't working label Jan 3, 2024
@simonvandel simonvandel changed the title Regression: Unneeded fields pushed to TableProvider if struct field is part of projection Regression: Unneeded fields pushed to TableProvider if struct field is part of query Jan 3, 2024
@alamb alamb added the help wanted Extra attention is needed label Jan 5, 2024
@alamb
Copy link
Contributor

alamb commented Jan 5, 2024

So the regression is that projection pushdown isn't working correctly.

Maybe this is related to #8073 from @berkaysynnada ?

Though it seems like the logical plan has the correct projection (of just s) pushed down. I am not sure why that projection gets lost in the physical plan

                "+---------------+--------------------------------------------------+",
                "| plan_type     | plan                                             |",
                "+---------------+--------------------------------------------------+",
                "| logical_plan  | Projection: (?table?.s)[x]                       |",
                "|               |   TableScan: ?table? projection=[s]              |",

@berkaysynnada
Copy link
Contributor

Maybe this is related to #8073 from @berkaysynnada ?

I don't think so. ProjectionPushdown rule does not cover yet the relation between projections and ValuesExec. The plan is returned without modification.

@alamb
Copy link
Contributor

alamb commented Jan 8, 2024

Maybe this is related to #8073 from @berkaysynnada ?

I don't think so. ProjectionPushdown rule does not cover yet the relation between projections and ValuesExec. The plan is returned without modification.

I agree -- it seems like @haohuaijin 's fix #8774 is related to not handling all expression types correctly so I am not sure what caused this particular regression

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants