[DISCUSSION] JOIN "task force" / project team #15885

alamb · 2025-04-28T21:52:46Z

What I see (what problem we are trying to solve)

DataFusion's current join implementations are fairly basic. They are functional enough to run TPCH and TPC-DS, but lack other features such as larger-than-memory processing, ASOF joins, complete subquery support and more.

There seems to be a non trivial desire in the community to improve this.

Some examples of issues / tickets related to enhanced join support / features:

Subqueries (which are implemented as joins)

Join Features

Specialized Joins

Performance

What is blocking significant forward progress

In my mind, the major challenge is that "improving" JOINs can get arbitrarily complicated. There are dozens of academic paper each year on various aspects of join implemnetations, and designing / implementing join capabilities is a substantial engineering effort.

I spent 6 years of my life doing joins at Vertica where they accounted for around 50% of the optimizer's complexity, to give some sense

I don't think the issue is that any particular feature is super complicated to understand, but defining the overall goal, the framework that will accomodate the goal, and then breaking it down into implementable pieces itself I think will require both specialized knowledge and substantial time.

What I suggest

I suggest that people with the relevant skills and time to invest gather together to drive this process worward

plan out a "join roadmap" (aka prioritize what join features they will push forward)
Figure out what, if any, new structures are in place
Start breaking it down into smaller tickets
I can't personally lead such an effort, but I am filing this ticket to try and help connect the relevant people in the community that can.

Some potential people that could help (sorry if I didn't list you)

@duongcongtoai -- the discussion on Decorrelate scalar subqueries with more complex filter expressions #14554 (comment)
@xudong963 who has experience in this area
@Dandandan @comphead and @korowa who contributed substantially to the existing joins
@mingmwang and @jackwener who contributed significantly to the original subquery implementation
@liukun4515 who likewise helped significantly
@suibianwanwank

Related content:

Related blogs (join ordering section in part 2): https://www.influxdata.com/blog/optimizing-sql-dataframes-part-two/

The text was updated successfully, but these errors were encountered:

milenkovicm · 2025-04-29T13:30:57Z

not sure if it will help direction, cost nothing to share :) Debunking the Myth of Join Ordering: Toward Robust SQL Analytics

alamb · 2025-04-29T21:01:11Z

not sure if it will help direction, cost nothing to share :) Debunking the Myth of Join Ordering: Toward Robust SQL Analytics

I have that paper on my reading list. Does anyone know of a production system that has implemented the RPT framework?

2010YOUY01 · 2025-05-02T15:51:17Z

not sure if it will help direction, cost nothing to share :) Debunking the Myth of Join Ordering: Toward Robust SQL Analytics

I have that paper on my reading list. Does anyone know of a production system that has implemented the RPT framework?

DuckDB is integrating it duckdb/duckdb#17326

xudong963 · 2025-05-04T15:00:29Z

not sure if it will help direction, cost nothing to share :) Debunking the Myth of Join Ordering: Toward Robust SQL Analytics

I have that paper on my reading list. Does anyone know of a production system that has implemented the RPT framework?

DuckDB is integrating it duckdb/duckdb#17326

Cool, looking forward to seeing the final result

alamb · 2025-05-15T12:54:53Z

In case others haven't heard, @irenjj is working on additional subquery support as part of a Google Summer of Code Project (where @jayzhan211 and I are helping mentor).

I am not quite sure what our next steps will be here

irenjj mentioned this issue May 15, 2025

[Epic]: Google Summer of Code 2025 Correlated Subquery Support #16059

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] JOIN "task force" / project team #15885

[DISCUSSION] JOIN "task force" / project team #15885

alamb commented Apr 28, 2025 •

edited

Loading

milenkovicm commented Apr 29, 2025 •

edited

Loading

alamb commented Apr 29, 2025

2010YOUY01 commented May 2, 2025

xudong963 commented May 4, 2025

alamb commented May 15, 2025

[DISCUSSION] JOIN "task force" / project team #15885

[DISCUSSION] JOIN "task force" / project team #15885

Comments

alamb commented Apr 28, 2025 • edited Loading

What I see (what problem we are trying to solve)

Subqueries (which are implemented as joins)

Join Features

Specialized Joins

Performance

What is blocking significant forward progress

What I suggest

Related content:

milenkovicm commented Apr 29, 2025 • edited Loading

alamb commented Apr 29, 2025

2010YOUY01 commented May 2, 2025

xudong963 commented May 4, 2025

alamb commented May 15, 2025

alamb commented Apr 28, 2025 •

edited

Loading

milenkovicm commented Apr 29, 2025 •

edited

Loading