Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: allow spread operators in to-many relationships #3640

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

laurenceisla
Copy link
Member

@laurenceisla laurenceisla commented Jul 5, 2024

Closes #3041

@laurenceisla
Copy link
Member Author

laurenceisla commented Jul 6, 2024

My approach right now is to generate this query for a to-many request:

curl 'localhost:3000/clients?select=name,...projects(name,id)'
SELECT "test"."clients"."name",
       "clients_projects_1"."name",
       "clients_projects_1"."id"
FROM "test"."clients"
LEFT JOIN LATERAL (
  SELECT json_agg("projects_1"."name") AS "name",
         json_agg("projects_1"."id") AS "id"
  FROM "test"."projects" AS "projects_1"
  WHERE "projects_1"."client_id" = "test"."clients"."id"
) AS "clients_projects_1" ON TRUE

Right now this gives the expected result. But aggregates are not working correctly, because they are designed to be selected in the top query with a GROUP BY. A solution would be to not do the json_agg() inside the sub-query and do it in the top one and treat it as another aggregate (with GROUP BY). Like this:

SELECT "test"."clients"."name",
      json_agg("clients_projects_1"."name") AS "name",
      json_agg("clients_projects_1"."id") AS "id"
FROM "test"."clients"
LEFT JOIN LATERAL (
  SELECT "projects_1"."name",
         "projects_1"."id"
  FROM "test"."projects" AS "projects_1"
  WHERE "projects_1"."client_id" = "test"."clients"."id"
) AS "clients_projects_1" ON TRUE
GROUP BY "test"."clients"."name"

Not sure which one is better/easier right now... I'm thinking the latter.

@wolfgangwalther
Copy link
Member

Right now this gives the expected result. But aggregates are not working correctly, because they are designed to be selected in the top query with a GROUP BY. A solution would be to not do the json_agg() inside the sub-query and do it in the top one and treat it as another aggregate (with GROUP BY).

Having the json_agg in the outer query would make the query cleaner, imho.

@laurenceisla
Copy link
Member Author

laurenceisla commented Jul 17, 2024

Some caveats I encountered:

Repeated values and order

Do we want to keep repeated values in the results? For example (not the best use case, just to illustrate):

curl 'localhost:3000/project?select=name,...tasks(tasks:name,due_dates:due_date)'
[
  {
    "name": "project 1",
    "tasks": ["task 1", "task 2", "task 3", "task 4"],
    "due_dates": [null, "2024-08-08", "2024-08-08", null]
  }
]

Here we're repeating null and "2024-08-08", so maybe we don't want to do this and just return [null, "2024-08-08"] (perhaps also remove null values?). Doing this will not guarantee the same dimensions for the same aggregated relationship, and definitely not the order of the results (which wasn't guaranteed before either). Doing a DISTINCT inside the json_agg() is a possible solution (the next caveat has an example query).

Nested To-Many Spreads

I have a doubt on what to expect with nested to-many spreads. For example, on a non-nested to-many spread like this one:

curl 'localhost:3000/entities?select=name,...child_entities(children:name)'

We would expect:

[
  {"name": "entity 1", "children": ["child entity 1", "child entity 2"]},
  {"name": "entity 2", "children": ["child entity 3"]},
  "..."
]

But what if we nest another to-many spread embedding with a new column to aggregate:

curl 'localhost:3000/entities?select=name,...child_entities(children:name,...grandchild_entities(grandchildren:name))'

I understand that we're hoisting all the aggregates to the top level, and not grouping by the intermediate columns (entities.name), because they should also be aggregated. I'm assuming that the result should be the same as above but also with the aggregated grandchild_entities.name.

[
  {"name": "entity 1", "children": ["child entity 1", "child entity 2"], "grandchildren": ["grandchild entity 1", "grandchild entity 2", "..."]},
  {"name": "entity 2", "children": ["child entity 3"], "grandchildren": []},
  "..."
]

This cannot be achieved by a simple GROUP BY, because duplicated values will be returned by entities.name (which, perhaps, is not what we want). A solution would also be to use DISTINCT. The query would look like this:

SELECT "api"."entities"."name",
       json_agg(DISTINCT "entities_child_entities_1"."children") AS "children",
       json_agg(DISTINCT "entities_child_entities_1"."grandchildren") AS "grandchildren"
FROM "api"."entities"
LEFT JOIN LATERAL (
  SELECT "child_entities_1"."name" AS "children",
         "child_entities_grandchild_entities_2"."grandchildren" AS "grandchildren"
  FROM "api"."child_entities" AS "child_entities_1"
  LEFT JOIN LATERAL (
    SELECT "grandchild_entities_2"."name" AS "grandchildren"
    FROM "api"."grandchild_entities" AS "grandchild_entities_2"
    WHERE "grandchild_entities_2"."parent_id" = "child_entities_1"."id"
  ) AS "child_entities_grandchild_entities_2" ON TRUE
  WHERE "child_entities_1"."parent_id" = "api"."entities"."id"
) AS "entities_child_entities_1" ON TRUE
GROUP BY "api"."entities"."name";

If there is no sensible interpretation of the query, another option is to prohibit these intermediate columns altogether (aggregates like sum, avg, etc. should still be possible).

@laurenceisla laurenceisla force-pushed the feat-spread-m2m branch 2 times, most recently from 6507878 to bd93514 Compare July 26, 2024 01:50
@laurenceisla
Copy link
Member Author

OK, this is what I got implemented so far. For example, using the tables in our spec test:

Factories <-02M-> processes <-M2M-> supervisors
curl 'localhost:3000/factories?select=name,...processes(processes:name,...supervisors(supervisors:name))'
[
 {
  "name": "Factory C",
  "processes": ["Process C1", "Process C2", "Process XX"],
  "supervisors": ["Peter", "Peter", null]
 },
 {
  "name": "Factory B",
  "process": ["Process B1", "Process B1", "Process B2", "Process B2"],
  "supervisors": ["Peter", "Sarah", "Mary", "John"]
 },
 {
  "name": "Factory A",
  "process": ["Process A1", "Process A2"],
  "supervisors": ["Mary", "John"]
 },
 {
  "name": "Factory D",
  "process": [null],
  "supervisors": [null]
 }
]
[
  {
  	"name":"Factory C",
  	"processes":["Process C1", "Process C2", "Process XX"],
  	"supervisors":[{"name": "Peter"}, {"name": "Peter"}, null]},
  {
  	"name":"Factory B",
  	"processes":["Process B1", "Process B1", "Process B2", "Process B2"],
  	"supervisors":[{"name": "Peter"}, {"name": "Sarah"}, {"name": "Mary"}, {"name": "John"}]},
  {
  	"name":"Factory A",
  	"processes":["Process A1", "Process A2"],
  	"supervisors":[{"name": "Mary"}, {"name": "John"}]},
  {
  	"name":"Factory D",
  	"processes":[null],
	"supervisors":[null]
  }
]

As I mentioned in previous comments, some values will repeat, since we're grouping by the factory "name" without doing a DISTINCT or NOT NULL. The next step would be to implement the .. operator as mentioned here: #3640 (comment), it shouldn't be too complicated.

There's a problem when the embeddings have no values, as seen in the "Factory D" example, which has no processes and no supervisors. This is the same issue as this SO question. One solution is to do a COALESCE(NULLIF(..., '[null]'), '[]'), but this does not take into consideration valid null values (the row exists but the column value is null). The best solution is in one of the answers (filtering by the PK of the relationship), but it doesn't seem like a trivial task.

@laurenceisla laurenceisla force-pushed the feat-spread-m2m branch 7 times, most recently from 9002110 to b3e5483 Compare September 18, 2024 23:37
Copy link
Member Author

@laurenceisla laurenceisla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature should be ready for review now.

I'm leaving the .. for DISTINCT and NOT NULL for another PR to keep it cleaner. Edit: Nvm. I figured that it should be OK to include that feature here too, although in different commits.

Here are some comments on the changes done:

@laurenceisla laurenceisla changed the title feat: WIP allow spread operators in to-many relationships feat: allow spread operators in to-many relationships Sep 18, 2024
@laurenceisla laurenceisla marked this pull request as ready for review September 18, 2024 23:40
@laurenceisla laurenceisla force-pushed the feat-spread-m2m branch 4 times, most recently from 67e6419 to 87a13ef Compare September 25, 2024 22:14
@wolfgangwalther
Copy link
Member

I did not look at the code.

I'm not deep enough into the Planner and Query Builder to be able to properly review the code side of things. @steve-chavez could you look at it again? There have been quite some changes since your last review, I think.

Comment on lines 295 to 297
Unlike the to-one spreads, the columns inside a :ref:`to-many spread relationship <spread_to_many_embed>` are not treated as if they were part of the top-level resource.
The aggregates will be done :ref:`in the context of the to-many spread resource <aggregate_functions_embed_context>`.
For example:
Copy link
Member

@steve-chavez steve-chavez Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These special cases make the feature really confusing. And this problem is a consequence of supporting multiple columns (#3640 (comment)), if we supported only a single column the feature would be simple.

And if users wanted multiple columns, they could still do:

...orders(amount),...orders(order_date)

Which already "conveys" why an order is necessary. If the user wants the columns coming from both orders if they want them to be correlated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some of the work it has been done here can be reused as a form of query optimization though. To not repeat work when two ...orders() are done.

But I would strongly suggest trying to eliminate these special cases.

Copy link
Member

@steve-chavez steve-chavez Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's the way the documentation is presented though. I'll try to play with this and get back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a documentation issue. From looking at the test-cases, this felt very natural now.

Copy link
Member Author

@laurenceisla laurenceisla Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a documentation issue. From looking at the test-cases, this felt very natural now.

Yes I think the documentation needs to be improved. Maybe we could start with the "only aggregates" example, as it works exactly the same as to-one spreads. After that, we'd mention this case, and reword it to something like: "If you need to group by the columns inside the spread embed, then...`, then give the example.

...orders(amount.sum()),...orders(order_date)

(I added .sum() to the example) This can still be an alternative to not group by the columns. Just noticed that this doesn't work right now, because we don't allow aliases to the spread embed names. I'll fix that in another commit PR (to not complicate this one any more).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs are looking much better now. I think one problem tangential to this PR is that https://docs.postgrest.org/en/v12/references/api/aggregate_functions.html are too verbose. I made a pass at reducing verbosity on 33b69b5.

Copy link
Member Author

@laurenceisla laurenceisla Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one problem tangential to this PR is that https://docs.postgrest.org/en/v12/references/api/aggregate_functions.html are too verbose.

I gave it a go here: #3913

@jdgamble555
Copy link

Hey guys, what is preventing this from being approved?

@wolfgangwalther
Copy link
Member

Hey guys, what is preventing this from being approved?

I'd say this mostly needs a code review. Unfortunately, I am not deep enough into the haskell code in general right now, so I can't provide one.

@steve-chavez
Copy link
Member

steve-chavez commented Feb 13, 2025

I'd say this mostly needs a code review. Unfortunately, I am not deep enough into the haskell code in general right now, so I can't provide one.

The code looks good overall. The implementation is complex, but that's inevitable because it's a "big feature" (kinda like aggregate functions in #2925, or data representations #2523).

I think we just need to refine the docs. I've added some more feedback.

}
]

Note that the aggregate is now returned inside an array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two lines that follow don't seem to add much info or are confusing. I suggest to remove them or make them clearer.

Copy link
Member

@steve-chavez steve-chavez Feb 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, just to make sure I understand this.

If you have:

...orders(amount.sum())

That sum will be returned as a single value.

But if you add one column

....orders(amount.sum(),order_date)

Then sum will be returned as array right?

(That could be surprising and a a bit complicated to type in Typescript)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, I wonder if it would be better to always return an array. Even on cases of a single column spread.. an array of 1 might be more consistent. What would be the drawback?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above is done, we could always say that a to-many spread returns an array. While to-one returns an object. That would make things more consistent IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sum will be returned as a single value.

But if you add one column

Then sum will be returned as array right?

Yes. The idea is that embedding an aggregation without grouping is essentially to-one, not to-many. We know it returns only a single row, so we shouldn't return an array.

(That could be surprising and a a bit complicated to type in Typescript)

I'm not sure how much relevance Typescript should have for our API output.

If the above is done, we could always say that a to-many spread returns an array. While to-one returns an object. That would make things more consistent IMO.

Yes, see above. That's essentially what we're doing. Adding an aggregation function to a "relation" changes that relation. An aggregated table is not the same as a non-aggregated table. Thus, when embedding an aggregated relation, it would also potentially have different to-many/to-one characteristics. I find this quite consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking more thoroughly, the special case is not so much the above to-many-aggregate->to-one-response, but the hoisting of the to-one-aggregates.

I think we need to discuss a very basic question here: Do we want to error out on some cases with "not supported" or do we want to try to give a sane response in as many cases as possible?

We have at least the following cases:

  • to-one embed, no aggregate, no spread: returns an object
  • to-one embed, no aggregate, spread: spreads values into the parent object
  • to-one embed, with aggregate, no spread: returns an object (useless)
  • to-one embed, with aggregate, spread: hoisted into the parent object (currently discussed)
  • to-many embed, no aggregate, no spread: returns an array of objects
  • to-many embed, no aggregate, spread: spreads arrays of values into the parent object
  • to-many embed, with aggregate, no group by, no spread: returns single item array of object (inconsistent)
  • to-many embed, with aggregate, no group by, spread: spreads values into the parent object
  • to-many embed, with aggregate, group by, no spread: returns array of objects
  • to-many embed, with aggregate, group by, spread: spreads arrays of values into the parent object

"at least", because we don't make a difference with the number of columns in the embedding, but we could in theory :D.

There's multiple different ways to look at the list above:

  • "It's a mess": Everything looks different, we need to cut down on the cases, to be able to explain them in the docs.
  • "It's beautiful": We try to only return sensible stuff and try to capture the intent of the user to the best degree possible. We should remove inconsistencies and useless case and make them better.

I think we shouldn't discuss the individual cases back and forth, but instead decide on a general approach to take. I am in the "beautiful API" camp.

I think we should do the following:

  • Fix the "inconsistent" case (to-many, with aggregate, no group by, no spread) and make it return an object instead of an array of a single object. Return an array with a known-to-be-single-value is always bad - just think about how annoying that would be to type in Haskell, having a list, where you always need to catch cases that the list could be empty or have more items.. even though you know it can only contain exactly a single item.
  • Keep the "currently discussed" case about hoisting.
  • Actually take it a step further and change the "useless" case (to-one, with aggregate, no spread) and apply spreading automatically, because.. that's the only sensible thing to interpret this request.

Copy link
Member

@steve-chavez steve-chavez Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't discuss the individual cases back and forth, but instead decide on a general approach to take. I am in the "beautiful API" camp.

Doing it the way of the "beautiful API" scares me a lot because we already been through something similar with "duck typing" (#424, which then was removed in #1430, very painfully so as it was a major breaking change). It may capture "intent" for certain users, but some others might be confused (as me) because of the change of shape in the output.

Just think about how annoying that would be to type in Haskell, having a list, where you always need to catch cases that the list could be empty or have more items..

This was my point when referring to typescript above #3640 (comment), and yes a surprising change of output will be difficult to type in any client library in any language. That's why I think we need to be consistent, rather than beautiful.


Going back a bit, the whole "hoisting" and "automatic group by" confuses me a lot, and I assume users familiar with SQL will be too. I think we should make these explicit operations.

So a group by would be /tbl?select=id,cost.sum()&groupby=id and a "hoisted" groupby would be analogous to top-level ordering: /tbl?select=id,embed(cost.sum())&groupby=embed(id). This is much more consistent with our functionality, our whole top-level features could be a section in the docs now.

(An aggregate without a groupby wouldn't have to fail, I've floated the idea on how to do it before on #2066 (comment))

So yes, sometimes this might imply useless output, but if the user can quickly make sense of the syntax then it's easy to correct.

Overall I'd like to normalize the output of an aggregate, and not make it a "transformation" but a type of filtering. Only our spreads should be "transformation". This will help a lot with typing on clients too.

Copy link
Member

@wolfgangwalther wolfgangwalther Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just think about how annoying that would be to type in Haskell, having a list, where you always need to catch cases that the list could be empty or have more items..

This was my point when referring to typescript above #3640 (comment), and yes a surprising change of output will be difficult to type in any client library in any language. That's why I think we need to be consistent, rather than beautiful.

Uh, that misses the point. This consistency leads to those "always-a-single-item lists". And those are hard to deal with in strongly typed languages. They might be easier to "type", but then harder to work with afterwards, imho. But that's just because... the typing is weaker. Weaker typing is always easier to "type".

and I assume users familiar with SQL will be too. I think we should make these explicit operations.

I think that both the TypeScript and the SQL arguments don't fly. We are not trying to create a TypeScript client for a PostgreSQL database. And we are not trying to expose SQL via http either.

We provide a REST API. This doesn't need to be "SQL-like" at all. Familiarity with SQL should give you exactly zero advantage when using this API. If it does.. then we designed the API in a bad way, because it suddenly requires SQL knowledge to use it well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it the way of the "beautiful API" scares me a lot because we already been through something similar with "duck typing"

So IIUC correctly the problem back then was that you couldn't tell from the syntax of your request what the response would look like, correct?

This is certainly a problem with embeddings in general - and it still is today.

But that's not the case for the aggregation things we're discussing here, right? Those are predictable from syntax.

If you want consistency, then you'd need to return "to-one" embeddings as single-item array all the time. Because that's the part that is not predictable and hard to type.

But I don't think we want to go that way, right? (I certainly don't)


re TypeScript: My vision for first class types would be a custom root endpoint mimetype handler: Once we have split OpenApi off, I hope that we can work towards allowing multiple handlers for the root endpoint. Then, we can implement a handler that returns type definitions straight from the database. And those should be able to treat all those special cases much better, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I'm not yet convinced for or against the group_by proposal. I think adding explicit operations helps in understanding what is "expected" or not. But I still don't want to lose the simplicity of not needing a separate group_by when aggregating. Also when trying different queries I keep seeing/expecting the to-many aggregates as single responses (what was implemented before this convo), so I'm not against that either.

I'd like to complete the implementation of to-many spreads without aggregates in this PR (which got too big) and continue this debate in a different PR.

@steve-chavez
Copy link
Member

Just tried this:

$ curl 'localhost:3000/processes?select=*,...process_costs(cost.sum())'
{"code":"42803","details":null,"hint":null,"message":"column \"processes.id\" must appear in the GROUP BY clause or be used in an aggregate function"}

Is the error expected?

@laurenceisla
Copy link
Member Author

Is the error expected?

Oh, nice catch. It's not expected at all. This error is for to-one spreads, it's also failing in the main branch right now with the same error message, so it should be checked in a different issue/PR.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Spread operator for many-to-many relationships, aliases, and aggregations
4 participants