feat: allow spread operators in to-many relationships #3640

laurenceisla · 2024-07-05T00:56:21Z

laurenceisla · 2024-07-06T00:13:57Z

My approach right now is to generate this query for a to-many request:

curl 'localhost:3000/clients?select=name,...projects(name,id)'

SELECT "test"."clients"."name",
       "clients_projects_1"."name",
       "clients_projects_1"."id"
FROM "test"."clients"
LEFT JOIN LATERAL (
  SELECT json_agg("projects_1"."name") AS "name",
         json_agg("projects_1"."id") AS "id"
  FROM "test"."projects" AS "projects_1"
  WHERE "projects_1"."client_id" = "test"."clients"."id"
) AS "clients_projects_1" ON TRUE

Right now this gives the expected result. But aggregates are not working correctly, because they are designed to be selected in the top query with a GROUP BY. A solution would be to not do the json_agg() inside the sub-query and do it in the top one and treat it as another aggregate (with GROUP BY). Like this:

SELECT "test"."clients"."name",
      json_agg("clients_projects_1"."name") AS "name",
      json_agg("clients_projects_1"."id") AS "id"
FROM "test"."clients"
LEFT JOIN LATERAL (
  SELECT "projects_1"."name",
         "projects_1"."id"
  FROM "test"."projects" AS "projects_1"
  WHERE "projects_1"."client_id" = "test"."clients"."id"
) AS "clients_projects_1" ON TRUE
GROUP BY "test"."clients"."name"

Not sure which one is better/easier right now... I'm thinking the latter.

src/PostgREST/Error.hs

wolfgangwalther · 2024-07-06T09:29:21Z

Right now this gives the expected result. But aggregates are not working correctly, because they are designed to be selected in the top query with a GROUP BY. A solution would be to not do the json_agg() inside the sub-query and do it in the top one and treat it as another aggregate (with GROUP BY).

Having the json_agg in the outer query would make the query cleaner, imho.

laurenceisla · 2024-07-17T01:19:59Z

Some caveats I encountered:

Repeated values and order

Do we want to keep repeated values in the results? For example (not the best use case, just to illustrate):

curl 'localhost:3000/project?select=name,...tasks(tasks:name,due_dates:due_date)'

[
  {
    "name": "project 1",
    "tasks": ["task 1", "task 2", "task 3", "task 4"],
    "due_dates": [null, "2024-08-08", "2024-08-08", null]
  }
]

Here we're repeating null and "2024-08-08", so maybe we don't want to do this and just return [null, "2024-08-08"] (perhaps also remove null values?). Doing this will not guarantee the same dimensions for the same aggregated relationship, and definitely not the order of the results (which wasn't guaranteed before either). Doing a DISTINCT inside the json_agg() is a possible solution (the next caveat has an example query).

Nested To-Many Spreads

I have a doubt on what to expect with nested to-many spreads. For example, on a non-nested to-many spread like this one:

curl 'localhost:3000/entities?select=name,...child_entities(children:name)'

We would expect:

[
  {"name": "entity 1", "children": ["child entity 1", "child entity 2"]},
  {"name": "entity 2", "children": ["child entity 3"]},
  "..."
]

But what if we nest another to-many spread embedding with a new column to aggregate:

curl 'localhost:3000/entities?select=name,...child_entities(children:name,...grandchild_entities(grandchildren:name))'

I understand that we're hoisting all the aggregates to the top level, and not grouping by the intermediate columns (entities.name), because they should also be aggregated. I'm assuming that the result should be the same as above but also with the aggregated grandchild_entities.name.

[
  {"name": "entity 1", "children": ["child entity 1", "child entity 2"], "grandchildren": ["grandchild entity 1", "grandchild entity 2", "..."]},
  {"name": "entity 2", "children": ["child entity 3"], "grandchildren": []},
  "..."
]

This cannot be achieved by a simple GROUP BY, because duplicated values will be returned by entities.name (which, perhaps, is not what we want). A solution would also be to use DISTINCT. The query would look like this:

SELECT "api"."entities"."name",
       json_agg(DISTINCT "entities_child_entities_1"."children") AS "children",
       json_agg(DISTINCT "entities_child_entities_1"."grandchildren") AS "grandchildren"
FROM "api"."entities"
LEFT JOIN LATERAL (
  SELECT "child_entities_1"."name" AS "children",
         "child_entities_grandchild_entities_2"."grandchildren" AS "grandchildren"
  FROM "api"."child_entities" AS "child_entities_1"
  LEFT JOIN LATERAL (
    SELECT "grandchild_entities_2"."name" AS "grandchildren"
    FROM "api"."grandchild_entities" AS "grandchild_entities_2"
    WHERE "grandchild_entities_2"."parent_id" = "child_entities_1"."id"
  ) AS "child_entities_grandchild_entities_2" ON TRUE
  WHERE "child_entities_1"."parent_id" = "api"."entities"."id"
) AS "entities_child_entities_1" ON TRUE
GROUP BY "api"."entities"."name";

If there is no sensible interpretation of the query, another option is to prohibit these intermediate columns altogether (aggregates like sum, avg, etc. should still be possible).

src/PostgREST/Plan.hs

laurenceisla · 2024-09-07T01:54:20Z

OK, this is what I got implemented so far. For example, using the tables in our spec test:

Factories <-02M-> processes <-M2M-> supervisors

curl 'localhost:3000/factories?select=name,...processes(processes:name,...supervisors(supervisors:name))'

[
 {
  "name": "Factory C",
  "processes": ["Process C1", "Process C2", "Process XX"],
  "supervisors": ["Peter", "Peter", null]
 },
 {
  "name": "Factory B",
  "process": ["Process B1", "Process B1", "Process B2", "Process B2"],
  "supervisors": ["Peter", "Sarah", "Mary", "John"]
 },
 {
  "name": "Factory A",
  "process": ["Process A1", "Process A2"],
  "supervisors": ["Mary", "John"]
 },
 {
  "name": "Factory D",
  "process": [null],
  "supervisors": [null]
 }
]⏎

[
  {
  	"name":"Factory C",
  	"processes":["Process C1", "Process C2", "Process XX"],
  	"supervisors":[{"name": "Peter"}, {"name": "Peter"}, null]},
  {
  	"name":"Factory B",
  	"processes":["Process B1", "Process B1", "Process B2", "Process B2"],
  	"supervisors":[{"name": "Peter"}, {"name": "Sarah"}, {"name": "Mary"}, {"name": "John"}]},
  {
  	"name":"Factory A",
  	"processes":["Process A1", "Process A2"],
  	"supervisors":[{"name": "Mary"}, {"name": "John"}]},
  {
  	"name":"Factory D",
  	"processes":[null],
	"supervisors":[null]
  }
]

As I mentioned in previous comments, some values will repeat, since we're grouping by the factory "name" without doing a DISTINCT or NOT NULL. The next step would be to implement the .. operator as mentioned here: #3640 (comment), it shouldn't be too complicated.

There's a problem when the embeddings have no values, as seen in the "Factory D" example, which has no processes and no supervisors. This is the same issue as this SO question. One solution is to do a COALESCE(NULLIF(..., '[null]'), '[]'), but this does not take into consideration valid null values (the row exists but the column value is null). The best solution is in one of the answers (filtering by the PK of the relationship), but it doesn't seem like a trivial task.

laurenceisla

This feature should be ready for review now.

~~I'm leaving the .. for DISTINCT and NOT NULL for another PR to keep it cleaner.~~ Edit: Nvm. I figured that it should be OK to include that feature here too, although in different commits.

Here are some comments on the changes done:

src/PostgREST/Query/SqlFragment.hs

src/PostgREST/ApiRequest/Types.hs

src/PostgREST/Plan/ReadPlan.hs

docs/references/api/resource_embedding.rst

wolfgangwalther · 2025-01-25T10:49:48Z

I did not look at the code.

I'm not deep enough into the Planner and Query Builder to be able to properly review the code side of things. @steve-chavez could you look at it again? There have been quite some changes since your last review, I think.

steve-chavez · 2025-01-27T17:38:08Z

docs/references/api/aggregate_functions.rst

+Unlike the to-one spreads, the columns inside a :ref:`to-many spread relationship <spread_to_many_embed>` are not treated as if they were part of the top-level resource.
+The aggregates will be done :ref:`in the context of the to-many spread resource <aggregate_functions_embed_context>`.
+For example:


These special cases make the feature really confusing. And this problem is a consequence of supporting multiple columns (#3640 (comment)), if we supported only a single column the feature would be simple.

And if users wanted multiple columns, they could still do:

...orders(amount),...orders(order_date)

Which already "conveys" why an order is necessary. If the user wants the columns coming from both orders if they want them to be correlated.

Maybe some of the work it has been done here can be reused as a form of query optimization though. To not repeat work when two ...orders() are done.

But I would strongly suggest trying to eliminate these special cases.

Maybe it's the way the documentation is presented though. I'll try to play with this and get back.

I think this is a documentation issue. From looking at the test-cases, this felt very natural now.

I think this is a documentation issue. From looking at the test-cases, this felt very natural now.

Yes I think the documentation needs to be improved. Maybe we could start with the "only aggregates" example, as it works exactly the same as to-one spreads. After that, we'd mention this case, and reword it to something like: "If you need to group by the columns inside the spread embed, then...`, then give the example.

...orders(amount.sum()),...orders(order_date)

(I added .sum() to the example) This can still be an alternative to not group by the columns. Just noticed that this doesn't work right now, because we don't allow aliases to the spread embed names. I'll fix that in another ~~commit~~ PR (to not complicate this one any more).

Docs are looking much better now. I think one problem tangential to this PR is that https://docs.postgrest.org/en/v12/references/api/aggregate_functions.html are too verbose. I made a pass at reducing verbosity on 33b69b5.

I think one problem tangential to this PR is that https://docs.postgrest.org/en/v12/references/api/aggregate_functions.html are too verbose.

I gave it a go here: #3913

test/spec/Feature/Query/SpreadQueriesSpec.hs

jdgamble555 · 2025-02-11T02:05:36Z

Hey guys, what is preventing this from being approved?

wolfgangwalther · 2025-02-11T20:39:33Z

Hey guys, what is preventing this from being approved?

I'd say this mostly needs a code review. Unfortunately, I am not deep enough into the haskell code in general right now, so I can't provide one.

docs/references/api/resource_embedding.rst

src/PostgREST/Plan.hs

steve-chavez · 2025-02-13T02:03:47Z

I'd say this mostly needs a code review. Unfortunately, I am not deep enough into the haskell code in general right now, so I can't provide one.

The code looks good overall. The implementation is complex, but that's inevitable because it's a "big feature" (kinda like aggregate functions in #2925, or data representations #2523).

I think we just need to refine the docs. I've added some more feedback.

docs/references/api/aggregate_functions.rst

steve-chavez · 2025-02-15T02:05:20Z

docs/references/api/aggregate_functions.rst

+    }
+  ]
+
+Note that the aggregate is now returned inside an array.


The two lines that follow don't seem to add much info or are confusing. I suggest to remove them or make them clearer.

As an aside, just to make sure I understand this.

If you have:

...orders(amount.sum())

That sum will be returned as a single value.

But if you add one column

....orders(amount.sum(),order_date)

Then sum will be returned as array right?

(That could be surprising and a a bit complicated to type in Typescript)

For consistency, I wonder if it would be better to always return an array. Even on cases of a single column spread.. an array of 1 might be more consistent. What would be the drawback?

If the above is done, we could always say that a to-many spread returns an array. While to-one returns an object. That would make things more consistent IMO.

That sum will be returned as a single value.

But if you add one column

Then sum will be returned as array right?

Yes. The idea is that embedding an aggregation without grouping is essentially to-one, not to-many. We know it returns only a single row, so we shouldn't return an array.

(That could be surprising and a a bit complicated to type in Typescript)

I'm not sure how much relevance Typescript should have for our API output.

If the above is done, we could always say that a to-many spread returns an array. While to-one returns an object. That would make things more consistent IMO.

Yes, see above. That's essentially what we're doing. Adding an aggregation function to a "relation" changes that relation. An aggregated table is not the same as a non-aggregated table. Thus, when embedding an aggregated relation, it would also potentially have different to-many/to-one characteristics. I find this quite consistent.

Checking more thoroughly, the special case is not so much the above to-many-aggregate->to-one-response, but the hoisting of the to-one-aggregates.

I think we need to discuss a very basic question here: Do we want to error out on some cases with "not supported" or do we want to try to give a sane response in as many cases as possible?

We have at least the following cases:

to-one embed, no aggregate, no spread: returns an object

to-one embed, no aggregate, spread: spreads values into the parent object

to-one embed, with aggregate, no spread: returns an object (useless)

to-one embed, with aggregate, spread: hoisted into the parent object (currently discussed)

to-many embed, no aggregate, no spread: returns an array of objects

to-many embed, no aggregate, spread: spreads arrays of values into the parent object

to-many embed, with aggregate, no group by, no spread: returns single item array of object (inconsistent)

to-many embed, with aggregate, no group by, spread: spreads values into the parent object

to-many embed, with aggregate, group by, no spread: returns array of objects

to-many embed, with aggregate, group by, spread: spreads arrays of values into the parent object

"at least", because we don't make a difference with the number of columns in the embedding, but we could in theory :D.

There's multiple different ways to look at the list above:

"It's a mess": Everything looks different, we need to cut down on the cases, to be able to explain them in the docs.

"It's beautiful": We try to only return sensible stuff and try to capture the intent of the user to the best degree possible. We should remove inconsistencies and useless case and make them better.

I think we shouldn't discuss the individual cases back and forth, but instead decide on a general approach to take. I am in the "beautiful API" camp.

I think we should do the following:

Fix the "inconsistent" case (to-many, with aggregate, no group by, no spread) and make it return an object instead of an array of a single object. Return an array with a known-to-be-single-value is always bad - just think about how annoying that would be to type in Haskell, having a list, where you always need to catch cases that the list could be empty or have more items.. even though you know it can only contain exactly a single item.

Keep the "currently discussed" case about hoisting.

Actually take it a step further and change the "useless" case (to-one, with aggregate, no spread) and apply spreading automatically, because.. that's the only sensible thing to interpret this request.

I think we shouldn't discuss the individual cases back and forth, but instead decide on a general approach to take. I am in the "beautiful API" camp.

Doing it the way of the "beautiful API" scares me a lot because we already been through something similar with "duck typing" (#424, which then was removed in #1430, very painfully so as it was a major breaking change). It may capture "intent" for certain users, but some others might be confused (as me) because of the change of shape in the output.

Just think about how annoying that would be to type in Haskell, having a list, where you always need to catch cases that the list could be empty or have more items..

This was my point when referring to typescript above #3640 (comment), and yes a surprising change of output will be difficult to type in any client library in any language. That's why I think we need to be consistent, rather than beautiful.

Going back a bit, the whole "hoisting" and "automatic group by" confuses me a lot, and I assume users familiar with SQL will be too. I think we should make these explicit operations.

So a group by would be /tbl?select=id,cost.sum()&groupby=id and a "hoisted" groupby would be analogous to top-level ordering: /tbl?select=id,embed(cost.sum())&groupby=embed(id). This is much more consistent with our functionality, our whole top-level features could be a section in the docs now.

(An aggregate without a groupby wouldn't have to fail, I've floated the idea on how to do it before on #2066 (comment))

So yes, sometimes this might imply useless output, but if the user can quickly make sense of the syntax then it's easy to correct.

Overall I'd like to normalize the output of an aggregate, and not make it a "transformation" but a type of filtering. Only our spreads should be "transformation". This will help a lot with typing on clients too.

Just think about how annoying that would be to type in Haskell, having a list, where you always need to catch cases that the list could be empty or have more items..

This was my point when referring to typescript above #3640 (comment), and yes a surprising change of output will be difficult to type in any client library in any language. That's why I think we need to be consistent, rather than beautiful.

Uh, that misses the point. This consistency leads to those "always-a-single-item lists". And those are hard to deal with in strongly typed languages. They might be easier to "type", but then harder to work with afterwards, imho. But that's just because... the typing is weaker. Weaker typing is always easier to "type".

and I assume users familiar with SQL will be too. I think we should make these explicit operations.

I think that both the TypeScript and the SQL arguments don't fly. We are not trying to create a TypeScript client for a PostgreSQL database. And we are not trying to expose SQL via http either.

We provide a REST API. This doesn't need to be "SQL-like" at all. Familiarity with SQL should give you exactly zero advantage when using this API. If it does.. then we designed the API in a bad way, because it suddenly requires SQL knowledge to use it well.

Doing it the way of the "beautiful API" scares me a lot because we already been through something similar with "duck typing"

So IIUC correctly the problem back then was that you couldn't tell from the syntax of your request what the response would look like, correct?

This is certainly a problem with embeddings in general - and it still is today.

But that's not the case for the aggregation things we're discussing here, right? Those are predictable from syntax.

If you want consistency, then you'd need to return "to-one" embeddings as single-item array all the time. Because that's the part that is not predictable and hard to type.

But I don't think we want to go that way, right? (I certainly don't)

re TypeScript: My vision for first class types would be a custom root endpoint mimetype handler: Once we have split OpenApi off, I hope that we can work towards allowing multiple handlers for the root endpoint. Then, we can implement a handler that returns type definitions straight from the database. And those should be able to treat all those special cases much better, right?

Hmm... I'm not yet convinced for or against the group_by proposal. I think adding explicit operations helps in understanding what is "expected" or not. But I still don't want to lose the simplicity of not needing a separate group_by when aggregating. Also when trying different queries I keep seeing/expecting the to-many aggregates as single responses (what was implemented before this convo), so I'm not against that either.

I'd like to complete the implementation of to-many spreads without aggregates in this PR (which got too big) and continue this debate in a different PR.

steve-chavez · 2025-02-28T00:28:53Z

Just tried this:

$ curl 'localhost:3000/processes?select=*,...process_costs(cost.sum())'
{"code":"42803","details":null,"hint":null,"message":"column \"processes.id\" must appear in the GROUP BY clause or be used in an aggregate function"}

Is the error expected?

laurenceisla · 2025-02-28T00:43:54Z

Is the error expected?

Oh, nice catch. It's not expected at all. This error is for to-one spreads, it's also failing in the main branch right now with the same error message, so it should be checked in a different issue/PR.

* Note: Aggregates are not implemented

laurenceisla mentioned this pull request Jul 5, 2024

Spread operator for many-to-many relationships, aliases, and aggregations #3041

Open

laurenceisla force-pushed the feat-spread-m2m branch from 1a02416 to dd89c12 Compare July 5, 2024 23:58

laurenceisla commented Jul 6, 2024

View reviewed changes

src/PostgREST/Error.hs Outdated Show resolved Hide resolved

laurenceisla force-pushed the feat-spread-m2m branch from dd89c12 to dce8597 Compare July 10, 2024 01:07

laurenceisla force-pushed the feat-spread-m2m branch 2 times, most recently from 6507878 to bd93514 Compare July 26, 2024 01:50

laurenceisla commented Jul 26, 2024

View reviewed changes

src/PostgREST/Plan.hs Outdated Show resolved Hide resolved

laurenceisla mentioned this pull request Aug 20, 2024

Fixes when using aggregates inside Spread Embedded Resources #3693

Merged

3 tasks

laurenceisla force-pushed the feat-spread-m2m branch 2 times, most recently from 38abc0e to 6e64707 Compare September 7, 2024 00:49

laurenceisla force-pushed the feat-spread-m2m branch 7 times, most recently from 9002110 to b3e5483 Compare September 18, 2024 23:37

laurenceisla commented Sep 18, 2024

View reviewed changes

src/PostgREST/Query/SqlFragment.hs Outdated Show resolved Hide resolved

src/PostgREST/Query/SqlFragment.hs Outdated Show resolved Hide resolved

src/PostgREST/ApiRequest/Types.hs Outdated Show resolved Hide resolved

src/PostgREST/Plan/ReadPlan.hs Outdated Show resolved Hide resolved

laurenceisla changed the title ~~feat: WIP allow spread operators in to-many relationships~~ feat: allow spread operators in to-many relationships Sep 18, 2024

laurenceisla marked this pull request as ready for review September 18, 2024 23:40

laurenceisla force-pushed the feat-spread-m2m branch 4 times, most recently from 67e6419 to 87a13ef Compare September 25, 2024 22:14

laurenceisla force-pushed the feat-spread-m2m branch from 87a13ef to 19466a8 Compare October 3, 2024 17:33

steve-chavez reviewed Oct 3, 2024

View reviewed changes

docs/references/api/resource_embedding.rst Show resolved Hide resolved

steve-chavez mentioned this pull request Jan 27, 2025

Schema cache speed went down after refactor #3882

Open

steve-chavez reviewed Jan 27, 2025

View reviewed changes

test/spec/Feature/Query/SpreadQueriesSpec.hs Show resolved Hide resolved

steve-chavez reviewed Jan 28, 2025

View reviewed changes

test/spec/Feature/Query/SpreadQueriesSpec.hs Show resolved Hide resolved

laurenceisla mentioned this pull request Jan 29, 2025

feat: allow aliases in spread embeddings #3885

Draft

laurenceisla force-pushed the feat-spread-m2m branch 2 times, most recently from 898cad3 to a51008f Compare February 4, 2025 01:02

laurenceisla requested a review from steve-chavez February 4, 2025 01:17

steve-chavez mentioned this pull request Feb 10, 2025

break: remove limited updated/delete feature #3907

Merged