Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Working query in EFCore 3.+ doesn't produce the same SQL query in EFCore 6/7+ #30938

Closed
FHTSean opened this issue May 20, 2023 · 13 comments
Closed

Comments

@FHTSean
Copy link

FHTSean commented May 20, 2023

Hello,

The following is the SQL query i am trying to produce in LINQ:

SELECT *
FROM `CustomerMainTable` AS `t`
WHERE `t`.`CustomerMainTableId` IN (
    SELECT MAX(`t0`.`CustomerMainTableId`)
    FROM `CustomerMainTable` AS `t0`
    GROUP BY `t0`.`CustomerMainTableCustomerId`
)

CustomerMainTableId is an autoincrement id, and CustomerMainTableCustomerId is the customer's unique ID. There can be multiple entries of the customer's transaction in the CustomerMainTableId, so the idea is to select the latest transaction for each customer by grouping and selecting the max id.

In entity framework core 3.1, this was achieved via:

var CustomerGroupQuery = _DB.CustomerMainTable.Where(p => _DB.CustomerMainTable
    .GroupBy(l => l.CustomerMainTableCustomerId)
    .Select(g => g.Max(c => c.CustomerMainTableId))
    .Contains(p.CustomerMainTableId));

However, the same query, when executed in EF Core 6 or 7 produces a wrong query in SQL:

SELECT *
FROM `CustomerMainTable` AS `t`
WHERE EXISTS (
    SELECT 1
    FROM `CustomerMainTable` AS `t0`
    GROUP BY `t0`.`CustomerMainTableCustomerId`
    HAVING MAX(`t0`.`CustomerMainTableId`) = `t`.`CustomerMainTableId`)

Interestingly enough, if we split the Linq query like so:

List<int> idlist = _DB.CustomerMainTable
    .GroupBy(l => l.CustomerMainTableCustomerId)
    .Select(g => g.Max(c => c.CustomerMainTableId)).ToList();

And then feed the list of integers back to the contains query:

var CustomerGroupQuery = _DB.CustomerMainTable.Where(p => idlist.Contains(p.CustomerMainTableId));

It produces the correct query. However, feeding the list of integers back to the contains query means this is a client side evaluation which is not ideal.

@ajcvickers
Copy link
Contributor

/cc @roji

@FHTSean FHTSean changed the title Working query in .net 3.1 doesn't produce the same SQL query in .net 7 Working query in EFCore 3.+ doesn't produce the same SQL query in EFCore 6/7+ May 24, 2023
@roji
Copy link
Member

roji commented May 25, 2023

@FHTSean how exactly is the second query wrong, they seem like equivalent SQLs that should return the same results? Are you seeing different rows returned by the two SQLs? If so, can you share a minimal, runnable repro or at least the kind of data showing the difference?

Note #30955 which I opened (before noticing this!), which would change our translation back to using IN instead of EXISTS in most cases - but AFAIK that's a SQL optimization (and possibly performance) change, rather than a correctness change.

@FHTSean
Copy link
Author

FHTSean commented May 25, 2023

The first query and second query produces vastly different results.
The first query scanned a few thousand rows and produces the correct count of customers in the database - around 2000 distinct customers.

The second query scanned about 200 million rows and produces 45 thousand customers from 2000 distinct customers.

@roji
Copy link
Member

roji commented May 25, 2023

Can you provide a minimal sample showing the different results?

@FHTSean
Copy link
Author

FHTSean commented May 25, 2023

customermaintable.txt
mysqlefcore

@FHTSean
Copy link
Author

FHTSean commented May 25, 2023

All i've done is create a table with two columns - CustomerMainTableId and CustomerMainTableCustomerId. I've used a loop to insert customer ids 1->2000 4 times, producing 8 thousand records.

There are 2000 distinct customer ids - and the screenshot shows that the 2nd query with the "Exists" statement examines the row 56 million times.

@roji
Copy link
Member

roji commented May 26, 2023

All i've done is create a table with two columns - CustomerMainTableId and CustomerMainTableCustomerId. I've used a loop to insert customer ids 1->2000 4 times, producing 8 thousand records.

Can you please share this (e.g. a SQL script to create the table with the dummy data), plus a minimal code sample, so that I can reproduce the same results here?

There are 2000 distinct customer ids - and the screenshot shows that the 2nd query with the "Exists" statement examines the row 56 million times.

The performance of the query is one question - it's indeed possible that one variant is faster than the other. But before that I'm concerned with its correctness, since you indicated above that you're seeing different results coming. This is what I'd like to understand better.

@FHTSean
Copy link
Author

FHTSean commented May 26, 2023

Hi Roji,
I am unable to reproduce the 45 k customers - you are correct, the query does appear to be accurate. Apologies for that - i must've made some modifications to the query while attempting to debug it.

Happy to move on to the performance bit.

@roji
Copy link
Member

roji commented May 26, 2023

Great, thanks for confirming about the correctness.

Re the performance - this comes at exactly the right time: I'm currently working on #30955 which should fix this. It would still be helpful for me to get a clear repro from you to confirm the perf improvement, i.e. a dummy database create script + two SQLs which perform differently.

@roji
Copy link
Member

roji commented May 28, 2023

@FHTSean PR #30976 is out to generate IN instead of EXISTS where possible. I'm interested in testing its effect on your scenario, so having the above repro from you would be very useful.

@FHTSean
Copy link
Author

FHTSean commented May 28, 2023

Hi Roji,
I've already included the dummy database create script above as: https://github.com/dotnet/efcore/files/11570096/customermaintable.txt

The two different SQLs were already provided above as:

SELECT *
FROM `CustomerMainTable` AS `t`
WHERE `t`.`CustomerMainTableId` IN (
    SELECT MAX(`t0`.`CustomerMainTableId`)
    FROM `CustomerMainTable` AS `t0`
    GROUP BY `t0`.`CustomerMainTableCustomerId`
)

AND

SELECT *
FROM `CustomerMainTable` AS `t`
WHERE EXISTS (
    SELECT 1
    FROM `CustomerMainTable` AS `t0`
    GROUP BY `t0`.`CustomerMainTableCustomerId`
    HAVING MAX(`t0`.`CustomerMainTableId`) = `t`.`CustomerMainTableId`)

Are you asking for something different?

@roji
Copy link
Member

roji commented May 29, 2023

@FHTSean thanks, I didn't notice the dummy database.

I could indeed reproduce the dramatic performance differences on MariaDB - and also less dramatic (but still very significant) differences on other databases. I've posted the full findings in #30955 (comment), we'll definitely be fixing this for 8.0.

Thanks again for raising this - it's always great to get clear, targeted issues with full repros as you've provided. I'll go ahead and close this as a duplicate of #30955.

@roji
Copy link
Member

roji commented May 29, 2023

Duplicate of #30955

@roji roji marked this as a duplicate of #30955 May 29, 2023
@roji roji closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

3 participants