Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

OPIK-859: Reduce find spans query cost #1123

Merged
merged 1 commit into from
Jan 23, 2025

Conversation

thiagohora
Copy link
Contributor

@thiagohora thiagohora commented Jan 23, 2025

Details

Despite the slight differences in the parts and the big difference in granules. I'm confident this new query is more efficient since it postpones the retrieval of fields that are not part of the sortable key. This has sped up the process considerably.

Before:

Expression (Project names)
  Limit
    LimitBy
      Expression ((Before LIMIT BY + (Before ORDER BY + Projection) [lifted up part]))
        Sorting (Sorting for ORDER BY)
          Expression ((Before ORDER BY + Projection))
            Expression
              ReadFromMergeTree (opik_prod.spans)
              Indexes:
                PrimaryKey
                  Keys: 
                    workspace_id
                    project_id
                  Condition: and((workspace_id in ['48d64607-f70d-4cb1-9207-dd658a423f8e', '48d64607-f70d-4cb1-9207-dd658a423f8e']), (project_id in ['019484c0-1a3c-7987-b2d4-0e3598e8717a', '019484c0-1a3c-7987-b2d4-0e3598e8717a']))
                  Parts: 16/20
                  Granules: 828/29114
Screenshot 2025-01-23 at 14 46 11

After:

CreatingSets (Create sets before main query execution)
  Expression ((Project names + (Before ORDER BY + Projection) [lifted up part]))
    Sorting (Sorting for ORDER BY)
      Expression ((Before ORDER BY + Projection))
        Expression
          ReadFromMergeTree (opik_prod.spans)
          Indexes:
            PrimaryKey
              Keys: 
                id
              Condition: (id in 200-element set)
              Parts: 14/14
              Granules: 25196/29116
Screenshot 2025-01-23 at 14 49 01

Issues

OPIK-859

@thiagohora thiagohora requested a review from a team as a code owner January 23, 2025 13:41
Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically converts the JOIN to a SUBQUERY which seems to be more efficient in ClickHouse, or at least for this particular case.

Left some comments of things to double check or for the future, but we should be good to go.

Comment on lines +530 to +533
if(end_time IS NOT NULL AND start_time IS NOT NULL
AND notEquals(start_time, toDateTime64('1970-01-01 00:00:00.000', 9)),
(dateDiff('microsecond', start_time, end_time) / 1000.0),
NULL) AS duration_millis
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need this duration_millis field here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is a dynamic filter that assumes this field exists in the query. We can probably remove it by changing it to a materialized column instead. This will also make the query simpler

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was no clear about its use case. I know now. No need to change anything at the moment.

SELECT *
FROM feedback_scores
WHERE entity_type = 'span'
AND project_id = :project_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we can filter by workspace_id here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree I will push this in a following PR

<if(filters)> AND <filters> <endif>
<if(feedback_scores_filters)>
AND id in (
WHERE id IN (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's double check if we need a similar optimisation for the related count query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count follows the same structure as we are using now (subquery returning id and duration_millis), and then count only one ìd. So we should be fine.

@thiagohora thiagohora merged commit 34062db into main Jan 23, 2025
8 checks passed
@thiagohora thiagohora deleted the thiagohora/OPIK-859_reduce_find_spans_query_cost branch January 23, 2025 15:20
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants