Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Session Aggregates are randomly dropped #2967

Closed
Dav1dde opened this issue Jan 18, 2024 · 8 comments
Closed

Session Aggregates are randomly dropped #2967

Dav1dde opened this issue Jan 18, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@Dav1dde
Copy link
Member

Dav1dde commented Jan 18, 2024

Session aggregates are being dropped/lost before the reach Kafka/Storage.

We could verify by peeking into the Kafka Topic that the aggregates (as metrics) never make it into Kafka.

From the Kafka Topic:

 (datetime.datetime(2024, 1, 18, 10, 9), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 11), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 12), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 14), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 17), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 19), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 21), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 23), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 25), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 29), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 31), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 32), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 34), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 35), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 40), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 49), 30.0),
 (datetime.datetime(2024, 1, 18, 10, 56), 30.0)]

In the UI:
image

Expected to see one entry every minute with a value of 30.

Script which generates an envelop containing a session aggregate every minute:

#!/usr/bin/env bash


URL="https://<INGEST URL>/api/<PROJECT ID>/envelope/?sentry_key=<KEY>&sentry_version=7&sentry_client=sentry.javascript.browser%2F7.93.0"

RELEASE="r1"
ENVIRONMENT="production"
USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

while true; do
    DT=$(date -u +'%Y-%m-%dT%H:%M:%S.398Z')
    echo $DT

    PAYLOAD="{\"sent_at\":\"${DT}\",\"sdk\":{\"name\":\"sentry.javascript.browser\",\"version\":\"7.93.0\"}}
{\"type\":\"sessions\"}
{\"attrs\":{\"release\":\"$RELEASE\", \"environment\":\"$ENVIRONMENT\", \"user_agent\":\"$USER_AGENT\"},\"aggregates\":[{\"started\":\"$(date -u +'%Y-%m-%dT%H:%M:00.000Z')\",\"exited\":100}]}"

    echo $PAYLOAD 
    curl -H 'User-Agent:' -H 'Accept:' -H "Content-Type: application/x-sentry-envelope" -X POST "$URL" -d "$PAYLOAD"

    echo ""
    sleep 60
done

I was not able to reproduce this with a local Sentry and Relay instance, Relay consistently delivered the metrics into the Kafka topic.

The local instance had the following non-standard settings:

SENTRY_FEATURES['organizations:metrics-extraction'] = True
SENTRY_FEATURES['organizations:transaction-metrics-extraction'] = True
SENTRY_FEATURES['organizations:custom-metrics'] = True
SENTRY_FEATURES['organizations:ddm-ui'] = True
SENTRY_FEATURES['organizations:release-health-drop-sessions'] = True

SENTRY_RELEASE_HEALTH = "sentry.release_health.metrics.MetricsReleaseHealthBackend"
SENTRY_RELEASE_MONITOR = "sentry.release_health.release_monitor.metrics.MetricReleaseMonitorBackend"

I manually verified that the sessionMetrics setting in the project config matches the production one version: 2, drop: true.

@Dav1dde
Copy link
Member Author

Dav1dde commented Jan 18, 2024

Ingesting the session aggregates directly into processing relays (skipping pops) with the same above script does not show this issue (using a value of 100 instead of 30):

image

The test was performed by sending the payload to a single Relay instance everytime, this does not rule out that there may be a problem with non-cached Project Configs.

@Dav1dde Dav1dde added the bug Something isn't working label Jan 18, 2024
@Dav1dde
Copy link
Member Author

Dav1dde commented Jan 18, 2024

Still no issues when running it against random processing relays (should rule out issues with non-cached project configs):

image

@Dav1dde
Copy link
Member Author

Dav1dde commented Jan 18, 2024

It appears like running the script against a single pop instance also works.

@Dav1dde
Copy link
Member Author

Dav1dde commented Jan 18, 2024

Running the script against random pop instances shows the same behaviour:

image

@jjbayer
Copy link
Member

jjbayer commented Jan 23, 2024

Did not notice anything obvious in the code (session aggregate -> aggregation -> submission via global metrics endpoint), will try to repro the behavior with an integration test now.

@jjbayer
Copy link
Member

jjbayer commented Jan 23, 2024

There's definitely an edge case in which we aggregate & attempt to send metrics even though the project has expired (see #2987), but I doubt that it explains the behavior above.

jjbayer added a commit that referenced this issue Jan 24, 2024
Metrics that arrive through the global metrics endpoint do not currently
trigger a refetch of the project config. If a single processing relay
only receives metrics traffic (no envelopes) for a specific project,
metrics might get stuck in the pre-aggregator a.k.a. metrics buffer.

ref: #2967
@jjbayer
Copy link
Member

jjbayer commented Jan 24, 2024

Seems fixed by #2987:

image

Problem seems to have been project configs not being actively fetched when the only incoming traffic (from processing relay's point of view) is metrics via the global endpoint. See linked PR for details.

@Dav1dde
Copy link
Member Author

Dav1dde commented Jan 26, 2024

Waiting for customer to confirm that the issue is fixed for them.

@olksdr olksdr closed this as completed Feb 14, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants