Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ref(replays): Switch to streaming data scrubbing #1800

Merged
merged 8 commits into from
Feb 1, 2023

Conversation

jan-auer
Copy link
Member

@jan-auer jan-auer commented Jan 31, 2023

Current replay payload processing uses significant CPU time and memory alike. Particularly, we have noticed gradual bumps in permanent memory utilization in Relay. It's not entirely clear to us where these are coming from, so there could be an unrelated leak. However, we were clearly able to correlate it with processing of large replays.

Replay processing consists of two major parts:

  1. Zlib decompression and JSON parsing, as well as the reverse. There are several intermediate buffer allocations between each of these steps. Additionally, the structures are highly recursive and create substantial memory fragmentation once parsed.
  2. Recursive data scrubbing, each creating several intermediate allocations and clones of the scrubbed string and rules.

This PR contains a complete rewrite of Replay processing, which includes:

  • Removing the schema entirely. It will be added back eventually for other use cases.
  • Streaming decompression and JSON parsing instead of working from buffers. This reduces redundant intermediate allocations.
  • Streaming PII scrubbing by feeding fields one-by-one directly from the deserializer to the serializer. This avoids building a complete AST of the replay payload, but will make it harder to scrub selectively in the future and also means we cannot perform any data normalization.

Next steps will be:

  • Streaming optimizations, making up 37% of the CPU time.
  • Data scrubbing optimizations, making up 63% of the CPU time. This PR is still using the inefficient chunk-based scrubber.
  • A final benchmark comparing E2E performance before and after the change.

Copy link
Member

@cmanallen cmanallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of thoughts. No action items.

Currently we are de-serializing to somewhat well defined structs. This has the benefit of asserting on some level the payload appears the way it should. We lose that with this implementation but we gain a scrubber which is very memory efficient. We probably want to do some post-processing in our consumer. Depending on how this performs in canary the replay team will want to consider porting the existing parser to a consumer environment where we can maybe be a little more "reckless" with memory. This is an issue for us to deal with not you though :P

Another thought I have is that with the previous implementation we were very targeted in our PII scrubbing. We would leave certain areas untouched (potentially leaving PII in the payload). Is this better? Will it "corrupt" replays in anyway if it wrongly detects PII within a script or style tag? E.g. some compressed stylesheet with a media query triggers the email scrubber }@media xyz { to [email] xyz {.

You pointed this out to me but I'll mention it here for completions sake. This does not transform img src attribute values to #. The current implementation hadn't perfected this either but worth mentioning.

We currently use statically defined scrubbers but there is a desire to use the user configured PII scrubbers. I assume we can just swap out the static scrubber with a function argument containing the PIIConfig. But thought I'd mention in case there were deeper implications to such a change.


Great job on this! I know this was a huge unit of effort and a really technical change. Thanks for being so helpful through this. I'm sure you have other things you could be doing. Really excited to see it run on canary and see some results!

@jan-auer
Copy link
Member Author

In some local benchmarks I noticed that a roundtrip with the 3MB replay takes 80ms. Of those, 28ms are decompression, streaming, and recompression. The remaining 50ms are the intermediate string allocations and scrubbing.

There's potential to speed up both, but that will require more benchmarks. I'll run a canary next to check for memory impact, then move on to speeding up the scrubbing itself.

@jan-auer
Copy link
Member Author

Currently we are de-serializing to somewhat well defined structs. This has the benefit of asserting on some level the payload appears the way it should.

That's a fair point, and we'll definitely have to consider bringing some of that back. I was reluctant to remove the struct declarations, but there's also an argument to be made that we might not need validation at this point. I'm viewing those replay payloads like video files, and we also wouldn't want to validate/normalize video files in ingestion. So this is maybe better done at another place.

Depending on how this performs in canary the replay team will want to consider porting the existing parser to a consumer environment where we can maybe be a little more "reckless" with memory.

We can restore the types and expose them for sure. I can clean that up if/when this PR works.

Another thought I have is that with the previous implementation we were very targeted in our PII scrubbing. We would leave certain areas untouched (potentially leaving PII in the payload). Is this better?

In an ideal state I would like to get some of that back. Particularly, we can identify values that don't need scrubbing, or entire branches in the tree that we can skip. That adds more complexity, we should focus on improving the scrubber first.

This does not transform img src attribute values to #. The current implementation hadn't perfected this either but worth mentioning.

That is a kind of transform that would best move to a consumer or even the place where this data is being interpreted (i.e. the player).

We currently use statically defined scrubbers but there is a desire to use the user configured PII scrubbers. I assume we can just swap out the static scrubber with a function argument containing the PIIConfig.

Yes, that is still possible.

@cmanallen
Copy link
Member

We can restore the types and expose them for sure. I can clean that up if/when this PR works.

@jan-auer Doesn't necessarily have to be in Relay. If we consider PII scrubbing complete at this step, I think our only dependency is serde. We could utilize an FFI in our Python consumer. Or whatever. Just thinking aloud.

@jan-auer jan-auer self-assigned this Feb 1, 2023
@jan-auer jan-auer marked this pull request as ready for review February 1, 2023 07:55
@jan-auer jan-auer requested a review from a team February 1, 2023 07:55
@@ -10,16 +10,17 @@ license-file = "../LICENSE"
publish = false

[dependencies]
flate2 = "1.0.19"
once_cell = "1.13.1"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once_cell is the only added dependency, the rest is sorted alphabetically.

Comment on lines +65 to +67
match processor.process_string(value, &mut Meta::default(), &REPLAY_PII_STATE) {
Err(ProcessingAction::DeleteValueHard) => *value = String::new(),
Err(ProcessingAction::DeleteValueSoft) => *value = String::new(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processing "errors" are now handled in-line by removing the value. This is the same behavior as Annotated would have.


// Recording Processor

struct RecordingProcessor<'a> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recording processor does not exist anymore. Instead of walking a structure, we now call scrub_replay.

});

fn scrub_string(value: &mut String) {
let mut processor = PiiProcessor::new(REPLAY_PII_CONFIG.compiled());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new PiiProcessor instance is cheap. The expensive action here is compiled(), which is cached on the static RELAY_PII_CONFIG instance.

assert!(parsed.contains("\"value\":\"[ip]\"")); // Assert texts were mutated.
assert!(parsed.contains("\"textContent\":\"[ip]\"")) // Assert text node was mutated.
}

#[test]
fn test_rrweb_snapshot_parsing() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests check parsing of the protocol. This will be added in a follow-up.

* master:
  ref(statsd): Add metric name as a tag for Sentry errors (#1797)
Copy link
Contributor

@olksdr olksdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm.
you might want to wait on @jjbayer approval though, as we was involved more in this.

}
}

struct Visitor<'a, S>(S, &'a dyn Fn(&mut String));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Give names to fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll give the transcoder a refactor in a follow-up.

@jan-auer jan-auer merged commit 0a871f5 into master Feb 1, 2023
@jan-auer jan-auer deleted the ref/replay-streaming-pii branch February 1, 2023 08:57
jan-auer added a commit to getsentry/sentry that referenced this pull request Feb 3, 2023
…4093)

Disables data scrubbing on replay recording payloads by default. The new
data scrubbing introduced in getsentry/relay#1800 can break payloads
because it scrubs strings that should be excluded from scrubbing.

The introduced feature flag is off by default. Once data scrubbing on
replay payloads has been stabilized, it can be re-enabled.

Note that the feature flag is NOT an entity feature, so it cannot be
controlled through flagr. This is intentional for all feature flags used
in Relay project configuration.
jan-auer added a commit that referenced this pull request Feb 3, 2023
Revised data scrubbing on replays from #1800 can break recording
payloads because it scrubs strings that should be excluded from
scrubbing. This breaks the payloads and makes them unusable.

This introduces a feature flag to enable data scrubbing on recordings
per organization. By default, this flag is off.
mikejihbe pushed a commit to getsentry/sentry that referenced this pull request Feb 6, 2023
…4093)

Disables data scrubbing on replay recording payloads by default. The new
data scrubbing introduced in getsentry/relay#1800 can break payloads
because it scrubs strings that should be excluded from scrubbing.

The introduced feature flag is off by default. Once data scrubbing on
replay payloads has been stabilized, it can be re-enabled.

Note that the feature flag is NOT an entity feature, so it cannot be
controlled through flagr. This is intentional for all feature flags used
in Relay project configuration.
wmak pushed a commit to getsentry/sentry that referenced this pull request Feb 16, 2023
…4093)

Disables data scrubbing on replay recording payloads by default. The new
data scrubbing introduced in getsentry/relay#1800 can break payloads
because it scrubs strings that should be excluded from scrubbing.

The introduced feature flag is off by default. Once data scrubbing on
replay payloads has been stabilized, it can be re-enabled.

Note that the feature flag is NOT an entity feature, so it cannot be
controlled through flagr. This is intentional for all feature flags used
in Relay project configuration.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants