ref(replays): Switch to streaming data scrubbing #1800

jan-auer · 2023-01-31T15:09:11Z

Current replay payload processing uses significant CPU time and memory alike. Particularly, we have noticed gradual bumps in permanent memory utilization in Relay. It's not entirely clear to us where these are coming from, so there could be an unrelated leak. However, we were clearly able to correlate it with processing of large replays.

Replay processing consists of two major parts:

Zlib decompression and JSON parsing, as well as the reverse. There are several intermediate buffer allocations between each of these steps. Additionally, the structures are highly recursive and create substantial memory fragmentation once parsed.
Recursive data scrubbing, each creating several intermediate allocations and clones of the scrubbed string and rules.

This PR contains a complete rewrite of Replay processing, which includes:

Removing the schema entirely. It will be added back eventually for other use cases.
Streaming decompression and JSON parsing instead of working from buffers. This reduces redundant intermediate allocations.
Streaming PII scrubbing by feeding fields one-by-one directly from the deserializer to the serializer. This avoids building a complete AST of the replay payload, but will make it harder to scrub selectively in the future and also means we cannot perform any data normalization.

Next steps will be:

Streaming optimizations, making up 37% of the CPU time.
Data scrubbing optimizations, making up 63% of the CPU time. This PR is still using the inefficient chunk-based scrubber.
A final benchmark comparing E2E performance before and after the change.

cmanallen

Just a couple of thoughts. No action items.

Currently we are de-serializing to somewhat well defined structs. This has the benefit of asserting on some level the payload appears the way it should. We lose that with this implementation but we gain a scrubber which is very memory efficient. We probably want to do some post-processing in our consumer. Depending on how this performs in canary the replay team will want to consider porting the existing parser to a consumer environment where we can maybe be a little more "reckless" with memory. This is an issue for us to deal with not you though :P

Another thought I have is that with the previous implementation we were very targeted in our PII scrubbing. We would leave certain areas untouched (potentially leaving PII in the payload). Is this better? Will it "corrupt" replays in anyway if it wrongly detects PII within a script or style tag? E.g. some compressed stylesheet with a media query triggers the email scrubber }@media xyz { to [email] xyz {.

You pointed this out to me but I'll mention it here for completions sake. This does not transform img src attribute values to #. The current implementation hadn't perfected this either but worth mentioning.

We currently use statically defined scrubbers but there is a desire to use the user configured PII scrubbers. I assume we can just swap out the static scrubber with a function argument containing the PIIConfig. But thought I'd mention in case there were deeper implications to such a change.

Great job on this! I know this was a huge unit of effort and a really technical change. Thanks for being so helpful through this. I'm sure you have other things you could be doing. Really excited to see it run on canary and see some results!

jan-auer · 2023-01-31T15:58:39Z

In some local benchmarks I noticed that a roundtrip with the 3MB replay takes 80ms. Of those, 28ms are decompression, streaming, and recompression. The remaining 50ms are the intermediate string allocations and scrubbing.

There's potential to speed up both, but that will require more benchmarks. I'll run a canary next to check for memory impact, then move on to speeding up the scrubbing itself.

jan-auer · 2023-01-31T16:04:37Z

Currently we are de-serializing to somewhat well defined structs. This has the benefit of asserting on some level the payload appears the way it should.

That's a fair point, and we'll definitely have to consider bringing some of that back. I was reluctant to remove the struct declarations, but there's also an argument to be made that we might not need validation at this point. I'm viewing those replay payloads like video files, and we also wouldn't want to validate/normalize video files in ingestion. So this is maybe better done at another place.

Depending on how this performs in canary the replay team will want to consider porting the existing parser to a consumer environment where we can maybe be a little more "reckless" with memory.

We can restore the types and expose them for sure. I can clean that up if/when this PR works.

Another thought I have is that with the previous implementation we were very targeted in our PII scrubbing. We would leave certain areas untouched (potentially leaving PII in the payload). Is this better?

In an ideal state I would like to get some of that back. Particularly, we can identify values that don't need scrubbing, or entire branches in the tree that we can skip. That adds more complexity, we should focus on improving the scrubber first.

This does not transform img src attribute values to #. The current implementation hadn't perfected this either but worth mentioning.

That is a kind of transform that would best move to a consumer or even the place where this data is being interpreted (i.e. the player).

We currently use statically defined scrubbers but there is a desire to use the user configured PII scrubbers. I assume we can just swap out the static scrubber with a function argument containing the PIIConfig.

Yes, that is still possible.

cmanallen · 2023-01-31T16:18:27Z

We can restore the types and expose them for sure. I can clean that up if/when this PR works.

@jan-auer Doesn't necessarily have to be in Relay. If we consider PII scrubbing complete at this step, I think our only dependency is serde. We could utilize an FFI in our Python consumer. Or whatever. Just thinking aloud.

jan-auer · 2023-02-01T07:58:22Z

relay-replays/Cargo.toml

@@ -10,16 +10,17 @@ license-file = "../LICENSE"
 publish = false

 [dependencies]
+flate2 = "1.0.19"
+once_cell = "1.13.1"


once_cell is the only added dependency, the rest is sorted alphabetically.

jan-auer · 2023-02-01T08:01:37Z

relay-replays/src/recording/mod.rs

+    match processor.process_string(value, &mut Meta::default(), &REPLAY_PII_STATE) {
+        Err(ProcessingAction::DeleteValueHard) => *value = String::new(),
+        Err(ProcessingAction::DeleteValueSoft) => *value = String::new(),


Processing "errors" are now handled in-line by removing the value. This is the same behavior as Annotated would have.

jan-auer · 2023-02-01T08:07:24Z

relay-replays/src/recording/mod.rs

-
-// Recording Processor
-
-struct RecordingProcessor<'a> {


The recording processor does not exist anymore. Instead of walking a structure, we now call scrub_replay.

jan-auer · 2023-02-01T08:10:05Z

relay-replays/src/recording/mod.rs

+});
+
+fn scrub_string(value: &mut String) {
+    let mut processor = PiiProcessor::new(REPLAY_PII_CONFIG.compiled());


Creating a new PiiProcessor instance is cheap. The expensive action here is compiled(), which is cached on the static RELAY_PII_CONFIG instance.

jan-auer · 2023-02-01T08:14:36Z

relay-replays/src/recording/mod.rs

        assert!(parsed.contains("\"value\":\"[ip]\"")); // Assert texts were mutated.
        assert!(parsed.contains("\"textContent\":\"[ip]\"")) // Assert text node was mutated.
    }

-    #[test]
-    fn test_rrweb_snapshot_parsing() {


These tests check parsing of the protocol. This will be added in a follow-up.

* master: ref(statsd): Add metric name as a tag for Sentry errors (#1797)

olksdr

Overall lgtm.
you might want to wait on @jjbayer approval though, as we was involved more in this.

jjbayer · 2023-02-01T08:52:25Z

relay-replays/src/recording/transcoder.rs

+    }
+}
+
+struct Visitor<'a, S>(S, &'a dyn Fn(&mut String));


nit: Give names to fields?

Good idea, I'll give the transcoder a refactor in a follow-up.

…4093) Disables data scrubbing on replay recording payloads by default. The new data scrubbing introduced in getsentry/relay#1800 can break payloads because it scrubs strings that should be excluded from scrubbing. The introduced feature flag is off by default. Once data scrubbing on replay payloads has been stabilized, it can be re-enabled. Note that the feature flag is NOT an entity feature, so it cannot be controlled through flagr. This is intentional for all feature flags used in Relay project configuration.

Revised data scrubbing on replays from #1800 can break recording payloads because it scrubs strings that should be excluded from scrubbing. This breaks the payloads and makes them unusable. This introduces a feature flag to enable data scrubbing on recordings per organization. By default, this flag is off.

…4093) Disables data scrubbing on replay recording payloads by default. The new data scrubbing introduced in getsentry/relay#1800 can break payloads because it scrubs strings that should be excluded from scrubbing. The introduced feature flag is off by default. Once data scrubbing on replay payloads has been stabilized, it can be re-enabled. Note that the feature flag is NOT an entity feature, so it cannot be controlled through flagr. This is intentional for all feature flags used in Relay project configuration.

ref(replays): Switch to streaming data scrubbing

35e0225

jan-auer force-pushed the ref/replay-streaming-pii branch from a036e96 to 35e0225 Compare January 31, 2023 15:11

jan-auer added 2 commits January 31, 2023 16:24

ref: Move code around

d801a1f

ref: Add back benchmarks

4d9e75d

cmanallen reviewed Jan 31, 2023

View reviewed changes

build: Revert unused dependency

db5f342

jan-auer self-assigned this Feb 1, 2023

jan-auer marked this pull request as ready for review February 1, 2023 07:55

jan-auer requested a review from a team February 1, 2023 07:55

test: Add comment to benchmark

3a2241b

jan-auer commented Feb 1, 2023

View reviewed changes

jan-auer added 2 commits February 1, 2023 09:17

ref: More comments

2bae02b

meta: Changelog

a97075b

jan-auer commented Feb 1, 2023

View reviewed changes

Merge branch 'master' into ref/replay-streaming-pii

f1a6a14

* master: ref(statsd): Add metric name as a tag for Sentry errors (#1797)

olksdr approved these changes Feb 1, 2023

View reviewed changes

jjbayer approved these changes Feb 1, 2023

View reviewed changes

jan-auer merged commit 0a871f5 into master Feb 1, 2023

jan-auer deleted the ref/replay-streaming-pii branch February 1, 2023 08:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(replays): Switch to streaming data scrubbing #1800

ref(replays): Switch to streaming data scrubbing #1800

jan-auer commented Jan 31, 2023 •

edited

Loading

cmanallen left a comment •

edited

Loading

jan-auer commented Jan 31, 2023

jan-auer commented Jan 31, 2023

cmanallen commented Jan 31, 2023

jan-auer Feb 1, 2023

jan-auer Feb 1, 2023

jan-auer Feb 1, 2023

jan-auer Feb 1, 2023

jan-auer Feb 1, 2023

olksdr left a comment

jjbayer Feb 1, 2023

jan-auer Feb 1, 2023

ref(replays): Switch to streaming data scrubbing #1800

ref(replays): Switch to streaming data scrubbing #1800

Conversation

jan-auer commented Jan 31, 2023 • edited Loading

cmanallen left a comment • edited Loading

Choose a reason for hiding this comment

jan-auer commented Jan 31, 2023

jan-auer commented Jan 31, 2023

cmanallen commented Jan 31, 2023

jan-auer Feb 1, 2023

Choose a reason for hiding this comment

jan-auer Feb 1, 2023

Choose a reason for hiding this comment

jan-auer Feb 1, 2023

Choose a reason for hiding this comment

jan-auer Feb 1, 2023

Choose a reason for hiding this comment

jan-auer Feb 1, 2023

Choose a reason for hiding this comment

olksdr left a comment

Choose a reason for hiding this comment

jjbayer Feb 1, 2023

Choose a reason for hiding this comment

jan-auer Feb 1, 2023

Choose a reason for hiding this comment

jan-auer commented Jan 31, 2023 •

edited

Loading

cmanallen left a comment •

edited

Loading