-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
ref(replays): Switch to streaming data scrubbing #1800
Conversation
a036e96
to
35e0225
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of thoughts. No action items.
Currently we are de-serializing to somewhat well defined structs. This has the benefit of asserting on some level the payload appears the way it should. We lose that with this implementation but we gain a scrubber which is very memory efficient. We probably want to do some post-processing in our consumer. Depending on how this performs in canary the replay team will want to consider porting the existing parser to a consumer environment where we can maybe be a little more "reckless" with memory. This is an issue for us to deal with not you though :P
Another thought I have is that with the previous implementation we were very targeted in our PII scrubbing. We would leave certain areas untouched (potentially leaving PII in the payload). Is this better? Will it "corrupt" replays in anyway if it wrongly detects PII within a script or style tag? E.g. some compressed stylesheet with a media query triggers the email scrubber }@media xyz {
to [email] xyz {
.
You pointed this out to me but I'll mention it here for completions sake. This does not transform img
src
attribute values to #
. The current implementation hadn't perfected this either but worth mentioning.
We currently use statically defined scrubbers but there is a desire to use the user configured PII scrubbers. I assume we can just swap out the static scrubber with a function argument containing the PIIConfig. But thought I'd mention in case there were deeper implications to such a change.
Great job on this! I know this was a huge unit of effort and a really technical change. Thanks for being so helpful through this. I'm sure you have other things you could be doing. Really excited to see it run on canary and see some results!
In some local benchmarks I noticed that a roundtrip with the 3MB replay takes 80ms. Of those, 28ms are decompression, streaming, and recompression. The remaining 50ms are the intermediate string allocations and scrubbing. There's potential to speed up both, but that will require more benchmarks. I'll run a canary next to check for memory impact, then move on to speeding up the scrubbing itself. |
That's a fair point, and we'll definitely have to consider bringing some of that back. I was reluctant to remove the struct declarations, but there's also an argument to be made that we might not need validation at this point. I'm viewing those replay payloads like video files, and we also wouldn't want to validate/normalize video files in ingestion. So this is maybe better done at another place.
We can restore the types and expose them for sure. I can clean that up if/when this PR works.
In an ideal state I would like to get some of that back. Particularly, we can identify values that don't need scrubbing, or entire branches in the tree that we can skip. That adds more complexity, we should focus on improving the scrubber first.
That is a kind of transform that would best move to a consumer or even the place where this data is being interpreted (i.e. the player).
Yes, that is still possible. |
@jan-auer Doesn't necessarily have to be in Relay. If we consider PII scrubbing complete at this step, I think our only dependency is serde. We could utilize an FFI in our Python consumer. Or whatever. Just thinking aloud. |
@@ -10,16 +10,17 @@ license-file = "../LICENSE" | |||
publish = false | |||
|
|||
[dependencies] | |||
flate2 = "1.0.19" | |||
once_cell = "1.13.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once_cell
is the only added dependency, the rest is sorted alphabetically.
match processor.process_string(value, &mut Meta::default(), &REPLAY_PII_STATE) { | ||
Err(ProcessingAction::DeleteValueHard) => *value = String::new(), | ||
Err(ProcessingAction::DeleteValueSoft) => *value = String::new(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Processing "errors" are now handled in-line by removing the value. This is the same behavior as Annotated
would have.
|
||
// Recording Processor | ||
|
||
struct RecordingProcessor<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recording processor does not exist anymore. Instead of walking a structure, we now call scrub_replay
.
}); | ||
|
||
fn scrub_string(value: &mut String) { | ||
let mut processor = PiiProcessor::new(REPLAY_PII_CONFIG.compiled()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new PiiProcessor
instance is cheap. The expensive action here is compiled()
, which is cached on the static RELAY_PII_CONFIG
instance.
assert!(parsed.contains("\"value\":\"[ip]\"")); // Assert texts were mutated. | ||
assert!(parsed.contains("\"textContent\":\"[ip]\"")) // Assert text node was mutated. | ||
} | ||
|
||
#[test] | ||
fn test_rrweb_snapshot_parsing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests check parsing of the protocol. This will be added in a follow-up.
* master: ref(statsd): Add metric name as a tag for Sentry errors (#1797)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm.
you might want to wait on @jjbayer approval though, as we was involved more in this.
} | ||
} | ||
|
||
struct Visitor<'a, S>(S, &'a dyn Fn(&mut String)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Give names to fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll give the transcoder a refactor in a follow-up.
…4093) Disables data scrubbing on replay recording payloads by default. The new data scrubbing introduced in getsentry/relay#1800 can break payloads because it scrubs strings that should be excluded from scrubbing. The introduced feature flag is off by default. Once data scrubbing on replay payloads has been stabilized, it can be re-enabled. Note that the feature flag is NOT an entity feature, so it cannot be controlled through flagr. This is intentional for all feature flags used in Relay project configuration.
Revised data scrubbing on replays from #1800 can break recording payloads because it scrubs strings that should be excluded from scrubbing. This breaks the payloads and makes them unusable. This introduces a feature flag to enable data scrubbing on recordings per organization. By default, this flag is off.
…4093) Disables data scrubbing on replay recording payloads by default. The new data scrubbing introduced in getsentry/relay#1800 can break payloads because it scrubs strings that should be excluded from scrubbing. The introduced feature flag is off by default. Once data scrubbing on replay payloads has been stabilized, it can be re-enabled. Note that the feature flag is NOT an entity feature, so it cannot be controlled through flagr. This is intentional for all feature flags used in Relay project configuration.
…4093) Disables data scrubbing on replay recording payloads by default. The new data scrubbing introduced in getsentry/relay#1800 can break payloads because it scrubs strings that should be excluded from scrubbing. The introduced feature flag is off by default. Once data scrubbing on replay payloads has been stabilized, it can be re-enabled. Note that the feature flag is NOT an entity feature, so it cannot be controlled through flagr. This is intentional for all feature flags used in Relay project configuration.
Current replay payload processing uses significant CPU time and memory alike. Particularly, we have noticed gradual bumps in permanent memory utilization in Relay. It's not entirely clear to us where these are coming from, so there could be an unrelated leak. However, we were clearly able to correlate it with processing of large replays.
Replay processing consists of two major parts:
This PR contains a complete rewrite of Replay processing, which includes:
Next steps will be: