Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

diffArrays RangeError: Maximum call stack size exceeded #72

Open
amiller-gh opened this issue Aug 7, 2020 · 3 comments
Open

diffArrays RangeError: Maximum call stack size exceeded #72

amiller-gh opened this issue Aug 7, 2020 · 3 comments
Assignees

Comments

@amiller-gh
Copy link

amiller-gh commented Aug 7, 2020

Hello! First off than you for a great library.

For reference, a version of this issue has been discussed previously here with no resolution: #10 (comment)

In my application I'm comparing objects that may contain very large arrays – sometimes up to 20-40k strings. Because this implementation's Levenshtein distance algorithm is recursive, it is guaranteed to blow out the stack every time it encounters one of these large objects.

A recursive algorithm may be more elegant, but when we can't know the data that may be coming through the pipes its a footgun waiting to happen.

I'm able to work around it by providing a custom diff function to abort operating on large arrays like so:

import { diffAny } from 'rfc6902/diff';
function customDiff(input: any, output: any, ptr: any): Operation[] | void {
  if ((Array.isArray(input) || Array.isArray(output)) && (input.length > 5000 || output.length > 5000)) {
    return [{ op: 'replace', path: ptr.toString(), value: output }];
  }
  return diffAny(input, output, ptr, customDiff);
}

Not ideal since these arrays often have a lot of overlap! Have you experimented any more with an iterative version of Levenshtein for this repo?

@amiller-gh
Copy link
Author

Also, ref: danger/danger-js#1019

@chbrown
Copy link
Owner

chbrown commented Aug 13, 2020

I've implemented iterative/imperative versions of the Levenshtein algorithm — not here specifically, but AFAIK there's no technical / computer-science-y reason not to.

I'm not sure that's the right solution though. In a way, I'm punting the problem of what to do with really long arrays down to the user's (default?) choice of stack size. It seems rude to if (input.length > 5000 || output.length > 5000) { throw Error("Your arrays are too long") } but by using recursion I can blame it on the JavaScript engine 😜

IDK; what do you think it should do? If given two arrays of 100K+ elements, should it really embark on a potentially O(n²) algorithm? (Which, in the 100K case, implies up to 10M operations!) Realistically, I think your custom diff is the "right" solution in your case — you had really long arrays, realized the default behavior of the library blew up in that case, and so you overrode that default behavior. For some users, the "right" solution might be to increase their engine's stack size (as in #10).

(Kinda just thinking out loud here...) What are we really trying to optimize? Maybe we say we want the smallest possible patch size. But what if that's potentially going to max out the CPU for some non-finite amount of time? There's a trade-off between how smart the library can be on whatever data you throw at it and it running down rabbit holes on pathological inputs. Where to draw the line? E.g., diffObjects isn't very smart, compared to diffArrays — it doesn't currently handle renamed keys as move operations, though that would be a nice (smart) feature to have. But that would require extra checking that could get into the same sticky Levenshtein-esque situations. Similarly, when diffing arrays (e.g., [][1, 2, 3, 4, 5], or even more poignantly, [1, 2, 3, 4, 5][]), an attentive user may observe that it'd be more efficient to replace the entire array than generate a whole bunch of remove and add operations, but the library isn't that smart. Should it be? If so, at what cost? In addition to all the Levenshtein comparisons, we'd also have to compare diffArrays's total patch size to the patch of a "dumb" wholesale replacement (how? encode as JSON and count bytes?). Maybe reaching for Levenshtein as a one-size-fits-all solution for all arrays is the wrong default behavior, and should be entirely opt-in? (In which case there are plenty of other JSON-diffing libraries that might be more to your taste 😉 )

tl;dr: you bring up a good point, but it's not as simple as unfolding a recursive implementation. (Though that might be an incremental improvement — I'll have to think it over.)

P.S. I think your isArray || isArray should be isArray && isArray?

P.P.S. I was initially worried you were running into the potential regression described in #15 (comment) which was recently released in v4, but I'm pretty sure that's not the case here, since your issue is blowing out the stack (more similar to #10), not just degraded performance on v4 vs. v3.

@amiller-gh
Copy link
Author

amiller-gh commented Aug 15, 2020

I think there are two different concerns we should separate out here 🙂 I see them as:

  1. In the worst case what should the behavior of this library be?

    Currently, the behavior is to crash the process. I would argue that an un-bound CPU operation is the better of two evils here! Either option will be quickly noticed by an engineer as a perf bottleneck, but only one of them comes with potential data loss.

  2. How can we best prevent the worst case situation from happening?

    This is where we get in to the very fun questions you pose above! I'd agree that Levenshtein as a one-size-fits-all solution for this library is probably a little optimistic for the amazing array (pun intended) of use cases you might see for this library, and some mix of all the methods outlined above is likely the answer. Perhaps as pre-packaged array diff strategies made available in the library to provide as a custom diff so users don't have to always write their own!

Because there is a simple – and backwards compatible – way for this library to keep the same exact behavior, but mitigate the worst-case risk for the end-user, the answer to concern 1 seems fairly straightforward to me! Concern 2 comes with a lot more (very fun) optimization and use case questions that are definitely deserving of a standalone RFC 😅 (Happy to tag team drafting some proposals)

P.S. Yup! Good catch.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants