`diffArrays` RangeError: Maximum call stack size exceeded #72

amiller-gh · 2020-08-07T19:57:25Z

Hello! First off than you for a great library.

For reference, a version of this issue has been discussed previously here with no resolution: #10 (comment)

In my application I'm comparing objects that may contain very large arrays – sometimes up to 20-40k strings. Because this implementation's Levenshtein distance algorithm is recursive, it is guaranteed to blow out the stack every time it encounters one of these large objects.

A recursive algorithm may be more elegant, but when we can't know the data that may be coming through the pipes its a footgun waiting to happen.

I'm able to work around it by providing a custom diff function to abort operating on large arrays like so:

import { diffAny } from 'rfc6902/diff';
function customDiff(input: any, output: any, ptr: any): Operation[] | void {
  if ((Array.isArray(input) || Array.isArray(output)) && (input.length > 5000 || output.length > 5000)) {
    return [{ op: 'replace', path: ptr.toString(), value: output }];
  }
  return diffAny(input, output, ptr, customDiff);
}

Not ideal since these arrays often have a lot of overlap! Have you experimented any more with an iterative version of Levenshtein for this repo?

amiller-gh · 2020-08-08T02:57:19Z

Also, ref: danger/danger-js#1019

chbrown · 2020-08-13T15:17:51Z

I've implemented iterative/imperative versions of the Levenshtein algorithm — not here specifically, but AFAIK there's no technical / computer-science-y reason not to.

I'm not sure that's the right solution though. In a way, I'm punting the problem of what to do with really long arrays down to the user's (default?) choice of stack size. It seems rude to if (input.length > 5000 || output.length > 5000) { throw Error("Your arrays are too long") } but by using recursion I can blame it on the JavaScript engine 😜

IDK; what do you think it should do? If given two arrays of 100K+ elements, should it really embark on a potentially O(n²) algorithm? (Which, in the 100K case, implies up to 10M operations!) Realistically, I think your custom diff is the "right" solution in your case — you had really long arrays, realized the default behavior of the library blew up in that case, and so you overrode that default behavior. For some users, the "right" solution might be to increase their engine's stack size (as in #10).

(Kinda just thinking out loud here...) What are we really trying to optimize? Maybe we say we want the smallest possible patch size. But what if that's potentially going to max out the CPU for some non-finite amount of time? There's a trade-off between how smart the library can be on whatever data you throw at it and it running down rabbit holes on pathological inputs. Where to draw the line? E.g., diffObjects isn't very smart, compared to diffArrays — it doesn't currently handle renamed keys as move operations, though that would be a nice (smart) feature to have. But that would require extra checking that could get into the same sticky Levenshtein-esque situations. Similarly, when diffing arrays (e.g., [] → [1, 2, 3, 4, 5], or even more poignantly, [1, 2, 3, 4, 5] → []), an attentive user may observe that it'd be more efficient to replace the entire array than generate a whole bunch of remove and add operations, but the library isn't that smart. Should it be? If so, at what cost? In addition to all the Levenshtein comparisons, we'd also have to compare diffArrays's total patch size to the patch of a "dumb" wholesale replacement (how? encode as JSON and count bytes?). Maybe reaching for Levenshtein as a one-size-fits-all solution for all arrays is the wrong default behavior, and should be entirely opt-in? (In which case there are plenty of other JSON-diffing libraries that might be more to your taste 😉 )

tl;dr: you bring up a good point, but it's not as simple as unfolding a recursive implementation. (Though that might be an incremental improvement — I'll have to think it over.)

P.S. I think your isArray || isArray should be isArray && isArray?

P.P.S. I was initially worried you were running into the potential regression described in #15 (comment) which was recently released in v4, but I'm pretty sure that's not the case here, since your issue is blowing out the stack (more similar to #10), not just degraded performance on v4 vs. v3.

amiller-gh · 2020-08-15T05:11:00Z

I think there are two different concerns we should separate out here 🙂 I see them as:

In the worst case what should the behavior of this library be?

Currently, the behavior is to crash the process. I would argue that an un-bound CPU operation is the better of two evils here! Either option will be quickly noticed by an engineer as a perf bottleneck, but only one of them comes with potential data loss.
How can we best prevent the worst case situation from happening?

This is where we get in to the very fun questions you pose above! I'd agree that Levenshtein as a one-size-fits-all solution for this library is probably a little optimistic for the amazing array (pun intended) of use cases you might see for this library, and some mix of all the methods outlined above is likely the answer. Perhaps as pre-packaged array diff strategies made available in the library to provide as a custom diff so users don't have to always write their own!

Because there is a simple – and backwards compatible – way for this library to keep the same exact behavior, but mitigate the worst-case risk for the end-user, the answer to concern 1 seems fairly straightforward to me! Concern 2 comes with a lot more (very fun) optimization and use case questions that are definitely deserving of a standalone RFC 😅 (Happy to tag team drafting some proposals)

P.S. Yup! Good catch.

chbrown self-assigned this Sep 20, 2020

supermeng mentioned this issue Apr 2, 2022

Using dp rather than recursion for better performance #88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`diffArrays` RangeError: Maximum call stack size exceeded #72

`diffArrays` RangeError: Maximum call stack size exceeded #72

amiller-gh commented Aug 7, 2020 •

edited

Loading

amiller-gh commented Aug 8, 2020

chbrown commented Aug 13, 2020

amiller-gh commented Aug 15, 2020 •

edited

Loading

diffArrays RangeError: Maximum call stack size exceeded #72

diffArrays RangeError: Maximum call stack size exceeded #72

Comments

amiller-gh commented Aug 7, 2020 • edited Loading

amiller-gh commented Aug 8, 2020

chbrown commented Aug 13, 2020

amiller-gh commented Aug 15, 2020 • edited Loading

`diffArrays` RangeError: Maximum call stack size exceeded #72

`diffArrays` RangeError: Maximum call stack size exceeded #72

amiller-gh commented Aug 7, 2020 •

edited

Loading

amiller-gh commented Aug 15, 2020 •

edited

Loading