Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

gsub lookahead cannot allocate memory #2354

Closed
stoat1 opened this issue Oct 3, 2021 · 2 comments · Fixed by #2641
Closed

gsub lookahead cannot allocate memory #2354

stoat1 opened this issue Oct 3, 2021 · 2 comments · Fixed by #2641
Labels
Milestone

Comments

@stoat1
Copy link

stoat1 commented Oct 3, 2021

Describe the bug
When using regex lookahead jq freezes, eats up all system memory, prints an error message and eventually aborts

To Reproduce

$ jq -n '"qux" | gsub("(?<=u)"; "u")'
"quux"

$ free -g
              total        used        free      shared  buff/cache   available
Mem:             15           5           6           0           3           8
Swap:             1           0           1

$ jq -n '"qux" | gsub("(?=u)"; "u")'
error: cannot allocate memory
[1]    123456 abort (core dumped)  jq -n '"qux" | gsub("(?=u)";  #"u")'

Expected behavior
jq outputs the result and exit

Environment

  • Ubuntu 20.04.2 LTS
  • jq-1.6

Additional context
I'm aware that this may be an issue in Oniguruma but I wasn't able to reproduce it due to my lack of experience with C

@pkoppstein
Copy link
Contributor

pkoppstein commented Oct 3, 2021

TL;DR: jq -n '"a"|gsub("^";"")' and similar expressions (success without matching any actual characters) cause an infinite loop.

(1) The example you gave (using "(<=u)" involves a look-behind RE. Using it, I was not able to reproduce the problem on a Mac OS, though it's interesting that gojq doesn't like the RE at all, and prints it out in a curious way. Details are below.

(2) gsub("(?=u)"; "u") does present problems for both jq and gojq, though for different reasons. jq goes into an infinite loop, as it does for all expressions like gsub("^";""). It handles sub("(?=u)"; "u") properly. Rightly or wrongly, with regard to gsub, jq tends to take the caveat emptor view, the point being that jq's gsub does not check for quiescence.

Details for look-behind:

for jq in jq1.5 jq1.6 jqMaster gojqMaster ; do
     $jq --version
    echo gsub
    $jq -n '"qux" | gsub("(?<=u)"; "u")'
    echo sub
    $jq -n '"qux" | sub("(?<=u)"; "u")'
    echo
done

Output:

jq-1.5
gsub
"quux"
sub
"quux"

jq-1.6
gsub
"quux"
sub
"quux"

jq-1.6-129-g80052e5-dirty
gsub
"quux"
sub
"quux"

gojq 0.12.5 (rev: 4ada50f/go1.17)
gsub
gojq: invalid regular expression "(?P<=u)": error parsing regexp: invalid named capture: `(?P<=u)`
sub
gojq: invalid regular expression "(?P<=u)": error parsing regexp: invalid named capture: `(?P<=u)`

@stoat1
Copy link
Author

stoat1 commented Oct 9, 2021

@pkoppstein thank you for your findings.

I'd like to add that the same regex works as expected in tools using another regex flavor, e.g.

$ awk 'gsub("^", "")' <<<a
a

jshell> "a".replaceAll("^", "")
$1 ==> "a"

> console.log('a'.replaceAll(/^/g, ''))
a

$ python -c '
import re
print(re.sub("^", "", "a"))
'
a

@itchyny itchyny added the bug label Jun 3, 2023
pkoppstein added a commit to pkoppstein/jq that referenced this issue Jun 29, 2023
… uniq(stream)

The primary purpose of this commit (which supercedes PR
jqlang#2624) is to rectify most problems
with `gsub` (and also `sub` with the "g" option), in particular jqlang#1425
('\b'), jqlang#2354 (lookahead), and jqlang#2532 (regex == "^(?!cd ).*$|^cd ";"")).

This commit also partly resolves jqlang#2148 and jqlang#1206 in that `gsub` no
longer loops infinitely; however, because the new `gsub` depends
critically on match(_;"g"), the behavior when regex == "" is sometimes
non-standard. [*1]

Since the new sub/3 relies on uniq/1, that has been added as well [*2].

The documentation has been updated to reflect the fact that `sub` and
`gsub` are intended to be regular in the second argument. [*3]

Also, _nwise/1 has been tweaked to take advantage of TCO.

Footnotes:

[*1] Using the new gsub, '"a" | gsub( ""; "a")' emits "aa" rather than
"aaa" as would be standard.  This is nevertheless better than the
infinite loop behavior of jq 1.6 in this case.

With one exception (as explained in [*2]), the new gsub is implemented
as though match/2 behavior is correct.  That is, bugs in `gsub`
behavior will most likely have their origin in `match/2`.

[*2] `uniq/1` adopts the Unix/Linux name and semantics; it is needed for the following test case:

gsub("(?=u)"; "u")
"qux"
"quux"

Without this functionality:

Test jqlang#23: 'gsub("(?=u)"; "u")' at line number 100
*** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u")

The root of the problem here is `match`: if `match` is fixed, then gsub would not need `untie`.

The addition of `uniq` as a top-level function should be a non-issue
relative to general concern about builtins.jq bloat: the line count of
the new builtin.jq is significantly reduced overall, and the number of
defs is actually reduced by 1 (from 111 (ignoring a redundant def) to 110).

[*3] See e.g. jqlang#513 (comment)
@itchyny itchyny added this to the 1.7 release milestone Jul 2, 2023
itchyny pushed a commit that referenced this issue Jul 3, 2023
The primary purpose of this commit is to rectify most problems with
`gsub` (and also `sub` with the `g` option), in particular fix #1425 ('\b'),
fix #2354 (lookahead), and fix #2532 (regex == `"^(?!cd ).*$|^cd "`).

This commit also partly resolves #2148 and resolves #1206 in that
`gsub` no longer loops infinitely; however, because the new `gsub`
depends critically on `match/2`, the behavior when regex == `""` is
sometimes non-standard.

The documentation has been updated to reflect the fact that `sub`
and `gsub` are intended to be regular in the second argument.

Also, `_nwise/1` has been tweaked to take advantage of TCO.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants