Skip to content

Make parsing of text be non-quadratic. #579

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alexmv
Copy link

@alexmv alexmv commented Feb 27, 2024

In Python, appending strings is not guaranteed to be constant-time, since they are documented to be immutable. In some corner cases, CPython is able to make these operations constant-time, but reaching into ETree objects is not such a case.

This leads to parse times being quadratic in the size of the text in the input in pathological cases where parsing outputs a large number of adjacent text nodes which must be combined (e.g. HTML-escaped values). Specifically, we expect doubling the size of the input to result in approximately doubling the time to parse; instead, we observe quadratic behavior:

In [1]: import html5lib

In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

Switch from appending to the internal str, to appending text to an array of text chunks, as appends can be done in constant time. Using bytearray is a similar solution, but benchmarks slightly worse because the strings must be encoded before being appended.

This improves parsing of text documents noticeably:

In [1]: import html5lib

In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

Old flamegraph:

New flamegraph:

In Python, appending strings is not guaranteed to be constant-time,
since they are documented to be immutable.  In some corner cases,
CPython is able to make these operations constant-time, but reaching
into ETree objects is not such a case.

This leads to parse times being quadratic in the size of the text in
the input in pathological cases where parsing outputs a large number
of adjacent text nodes which must be combined (e.g. HTML-escaped
values).  Specifically, we expect doubling the size of the input to
result in approximately doubling the time to parse; instead, we
observe quadratic behavior:

```
In [1]: import html5lib

In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)
```

Switch from appending to the internal `str`, to appending text to an
array of text chunks, as appends can be done in constant time.  Using
`bytearray` is a similar solution, but benchmarks slightly worse
because the strings must be encoded before being appended.

This improves parsing of text documents noticeably:

```
In [1]: import html5lib

In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
```
@andersk
Copy link

andersk commented Feb 28, 2024

This solution can’t work, as it’s a breaking change to the public API. Before:

>>> html5lib.parse("hello")[1].text
'hello'

After:

>>> html5lib.parse("hello")[1].text
<html5lib.treebuilders.etree.TextBuffer object at 0x7ff2e31268d0>

@lopuhin
Copy link

lopuhin commented Mar 10, 2025

From what I can see, there are also plenty of operations in the _tokenizer.py which assume that it's possible to append a character to a string in O(1), which is often the case in CPython, but not the case for other implementations, where having a pure-python parser can be especially valuable. E.g. here

self.currentToken["data"][-1][1] += output

@andersk
Copy link

andersk commented Mar 10, 2025

@lopuhin That line is slow even in CPython.

In CPython, appending a character is only O(1) if the string is a local variable inside a function with no other references. It is O(n) for an object property obj.prop or an array element arr[i] (even if the object or array itself is a local variable), or for a global or nonlocal variable—in all of those cases, the string has a refcount of at least 2, which prevents it from being safely mutated in place and forces it to be copied.

import timeit

def linear_local(n):
    s = ""
    for i in range(n):
        s += "a"  # fast

def quadratic_object(n):
    class C: pass
    c = C()
    c.s = ""
    for i in range(n):
        c.s += "a"  # slow

def quadratic_array(n):
    a = [""]
    for i in range(n):
        a[0] += "a"  # slow

def quadratic_global(n):
    global s
    s = ""
    for i in range(n):
        s += "a"  # slow

def quadratic_nonlocal(n):
    s = ""
    def inner():
        nonlocal s
        for i in range(n):
            s += "a"  # slow
    inner()

for f in [linear_local, quadratic_object, quadratic_array, quadratic_global, quadratic_nonlocal]:
    for n in [100000, 200000, 400000, 800000]:
        print(f.__name__, n, timeit.timeit(lambda: f(n), number=1))

Output with CPython 3.13.2:

linear_local 100000 0.006017955995048396
linear_local 200000 0.013165883996407501
linear_local 400000 0.027179232012713328
linear_local 800000 0.052238386997487396
quadratic_object 100000 0.11766406099195592
quadratic_object 200000 0.5580674420052674
quadratic_object 400000 2.6726826040103333
quadratic_object 800000 12.140160495007876
quadratic_array 100000 0.12400677500409074
quadratic_array 200000 0.5755963019910268
quadratic_array 400000 2.642135899004643
quadratic_array 800000 11.990410245998646
quadratic_global 100000 0.12772354800836183
quadratic_global 200000 0.5731496340013109
quadratic_global 400000 2.738810390001163
quadratic_global 800000 12.154955972000607
quadratic_nonlocal 100000 0.1292998229910154
quadratic_nonlocal 200000 0.5955325639952207
quadratic_nonlocal 400000 2.6306100980000338
quadratic_nonlocal 800000 11.95639012400352

@lopuhin
Copy link

lopuhin commented Mar 10, 2025

Good point, thank you! Indeed I can reproduce the slowness of a particular HTML under CPython as well, although the difference is less than under GraalPy.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants