-
Notifications
You must be signed in to change notification settings - Fork 294
Make parsing of text be non-quadratic. #579
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: master
Are you sure you want to change the base?
Conversation
In Python, appending strings is not guaranteed to be constant-time, since they are documented to be immutable. In some corner cases, CPython is able to make these operations constant-time, but reaching into ETree objects is not such a case. This leads to parse times being quadratic in the size of the text in the input in pathological cases where parsing outputs a large number of adjacent text nodes which must be combined (e.g. HTML-escaped values). Specifically, we expect doubling the size of the input to result in approximately doubling the time to parse; instead, we observe quadratic behavior: ``` In [1]: import html5lib In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000) 2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000) 6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000) 19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each) ``` Switch from appending to the internal `str`, to appending text to an array of text chunks, as appends can be done in constant time. Using `bytearray` is a similar solution, but benchmarks slightly worse because the strings must be encoded before being appended. This improves parsing of text documents noticeably: ``` In [1]: import html5lib In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000) 2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000) 3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000) 8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) ```
This solution can’t work, as it’s a breaking change to the public API. Before: >>> html5lib.parse("hello")[1].text
'hello' After: >>> html5lib.parse("hello")[1].text
<html5lib.treebuilders.etree.TextBuffer object at 0x7ff2e31268d0> |
From what I can see, there are also plenty of operations in the html5lib-python/html5lib/_tokenizer.py Line 215 in fd4f032
|
@lopuhin That line is slow even in CPython. In CPython, appending a character is only O(1) if the string is a local variable inside a function with no other references. It is O(n) for an object property import timeit
def linear_local(n):
s = ""
for i in range(n):
s += "a" # fast
def quadratic_object(n):
class C: pass
c = C()
c.s = ""
for i in range(n):
c.s += "a" # slow
def quadratic_array(n):
a = [""]
for i in range(n):
a[0] += "a" # slow
def quadratic_global(n):
global s
s = ""
for i in range(n):
s += "a" # slow
def quadratic_nonlocal(n):
s = ""
def inner():
nonlocal s
for i in range(n):
s += "a" # slow
inner()
for f in [linear_local, quadratic_object, quadratic_array, quadratic_global, quadratic_nonlocal]:
for n in [100000, 200000, 400000, 800000]:
print(f.__name__, n, timeit.timeit(lambda: f(n), number=1)) Output with CPython 3.13.2:
|
Good point, thank you! Indeed I can reproduce the slowness of a particular HTML under CPython as well, although the difference is less than under GraalPy. |
In Python, appending strings is not guaranteed to be constant-time, since they are documented to be immutable. In some corner cases, CPython is able to make these operations constant-time, but reaching into ETree objects is not such a case.
This leads to parse times being quadratic in the size of the text in the input in pathological cases where parsing outputs a large number of adjacent text nodes which must be combined (e.g. HTML-escaped values). Specifically, we expect doubling the size of the input to result in approximately doubling the time to parse; instead, we observe quadratic behavior:
Switch from appending to the internal
str
, to appending text to an array of text chunks, as appends can be done in constant time. Usingbytearray
is a similar solution, but benchmarks slightly worse because the strings must be encoded before being appended.This improves parsing of text documents noticeably:
Old flamegraph:

New flamegraph:
