len=0 was previously OK if there had previously been a non-zero call.
It makes sense to allow an application to work the same way on a
newly-created parser, and not have to care if its incoming buffer
happens to be 0.
If we always run with the heuristic enabled, it may hide some bugs by
grouping up input into bigger parse attempts.
CI-fighting-assistance-by: Sebastian Pipping <sebastian@pipping.org>
When the parse buffer contains the starting bytes of a token but not
all of them, we cannot parse the token to completion. We call this a
partial token. When this happens, the parse position is reset to the
start of the token, and the parse() call returns. The client is then
expected to provide more data and call parse() again.
In extreme cases, this means that the bytes of a token may be parsed
many times: once for every buffer refill required before the full token
is present in the buffer.
Math:
Assume there's a token of T bytes
Assume the client fills the buffer in chunks of X bytes
We'll try to parse X, 2X, 3X, 4X ... until mX == T (technically >=)
That's (m²+m)X/2 = (T²/X+T)/2 bytes parsed (arithmetic progression)
While it is alleviated by larger refills, this amounts to O(T²)
Expat grows its internal buffer by doubling it when necessary, but has
no way to inform the client about how much space is available. Instead,
we add a heuristic that skips parsing when we've repeatedly stopped on
an incomplete token. Specifically:
* Only try to parse if we have a certain amount of data buffered
* Every time we stop on an incomplete token, double the threshold
* As soon as any token completes, the threshold is reset
This means that when we get stuck on an incomplete token, the threshold
grows exponentially, effectively making the client perform larger buffer
fills, limiting how many times we can end up re-parsing the same bytes.
Math:
Assume there's a token of T bytes
Assume the client fills the buffer in chunks of X bytes
We'll try to parse X, 2X, 4X, 8X ... until (2^k)X == T (or larger)
That's (2^(k+1)-1)X bytes parsed -- e.g. 15X if T = 8X
This is equal to 2T-X, which amounts to O(T)
We could've chosen a faster growth rate, e.g. 4 or 8. Those seem to
increase performance further, at the cost of further increasing the
risk of growing the buffer more than necessary. This can easily be
adjusted in the future, if desired.
This is all completely transparent to the client, except for:
1. possible delay of some callbacks (when our heuristic overshoots)
2. apps that never do isFinal=XML_TRUE could miss data at the end
For the affected testdata, this change shows a 100-400x speedup.
The recset.xml benchmark shows no clear change either way.
Before:
benchmark -n ../testdata/largefiles/recset.xml 65535 3
3 loops, with buffer size 65535. Average time per loop: 0.270223
benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 15.033048
benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.018027
benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 11.775362
benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 11.711414
benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.019362
After:
./run.sh benchmark -n ../testdata/largefiles/recset.xml 65535 3
3 loops, with buffer size 65535. Average time per loop: 0.269030
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.044794
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.016377
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.027022
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.099360
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3
3 loops, with buffer size 4096. Average time per loop: 0.017956
When the parser is suspended, _XML_Parse_SINGLE_BYTES() will return
early. At that point, there could be some amount of bytes that haven't
been fed into Expat at all yet. This leaves us with an incomplete
document.
Furthermore, the last internal XML_Parse() call with isFinal=XML_TRUE
will not have happened, so the parser will not know that no more input
is to be expected. This is what allowed the test to pass when it was
originally changed to use SINGLE_BYTES.
With the new partial token heuristic, the lack of a final parse call
means that we don't even reach the "Ho" text, and fail the test.
The simplest solution is to go back to using XML_Parse() in this test.
Another option would be to let SINGLE_BYTES expose how far it got in
its loop, allowing for later continuation, but it doesn't seem worth the
extra complexity.
Some of these currently take a very long time to parse. I set those to
only run one loop in the run-benchmark make target.
4096 may be a fairly small buffer, and definitely make the problem worse
than it otherwise would've been, but similar sizes exist in real code:
* 2048 bytes in cpython Modules/pyexpat.c
* 4096 bytes in skia SkXMLParser.cpp
* BUFSIZ bytes (8192 on my machine) in expat/examples
The files, too, are inspired by real-life examples: Android stores
depth and gain maps as base64-encoded JPEGs inside the XMP data of
other JPEGs. Sometimes as a text element, sometimes as an attribute
value. I've seen attribute values slightly over 5 MiB in size.
clang-tidy output was:
> [..]/libexpat/expat/tests/misc_tests.c:112:23: note: The value '-1' provided to the cast expression is not in the valid range of values for 'XML_Error'
> 112 | if (XML_ErrorString((enum XML_Error) - 1) != NULL)
> | ^~~~~~~~~~~~~~~~~~~~
> [..]/libexpat/expat/tests/misc_tests.c:114:23: error: The value '100' provided to the cast expression is not in the valid range of values for 'XML_Error' [clang-analyzer-optin.core.EnumCastOutOfRange,-warnings-as-errors]
> 114 | if (XML_ErrorString((enum XML_Error)100) != NULL)
> | ^~~~~~~~~~~~~~~~~~~
Cppcheck output was:
> expat/lib/xmlparse.c:67:4: error: #error XML_GE (for general entities) must be defined, [..]
> # error XML_GE (for general entities) must be defined, [..]
> ^
clang-tidy output was:
> [..]/tests/handlers.h:502:64: error: parameter 'index' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls,-warnings-as-errors]
> 502 | _handler_record_get(const struct handler_record_list *storage, const int index,
> | ^~~~~
> [..]/tests/handlers.h:503:39: error: parameter 'line' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls,-warnings-as-errors]
> 503 | const char *file, const int line);
> | ^~~~~
clang-tidy output was:
> [..]/tests/acc_tests.c:368:9: warning: Null pointer passed to 1st parameter expecting 'nonnull' [clang-analyzer-core.NonNullParamChecker]
> 368 | if (strlen(printable) < (size_t)1)
> | ^ ~~~~~~~~~
Note: It was harmless because fail(..) right before catches that case.