File "doc/xmlwf.1" should not be cleaned when building with
"./configure --without-docbook", and re-compilation of the file
should take precedence over a pre-built copy where available.
Also, variable CLEANFILES can be used to simplify things a bit
in Makefile.am.
This removes the dependency on CLOCKS_PER_SEC that prevented this test
from running properly on some platforms, as well as the inherent
flakiness of time measurements.
Since later commits have introduced g_bytesScanned (and before that,
g_parseAttempts), we can use that value as a proxy for parse time
instead of clock().
The bypass works on the assumption that the application uses a
consistent fill size. Let's make some assertions about what should
happen when the application doesn't do that -- most importantly,
that parsing does happen eventually, and that the number of scanned
bytes doesn't explode.
The key is to have __attribute__((noreturn)) somewhere that clang-tidy
can see it. In this case, this is the _fail() function, which is
conditionally called from the assert_true() macro.
This will ensure that clang-tidy doesn't complain about NULL values
that we've asserted against in tests.
...instead of only when approaching the maximum buffer size INT/2+1.
We'd like to give applications a chance to finish parsing a large token
before buffer reallocation, in case the reallocation fails.
By bypassing the reparse deferral heuristic when getting close to the
filling the buffer, we give them this chance -- if the whole token is
present in the buffer, it will be parsed at that time.
This may come at the cost of some extra reparse attempts. For a token
of n bytes, these extra parses cause us to scan over a maximum of
2n bytes (... + n/8 + n/4 + n/2 + n). Therefore, parsing of big tokens
remains O(n) in regard how many bytes we scan in attempts to parse. The
cost in reality is lower than that, since the reparses that happen due
to the bypass will affect m_partialTokenBytesBefore, delaying the next
ratio-based reparse. Furthermore, only the first token that "breaks
through" a buffer ceiling takes that extra reparse attempt; subsequent
large tokens will only bypass the heuristic if they manage to hit the
new buffer ceiling.
Note that this cost analysis depends on the assumption that Expat grows
its buffer by doubling it (or, more generally, grows it exponentially).
If this changes, the cost of this bypass may increase. Hopefully, this
would be caught by test_big_tokens_take_linear_time or the new test.
The bypass logic assumes that the application uses a consistent fill.
If the app increases its fill size, it may miss the bypass (and the
normal heuristic will apply). If the app decreases its fill size, the
bypass may be hit multiple times for the same buffer size. The very
worst case would be to always fill half of the remaining buffer space,
in which case parsing of a large n-byte token becomes O(n log n).
As an added bonus, the new test case should be faster than the old one,
since it doesn't have to go all the way to 1GiB to check the behavior.
Finally, this change necessitated a small modification to two existing
tests related to reparse deferral. These tests are testing the deferral
enabled setting, and assume that reparsing will not happen for any other
reason. By pre-growing the buffer, we make sure that this new deferral
does not affect those test cases.
For huge tokens, we may end up in a situation where the partial token
parse deferral heuristic demands more bytes than Expat's maximum buffer
size (currently ~half of INT_MAX) could fit.
INT_MAX/2 is 1024 MiB on most systems. Clearly, a token of 950 MiB could
fit in that buffer, but the reparse threshold might be such that
callProcessor() will defer it, allowing the app to keep filling the
buffer until XML_GetBuffer() eventually returns a memory error.
By bypassing the heuristic when we're getting close to the maximum
buffer size, it will once again be possible to parse tokens in the size
range INT_MAX/2/ratio < size < INT_MAX/2 reliably.
We subtract the last buffer fill size as a way to detect that the next
XML_GetBuffer() call has a risk of returning a memory error -- assuming
that the application is likely to keep using the same (or smaller) fill.
We subtract XML_CONTEXT_BYTES because that's the maximum amount of bytes
that could remain at the start of the buffer, preceding the partial
token. Technically, it could be fewer bytes, but XML_CONTEXT_BYTES is
normally small relative to INT_MAX, and is much simpler to use.
Co-authored-by: Sebastian Pipping <sebastian@pipping.org>
The test is essentially a copy of the existing test for the setter,
adapted to run on the external parser instead of the original one.
Suggested-by: Sebastian Pipping <sebastian@pipping.org>
CI-fighting-assistance-by: Sebastian Pipping <sebastian@pipping.org>