Commit graph

902 commits

Author SHA1 Message Date
Sebastian Pipping
58ff7c39ea Sync file headers 2024-02-28 23:41:43 +01:00
Sebastian Pipping
dfe043fe6a Bump version to 2.6.1 2024-02-28 23:41:31 +01:00
Sebastian Pipping
1e028f2ef7 lib/expat.h: Expose billion laughs API for XML_DTD without XML_GE
Regression from commit caa2719863 .
2024-02-28 20:47:45 +01:00
Sebastian Pipping
7e2a0da9ba lib: Hide some test-only code behind new macro XML_TESTING 2024-02-21 13:07:35 +01:00
Sebastian Pipping
a4a420eedc Autotools: Turn libexpatinternal.la into standalone library
.. so that we can now have code in say xmlparse.c that does not
end up in libexpat.so but still runs when executing the test suite.
2024-02-21 12:53:03 +01:00
Snild Dolkow
fe0177cd3f tests: Replace g_parseAttempts with g_bytesScanned
This was used to estimate the number of scanned bytes. Just exposing
that number directly will be more precise.
2024-02-13 13:57:35 +01:00
Taichi Haradaguchi
3f60a47cb5 Fix compiler warnings
> In file included from ./../lib/internal.h:149,
>                  from codepage.c:38:
> ./../lib/expat.h:1045:5: warning: "XML_GE" is not defined, evaluates to 0 [-Wundef]
>  1045 | #if XML_GE == 1
>       |     ^~~~~~
> ./../lib/internal.h:158:5: warning: "XML_GE" is not defined, evaluates to 0 [-Wundef]
>   158 | #if XML_GE == 1
>       |     ^~~~~~
2024-02-10 23:08:03 +09:00
clang-format 18.1.0
d4f958e345 Mass-apply clang-format 18.1.0 2024-02-08 15:21:53 +01:00
Sebastian Pipping
2a10e173ab Sync file headers 2024-02-06 14:13:00 +01:00
Sebastian Pipping
310a1977f4 Bump version to 2.6.0 2024-02-06 14:08:05 +01:00
Sebastian Pipping
9944b71234
Merge pull request #813 from libexpat/issue-812-protect-against-closing-entities-out-of-order
Protect against closing entities out of order (fixes #812)
2024-02-06 00:16:23 +01:00
clang-format 18.1.0
137a578087 Mass-apply clang-format 18.1.0 2024-01-30 22:57:09 +01:00
Sebastian Pipping
c4208e7fd1 lib/xmlparse.c: Protect against closing entities out of order 2024-01-30 02:40:31 +01:00
Snild Dolkow
8f8aaf5c8e tests: Check heuristic bypass with varying buffer fill sizes
The bypass works on the assumption that the application uses a
consistent fill size. Let's make some assertions about what should
happen when the application doesn't do that -- most importantly,
that parsing does happen eventually, and that the number of scanned
bytes doesn't explode.
2024-01-29 19:59:18 +01:00
Snild Dolkow
3d8141d26a Bypass partial token heuristic when nearing full buffer
...instead of only when approaching the maximum buffer size INT/2+1.

We'd like to give applications a chance to finish parsing a large token
before buffer reallocation, in case the reallocation fails.

By bypassing the reparse deferral heuristic when getting close to the
filling the buffer, we give them this chance -- if the whole token is
present in the buffer, it will be parsed at that time.

This may come at the cost of some extra reparse attempts. For a token
of n bytes, these extra parses cause us to scan over a maximum of
2n bytes (... + n/8 + n/4 + n/2 + n). Therefore, parsing of big tokens
remains O(n) in regard how many bytes we scan in attempts to parse. The
cost in reality is lower than that, since the reparses that happen due
to the bypass will affect m_partialTokenBytesBefore, delaying the next
ratio-based reparse. Furthermore, only the first token that "breaks
through" a buffer ceiling takes that extra reparse attempt; subsequent
large tokens will only bypass the heuristic if they manage to hit the
new buffer ceiling.

Note that this cost analysis depends on the assumption that Expat grows
its buffer by doubling it (or, more generally, grows it exponentially).
If this changes, the cost of this bypass may increase. Hopefully, this
would be caught by test_big_tokens_take_linear_time or the new test.

The bypass logic assumes that the application uses a consistent fill.
If the app increases its fill size, it may miss the bypass (and the
normal heuristic will apply). If the app decreases its fill size, the
bypass may be hit multiple times for the same buffer size. The very
worst case would be to always fill half of the remaining buffer space,
in which case parsing of a large n-byte token becomes O(n log n).

As an added bonus, the new test case should be faster than the old one,
since it doesn't have to go all the way to 1GiB to check the behavior.

Finally, this change necessitated a small modification to two existing
tests related to reparse deferral. These tests are testing the deferral
enabled setting, and assume that reparsing will not happen for any other
reason. By pre-growing the buffer, we make sure that this new deferral
does not affect those test cases.
2024-01-29 17:09:36 +01:00
Snild Dolkow
60b7420989 Bypass partial token heuristic when close to maximum buffer size
For huge tokens, we may end up in a situation where the partial token
parse deferral heuristic demands more bytes than Expat's maximum buffer
size (currently ~half of INT_MAX) could fit.

INT_MAX/2 is 1024 MiB on most systems. Clearly, a token of 950 MiB could
fit in that buffer, but the reparse threshold might be such that
callProcessor() will defer it, allowing the app to keep filling the
buffer until XML_GetBuffer() eventually returns a memory error.

By bypassing the heuristic when we're getting close to the maximum
buffer size, it will once again be possible to parse tokens in the size
range INT_MAX/2/ratio < size < INT_MAX/2 reliably.

We subtract the last buffer fill size as a way to detect that the next
XML_GetBuffer() call has a risk of returning a memory error -- assuming
that the application is likely to keep using the same (or smaller) fill.

We subtract XML_CONTEXT_BYTES because that's the maximum amount of bytes
that could remain at the start of the buffer, preceding the partial
token. Technically, it could be fewer bytes, but XML_CONTEXT_BYTES is
normally small relative to INT_MAX, and is much simpler to use.

Co-authored-by: Sebastian Pipping <sebastian@pipping.org>
2024-01-29 17:09:36 +01:00
Snild Dolkow
ad9c01be8e Make external entity parser inherit partial token heuristic setting
The test is essentially a copy of the existing test for the setter,
adapted to run on the external parser instead of the original one.

Suggested-by: Sebastian Pipping <sebastian@pipping.org>
CI-fighting-assistance-by: Sebastian Pipping <sebastian@pipping.org>
2024-01-29 17:09:36 +01:00
Snild Dolkow
8ddd8e86aa Try to parse even when incoming len is zero
If the reparse deferral setting has changed, it may be possible to
finish a token.
2024-01-29 17:09:36 +01:00
Snild Dolkow
1d3162da8a Add app setting for enabling/disabling reparse heuristic
Suggested-by: Sebastian Pipping <sebastian@pipping.org>
CI-fighting-assistance-by: Sebastian Pipping <sebastian@pipping.org>
2024-01-29 17:09:36 +01:00
Snild Dolkow
09957b8ced Allow XML_GetBuffer() with len=0 on a fresh parser
len=0 was previously OK if there had previously been a non-zero call.
It makes sense to allow an application to work the same way on a
newly-created parser, and not have to care if its incoming buffer
happens to be 0.
2024-01-29 17:09:36 +01:00
Snild Dolkow
9fe3672459 tests: Run both with and without partial token heuristic
If we always run with the heuristic enabled, it may hide some bugs by
grouping up input into bigger parse attempts.

CI-fighting-assistance-by: Sebastian Pipping <sebastian@pipping.org>
2024-01-29 17:09:36 +01:00
Snild Dolkow
1b9d398517 Don't update partial token heuristic on error
Suggested-by: Sebastian Pipping <sebastian@pipping.org>
2024-01-29 17:09:35 +01:00
Snild Dolkow
9cdf9b8d77 Skip parsing after repeated partials on the same token
When the parse buffer contains the starting bytes of a token but not
all of them, we cannot parse the token to completion. We call this a
partial token.  When this happens, the parse position is reset to the
start of the token, and the parse() call returns. The client is then
expected to provide more data and call parse() again.

In extreme cases, this means that the bytes of a token may be parsed
many times: once for every buffer refill required before the full token
is present in the buffer.

Math:
  Assume there's a token of T bytes
  Assume the client fills the buffer in chunks of X bytes
  We'll try to parse X, 2X, 3X, 4X ... until mX == T (technically >=)
  That's (m²+m)X/2 = (T²/X+T)/2 bytes parsed (arithmetic progression)
  While it is alleviated by larger refills, this amounts to O(T²)

Expat grows its internal buffer by doubling it when necessary, but has
no way to inform the client about how much space is available. Instead,
we add a heuristic that skips parsing when we've repeatedly stopped on
an incomplete token. Specifically:

 * Only try to parse if we have a certain amount of data buffered
 * Every time we stop on an incomplete token, double the threshold
 * As soon as any token completes, the threshold is reset

This means that when we get stuck on an incomplete token, the threshold
grows exponentially, effectively making the client perform larger buffer
fills, limiting how many times we can end up re-parsing the same bytes.

Math:
  Assume there's a token of T bytes
  Assume the client fills the buffer in chunks of X bytes
  We'll try to parse X, 2X, 4X, 8X ... until (2^k)X == T (or larger)
  That's (2^(k+1)-1)X bytes parsed -- e.g. 15X if T = 8X
  This is equal to 2T-X, which amounts to O(T)

We could've chosen a faster growth rate, e.g. 4 or 8. Those seem to
increase performance further, at the cost of further increasing the
risk of growing the buffer more than necessary. This can easily be
adjusted in the future, if desired.

This is all completely transparent to the client, except for:
1. possible delay of some callbacks (when our heuristic overshoots)
2. apps that never do isFinal=XML_TRUE could miss data at the end

For the affected testdata, this change shows a 100-400x speedup.
The recset.xml benchmark shows no clear change either way.

Before:
benchmark -n ../testdata/largefiles/recset.xml 65535 3
  3 loops, with buffer size 65535. Average time per loop: 0.270223
benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 15.033048
benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.018027
benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 11.775362
benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 11.711414
benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.019362

After:
./run.sh benchmark -n ../testdata/largefiles/recset.xml 65535 3
  3 loops, with buffer size 65535. Average time per loop: 0.269030
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.044794
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.016377
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.027022
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.099360
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.017956
2024-01-29 17:09:35 +01:00
Sebastian Pipping
226a1527cf clang-tidy: Address warning readability-named-parameter 2024-01-12 23:27:19 +01:00
Sebastian Pipping
8a6c61de4a lib: Add XML_GE to XML_GetFeatureList and XML_FeatureEnum
Co-authored-by: Snild Dolkow <snild@sony.com>
2023-11-07 13:00:42 +01:00
Sebastian Pipping
55fecd6aa4 Drop redundant "XML_GE == 1" guards
These are redundant because further out there is a guard
for "XML_GE == 1" already.  In the visual world, the pattern
is this:

> #if XML_GE == 1
> [..]
> #  if XML_GE == 1
> [..]
> #  endif
> [..]
> #endif

Spotted by Snild Dolkow, thanks!

Co-authored-by: Snild Dolkow <snild@sony.com>
2023-11-07 13:00:42 +01:00
Sebastian Pipping
caa2719863 Simplify "defined(XML_DTD) || XML_GE == 1" to "XML_GE == 1" 2023-11-07 13:00:42 +01:00
Sebastian Pipping
2b127c20b2 lib: Make XML_GE==0 use self-references as entity replacement text 2023-11-06 21:02:42 +01:00
Sebastian Pipping
b0975cb73a lib: Fail the build if XML_GE is not set to 1 or 0 2023-11-06 20:43:09 +01:00
Sebastian Pipping
0f075ec8ec lib|xmlwf|cmake: Extend scope of billion laughs attack protection
.. from "defined(XML_DTD)" to "defined(XML_DTD) || XML_GE==1".
2023-11-06 20:43:09 +01:00
Snild Dolkow
119ae277ab Grow buffer based on current size
Until now, the buffer size to grow to has been calculated based on the
distance from the current parse position to the end of the buffer. This
means that the size of any already-parsed data was not considered,
leading to inconsistent buffer growth.

There was also a special case in XML_Parse() when XML_CONTEXT_BYTES was
zero, where the buffer size would be set to twice the incoming string
length. This patch replaces this with an XML_GetBuffer() call.

Growing the buffer based on its total size makes its growth consistent.

The commit includes a test that checks that we can reach the max buffer
size (usually INT_MAX/2 + 1) regardless of previously parsed content.

GitHub CI couldn't allocate the full 1GiB with MinGW/wine32, though it
works locally with the same compiler and wine version. As a workaround,
the test tries to malloc 1GiB, and reduces `maxbuf` to 512MiB in case
of failure.
2023-10-26 08:21:51 +02:00
Sebastian Pipping
4eeaf49262 xmlparse.c: Fix NULL pointer dereference in XML_ExternalEntityParserCreate
.. for context NULL inside function setContext
when macro XML_DTD is not defined at compile time.
2023-10-23 18:14:56 +02:00
clang-format
a392427d3a Mass-apply clang-format 17.0.3 using ./apply-clang-format.sh 2023-10-20 23:49:51 +02:00
Sebastian Pipping
96985a1a07 lib/xmlparse.c: Make clang-format 16.0.6 happy again 2023-10-05 15:44:10 +02:00
Sebastian Pipping
23110a864d Be stricter about macro XML_CONTEXT_BYTES
- Start treating -DXML_CONTEXT_BYTES=0 as "no context"
  rather than "context of size 0".  Was documented as
  "must be set to a positive integer", previously.

- Enforce that macro XML_CONTEXT_BYTES is defined at build time to
  avoid accidental misbuilds lacking context in environments that
  bypass both of Expats official build systems.

- Detect and reject use of negative context size at compile time.
2023-10-05 15:44:10 +02:00
Sebastian Pipping
acbcd0915d
Merge pull request #766 from libexpat/doc-parse-buffer-variables
lib/xmlparse.c: Improve parse buffer variables documentation
2023-10-05 14:50:10 +02:00
Sebastian Pipping
dd34d0e65c lib/xmlparse.c: Improve parse buffer variables documentation 2023-10-04 22:40:31 +02:00
Sebastian Pipping
ab43d8d116 Make inclusion to expat_config.h consistent
.. and priorize the local build over the system header.
2023-10-04 19:58:28 +02:00
Sebastian Pipping
c1d4c439a1 docs: Mass-replace "re-use[d]" by "reuse[d]"
Pointed out by codespell.
2023-10-03 21:33:55 +02:00
Donghee Na
e52b6b8b8c Update legal name of Donghee Na (#754) 2023-09-24 18:13:03 +02:00
Snild Dolkow
b1e955449c Always consume BOM bytes when found in prolog
The byte order mark is not correctly consumed when followed by an
incomplete token in a non-final parse. This results in the BOM staying
in the buffer, causing an invalid token error later.

This was not detected by existing tests because they either parse
everything in one call, or add a single byte at a time.

By moving `s` forward when we find a BOM, we make sure that the BOM
bytes are properly consumed in all cases.
2023-09-22 17:14:22 +02:00
Sebastian Pipping
81dd95d20a Document that glibc 2.36+ is bringing arc4random/arc4random_buf
Related:
https://sourceware.org/pipermail/libc-alpha/2022-August/141193.html
2023-05-28 17:04:49 +02:00
Rose
0685e01da2 debugLevel fields should be unsigned longs, not integers
This is a variation of https://github.com/libexpat/libexpat/pull/714 that makes more sense.
2023-05-23 16:26:32 -04:00
Sebastian Pipping
a256094b1d lib: Fix winconfig.h for clang-format 2023-03-22 18:20:04 +01:00
Orgad Shaneh
f86426fc8a winconfig: Avoid redefinition of WIN32_LEAN_AND_MEAN
If it is already defined externally, do not define it again.
2023-03-22 11:23:54 +02:00
Hanno Böck
50937b63bb
Use HTTPS where possible in URLs in code comments. 2023-03-16 20:12:49 +01:00
oda-gitso
1056758a33 lib: Address -Wunreachable-code for Clang 2023-02-20 01:13:58 +07:00
Sebastian Pipping
41c9daa337
Merge pull request #670 from seanm/more-const
Fix some -Wcast-qual Clang warnings
2022-11-02 18:15:14 +01:00
Sean McBride
cb7f93922e Simplify code by using SB_BYTE_TYPE macro
This also fixes -Wcast-qual warnings, which was the original motivation.
2022-11-02 12:17:18 -04:00
Sean McBride
c094e450a1 Fixed some clang -Wcast-qual warnings
Mostly added various missing consts, thus eliminating the casting away of constness. In a few cases, just removed unnecessary casts entirely.
2022-11-01 18:54:12 -04:00