Commit graph

4087 commits

Author SHA1 Message Date
Snild Dolkow
09957b8ced Allow XML_GetBuffer() with len=0 on a fresh parser
len=0 was previously OK if there had previously been a non-zero call.
It makes sense to allow an application to work the same way on a
newly-created parser, and not have to care if its incoming buffer
happens to be 0.
2024-01-29 17:09:36 +01:00
Snild Dolkow
f1eea784d0 tests: Add max_slowdown info in test_big_tokens_take_linear_time
Suggested-by: Sebastian Pipping <sebastian@pipping.org>
2024-01-29 17:09:36 +01:00
Snild Dolkow
9fe3672459 tests: Run both with and without partial token heuristic
If we always run with the heuristic enabled, it may hide some bugs by
grouping up input into bigger parse attempts.

CI-fighting-assistance-by: Sebastian Pipping <sebastian@pipping.org>
2024-01-29 17:09:36 +01:00
Snild Dolkow
1b9d398517 Don't update partial token heuristic on error
Suggested-by: Sebastian Pipping <sebastian@pipping.org>
2024-01-29 17:09:35 +01:00
Snild Dolkow
9cdf9b8d77 Skip parsing after repeated partials on the same token
When the parse buffer contains the starting bytes of a token but not
all of them, we cannot parse the token to completion. We call this a
partial token.  When this happens, the parse position is reset to the
start of the token, and the parse() call returns. The client is then
expected to provide more data and call parse() again.

In extreme cases, this means that the bytes of a token may be parsed
many times: once for every buffer refill required before the full token
is present in the buffer.

Math:
  Assume there's a token of T bytes
  Assume the client fills the buffer in chunks of X bytes
  We'll try to parse X, 2X, 3X, 4X ... until mX == T (technically >=)
  That's (m²+m)X/2 = (T²/X+T)/2 bytes parsed (arithmetic progression)
  While it is alleviated by larger refills, this amounts to O(T²)

Expat grows its internal buffer by doubling it when necessary, but has
no way to inform the client about how much space is available. Instead,
we add a heuristic that skips parsing when we've repeatedly stopped on
an incomplete token. Specifically:

 * Only try to parse if we have a certain amount of data buffered
 * Every time we stop on an incomplete token, double the threshold
 * As soon as any token completes, the threshold is reset

This means that when we get stuck on an incomplete token, the threshold
grows exponentially, effectively making the client perform larger buffer
fills, limiting how many times we can end up re-parsing the same bytes.

Math:
  Assume there's a token of T bytes
  Assume the client fills the buffer in chunks of X bytes
  We'll try to parse X, 2X, 4X, 8X ... until (2^k)X == T (or larger)
  That's (2^(k+1)-1)X bytes parsed -- e.g. 15X if T = 8X
  This is equal to 2T-X, which amounts to O(T)

We could've chosen a faster growth rate, e.g. 4 or 8. Those seem to
increase performance further, at the cost of further increasing the
risk of growing the buffer more than necessary. This can easily be
adjusted in the future, if desired.

This is all completely transparent to the client, except for:
1. possible delay of some callbacks (when our heuristic overshoots)
2. apps that never do isFinal=XML_TRUE could miss data at the end

For the affected testdata, this change shows a 100-400x speedup.
The recset.xml benchmark shows no clear change either way.

Before:
benchmark -n ../testdata/largefiles/recset.xml 65535 3
  3 loops, with buffer size 65535. Average time per loop: 0.270223
benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 15.033048
benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.018027
benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 11.775362
benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 11.711414
benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.019362

After:
./run.sh benchmark -n ../testdata/largefiles/recset.xml 65535 3
  3 loops, with buffer size 65535. Average time per loop: 0.269030
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.044794
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.016377
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.027022
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.099360
./run.sh benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3
  3 loops, with buffer size 4096. Average time per loop: 0.017956
2024-01-29 17:09:35 +01:00
Snild Dolkow
60dffa148c tests: Use normal XML_Parse in test_suspend_resume_internal_entity
When the parser is suspended, _XML_Parse_SINGLE_BYTES() will return
early. At that point, there could be some amount of bytes that haven't
been fed into Expat at all yet. This leaves us with an incomplete
document.

Furthermore, the last internal XML_Parse() call with isFinal=XML_TRUE
will not have happened, so the parser will not know that no more input
is to be expected. This is what allowed the test to pass when it was
originally changed to use SINGLE_BYTES.

With the new partial token heuristic, the lack of a final parse call
means that we don't even reach the "Ho" text, and fail the test.

The simplest solution is to go back to using XML_Parse() in this test.
Another option would be to let SINGLE_BYTES expose how far it got in
its loop, allowing for later continuation, but it doesn't seem worth the
extra complexity.
2024-01-29 17:09:35 +01:00
Snild Dolkow
3484383fa7 Add aaaaaa_*.xml with unreasonably large tokens
Some of these currently take a very long time to parse. I set those to
only run one loop in the run-benchmark make target.

4096 may be a fairly small buffer, and definitely make the problem worse
than it otherwise would've been, but similar sizes exist in real code:

 * 2048 bytes in cpython Modules/pyexpat.c
 * 4096 bytes in skia SkXMLParser.cpp
 * BUFSIZ bytes (8192 on my machine) in expat/examples

The files, too, are inspired by real-life examples: Android stores
depth and gain maps as base64-encoded JPEGs inside the XMP data of
other JPEGs. Sometimes as a text element, sometimes as an attribute
value. I've seen attribute values slightly over 5 MiB in size.
2024-01-29 17:09:35 +01:00
Sebastian Pipping
183270d565
Merge pull request #810 from libexpat/clang-18
CI: Upgrade to Clang 18 (except clang-tidy and clang-format)
2024-01-26 19:10:31 +01:00
Sebastian Pipping
f7ada131b7
Merge pull request #808 from libexpat/clang-tidy-18
CI: Upgrade to clang-tidy 18
2024-01-26 18:30:17 +01:00
Sebastian Pipping
6880fe4948 CI: Upgrade to Clang 18 (except clang-tidy and clang-format) 2024-01-26 16:20:04 +01:00
Sebastian Pipping
fc0b026ce5 clang-format.yml: De-couple clang-format from Clang
.. so that we can bump their versions independently
2024-01-26 16:19:59 +01:00
Sebastian Pipping
7acda8d16a clang-tidy.yml: Upgrade to clang-tidy 18 2024-01-26 16:19:02 +01:00
Sebastian Pipping
737e8ea183 tests/misc_tests.c: Address clang-tidy 18 warning EnumCastOutOfRange
clang-tidy output was:
> [..]/libexpat/expat/tests/misc_tests.c:112:23: note: The value '-1' provided to the cast expression is not in the valid range of values for 'XML_Error'
>   112 |   if (XML_ErrorString((enum XML_Error) - 1) != NULL)
>       |                       ^~~~~~~~~~~~~~~~~~~~
> [..]/libexpat/expat/tests/misc_tests.c:114:23: error: The value '100' provided to the cast expression is not in the valid range of values for 'XML_Error' [clang-analyzer-optin.core.EnumCastOutOfRange,-warnings-as-errors]
>   114 |   if (XML_ErrorString((enum XML_Error)100) != NULL)
>       |                       ^~~~~~~~~~~~~~~~~~~
2024-01-26 16:19:02 +01:00
Sebastian Pipping
abd9542b32
Merge pull request #806 from libexpat/dependabot/github_actions/actions/upload-artifact-4.2.0
Actions(deps): Bump actions/upload-artifact from 4.0.0 to 4.2.0
2024-01-22 14:52:20 +01:00
dependabot[bot]
2c37fc7d7d
Actions(deps): Bump actions/upload-artifact from 4.0.0 to 4.2.0
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.0.0 to 4.2.0.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...694cdabd8bdb0f10b2cea11669e1bf5453eed0a6)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-22 12:09:43 +00:00
Sebastian Pipping
86a3623a9a
Merge pull request #801 from catenacyber/fuzzcovstop
fuzz: Improve coverage by maybe stopping the parser
2024-01-17 17:36:43 +01:00
Sebastian Pipping
5b70d3ac44 fuzz/xml_parsebuffer_fuzzer.c: Be more robust towards ouf-of-memory 2024-01-17 10:08:42 +01:00
Philippe Antoine
34af886238 fuzz: improve coverage by maybe stopping parser 2024-01-16 11:08:44 +01:00
Sebastian Pipping
2640b1d97c
Merge pull request #799 from libexpat/ci-fuzzing
Make CI run fuzzing regression tests (fixes #367)
2024-01-16 02:26:43 +01:00
Sebastian Pipping
c47e191797
Merge pull request #803 from libexpat/fix-cppcheck-ci
Fix Cppcheck CI for Cppcheck 2.13.0
2024-01-16 01:14:12 +01:00
Sebastian Pipping
24ffba44bd Make CI run fuzzing regression tests 2024-01-15 23:57:02 +01:00
Sebastian Pipping
73ebe0bfb3 fuzz: Address warning -Wunused-function with regard to sip24_valid 2024-01-15 23:57:02 +01:00
Sebastian Pipping
ed38687779 mass-cppcheck.sh: Fix for Cppcheck 2.13.0
Cppcheck output was:
> expat/lib/xmlparse.c:67:4: error: #error XML_GE (for general entities) must be defined, [..]
> #  error XML_GE (for general entities) must be defined, [..]
>    ^
2024-01-15 23:29:19 +01:00
Sebastian Pipping
3ff1d00dc2 cppcheck.yml: Bump to macOS 12
Homebrew output was:
> Warning: You are using macOS 11.
> We (and Apple) do not provide support for this old version.
> [..]
2024-01-15 23:29:19 +01:00
Sebastian Pipping
9e603b35e0
Merge pull request #802 from libexpat/dependabot/github_actions/actions/upload-artifact-4.1.0
Actions(deps): Bump actions/upload-artifact from 4.0.0 to 4.1.0
2024-01-15 17:06:21 +01:00
dependabot[bot]
2d9bc9aec6
Actions(deps): Bump actions/upload-artifact from 4.0.0 to 4.1.0
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.0.0 to 4.1.0.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](c7d193f32e...1eb3cb2b3e)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-15 12:39:38 +00:00
Sebastian Pipping
19af57f2dd
Merge pull request #800 from libexpat/clang-tidy-more
clang-tidy: Address warnings `readability-avoid-const-params-in-decls` and `readability-named-parameter`
2024-01-13 01:59:22 +01:00
Sebastian Pipping
226a1527cf clang-tidy: Address warning readability-named-parameter 2024-01-12 23:27:19 +01:00
Sebastian Pipping
225ebd45e1 clang-tidy: Address warning readability-avoid-const-params-in-decls
clang-tidy output was:
> [..]/tests/handlers.h:502:64: error: parameter 'index' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls,-warnings-as-errors]
>   502 | _handler_record_get(const struct handler_record_list *storage, const int index,
>       |                                                                ^~~~~
> [..]/tests/handlers.h:503:39: error: parameter 'line' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls,-warnings-as-errors]
>   503 |                     const char *file, const int line);
>       |                                       ^~~~~
2024-01-12 22:19:05 +01:00
Sebastian Pipping
7664ecdbae
Merge pull request #798 from libexpat/clang-tidy
Make GitHub Actions enforce clang-tidy clean code + address current clang-tidy warnings
2024-01-12 21:38:35 +01:00
Sebastian Pipping
f832f7b981 Make GitHub Actions enforce clang-tidy clean code 2024-01-12 17:26:50 +01:00
Sebastian Pipping
10cded2493 tests/basic_tests.c: Address clang-tidy warning clang-analyzer-core.NullDereference
clang-tidy output was:
> [..]/tests/basic_tests.c:2083:19: warning: Dereference of null pointer [clang-analyzer-core.NullDereference]
>  2083 |   errorFlags |= ((model[0].type == XML_CTYPE_SEQ) ? 0 : (1u << 2));
>       |                   ^~~~~~~~~~~~~
2024-01-12 17:25:27 +01:00
Sebastian Pipping
e23c300f25 tests/acc_tests.c: Address clang-tidy warning clang-analyzer-core.NonNullParamChecker
clang-tidy output was:
> [..]/tests/acc_tests.c:368:9: warning: Null pointer passed to 1st parameter expecting 'nonnull' [clang-analyzer-core.NonNullParamChecker]
>   368 |     if (strlen(printable) < (size_t)1)
>       |         ^      ~~~~~~~~~

Note: It was harmless because fail(..) right before catches that case.
2024-01-12 17:25:27 +01:00
Sebastian Pipping
0b424cb9ae examples/element_declarations.c: Simplify first call to stackPushMalloc
.. where stackTop is NULL anyway
2024-01-12 17:25:27 +01:00
Sebastian Pipping
0ebca2b10f examples/element_declarations.c: Fix memleak in dumpContentModel on OOM
clang-tidy output was:
> [..]/examples/element_declarations.c:163:16: warning: Potential leak of memory pointed to by 'stackTop' [clang-analyzer-unix.Malloc]
>   163 |         return false;
>       |                ^
2024-01-12 04:46:47 +01:00
Sebastian Pipping
716fd10bd4
Merge pull request #797 from catenacyber/fuzzcov
fuzz: improve coverage
2024-01-10 23:07:00 +01:00
Philippe Antoine
bb58abd4e0 fuzz: improve coverage 2024-01-10 22:06:37 +01:00
Sebastian Pipping
be47f6d5e8
Merge pull request #796 from libexpat/ci-control-flow-integrity
Make CI cover Clang's Control Flow Integrity sanitizer
2023-12-19 18:39:44 +01:00
Sebastian Pipping
64912b70fb
Merge pull request #795 from libexpat/autotools-install-shipped-xmlwf-manpage
Autotools: Make installation of shipped `doc/xmlwf.1` independent of docbook2man availability
2023-12-19 18:38:44 +01:00
Sebastian Pipping
18b44c980e linux.yml: Cover Clang's Control Flow Integrity sanitizer 2023-12-19 01:31:10 +01:00
Sebastian Pipping
9495cefd94 qa.sh: Fix dropping of QA_SANITIZER 2023-12-19 01:31:10 +01:00
Sebastian Pipping
4b878938bb qa.sh: Support Clang's Control Flow Integrity sanitizer
https://clang.llvm.org/docs/ControlFlowIntegrity.html
2023-12-19 01:31:10 +01:00
Sebastian Pipping
7384c88f9a configure.ac: Make installation of shipped doc/xmlwf.1 independent of docbook2man availability 2023-12-18 23:59:25 +01:00
Sebastian Pipping
822d1706b2
Merge pull request #794 from libexpat/dependabot/github_actions/actions/upload-artifact-4.0.0
Actions(deps): Bump actions/upload-artifact from 3.1.3 to 4.0.0
2023-12-18 17:48:34 +01:00
dependabot[bot]
8c87ca470d
Actions(deps): Bump actions/upload-artifact from 3.1.3 to 4.0.0
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 3.1.3 to 4.0.0.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](a8a3f3ad30...c7d193f32e)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-12-18 12:12:08 +00:00
Sebastian Pipping
b9fcca0aaa
Merge pull request #793 from libexpat/fix-bug-report-target
CMake|Autotools: Fix `PACKAGE_BUGREPORT` variable to something working
2023-12-17 23:09:46 +01:00
Sebastian Pipping
5a3c419e6a CMake|Autotools: Fix PACKAGE_BUGREPORT variable to something working 2023-12-17 03:34:27 +01:00
Sebastian Pipping
85ee77d31f
Merge pull request #792 from libexpat/autotools-sync-cmake-files
autotools: Sync CMake templates with CMake 3.26
2023-12-16 16:50:45 +01:00
Sebastian Pipping
141cdab714 autotools: Sync CMake templates with CMake 3.26 2023-12-15 05:02:23 +01:00
Sebastian Pipping
fb702e6c0e
Merge pull request #790 from libexpat/cmake-build-benchmark-also
CMake: Build `tests/benchmark/benchmark.c` for `EXPAT_BUILD_TESTS`
2023-11-22 13:04:23 +01:00