🌿 Fast streaming XML parser written in C99 with >90% test coverage; moved from SourceForge to GitHub
Find a file
Snild Dolkow 3d8141d26a Bypass partial token heuristic when nearing full buffer
...instead of only when approaching the maximum buffer size INT/2+1.

We'd like to give applications a chance to finish parsing a large token
before buffer reallocation, in case the reallocation fails.

By bypassing the reparse deferral heuristic when getting close to the
filling the buffer, we give them this chance -- if the whole token is
present in the buffer, it will be parsed at that time.

This may come at the cost of some extra reparse attempts. For a token
of n bytes, these extra parses cause us to scan over a maximum of
2n bytes (... + n/8 + n/4 + n/2 + n). Therefore, parsing of big tokens
remains O(n) in regard how many bytes we scan in attempts to parse. The
cost in reality is lower than that, since the reparses that happen due
to the bypass will affect m_partialTokenBytesBefore, delaying the next
ratio-based reparse. Furthermore, only the first token that "breaks
through" a buffer ceiling takes that extra reparse attempt; subsequent
large tokens will only bypass the heuristic if they manage to hit the
new buffer ceiling.

Note that this cost analysis depends on the assumption that Expat grows
its buffer by doubling it (or, more generally, grows it exponentially).
If this changes, the cost of this bypass may increase. Hopefully, this
would be caught by test_big_tokens_take_linear_time or the new test.

The bypass logic assumes that the application uses a consistent fill.
If the app increases its fill size, it may miss the bypass (and the
normal heuristic will apply). If the app decreases its fill size, the
bypass may be hit multiple times for the same buffer size. The very
worst case would be to always fill half of the remaining buffer space,
in which case parsing of a large n-byte token becomes O(n log n).

As an added bonus, the new test case should be faster than the old one,
since it doesn't have to go all the way to 1GiB to check the behavior.

Finally, this change necessitated a small modification to two existing
tests related to reparse deferral. These tests are testing the deferral
enabled setting, and assume that reparsing will not happen for any other
reason. By pre-growing the buffer, we make sure that this new deferral
does not affect those test cases.
2024-01-29 17:09:36 +01:00
.github Add app setting for enabling/disabling reparse heuristic 2024-01-29 17:09:36 +01:00
expat Bypass partial token heuristic when nearing full buffer 2024-01-29 17:09:36 +01:00
testdata Add aaaaaa_*.xml with unreasonably large tokens 2024-01-29 17:09:35 +01:00
.ci.sh CI: Upgrade to Clang 18 (except clang-tidy and clang-format) 2024-01-26 16:20:04 +01:00
.gitignore .gitignore: Add missing 2022-01-29 23:28:05 +01:00
.mailmap Update legal name of Donghee Na (#754) 2023-09-24 18:13:03 +02:00
appveyor.yml Sync file headers 2022-07-14 22:26:59 +02:00
Brewfile Actions: Split off Cppcheck to stop installing 13 unrelated Homebrew formulas 2021-04-06 23:52:35 +02:00
COPYING COPYING: cp expat/COPYING ./ 2022-07-14 22:27:25 +02:00
README.md Migrate README to Markdown 2017-07-29 16:21:39 +02:00
SECURITY.md Update SECURITY.md 2023-04-06 14:40:31 -03:00

Run Linux CI tasks AppVeyor Build Status Packaging status Downloads SourceForge Downloads GitHub

Expat, Release 2.5.0

This is Expat, a C99 library for parsing XML 1.0 Fourth Edition, started by James Clark in 1997. Expat is a stream-oriented XML parser. This means that you register handlers with the parser before starting the parse. These handlers are called when the parser discovers the associated structures in the document being parsed. A start tag is an example of the kind of structures for which you may register handlers.

Expat supports the following compilers:

  • GNU GCC >=4.5
  • LLVM Clang >=3.5
  • Microsoft Visual Studio >=15.0/2017 (rolling ${today} minus 5 years)

Windows users can use the expat-win32bin-*.*.*.{exe,zip} download, which includes both pre-compiled libraries and executables, and source code for developers.

Expat is free software. You may copy, distribute, and modify it under the terms of the License contained in the file COPYING distributed with this package. This license is the same as the MIT/X Consortium license.

Using libexpat in your CMake-Based Project

There are two ways of using libexpat with CMake:

a) Module Mode

This approach leverages CMake's own module FindEXPAT.

Notice the uppercase EXPAT in the following example:

cmake_minimum_required(VERSION 3.0)  # or 3.10, see below

project(hello VERSION 1.0.0)

find_package(EXPAT 2.2.8 MODULE REQUIRED)

add_executable(hello
    hello.c
)

# a) for CMake >=3.10 (see CMake's FindEXPAT docs)
target_link_libraries(hello PUBLIC EXPAT::EXPAT)

# b) for CMake >=3.0
target_include_directories(hello PRIVATE ${EXPAT_INCLUDE_DIRS})
target_link_libraries(hello PUBLIC ${EXPAT_LIBRARIES})

b) Config Mode

This approach requires files from…

  • libexpat >=2.2.8 where packaging uses the CMake build system or
  • libexpat >=2.3.0 where packaging uses the GNU Autotools build system on Linux or
  • libexpat >=2.4.0 where packaging uses the GNU Autotools build system on macOS or MinGW.

Notice the lowercase expat in the following example:

cmake_minimum_required(VERSION 3.0)

project(hello VERSION 1.0.0)

find_package(expat 2.2.8 CONFIG REQUIRED char dtd ns)

add_executable(hello
    hello.c
)

target_link_libraries(hello PUBLIC expat::expat)

Building from a Git Clone

If you are building Expat from a check-out from the Git repository, you need to run a script that generates the configure script using the GNU autoconf and libtool tools. To do this, you need to have autoconf 2.58 or newer. Run the script like this:

./buildconf.sh

Once this has been done, follow the same instructions as for building from a source distribution.

Building from a Source Distribution

a) Building with the configure script (i.e. GNU Autotools)

To build Expat from a source distribution, you first run the configuration shell script in the top level distribution directory:

./configure

There are many options which you may provide to configure (which you can discover by running configure with the --help option). But the one of most interest is the one that sets the installation directory. By default, the configure script will set things up to install libexpat into /usr/local/lib, expat.h into /usr/local/include, and xmlwf into /usr/local/bin. If, for example, you'd prefer to install into /home/me/mystuff/lib, /home/me/mystuff/include, and /home/me/mystuff/bin, you can tell configure about that with:

./configure --prefix=/home/me/mystuff

Another interesting option is to enable 64-bit integer support for line and column numbers and the over-all byte index:

./configure CPPFLAGS=-DXML_LARGE_SIZE

However, such a modification would be a breaking change to the ABI and is therefore not recommended for general use — e.g. as part of a Linux distribution — but rather for builds with special requirements.

After running the configure script, the make command will build things and make install will install things into their proper location. Have a look at the Makefile to learn about additional make options. Note that you need to have write permission into the directories into which things will be installed.

If you are interested in building Expat to provide document information in UTF-16 encoding rather than the default UTF-8, follow these instructions (after having run make distclean). Please note that we configure with --without-xmlwf as xmlwf does not support this mode of compilation (yet):

  1. Mass-patch Makefile.am files to use libexpatw.la for a library name:
    find -name Makefile.am -exec sed -e 's,libexpat\.la,libexpatw.la,' -e 's,libexpat_la,libexpatw_la,' -i {} +

  2. Run automake to re-write Makefile.in files:
    automake

  3. For UTF-16 output as unsigned short (and version/error strings as char), run:
    ./configure CPPFLAGS=-DXML_UNICODE --without-xmlwf
    For UTF-16 output as wchar_t (incl. version/error strings), run:
    ./configure CFLAGS="-g -O2 -fshort-wchar" CPPFLAGS=-DXML_UNICODE_WCHAR_T --without-xmlwf
    Note: The latter requires libc compiled with -fshort-wchar, as well.

  4. Run make (which excludes xmlwf).

  5. Run make install (again, excludes xmlwf).

Using DESTDIR is supported. It works as follows:

make install DESTDIR=/path/to/image

overrides the in-makefile set DESTDIR, because variable-setting priority is

  1. commandline
  2. in-makefile
  3. environment

Note: This only applies to the Expat library itself, building UTF-16 versions of xmlwf and the tests is currently not supported.

When using Expat with a project using autoconf for configuration, you can use the probing macro in conftools/expat.m4 to determine how to include Expat. See the comments at the top of that file for more information.

A reference manual is available in the file doc/reference.html in this distribution.

b) Building with CMake

The CMake build system is still experimental and may replace the primary build system based on GNU Autotools at some point when it is ready.

Available Options

For an idea of the available (non-advanced) options for building with CMake:

# rm -f CMakeCache.txt ; cmake -D_EXPAT_HELP=ON -LH . | grep -B1 ':.*=' | sed 's,^--$,,'
// Choose the type of build, options are: None Debug Release RelWithDebInfo MinSizeRel ...
CMAKE_BUILD_TYPE:STRING=

// Install path prefix, prepended onto install directories.
CMAKE_INSTALL_PREFIX:PATH=/usr/local

// Path to a program.
DOCBOOK_TO_MAN:FILEPATH=/usr/bin/docbook2x-man

// Build man page for xmlwf
EXPAT_BUILD_DOCS:BOOL=ON

// Build the examples for expat library
EXPAT_BUILD_EXAMPLES:BOOL=ON

// Build fuzzers for the expat library
EXPAT_BUILD_FUZZERS:BOOL=OFF

// Build pkg-config file
EXPAT_BUILD_PKGCONFIG:BOOL=ON

// Build the tests for expat library
EXPAT_BUILD_TESTS:BOOL=ON

// Build the xmlwf tool for expat library
EXPAT_BUILD_TOOLS:BOOL=ON

// Character type to use (char|ushort|wchar_t) [default=char]
EXPAT_CHAR_TYPE:STRING=char

// Install expat files in cmake install target
EXPAT_ENABLE_INSTALL:BOOL=ON

// Use /MT flag (static CRT) when compiling in MSVC
EXPAT_MSVC_STATIC_CRT:BOOL=OFF

// Build fuzzers via ossfuzz for the expat library
EXPAT_OSSFUZZ_BUILD:BOOL=OFF

// Build a shared expat library
EXPAT_SHARED_LIBS:BOOL=ON

// Treat all compiler warnings as errors
EXPAT_WARNINGS_AS_ERRORS:BOOL=OFF

// Make use of getrandom function (ON|OFF|AUTO) [default=AUTO]
EXPAT_WITH_GETRANDOM:STRING=AUTO

// Utilize libbsd (for arc4random_buf)
EXPAT_WITH_LIBBSD:BOOL=OFF

// Make use of syscall SYS_getrandom (ON|OFF|AUTO) [default=AUTO]
EXPAT_WITH_SYS_GETRANDOM:STRING=AUTO