When the parse buffer contains the starting bytes of a token but not all of them, we cannot parse the token to completion. We call this a partial token. When this happens, the parse position is reset to the start of the token, and the parse() call returns. The client is then expected to provide more data and call parse() again. In extreme cases, this means that the bytes of a token may be parsed many times: once for every buffer refill required before the full token is present in the buffer. Math: Assume there's a token of T bytes Assume the client fills the buffer in chunks of X bytes We'll try to parse X, 2X, 3X, 4X ... until mX == T (technically >=) That's (m²+m)X/2 = (T²/X+T)/2 bytes parsed (arithmetic progression) While it is alleviated by larger refills, this amounts to O(T²) Expat grows its internal buffer by doubling it when necessary, but has no way to inform the client about how much space is available. Instead, we add a heuristic that skips parsing when we've repeatedly stopped on an incomplete token. Specifically: * Only try to parse if we have a certain amount of data buffered * Every time we stop on an incomplete token, double the threshold * As soon as any token completes, the threshold is reset This means that when we get stuck on an incomplete token, the threshold grows exponentially, effectively making the client perform larger buffer fills, limiting how many times we can end up re-parsing the same bytes. Math: Assume there's a token of T bytes Assume the client fills the buffer in chunks of X bytes We'll try to parse X, 2X, 4X, 8X ... until (2^k)X == T (or larger) That's (2^(k+1)-1)X bytes parsed -- e.g. 15X if T = 8X This is equal to 2T-X, which amounts to O(T) We could've chosen a faster growth rate, e.g. 4 or 8. Those seem to increase performance further, at the cost of further increasing the risk of growing the buffer more than necessary. This can easily be adjusted in the future, if desired. This is all completely transparent to the client, except for: 1. possible delay of some callbacks (when our heuristic overshoots) 2. apps that never do isFinal=XML_TRUE could miss data at the end For the affected testdata, this change shows a 100-400x speedup. The recset.xml benchmark shows no clear change either way. Before: benchmark -n ../testdata/largefiles/recset.xml 65535 3 3 loops, with buffer size 65535. Average time per loop: 0.270223 benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3 3 loops, with buffer size 4096. Average time per loop: 15.033048 benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3 3 loops, with buffer size 4096. Average time per loop: 0.018027 benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3 3 loops, with buffer size 4096. Average time per loop: 11.775362 benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3 3 loops, with buffer size 4096. Average time per loop: 11.711414 benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3 3 loops, with buffer size 4096. Average time per loop: 0.019362 After: ./run.sh benchmark -n ../testdata/largefiles/recset.xml 65535 3 3 loops, with buffer size 65535. Average time per loop: 0.269030 ./run.sh benchmark -n ../testdata/largefiles/aaaaaa_attr.xml 4096 3 3 loops, with buffer size 4096. Average time per loop: 0.044794 ./run.sh benchmark -n ../testdata/largefiles/aaaaaa_cdata.xml 4096 3 3 loops, with buffer size 4096. Average time per loop: 0.016377 ./run.sh benchmark -n ../testdata/largefiles/aaaaaa_comment.xml 4096 3 3 loops, with buffer size 4096. Average time per loop: 0.027022 ./run.sh benchmark -n ../testdata/largefiles/aaaaaa_tag.xml 4096 3 3 loops, with buffer size 4096. Average time per loop: 0.099360 ./run.sh benchmark -n ../testdata/largefiles/aaaaaa_text.xml 4096 3 3 loops, with buffer size 4096. Average time per loop: 0.017956 |
||
---|---|---|
.github | ||
expat | ||
testdata | ||
.ci.sh | ||
.gitignore | ||
.mailmap | ||
appveyor.yml | ||
Brewfile | ||
COPYING | ||
README.md | ||
SECURITY.md |
Expat, Release 2.5.0
This is Expat, a C99 library for parsing XML 1.0 Fourth Edition, started by James Clark in 1997. Expat is a stream-oriented XML parser. This means that you register handlers with the parser before starting the parse. These handlers are called when the parser discovers the associated structures in the document being parsed. A start tag is an example of the kind of structures for which you may register handlers.
Expat supports the following compilers:
- GNU GCC >=4.5
- LLVM Clang >=3.5
- Microsoft Visual Studio >=15.0/2017 (rolling
${today} minus 5 years
)
Windows users can use the
expat-win32bin-*.*.*.{exe,zip}
download,
which includes both pre-compiled libraries and executables, and source code for
developers.
Expat is free software.
You may copy, distribute, and modify it under the terms of the License
contained in the file
COPYING
distributed with this package.
This license is the same as the MIT/X Consortium license.
Using libexpat in your CMake-Based Project
There are two ways of using libexpat with CMake:
a) Module Mode
This approach leverages CMake's own module FindEXPAT
.
Notice the uppercase EXPAT
in the following example:
cmake_minimum_required(VERSION 3.0) # or 3.10, see below
project(hello VERSION 1.0.0)
find_package(EXPAT 2.2.8 MODULE REQUIRED)
add_executable(hello
hello.c
)
# a) for CMake >=3.10 (see CMake's FindEXPAT docs)
target_link_libraries(hello PUBLIC EXPAT::EXPAT)
# b) for CMake >=3.0
target_include_directories(hello PRIVATE ${EXPAT_INCLUDE_DIRS})
target_link_libraries(hello PUBLIC ${EXPAT_LIBRARIES})
b) Config Mode
This approach requires files from…
- libexpat >=2.2.8 where packaging uses the CMake build system or
- libexpat >=2.3.0 where packaging uses the GNU Autotools build system on Linux or
- libexpat >=2.4.0 where packaging uses the GNU Autotools build system on macOS or MinGW.
Notice the lowercase expat
in the following example:
cmake_minimum_required(VERSION 3.0)
project(hello VERSION 1.0.0)
find_package(expat 2.2.8 CONFIG REQUIRED char dtd ns)
add_executable(hello
hello.c
)
target_link_libraries(hello PUBLIC expat::expat)
Building from a Git Clone
If you are building Expat from a check-out from the Git repository, you need to run a script that generates the configure script using the GNU autoconf and libtool tools. To do this, you need to have autoconf 2.58 or newer. Run the script like this:
./buildconf.sh
Once this has been done, follow the same instructions as for building from a source distribution.
Building from a Source Distribution
a) Building with the configure script (i.e. GNU Autotools)
To build Expat from a source distribution, you first run the configuration shell script in the top level distribution directory:
./configure
There are many options which you may provide to configure (which you
can discover by running configure with the --help
option). But the
one of most interest is the one that sets the installation directory.
By default, the configure script will set things up to install
libexpat into /usr/local/lib
, expat.h
into /usr/local/include
, and
xmlwf
into /usr/local/bin
. If, for example, you'd prefer to install
into /home/me/mystuff/lib
, /home/me/mystuff/include
, and
/home/me/mystuff/bin
, you can tell configure
about that with:
./configure --prefix=/home/me/mystuff
Another interesting option is to enable 64-bit integer support for line and column numbers and the over-all byte index:
./configure CPPFLAGS=-DXML_LARGE_SIZE
However, such a modification would be a breaking change to the ABI and is therefore not recommended for general use — e.g. as part of a Linux distribution — but rather for builds with special requirements.
After running the configure script, the make
command will build
things and make install
will install things into their proper
location. Have a look at the Makefile
to learn about additional
make
options. Note that you need to have write permission into
the directories into which things will be installed.
If you are interested in building Expat to provide document
information in UTF-16 encoding rather than the default UTF-8, follow
these instructions (after having run make distclean
).
Please note that we configure with --without-xmlwf
as xmlwf does not
support this mode of compilation (yet):
-
Mass-patch
Makefile.am
files to uselibexpatw.la
for a library name:
find -name Makefile.am -exec sed -e 's,libexpat\.la,libexpatw.la,' -e 's,libexpat_la,libexpatw_la,' -i {} +
-
Run
automake
to re-writeMakefile.in
files:
automake
-
For UTF-16 output as unsigned short (and version/error strings as char), run:
./configure CPPFLAGS=-DXML_UNICODE --without-xmlwf
For UTF-16 output aswchar_t
(incl. version/error strings), run:
./configure CFLAGS="-g -O2 -fshort-wchar" CPPFLAGS=-DXML_UNICODE_WCHAR_T --without-xmlwf
Note: The latter requires libc compiled with-fshort-wchar
, as well. -
Run
make
(which excludes xmlwf). -
Run
make install
(again, excludes xmlwf).
Using DESTDIR
is supported. It works as follows:
make install DESTDIR=/path/to/image
overrides the in-makefile set DESTDIR
, because variable-setting priority is
- commandline
- in-makefile
- environment
Note: This only applies to the Expat library itself, building UTF-16 versions of xmlwf and the tests is currently not supported.
When using Expat with a project using autoconf for configuration, you
can use the probing macro in conftools/expat.m4
to determine how to
include Expat. See the comments at the top of that file for more
information.
A reference manual is available in the file doc/reference.html
in this
distribution.
b) Building with CMake
The CMake build system is still experimental and may replace the primary build system based on GNU Autotools at some point when it is ready.
Available Options
For an idea of the available (non-advanced) options for building with CMake:
# rm -f CMakeCache.txt ; cmake -D_EXPAT_HELP=ON -LH . | grep -B1 ':.*=' | sed 's,^--$,,'
// Choose the type of build, options are: None Debug Release RelWithDebInfo MinSizeRel ...
CMAKE_BUILD_TYPE:STRING=
// Install path prefix, prepended onto install directories.
CMAKE_INSTALL_PREFIX:PATH=/usr/local
// Path to a program.
DOCBOOK_TO_MAN:FILEPATH=/usr/bin/docbook2x-man
// Build man page for xmlwf
EXPAT_BUILD_DOCS:BOOL=ON
// Build the examples for expat library
EXPAT_BUILD_EXAMPLES:BOOL=ON
// Build fuzzers for the expat library
EXPAT_BUILD_FUZZERS:BOOL=OFF
// Build pkg-config file
EXPAT_BUILD_PKGCONFIG:BOOL=ON
// Build the tests for expat library
EXPAT_BUILD_TESTS:BOOL=ON
// Build the xmlwf tool for expat library
EXPAT_BUILD_TOOLS:BOOL=ON
// Character type to use (char|ushort|wchar_t) [default=char]
EXPAT_CHAR_TYPE:STRING=char
// Install expat files in cmake install target
EXPAT_ENABLE_INSTALL:BOOL=ON
// Use /MT flag (static CRT) when compiling in MSVC
EXPAT_MSVC_STATIC_CRT:BOOL=OFF
// Build fuzzers via ossfuzz for the expat library
EXPAT_OSSFUZZ_BUILD:BOOL=OFF
// Build a shared expat library
EXPAT_SHARED_LIBS:BOOL=ON
// Treat all compiler warnings as errors
EXPAT_WARNINGS_AS_ERRORS:BOOL=OFF
// Make use of getrandom function (ON|OFF|AUTO) [default=AUTO]
EXPAT_WITH_GETRANDOM:STRING=AUTO
// Utilize libbsd (for arc4random_buf)
EXPAT_WITH_LIBBSD:BOOL=OFF
// Make use of syscall SYS_getrandom (ON|OFF|AUTO) [default=AUTO]
EXPAT_WITH_SYS_GETRANDOM:STRING=AUTO