Some articles use non-breaking spaces between quantities and units,
which Wikipedia seems to wrap with a span. Elements with no or
whitespace-only text were previously removed to prune `<link>`s and
parents of other removed elements.
This fix preserves the internal whitespace of elements that would
otherwise be removed for being "empty". It does not distinguish between
"meaningful" whitespace and padding between elements that would be
collapsed by HTML formatting rules. It also cannot distinguish between
elements that _started_ with only whitespace and nodes that now contain
only whitespace after previous steps. The preserved whitespace in the
latter case is unlikely to remain because of later processing steps.
Fixes#47, fixesorganicmaps/organicmaps#8651
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Whitespace behavior is different between Html::html and this
half-working pretty printer. Now the tests match the parser output
exactly.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
- Article contents are from the 2023-04-01 Wikipedia Enterprise Dump
- Add benchmark for HTML processing
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>