Commit graph

11 commits

Author SHA1 Message Date
Evan Lloyd New-Schmidt
bab29c0de9 Preserve whitespace of removed "empty" elements
Some articles use non-breaking spaces between quantities and units,
which Wikipedia seems to wrap with a span. Elements with no or
whitespace-only text were previously removed to prune `<link>`s and
parents of other removed elements.

This fix preserves the internal whitespace of elements that would
otherwise be removed for being "empty". It does not distinguish between
"meaningful" whitespace and padding between elements that would be
collapsed by HTML formatting rules. It also cannot distinguish between
elements that _started_ with only whitespace and nodes that now contain
only whitespace after previous steps. The preserved whitespace in the
latter case is unlikely to remain because of later processing steps.

Fixes #47, fixes organicmaps/organicmaps#8651

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-07-08 17:13:55 -04:00
Evan Lloyd New-Schmidt
3579410659 Remove pretty-printing
Whitespace behavior is different between Html::html and this
half-working pretty printer. Now the tests match the parser output
exactly.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-07-08 17:13:55 -04:00
Evan Lloyd New-Schmidt
7d453d5e63 Reorganize html module
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
1da8ec212a Add checks for article redirects, empty articles, and sniff language
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2024-01-24 12:45:24 -08:00
Evan Lloyd New-Schmidt
c7fe34f3ad Remove header ids
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
6c02f4a569 Remove coordinates from output
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
81783695d5 Remove doctype and html element
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
58f32b43fd Remove empty sections after other removals
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
81f528a350 Expand spans, sections, and body after removing head
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
4b776f49d4 Add denylist from Extracts API
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00
Evan Lloyd New-Schmidt
75fa04407d Add snapshot tests for html output
- Article contents are from the 2023-04-01 Wikipedia Enterprise Dump
- Add benchmark for HTML processing

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
2023-08-15 18:37:43 -04:00