Preserve whitespace of removed "empty" elements

Some articles use non-breaking spaces between quantities and units,
which Wikipedia seems to wrap with a span. Elements with no or
whitespace-only text were previously removed to prune `<link>`s and
parents of other removed elements.

This fix preserves the internal whitespace of elements that would
otherwise be removed for being "empty". It does not distinguish between
"meaningful" whitespace and padding between elements that would be
collapsed by HTML formatting rules. It also cannot distinguish between
elements that _started_ with only whitespace and nodes that now contain
only whitespace after previous steps. The preserved whitespace in the
latter case is unlikely to remain because of later processing steps.

Fixes #47, fixes organicmaps/organicmaps#8651

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
This commit is contained in:
Evan Lloyd New-Schmidt 2024-07-08 14:19:16 -04:00 committed by Evan Lloyd New-Schmidt
parent 3579410659
commit bab29c0de9
2 changed files with 7 additions and 4 deletions

View file

@ -202,7 +202,7 @@ pub fn simplify(document: &mut Html, lang: &str) {
remove_empty_sections(document);
remove_empty(document);
expand_empty(document);
remove_non_element_nodes(document);
@ -305,7 +305,8 @@ fn remove_toplevel_whitespace(document: &mut Html) {
remove_ids(document, to_remove.drain(..));
}
fn remove_empty(document: &mut Html) {
/// Expand elements that contain no text or only whitespace, leaving only their contents.
fn expand_empty(document: &mut Html) {
let mut to_remove = Vec::new();
for el in document
@ -318,7 +319,9 @@ fn remove_empty(document: &mut Html) {
}
}
remove_ids(document, to_remove.drain(..));
for id in to_remove.drain(..) {
expand_id(document, id);
}
}
fn remove_empty_sections(document: &mut Html) {

View file

@ -7,7 +7,7 @@
<li>Chatyr-Dag yayla</li>
<li>Dologorukovskaya (Subatkan) yayla</li>
<li>Demirci yayla</li>
<li>Qarabiy yayla</li></ul><h2>Highest peaks</h2><p>The Crimea's highest peak is the Roman-Kosh (Ukrainian: <span lang="uk">Роман-Кош</span>; Russian: <span lang="ru">Роман-Кош</span>, Crimean Tatar: <span lang="crh">Roman Qoş</span>) on the Babugan Yayla at 1,545 metres (5,069ft). Other important peaks over 1,200 metres include:</p><ul><li>Demir-Kapu (Ukrainian: <span lang="uk">Демір-Капу</span>, Russian: <span lang="ru">Демир-Капу</span>, Crimean Tatar: <span lang="crh">Demir Qapı</span>) 1,540 m in the Babugan Yayla;</li>
<li>Qarabiy yayla</li></ul><h2>Highest peaks</h2><p>The Crimea's highest peak is the Roman-Kosh (Ukrainian: <span lang="uk">Роман-Кош</span>; Russian: <span lang="ru">Роман-Кош</span>, Crimean Tatar: <span lang="crh">Roman Qoş</span>) on the Babugan Yayla at 1,545 metres (5,069&nbsp;ft). Other important peaks over 1,200 metres include:</p><ul><li>Demir-Kapu (Ukrainian: <span lang="uk">Демір-Капу</span>, Russian: <span lang="ru">Демир-Капу</span>, Crimean Tatar: <span lang="crh">Demir Qapı</span>) 1,540 m in the Babugan Yayla;</li>
<li>Zeytin-Kosh (Ukrainian: <span lang="uk">Зейтин-Кош</span>; Russian: <span lang="ru">Зейтин-Кош</span>, Crimean Tatar: <span lang="crh">Zeytün Qoş</span>) 1,537 m in the Babugan Yayla;</li>
<li>Kemal-Egerek (Ukrainian: <span lang="uk">Кемаль-Егерек</span>, Russian: <span lang="ru">Кемаль-Эгерек</span>, Crimean Tatar: <span lang="crh">Kemal Egerek</span>) 1,529 m in the Babugan Yayla;</li>
<li>Eklizi-Burun (Ukrainian: <span lang="uk">Еклізі-Бурун</span>, Russian: <span lang="ru">Эклизи-Бурун</span>, Crimean Tatar: <span lang="crh">Eklizi Burun</span>) 1,527 m in the Chatyrdag Yayla;</li>