Preserve whitespace of removed "empty" elements #48

Merged
newsch merged 2 commits from nbsp into main 2024-07-08 21:13:55 +00:00
newsch commented 2024-07-08 19:08:38 +00:00 (Migrated from github.com)

The first commit removes the pretty-printing from the test examples and adds a lot of noise to the diff.

Some articles use non-breaking spaces between quantities and units, which Wikipedia seems to wrap with a span. Elements with no or whitespace-only text were previously removed to prune <link>s and parents of other removed elements.

This fix preserves the internal whitespace of elements that would other wise be removed for being "empty". It does not distinguish between "meaningful" whitespace and padding between elements that would otherwise be collapsed by HTML formatting rules. It also cannot distinguish between elements that started with only whitespace and nodes that now contain only whitespace after previous steps. The preserved whitespace in the latter case is unlikely to remain because of later processing steps.

Fixes , fixes

_The first commit removes the pretty-printing from the test examples and adds a lot of noise to the diff._ Some articles use non-breaking spaces between quantities and units, which Wikipedia seems to wrap with a span. Elements with no or whitespace-only text were previously removed to prune `<link>`s and parents of other removed elements. This fix preserves the internal whitespace of elements that would other wise be removed for being "empty". It does not distinguish between "meaningful" whitespace and padding between elements that would otherwise be collapsed by HTML formatting rules. It also cannot distinguish between elements that _started_ with only whitespace and nodes that now contain only whitespace after previous steps. The preserved whitespace in the latter case is unlikely to remain because of later processing steps. Fixes #47, fixes organicmaps/organicmaps#8651
biodranik (Migrated from github.com) approved these changes 2024-07-08 19:26:25 +00:00
biodranik (Migrated from github.com) left a comment

Thanks!

Thanks!
biodranik (Migrated from github.com) commented 2024-07-08 19:26:18 +00:00

Is there any minification used later, to remove unnecessary line endings for final HTML pages?

Is there any minification used later, to remove unnecessary line endings for final HTML pages?
biodranik (Migrated from github.com) commented 2024-07-08 19:25:07 +00:00

Would it save a bit more space if the nbsp were encoded directly as  , instead of &nbsp;?

Would it save a bit more space if the nbsp were encoded directly as ` `, instead of `&nbsp;`?
newsch (Migrated from github.com) reviewed 2024-07-08 21:11:58 +00:00
newsch (Migrated from github.com) commented 2024-07-08 21:11:58 +00:00

It would, but we don't control that part of the writing. html5ever converts the literal to the escaped version, I assume because it is part of the serialization spec.
It's possible to write another Serializer like the pretty-printer that minifies instead, but I haven't figured out the whitespace collapsing rules enough to write one. There aren't any crates that implement an html5ever::Serializer minifier, so adding an external minifier would need to re-parse the html.

It would, but we don't control that part of the writing. `html5ever` [converts the literal to the escaped version](https://github.com/servo/html5ever/blob/e69b05c849031d6a2837d7d86372ff14b3f4080e/html5ever/src/serialize/mod.rs#L108), I assume because it is [part of the serialization spec](https://html.spec.whatwg.org/multipage/parsing.html#escapingString). It's possible to write another `Serializer` like the pretty-printer that minifies instead, but I haven't figured out the whitespace collapsing rules enough to write one. There aren't any crates that implement an `html5ever::Serializer` minifier, so adding an external minifier would need to re-parse the html.
newsch (Migrated from github.com) reviewed 2024-07-08 21:12:22 +00:00
newsch (Migrated from github.com) commented 2024-07-08 21:12:22 +00:00

See above - we don't do a proper minification step, so whitespace within elements is left in the output.

See above - we don't do a proper minification step, so whitespace _within_ elements is left in the output.
Sign in to join this conversation.
No description provided.