Initial Html Processing #10

Merged
newsch merged 1 commit from html-processing into main 2023-07-10 14:58:56 +00:00
newsch commented 2023-06-29 20:02:31 +00:00 (Migrated from github.com)

While I'm waiting for more dump files to download, I got started with this.

This PR will bring the html output to functional parity with the scraped ones.
There will still be extra metadata and other bloat covered in #4.

Remaining steps:

  • convert all relative articles to absolute
  • Remove img/picture elements
  • Make sure interwiki links are handled correctly
  • Merge #6
While I'm waiting for more dump files to download, I got started with this. This PR will bring the html output to functional parity with the scraped ones. There will still be extra metadata and other bloat covered in #4. Remaining steps: - [x] convert all relative articles to absolute - [x] Remove `img`/`picture` elements - [x] Make sure interwiki links are handled correctly - [x] Merge #6
biodranik (Migrated from github.com) reviewed 2023-06-30 05:41:23 +00:00
@ -10,0 +10,4 @@
env_logger::Builder::new()
.filter_level(log::LevelFilter::Info)
.parse_default_env()
.try_init()?;
biodranik (Migrated from github.com) commented 2023-06-30 05:32:12 +00:00

What does it do?

What does it do?
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
biodranik (Migrated from github.com) commented 2023-06-30 05:34:00 +00:00

Do we insert full URLs on every page now? That is an overhead, webview on iOS and Android should work properly with relative URLs. Let's investigate and fix it in a separate issue later (TODO may be good here too).

Do we insert full URLs on every page now? That is an overhead, webview on iOS and Android should work properly with relative URLs. Let's investigate and fix it in a separate issue later (TODO may be good here too).
biodranik (Migrated from github.com) commented 2023-06-30 05:35:35 +00:00

Is it possible to check whitespaces without modification? E.g. use is_whitespace ?

Is it possible to check whitespaces without modification? E.g. use is_whitespace ?
biodranik (Migrated from github.com) commented 2023-06-30 05:38:28 +00:00

Maybe it makes sense to strip cross-wiki links to reduce the HTML size. As our articles are localized and mostly used offline, it can be a good idea.

Maybe it makes sense to strip cross-wiki links to reduce the HTML size. As our articles are localized and mostly used offline, it can be a good idea.
biodranik (Migrated from github.com) commented 2023-06-30 05:41:09 +00:00

I don't see any links at all in the existing HTML dumps... Only formatting and headers tags.

I don't see any links at all in the existing HTML dumps... Only formatting and headers tags.
newsch (Migrated from github.com) reviewed 2023-06-30 15:17:22 +00:00
@ -10,0 +10,4 @@
env_logger::Builder::new()
.filter_level(log::LevelFilter::Info)
.parse_default_env()
.try_init()?;
newsch (Migrated from github.com) commented 2023-06-30 15:17:22 +00:00

It enables the logger and sets a default log level that can still be overridden by the default environment variable. I've run into trouble before when I do it in a different order and then the env var doesn't work, or can filter higher levels but not enable lower levels.

It enables the logger and sets a default log level that can still be overridden by the default environment variable. I've run into trouble before when I do it in a different order and then the env var doesn't work, or can filter higher levels but not enable lower levels.
newsch (Migrated from github.com) reviewed 2023-06-30 15:21:29 +00:00
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
newsch (Migrated from github.com) commented 2023-06-30 15:21:28 +00:00

When looking at this more I found that the html does include a base element set to //lang.wikipedia.org/wiki/, but when opened as a file in firefox it assumes the scheme is file: so they don't work.
I'll remove this, and later if we run into a similar problem with the webviews, setting the scheme once in the base element should handle it.

When looking at this more I found that the html does include a [`base`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base) element set to `//lang.wikipedia.org/wiki/`, but when opened as a file in firefox it assumes the scheme is `file:` so they don't work. I'll remove this, and later if we run into a similar problem with the webviews, setting the scheme once in the base element should handle it.
newsch (Migrated from github.com) reviewed 2023-06-30 15:36:14 +00:00
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
newsch (Migrated from github.com) commented 2023-06-30 15:36:14 +00:00

We could do el.text().flat_map(str::chars).all(char::is_whitespace). I was going to say that working with characters might be less efficient, since the Patterns can work directly in UTF-8, but it looks like the implementation of trim also uses char::is_whitespace.

We could do `el.text().flat_map(str::chars).all(char::is_whitespace)`. I was going to say that working with characters might be less efficient, since the `Pattern`s can [work directly in UTF-8](https://doc.rust-lang.org/src/core/str/pattern.rs.html#374), but it looks like [the implementation of `trim`](https://doc.rust-lang.org/src/core/str/mod.rs.html#1867) also uses `char::is_whitespace`.
newsch (Migrated from github.com) reviewed 2023-06-30 15:38:28 +00:00
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
newsch (Migrated from github.com) commented 2023-06-30 15:38:28 +00:00

Yes, the scraper uses the Extracts API which strips most of the non-text elements.
Would you prefer we removed the links altogether?

Yes, the scraper uses [the Extracts API](https://www.mediawiki.org/wiki/Extension:TextExtracts) which strips most of the non-text elements. Would you prefer we removed the links altogether?
biodranik (Migrated from github.com) reviewed 2023-06-30 16:08:23 +00:00
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
biodranik (Migrated from github.com) commented 2023-06-30 16:08:23 +00:00

Links may be useful in the future, especially if we somehow link wiki articles that are already embedded in the offline maps data. But for now, they can be omitted, I didn't see them on Wiki pages in OM. Did you?

Links may be useful in the future, especially if we somehow link wiki articles that are already embedded in the offline maps data. But for now, they can be omitted, I didn't see them on Wiki pages in OM. Did you?
biodranik (Migrated from github.com) reviewed 2023-06-30 16:31:41 +00:00
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
biodranik (Migrated from github.com) commented 2023-06-30 16:31:41 +00:00

As trim actually returns a slice, your implementation is already optimal )

As trim actually returns a slice, your implementation is already optimal )
newsch (Migrated from github.com) reviewed 2023-06-30 17:11:40 +00:00
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
newsch (Migrated from github.com) commented 2023-06-30 17:11:39 +00:00

One thing I realized though is that trim still has to check the right side if it encounters non-whitespace characters from the left. trim_left is better, but both still need to construct the slice. I don't think the compiler can optimize that away.

I tried the three approaches with a selection of strings and found that the char iterator was fastest. Ultimately a micro-optimization but still interesting 😉.

const MIXED: &str =          " \tcd\nfg ";
const NOT_WHITESPACE: &str = "abcdefgh";
const WHITESPACE: &str =     " \t\n \t\t  ";
running 9 tests
test chars_mixed              ... bench:           3 ns/iter (+/- 0)
test chars_not_whitespace     ... bench:           0 ns/iter (+/- 0)
test chars_whitespace         ... bench:           7 ns/iter (+/- 1)
test trim_left_mixed          ... bench:           4 ns/iter (+/- 1)
test trim_left_not_whitespace ... bench:           3 ns/iter (+/- 0)
test trim_left_whitespace     ... bench:           9 ns/iter (+/- 0)
test trim_mixed               ... bench:          10 ns/iter (+/- 1)
test trim_not_whitespace      ... bench:           5 ns/iter (+/- 4)
test trim_whitespace          ... bench:          12 ns/iter (+/- 0)
One thing I realized though is that `trim` still has to check the right side if it encounters non-whitespace characters from the left. `trim_left` is better, but both still need to construct the slice. I don't think the compiler can optimize that away. I tried the three approaches with a selection of strings and found that the char iterator was fastest. Ultimately a micro-optimization but still interesting :wink:. ```rust const MIXED: &str = " \tcd\nfg "; const NOT_WHITESPACE: &str = "abcdefgh"; const WHITESPACE: &str = " \t\n \t\t "; ``` ``` running 9 tests test chars_mixed ... bench: 3 ns/iter (+/- 0) test chars_not_whitespace ... bench: 0 ns/iter (+/- 0) test chars_whitespace ... bench: 7 ns/iter (+/- 1) test trim_left_mixed ... bench: 4 ns/iter (+/- 1) test trim_left_not_whitespace ... bench: 3 ns/iter (+/- 0) test trim_left_whitespace ... bench: 9 ns/iter (+/- 0) test trim_mixed ... bench: 10 ns/iter (+/- 1) test trim_not_whitespace ... bench: 5 ns/iter (+/- 4) test trim_whitespace ... bench: 12 ns/iter (+/- 0) ```
newsch (Migrated from github.com) reviewed 2023-06-30 17:15:59 +00:00
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
newsch (Migrated from github.com) commented 2023-06-30 17:15:59 +00:00

No, the current ones don't have any. I'll strip them from the output for now.

No, the current ones don't have any. I'll strip them from the output for now.
biodranik (Migrated from github.com) reviewed 2023-07-01 09:46:42 +00:00
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
biodranik (Migrated from github.com) commented 2023-07-01 09:46:42 +00:00

This is an important outcome of an interesting project: to learn something new )

This is an important outcome of an interesting project: to learn something new )
biodranik (Migrated from github.com) approved these changes 2023-07-06 10:20:05 +00:00
biodranik (Migrated from github.com) commented 2023-07-06 10:19:19 +00:00

Can this comment be clarified?

Can this comment be clarified?
biodranik (Migrated from github.com) commented 2023-07-06 10:20:01 +00:00

nit: Using English sentences starting with a capital letter and ending with a dot may be more readable.

nit: Using English sentences starting with a capital letter and ending with a dot may be more readable.
newsch (Migrated from github.com) reviewed 2023-07-06 15:05:42 +00:00
newsch (Migrated from github.com) commented 2023-07-06 15:05:41 +00:00
        // Detach one of the links from the root tree (as if previously deleted) to ensure it handles orphan nodes nicely.

Some of the tree operations panic when the node doesn't have a parent, and we could be processing nodes that were removed from the tree in a previous pass.

```suggestion // Detach one of the links from the root tree (as if previously deleted) to ensure it handles orphan nodes nicely. ``` Some of the tree operations [panic when the node doesn't have a parent](https://docs.rs/ego-tree/latest/ego_tree/struct.NodeMut.html#method.insert_id_before), and we could be processing nodes that were removed from the tree in a previous pass.
Sign in to join this conversation.
No description provided.