Initial Html Processing #10
No reviewers
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/wikiparser#10
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "html-processing"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
While I'm waiting for more dump files to download, I got started with this.
This PR will bring the html output to functional parity with the scraped ones.
There will still be extra metadata and other bloat covered in #4.
Remaining steps:
img
/picture
elements@ -10,0 +10,4 @@
env_logger::Builder::new()
.filter_level(log::LevelFilter::Info)
.parse_default_env()
.try_init()?;
What does it do?
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
Do we insert full URLs on every page now? That is an overhead, webview on iOS and Android should work properly with relative URLs. Let's investigate and fix it in a separate issue later (TODO may be good here too).
Is it possible to check whitespaces without modification? E.g. use is_whitespace ?
Maybe it makes sense to strip cross-wiki links to reduce the HTML size. As our articles are localized and mostly used offline, it can be a good idea.
I don't see any links at all in the existing HTML dumps... Only formatting and headers tags.
@ -10,0 +10,4 @@
env_logger::Builder::new()
.filter_level(log::LevelFilter::Info)
.parse_default_env()
.try_init()?;
It enables the logger and sets a default log level that can still be overridden by the default environment variable. I've run into trouble before when I do it in a different order and then the env var doesn't work, or can filter higher levels but not enable lower levels.
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
When looking at this more I found that the html does include a
base
element set to//lang.wikipedia.org/wiki/
, but when opened as a file in firefox it assumes the scheme isfile:
so they don't work.I'll remove this, and later if we run into a similar problem with the webviews, setting the scheme once in the base element should handle it.
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
We could do
el.text().flat_map(str::chars).all(char::is_whitespace)
. I was going to say that working with characters might be less efficient, since thePattern
s can work directly in UTF-8, but it looks like the implementation oftrim
also useschar::is_whitespace
.@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
Yes, the scraper uses the Extracts API which strips most of the non-text elements.
Would you prefer we removed the links altogether?
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
Links may be useful in the future, especially if we somehow link wiki articles that are already embedded in the offline maps data. But for now, they can be omitted, I didn't see them on Wiki pages in OM. Did you?
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
As trim actually returns a slice, your implementation is already optimal )
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
One thing I realized though is that
trim
still has to check the right side if it encounters non-whitespace characters from the left.trim_left
is better, but both still need to construct the slice. I don't think the compiler can optimize that away.I tried the three approaches with a selection of strings and found that the char iterator was fastest. Ultimately a micro-optimization but still interesting 😉.
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
No, the current ones don't have any. I'll strip them from the output for now.
@ -51,34 +52,65 @@ pub fn simplify(html: &str, lang: &str) -> String {
}
}
This is an important outcome of an interesting project: to learn something new )
Can this comment be clarified?
nit: Using English sentences starting with a capital letter and ending with a dot may be more readable.
Some of the tree operations panic when the node doesn't have a parent, and we could be processing nodes that were removed from the tree in a previous pass.