Proof of Concept #3
No reviewers
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/wikiparser#3
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "proof-of-concept"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is an initial implementation of searching for Wikipedia urls/Wikidata QIDs and simplifying html.
It's a little messy, inefficient,
and doesn't handle multiple languages or writing to an output directory.Remaining work for this PR:
WikipediaTitleNorm
design (parsing 3 million QIDs is almost instant, 1.5 million titles around a minute)I think so, at the very least it needs a
lib.rs
to make a separate binary for testing the html simplifying. I won't go overboard 😉.I agree, but I wanted to get the logic working before worrying about that.
Good point, I'll keep that in mind if I run into any other work that needs to be done.
There are a number of crates related to the
.pbf
format, and bindings for libosmium, but I haven't come across any.o5m
ones that seem thoroughly developed.Can pbf be used to get fresh OpenStreetMap.org updates without converting it to o5m?
The planet downloads are available in pbf: https://planet.openstreetmap.org/pbf/
I think this is saying you can apply the daily/hourly diffs to pbf, but I'm not sure I'm reading it correctly: https://wiki.openstreetmap.org/wiki/Planet.osm/diffs#Using_the_replication_diffs
This makes it seem like diffs can be appended at the end of the file, which seems like you'd need to read the entire file to get the true state of any node/object?
https://wiki.openstreetmap.org/wiki/PBF_Format#What_are_the_replication_fields_for?
I don't understand enough about these formats or how the generator works to know if they'd be compatible.
That's why we use o5m. It is suitable for incremental updates, and osmupdate tool creates a new planet.o5m file automatically by applying these updates.
It looks like osmupdate supports pbf for the same thing, but I don't understand how that works
Tested, it works for pbf too. Cool! The whole planet can be stripped to 58gb:
"$OSMUPDATE" -v --drop-authors --drop-version --hash-memory=16000 planet-230529.osm.pbf planet-updated.osm.pbf
Alright, here's what I've learned from comparing the html outputs:
The simplifying step of the python and the rust versions give comparable output (proper minification is still needed), but the html that the python gets from the api is much more simple than I expected.
It uses the Extracts API, which strips most of the markup.
The html in the dumps on the other hand seem much closer to the content in a complete article. Size-wise the dump html is around 10x the size of that of the extracts for the subset I looked at.
To get to parity with that output, we'll need to add additional steps to the html processing to remove:
This is doable, it will involve some more exploration of what exactly to match on.
I'll make issues for that and the minification.
Minification is not a blocker and can be done later, if necessary. It would be great to compare final mwm sizes, AFAIR, some compression is used there.
Showing images for those who wants them, and maybe leave links too may be a good idea.
👍
Good point. I've seen the compression referenced in
text_storage.hpp
.I'll look into doing this locally.
I know that's a goal, but right now the links/images are all relative urls so they'll still need some processing.
Next steps are:
After that I think this will be ready to try out in production!
Is there a more robust way to exclude some sections for all languages?
Do you mean moving this to a configuration file, or something that works independent of the language?
Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.
We could collapse it into a single set and apply it to all languages.
No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
Do you want to load it at compile time or runtime?
I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
LGTM with a few comments
@ -4,0 +4,4 @@
## Usage
[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language.
... It defines the article's sections that are not important for users and should be removed.
Does it make sense to sort sections by name?
nit: Normal sentences are more readable in many cases. Here and in other places.
What's needed to get right answers to these TODOs?
Should title be trimmed?
@ -0,0 +76,4 @@
if let Some(mut node) = document.tree.get_mut(id) {
node.detach();
}
}
Can copy-paste be avoided?
@ -0,0 +82,4 @@
}
#[cfg(test)]
mod test {
Is it hard to make a simple test for the function above?
@ -0,0 +97,4 @@
///
/// let with_q = WikidataQid::from_str("Q12345").unwrap();
/// let without_q = WikidataQid::from_str(" 12345 ").unwrap();
/// assert_eq!(with_q, without_q);
Does it make sense to test it?
@ -0,0 +128,4 @@
///
/// let title = WikipediaTitleNorm::from_title("Article Title", "en").unwrap();
/// let url = WikipediaTitleNorm::from_url("https://en.wikipedia.org/wiki/Article_Title#Section").unwrap();
/// assert_eq!(url, title);
Does it make sense to test it?
Thanks, I'll address those comments, rebase into individual changes, and merge.
Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.
Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a
.in_language.identifier
field set toen
, and the program writes that html to aQXXXXX/en.html
file.Running the program multiple times with different language dumps will fill in the various
QXXXXX/$lang.html
files.We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.
The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?
I haven't tried it out yet, but there are a couple of options that come to mind:
gunzip
/tar
, not sure about pythonpgzip
.xargs
,parallel
, etc.Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
@ -0,0 +82,4 @@
}
#[cfg(test)]
mod test {
No, I just didn't bother since I'll be changing this in the next PR. I can add some if you'd like.
@ -0,0 +76,4 @@
if let Some(mut node) = document.tree.get_mut(id) {
node.detach();
}
}
I'll be improving and refactoring this in the next PR, if this bit survives I'll move it to a function.
@ -0,0 +128,4 @@
///
/// let title = WikipediaTitleNorm::from_title("Article Title", "en").unwrap();
/// let url = WikipediaTitleNorm::from_url("https://en.wikipedia.org/wiki/Article_Title#Section").unwrap();
/// assert_eq!(url, title);
I added some checks for whitespace, empty strings, and tests for errors in 70f7edf, is there something else you think should be handled?
I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.
@ -0,0 +97,4 @@
///
/// let with_q = WikidataQid::from_str("Q12345").unwrap();
/// let without_q = WikidataQid::from_str(" 12345 ").unwrap();
/// assert_eq!(with_q, without_q);
Same as above.
It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.