Proof of Concept #3

Merged
newsch merged 6 commits from proof-of-concept into main 2023-06-23 19:50:04 +00:00
newsch commented 2023-06-01 20:00:39 +00:00 (Migrated from github.com)

This is an initial implementation of searching for Wikipedia urls/Wikidata QIDs and simplifying html.
It's a little messy, inefficient, and doesn't handle multiple languages or writing to an output directory.

Remaining work for this PR:

  • Refactor into library/modules
  • Set up comparison of html processing with python implementation
    • processing of same html files
    • scraped vs enterprise dump
  • Add unit tests
  • Revisit WikipediaTitleNorm design (parsing 3 million QIDs is almost instant, 1.5 million titles around a minute)
  • Add logging/tracing
  • Add proper CLI
  • Look through TODOs
  • Add multiple language support
  • Write to directory
This is an initial implementation of searching for Wikipedia urls/Wikidata QIDs and simplifying html. It's a little messy, inefficient, ~~and doesn't handle multiple languages or writing to an output directory~~. Remaining work for this PR: - [x] Refactor into library/modules - [x] Set up comparison of html processing with python implementation - processing of same html files - scraped vs enterprise dump - [x] Add unit tests - [x] Revisit `WikipediaTitleNorm` design (parsing 3 million QIDs is almost instant, 1.5 million titles around a minute) - [x] Add logging/tracing - [x] Add proper CLI - [x] Look through TODOs - [x] Add multiple language support - [x] Write to directory
biodranik (Migrated from github.com) reviewed 2023-06-01 20:21:55 +00:00
biodranik (Migrated from github.com) left a comment
  1. Does it make sense to start splitting the code into logical modules in separate sources?
  2. Implementing convenient cmd line args may save some time.
  3. Some common data preparation tasks or common test scenarios can be also implemented in rust with a cmd line param if it saves time in the longer term.
  4. Is there a Rust library to read and extract tags from planet.o5m files directly? Just curious, using osmfilter or osmium-tool is also ok.
1. Does it make sense to start splitting the code into logical modules in separate sources? 2. Implementing convenient cmd line args may save some time. 3. Some common data preparation tasks or common test scenarios can be also implemented in rust with a cmd line param if it saves time in the longer term. 4. Is there a Rust library to read and extract tags from planet.o5m files directly? Just curious, using osmfilter or osmium-tool is also ok.
newsch commented 2023-06-01 21:05:35 +00:00 (Migrated from github.com)
  1. Does it make sense to start splitting the code into logical modules in separate sources?

I think so, at the very least it needs a lib.rs to make a separate binary for testing the html simplifying. I won't go overboard 😉.

  1. Implementing convenient cmd line args may save some time.

I agree, but I wanted to get the logic working before worrying about that.

  1. Some common data preparation tasks or common test scenarios can be also implemented in rust with a cmd line param if it saves time in the longer term.

Good point, I'll keep that in mind if I run into any other work that needs to be done.

  1. Is there a Rust library to read and extract tags from planet.o5m files directly? Just curious, using osmfilter or osmium-tool is also ok.

There are a number of crates related to the .pbf format, and bindings for libosmium, but I haven't come across any .o5m ones that seem thoroughly developed.

> 1. Does it make sense to start splitting the code into logical modules in separate sources? I think so, at the very least it needs a `lib.rs` to make a separate binary for testing the html simplifying. I won't go overboard :wink:. > 2. Implementing convenient cmd line args may save some time. I agree, but I wanted to get the logic working before worrying about that. > 3. Some common data preparation tasks or common test scenarios can be also implemented in rust with a cmd line param if it saves time in the longer term. Good point, I'll keep that in mind if I run into any other work that needs to be done. > 4. Is there a Rust library to read and extract tags from planet.o5m files directly? Just curious, using osmfilter or osmium-tool is also ok. There are [a number of crates related to the `.pbf` format](https://lib.rs/search?q=osm+OR+openstreetmap), and bindings for libosmium, but I haven't come across any `.o5m` ones that seem thoroughly developed.
biodranik commented 2023-06-01 22:01:29 +00:00 (Migrated from github.com)

Can pbf be used to get fresh OpenStreetMap.org updates without converting it to o5m?

Can pbf be used to get fresh OpenStreetMap.org updates without converting it to o5m?
newsch commented 2023-06-02 13:20:01 +00:00 (Migrated from github.com)

The planet downloads are available in pbf: https://planet.openstreetmap.org/pbf/

I think this is saying you can apply the daily/hourly diffs to pbf, but I'm not sure I'm reading it correctly: https://wiki.openstreetmap.org/wiki/Planet.osm/diffs#Using_the_replication_diffs

This makes it seem like diffs can be appended at the end of the file, which seems like you'd need to read the entire file to get the true state of any node/object?
https://wiki.openstreetmap.org/wiki/PBF_Format#What_are_the_replication_fields_for?

I don't understand enough about these formats or how the generator works to know if they'd be compatible.

The planet downloads are available in pbf: https://planet.openstreetmap.org/pbf/ I think this is saying you can apply the daily/hourly diffs to pbf, but I'm not sure I'm reading it correctly: https://wiki.openstreetmap.org/wiki/Planet.osm/diffs#Using_the_replication_diffs This makes it seem like diffs can be appended at the end of the file, which seems like you'd need to read the entire file to get the true state of any node/object? https://wiki.openstreetmap.org/wiki/PBF_Format#What_are_the_replication_fields_for? I don't understand enough about these formats or how the generator works to know if they'd be compatible.
biodranik commented 2023-06-02 13:21:44 +00:00 (Migrated from github.com)

That's why we use o5m. It is suitable for incremental updates, and osmupdate tool creates a new planet.o5m file automatically by applying these updates.

That's why we use o5m. It is suitable for incremental updates, and osmupdate tool creates a new planet.o5m file automatically by applying these updates.
newsch commented 2023-06-02 13:30:21 +00:00 (Migrated from github.com)

It looks like osmupdate supports pbf for the same thing, but I don't understand how that works

It looks like osmupdate supports pbf for the same thing, but I don't understand how that works
biodranik commented 2023-06-02 21:12:33 +00:00 (Migrated from github.com)

Tested, it works for pbf too. Cool! The whole planet can be stripped to 58gb:
"$OSMUPDATE" -v --drop-authors --drop-version --hash-memory=16000 planet-230529.osm.pbf planet-updated.osm.pbf

Tested, it works for pbf too. Cool! The whole planet can be stripped to 58gb: `"$OSMUPDATE" -v --drop-authors --drop-version --hash-memory=16000 planet-230529.osm.pbf planet-updated.osm.pbf`
newsch commented 2023-06-05 21:13:33 +00:00 (Migrated from github.com)

Alright, here's what I've learned from comparing the html outputs:

The simplifying step of the python and the rust versions give comparable output (proper minification is still needed), but the html that the python gets from the api is much more simple than I expected.

It uses the Extracts API, which strips most of the markup.

The html in the dumps on the other hand seem much closer to the content in a complete article. Size-wise the dump html is around 10x the size of that of the extracts for the subset I looked at.

To get to parity with that output, we'll need to add additional steps to the html processing to remove:

  • mw-specific tags and attributes
  • links
  • images
  • info boxes

This is doable, it will involve some more exploration of what exactly to match on.

I'll make issues for that and the minification.

Alright, here's what I've learned from comparing the html outputs: The simplifying step of the python and the rust versions give comparable output (proper minification is still needed), but the html that the python gets from the api is much more simple than I expected. It uses [the Extracts API](https://www.mediawiki.org/wiki/Extension:TextExtracts), which strips most of the markup. The html in the dumps on the other hand seem much closer to the content in a complete article. Size-wise the dump html is around 10x the size of that of the extracts for the subset I looked at. To get to parity with that output, we'll need to add additional steps to the html processing to remove: - mw-specific tags and attributes - links - images - [info boxes](https://en.wikipedia.org/wiki/Help:Infobox) This is doable, it will involve some more exploration of what exactly to match on. I'll make issues for that and the minification.
biodranik commented 2023-06-06 00:23:28 +00:00 (Migrated from github.com)

Minification is not a blocker and can be done later, if necessary. It would be great to compare final mwm sizes, AFAIR, some compression is used there.

Showing images for those who wants them, and maybe leave links too may be a good idea.

Minification is not a blocker and can be done later, if necessary. It would be great to compare final mwm sizes, AFAIR, some compression is used there. Showing images for those who wants them, and maybe leave links too may be a good idea.
newsch commented 2023-06-06 13:36:33 +00:00 (Migrated from github.com)

Minification is not a blocker and can be done later, if necessary.

👍

It would be great to compare final mwm sizes, AFAIR, some compression is used there.

Good point. I've seen the compression referenced in text_storage.hpp.
I'll look into doing this locally.

Showing images for those who wants them, and maybe leave links too may be a good idea.

I know that's a goal, but right now the links/images are all relative urls so they'll still need some processing.

> Minification is not a blocker and can be done later, if necessary. :thumbsup: > It would be great to compare final mwm sizes, AFAIR, some compression is used there. Good point. I've seen the compression [referenced in `text_storage.hpp`](https://github.com/organicmaps/organicmaps/blob/acc7c0547db4285dd8841ae7f98811268e38d908/coding/text_storage.hpp#L28). I'll look into doing this locally. > Showing images for those who wants them, and maybe leave links too may be a good idea. I know that's a goal, but right now the links/images are all relative urls so they'll still need some processing.
newsch commented 2023-06-07 22:14:11 +00:00 (Migrated from github.com)

Next steps are:

  • reviewing html output (#4)
  • connecting output to map generator (mainly handling wikipedia -> wikidata mapping)
  • defining some custom errors for better handling
  • initial profiling of pipeline (right now it can roughly keep up with tar on my machine, and that's with writing the output files to a ramdisk)

After that I think this will be ready to try out in production!

Next steps are: - reviewing html output (#4) - connecting output to map generator (mainly handling wikipedia -> wikidata mapping) - defining some custom errors for better handling - initial profiling of pipeline (right now it can roughly keep up with tar on my machine, and that's with writing the output files to a ramdisk) After that I think this will be ready to try out in production!
biodranik (Migrated from github.com) reviewed 2023-06-08 08:27:41 +00:00
biodranik (Migrated from github.com) commented 2023-06-08 08:27:41 +00:00

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch (Migrated from github.com) reviewed 2023-06-08 13:06:45 +00:00
newsch (Migrated from github.com) commented 2023-06-08 13:06:45 +00:00

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik (Migrated from github.com) reviewed 2023-06-08 13:35:24 +00:00
biodranik (Migrated from github.com) commented 2023-06-08 13:35:24 +00:00

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch (Migrated from github.com) reviewed 2023-06-08 13:51:06 +00:00
newsch (Migrated from github.com) commented 2023-06-08 13:51:06 +00:00

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch (Migrated from github.com) reviewed 2023-06-08 14:39:06 +00:00
newsch (Migrated from github.com) commented 2023-06-08 14:39:06 +00:00

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik (Migrated from github.com) approved these changes 2023-06-22 21:53:09 +00:00
biodranik (Migrated from github.com) left a comment

LGTM with a few comments

LGTM with a few comments
@ -4,0 +4,4 @@
## Usage
[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language.
biodranik (Migrated from github.com) commented 2023-06-22 17:27:08 +00:00

... It defines the article's sections that are not important for users and should be removed.

... It defines the article's sections that are not important for users and should be removed.
biodranik (Migrated from github.com) commented 2023-06-22 17:28:31 +00:00

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
biodranik (Migrated from github.com) commented 2023-06-22 17:30:41 +00:00
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik (Migrated from github.com) commented 2023-06-22 17:34:03 +00:00

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik (Migrated from github.com) commented 2023-06-22 17:36:08 +00:00

Should title be trimmed?

Should title be trimmed?
@ -0,0 +76,4 @@
if let Some(mut node) = document.tree.get_mut(id) {
node.detach();
}
}
biodranik (Migrated from github.com) commented 2023-06-22 21:36:12 +00:00

Can copy-paste be avoided?

Can copy-paste be avoided?
@ -0,0 +82,4 @@
}
#[cfg(test)]
mod test {
biodranik (Migrated from github.com) commented 2023-06-22 21:37:15 +00:00

Is it hard to make a simple test for the function above?

Is it hard to make a simple test for the function above?
biodranik (Migrated from github.com) commented 2023-06-22 21:45:20 +00:00
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
@ -0,0 +97,4 @@
///
/// let with_q = WikidataQid::from_str("Q12345").unwrap();
/// let without_q = WikidataQid::from_str(" 12345 ").unwrap();
/// assert_eq!(with_q, without_q);
biodranik (Migrated from github.com) commented 2023-06-22 21:47:45 +00:00

Does it make sense to test it?

Does it make sense to test it?
@ -0,0 +128,4 @@
///
/// let title = WikipediaTitleNorm::from_title("Article Title", "en").unwrap();
/// let url = WikipediaTitleNorm::from_url("https://en.wikipedia.org/wiki/Article_Title#Section").unwrap();
/// assert_eq!(url, title);
biodranik (Migrated from github.com) commented 2023-06-22 21:47:31 +00:00

Does it make sense to test it?

Does it make sense to test it?
newsch commented 2023-06-22 22:35:52 +00:00 (Migrated from github.com)

Thanks, I'll address those comments, rebase into individual changes, and merge.

Thanks, I'll address those comments, rebase into individual changes, and merge.
newsch (Migrated from github.com) reviewed 2023-06-22 23:33:17 +00:00
newsch (Migrated from github.com) commented 2023-06-22 23:33:17 +00:00

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik (Migrated from github.com) reviewed 2023-06-23 06:08:13 +00:00
biodranik (Migrated from github.com) commented 2023-06-23 06:08:13 +00:00
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch (Migrated from github.com) reviewed 2023-06-23 14:41:22 +00:00
newsch (Migrated from github.com) commented 2023-06-23 14:41:22 +00:00
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
newsch (Migrated from github.com) reviewed 2023-06-23 15:00:08 +00:00
@ -0,0 +82,4 @@
}
#[cfg(test)]
mod test {
newsch (Migrated from github.com) commented 2023-06-23 15:00:08 +00:00

No, I just didn't bother since I'll be changing this in the next PR. I can add some if you'd like.

No, I just didn't bother since I'll be changing this in the next PR. I can add some if you'd like.
newsch (Migrated from github.com) reviewed 2023-06-23 15:01:22 +00:00
@ -0,0 +76,4 @@
if let Some(mut node) = document.tree.get_mut(id) {
node.detach();
}
}
newsch (Migrated from github.com) commented 2023-06-23 15:01:22 +00:00

I'll be improving and refactoring this in the next PR, if this bit survives I'll move it to a function.

I'll be improving and refactoring this in the next PR, if this bit survives I'll move it to a function.
newsch (Migrated from github.com) reviewed 2023-06-23 16:08:14 +00:00
@ -0,0 +128,4 @@
///
/// let title = WikipediaTitleNorm::from_title("Article Title", "en").unwrap();
/// let url = WikipediaTitleNorm::from_url("https://en.wikipedia.org/wiki/Article_Title#Section").unwrap();
/// assert_eq!(url, title);
newsch (Migrated from github.com) commented 2023-06-23 16:08:14 +00:00

I added some checks for whitespace, empty strings, and tests for errors in 70f7edf, is there something else you think should be handled?

I added some checks for whitespace, empty strings, and tests for errors in 70f7edf, is there something else you think should be handled?
newsch (Migrated from github.com) reviewed 2023-06-23 16:20:04 +00:00
newsch (Migrated from github.com) commented 2023-06-23 16:20:03 +00:00

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
newsch (Migrated from github.com) reviewed 2023-06-23 16:20:34 +00:00
@ -0,0 +97,4 @@
///
/// let with_q = WikidataQid::from_str("Q12345").unwrap();
/// let without_q = WikidataQid::from_str(" 12345 ").unwrap();
/// assert_eq!(with_q, without_q);
newsch (Migrated from github.com) commented 2023-06-23 16:20:34 +00:00

Same as above.

Same as [above](https://github.com/organicmaps/wikiparser/pull/3#discussion_r1239061793).
biodranik (Migrated from github.com) reviewed 2023-06-23 17:16:40 +00:00
biodranik (Migrated from github.com) commented 2023-06-23 17:16:40 +00:00

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik (Migrated from github.com) approved these changes 2023-06-23 17:17:03 +00:00
Sign in to join this conversation.
No description provided.