Generator directory format #6
No reviewers
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/wikiparser#6
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "generator-compat"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I decided to break up the next steps into smaller PRs compared to the last one.
This PR updates the program to create to the folder structure that the map generator expects, e.g.:
While the old description scraper would write duplicates for the same article's title and qid, this implementation writes symlinks in the wikipedia tree that point to the wikidata files.
I know I can change what the generator looks for, but I figured it would be easier to have this working and then change them together instead of debugging both at the same time while neither works.
The goal is that with this PR, the parser will be a drop-in replacement for the current scraper, even if the speed and html size is not what we'd like.
Remaining work for this PR:
(e.g. timestamps)timestamps moved to #9Good approach )
@ -40,0 +30,4 @@
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
How are they processed in the generator?
For example?
Lang is used two times here in the path, but only one file is always stored in the directory, right?
Can / be percent-escaped in such cases? How the generator handles it now?
Is more than one slash in the title possible?
@ -40,0 +30,4 @@
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
The generator only works with complete wikipedia urls (and wikidata QIDs). The OSM tags like
en:Article_Title
are converted to urls somewhere early in the OSM ingestion process.It dumps the urls to a file for the descriptions scraper, then when it adds them to the mwm files it strips the protocol, appends the url to the base directory, and looks for language html files in the folder at that location..
It doesn't do any special processing for articles with a slash in the title, they are just another subdirectory down. I'll update the diagram to show that.
@ -40,0 +30,4 @@
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
The "incorrect links, directories" refers to updating a directory tree from a previous run, instead of starting from scratch. Right now the behavior is to skip any file that exists.
The behavior that the generator/scraper expects is to write all available translations in each directory.
So for the article for Berlin, if there are OSM tags for
wikipedia:en=Berlin
,wikipedia:de=Berlin
,wikipedia:fr=Berlin
andwikidata=Q64
, and the generator keeps them all, then there will be four folders with duplicates of all language copies:Now, I don't understand exactly how the generator picks which tags to use yet, but just from looking at the Canada Yukon region map there are duplicated copies of wikipedia items there.
For this program, we only see one language at a time, so we write that copy to the master wikidata directory. When later we get the same article in a different language, we write it to the same wikidata directory.
Once all the languages have been processed, it would look like:
I guess it could be, I haven't looked for that. Wikipedia works with either.
See below for more details, but the generator should decode those before dumping the urls.
It looks like a handful of encoded titles still slip through, but none with
%2F
=/
.I made an issue with some notes about this in #7.
From my read of when it first adds a wikipedia tag and later writes it as a url:
lang:Article Title
format, take what's after.wikipedia.org/wiki/
, url decode it, replace underscores with spaces, then concat that with the lang at the beginning of the url and store it.%
s, and add it to the end ofhttps://lang.wikipedia.org/wiki/
.Glancing at the url decoding, I don't think there's anything wrong with it - it should handle arbitrary characters, although neither the encoding or decoding look unicode-aware.
Yes, there are a handful, for example https://en.wikipedia.org/wiki/KXTV/KOVR/KCRA_Tower.
There are 39 present in the generator urls
@ -40,0 +30,4 @@
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
Good. Please don't forget that if something can be simplified or improved by changing the current generator approach, then it makes sense to do it.
@ -40,0 +30,4 @@
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
I'm working on a list of changes that would be helpful
I ran them with all languages on my machine. I only have 4 cores, so more than two instances didn't show much of an improvement.
I didn't run into any errors, but there is a race condition between checking if the folder for a QID exists and creating it.
If we decide to do parallelism by running multiple instances, that should be handled. But I think we will be better off running multiple decompression threads internally.
Speaking of which, after investigating pgzip further, my understanding is it can only parallelize decompressing files that it compressed in a specific way. I'll make another issue for investigating other gunzip implementations.
Parallelism is the next step, it can be done using existing tools. Let's lower its priority.
Why is there a race condition with QID? Aren't they created from a separate pass over the OSM dump?
When running multiple instances in parallel, they could process different translations of an article at the same time, and interleave between checking that the QID folder doesn't exist and creating it.
The same thing could hypothetically happen with article title folders, but since each dump is in a different language it shouldn't occur.
It is probably unlikely to occur, and it won't take down the entire program.
I can add special handling for the error to mitigate it.
Aren't file system operations atomic? Adding handler for the case "tried to create it but it was already created by another process" is a good idea.
Yes, individual syscalls should be atomic but I don't think there are any guarantees between the call to
path.is_dir()
andfs::create_dir(&path)
.It looks like
create_dir_all
explicitly handles this though by checking if the directory exists after getting an error. So it should not be a problem after all.@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
List? Delimited by what? Any example? Is specifying a directory with dumps better?
Why it should be at PATH? Can it be run from any directory?
Is extracting ids directly from the osm pbf planet dump better than relying on the intermediate generator files? What are pros and cons?
nit: Start sentences with a capital letter and end them with a dot.
@ -9,0 +42,4 @@
--wikidata-ids wikidata_ids.txt \
--wikipedia-urls wikipedia_urls.txt \
descriptions/
done
Would a hint about om-wikiparser command line options be helpful?
Print file name too?
In which cases dates can be different?
How will it work now?
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
I meant a shell list/array(?), separated by spaces.
One example is a glob, so using a directory and then referencing
$WIKIPEDIA_DUMP_DIRECTORY/*.json.tar.gz
might be clearer?@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
It doesn't need to be, the example script read more clearly to me if it's in the context of the
intermediate_data
directory. It could also be run as../../../wikiparser/target/release/om-wikiparser
, withcargo run --release
from the wikiparser directory, or anything else.@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
Pros:
Cons:
When I did this earlier, it was with the
osm-filter
tool, I only tested it on the yukon region, and it output more entries than the generator did.I can create an issue for this, but the rough steps to get that working are:
osmfilter
query toosmium
command so it can work onpbf
files directly.osmuim
output forwikiparser
to use.@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
Sorry, old habits die hard.
That's an old TODO, I'll remove it. It returns any parse errors it encounters with the title and redirects.
The debug line above does that.
That's referring to #9, but I should remove that line now that it is designed to overwrite the directories from a previous run.
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
It's better to mention list item separators explicitly and provide some example for clarity.
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
...then why suggesting to install the tool at PATH?
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. WDYT?
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
So that you can always reference it as
om-wikiparser
wherever you are, without worrying about where it is relative to you, or copying it into your working directory.I meant this as an explanation of how to use it, not a step-by-step for what to run on the build server filesystem.
Maybe writing a shell script to use on the maps server instead would be helpful?
Would you prefer:
or
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
cargo run -r
may be even better instead of a path to binary :) But it's also ok to hard-code the path or use$WIKIPARSER_BINARY
var.Think about me testing your code soon on a production server. Less surprises = less stress ;-)
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
Btw, it may make sense to also print/measure time taken to execute some commands after the first run on the whole planet, to have some reference starting values.
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
Absolutely agree!
I think so, do you mean the wikipedia/wikidata files or the mwm format in general?
By transformations, when I looked at it last, it looked like it was doing some sort of merging of ways/nodes/shapes to get a single parent object.
When I compared the OSM IDs that it output with the Wikidata ids it didn't match up with what I got from
osmfilter
, even if the urls were the same. Not a problem for the wikiparser, as long as the QIDs/articles are all caught, but it was harder to tell if they were doing the same thing.As we talked about before, there are also multiple layers of filtering nodes by amenity or other tags, and I only looked at the final layer when I was trying this
osmfilter
approach (based onftypes_matcher.cpp
).As you say, the worst case isn't a problem for the end user, but I want to do more comparisons with the whole planet file to be confident that this is really a superset of them.
That was around 25%, but in the Yukon territory so not very many nodes and I would guess not comparable to the planet.
I haven't looked into
omium
much, but my understanding is it is at least as powerful asosmfilter
/osmconvert
. I know we talked about using pbfs directly at some point so that's why I mentioned it.@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
I will update the README to be more of an explanation, and make another issue/PR for a script that handles the build directory, timing, backtraces, saving logs, etc.
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
I meant those files that are required for wikiparser to work. It actually may make sense to keep it in README or some other doc, not in an issue.