Generator directory format #6

Merged
newsch merged 1 commit from generator-compat into main 2023-07-10 14:34:21 +00:00
newsch commented 2023-06-23 22:00:14 +00:00 (Migrated from github.com)

I decided to break up the next steps into smaller PRs compared to the last one.

This PR updates the program to create to the folder structure that the map generator expects, e.g.:

.
├── de.wikipedia.org
│  └── wiki
│     ├── Coal_River_Springs_Territorial_Park
│     │  ├── de.html
│     │  └── ru.html
│     ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park
│     │  ├── de.html
│     │  └── en.html
│    ...
├── en.wikipedia.org
│  └── wiki
│     ├── Arctic_National_Wildlife_Refuge
│     │  ├── de.html
│     │  ├── en.html
│     │  ├── es.html
│     │  ├── fr.html
│     │  └── ru.html
│     │
│     │ **NOTE: Article titles with a `/` are not escaped, so "Baltimore/Washington_International_Airport" becomes two subfolders as below.**
│     │
│     ├── Baltimore
│     │  └── Washington_International_Airport
│     │     ├── de.html
│     │     ├── en.html
│     │     ├── es.html
│     │     ├── fr.html
│     │     └── ru.html
│    ...
└── wikidata
   ├── Q59320
   │  ├── de.html
   │  ├── en.html
   │  ├── es.html
   │  ├── fr.html
   │  └── ru.html
   ├── Q120306
   │  ├── de.html
   │  ├── en.html
   │  ├── es.html
   │  ├── fr.html
   │  └── ru.html
  ...

While the old description scraper would write duplicates for the same article's title and qid, this implementation writes symlinks in the wikipedia tree that point to the wikidata files.

I know I can change what the generator looks for, but I figured it would be easier to have this working and then change them together instead of debugging both at the same time while neither works.

The goal is that with this PR, the parser will be a drop-in replacement for the current scraper, even if the speed and html size is not what we'd like.

Remaining work for this PR:

  • handle articles without QIDs (yes, they exist! 🤷)
  • only write symlinks for requested redirects
  • handle updating existing files (e.g. timestamps) timestamps moved to #9
  • do a test run with the generator and multiple languages
  • add documentation for running with multiple languages
I decided to break up the next steps into smaller PRs compared to the last one. This PR updates the program to create to the folder structure that the map generator expects, e.g.: ``` . ├── de.wikipedia.org │ └── wiki │ ├── Coal_River_Springs_Territorial_Park │ │ ├── de.html │ │ └── ru.html │ ├── Ni'iinlii_Njik_(Fishing_Branch)_Territorial_Park │ │ ├── de.html │ │ └── en.html │ ... ├── en.wikipedia.org │ └── wiki │ ├── Arctic_National_Wildlife_Refuge │ │ ├── de.html │ │ ├── en.html │ │ ├── es.html │ │ ├── fr.html │ │ └── ru.html │ │ │ │ **NOTE: Article titles with a `/` are not escaped, so "Baltimore/Washington_International_Airport" becomes two subfolders as below.** │ │ │ ├── Baltimore │ │ └── Washington_International_Airport │ │ ├── de.html │ │ ├── en.html │ │ ├── es.html │ │ ├── fr.html │ │ └── ru.html │ ... └── wikidata ├── Q59320 │ ├── de.html │ ├── en.html │ ├── es.html │ ├── fr.html │ └── ru.html ├── Q120306 │ ├── de.html │ ├── en.html │ ├── es.html │ ├── fr.html │ └── ru.html ... ``` While the old description scraper would write duplicates for the same article's title and qid, this implementation writes symlinks in the wikipedia tree that point to the wikidata files. I know I can change what the generator looks for, but I figured it would be easier to have this working and then change them _together_ instead of debugging both at the same time while neither works. The goal is that with this PR, the parser will be a **drop-in replacement for the current scraper**, even if the speed and html size is not what we'd like. Remaining work for this PR: - [x] handle articles without QIDs (yes, they exist! :shrug:) - [x] only write symlinks for requested redirects - [x] handle updating existing files ~~(e.g. timestamps)~~ _timestamps moved to #9_ - [x] do a test run with the generator and multiple languages - [x] add documentation for running with multiple languages
biodranik (Migrated from github.com) reviewed 2023-06-24 05:50:00 +00:00
biodranik (Migrated from github.com) left a comment

Good approach )

Good approach )
@ -40,0 +30,4 @@
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
biodranik (Migrated from github.com) commented 2023-06-24 05:45:00 +00:00

How are they processed in the generator?

How are they processed in the generator?
biodranik (Migrated from github.com) commented 2023-06-24 05:45:09 +00:00

For example?

For example?
biodranik (Migrated from github.com) commented 2023-06-24 05:47:09 +00:00

Lang is used two times here in the path, but only one file is always stored in the directory, right?

Lang is used two times here in the path, but only one file is always stored in the directory, right?
biodranik (Migrated from github.com) commented 2023-06-24 05:48:53 +00:00

Can / be percent-escaped in such cases? How the generator handles it now?

Can / be percent-escaped in such cases? How the generator handles it now?
biodranik (Migrated from github.com) commented 2023-06-24 05:49:27 +00:00

Is more than one slash in the title possible?

Is more than one slash in the title possible?
newsch (Migrated from github.com) reviewed 2023-06-24 17:06:16 +00:00
@ -40,0 +30,4 @@
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
newsch (Migrated from github.com) commented 2023-06-24 17:06:15 +00:00

The generator only works with complete wikipedia urls (and wikidata QIDs). The OSM tags like en:Article_Title are converted to urls somewhere early in the OSM ingestion process.
It dumps the urls to a file for the descriptions scraper, then when it adds them to the mwm files it strips the protocol, appends the url to the base directory, and looks for language html files in the folder at that location..
It doesn't do any special processing for articles with a slash in the title, they are just another subdirectory down. I'll update the diagram to show that.

The generator only works with complete wikipedia urls (and wikidata QIDs). The OSM tags like `en:Article_Title` are converted to urls somewhere early in the OSM ingestion process. It [dumps the urls to a file](https://github.com/organicmaps/organicmaps/blob/acc7c0547db4285dd8841ae7f98811268e38d908/generator/wiki_url_dumper.cpp#L63) for the descriptions scraper, then when it adds them to the mwm files [it strips the protocol, appends the url to the base directory, and looks for language html files in the folder at that location.](https://github.com/organicmaps/organicmaps/blob/34bbdf6a2f077b3d629b3f17e8e05bd18a4e4110/generator/descriptions_section_builder.cpp#L142). It doesn't do any special processing for articles with a slash in the title, they are just another subdirectory down. I'll update the diagram to show that.
newsch (Migrated from github.com) reviewed 2023-06-24 17:15:02 +00:00
@ -40,0 +30,4 @@
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
newsch (Migrated from github.com) commented 2023-06-24 17:15:01 +00:00

The "incorrect links, directories" refers to updating a directory tree from a previous run, instead of starting from scratch. Right now the behavior is to skip any file that exists.

The "incorrect links, directories" refers to updating a directory tree from a previous run, instead of starting from scratch. Right now the behavior is to skip any file that exists.
newsch (Migrated from github.com) reviewed 2023-06-24 17:28:10 +00:00
newsch (Migrated from github.com) commented 2023-06-24 17:28:10 +00:00

The behavior that the generator/scraper expects is to write all available translations in each directory.
So for the article for Berlin, if there are OSM tags for wikipedia:en=Berlin, wikipedia:de=Berlin, wikipedia:fr=Berlin and wikidata=Q64, and the generator keeps them all, then there will be four folders with duplicates of all language copies:

en.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...}
de.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...}
fr.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...}
wikidata/Q64/{en.html, de.html, fr.html, ...}

Now, I don't understand exactly how the generator picks which tags to use yet, but just from looking at the Canada Yukon region map there are duplicated copies of wikipedia items there.

For this program, we only see one language at a time, so we write that copy to the master wikidata directory. When later we get the same article in a different language, we write it to the same wikidata directory.

Once all the languages have been processed, it would look like:

en.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/
de.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/
fr.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/
wikidata/Q64/{en.html, de.html, fr.html, ...}
The behavior that the generator/scraper expects is to write all available translations in each directory. So for the article for [Berlin](https://en.wikipedia.org/wiki/Berlin), if there are OSM tags for `wikipedia:en=Berlin`, `wikipedia:de=Berlin`, `wikipedia:fr=Berlin` and `wikidata=Q64`, and the generator keeps them all, then there will be four folders with duplicates of all language copies: ``` en.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...} de.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...} fr.wikipedia.org/wiki/Berlin/{en.html, de.html, fr.html, ...} wikidata/Q64/{en.html, de.html, fr.html, ...} ``` Now, I don't understand exactly how the generator picks which tags to use yet, but just from looking at the Canada Yukon region map there are duplicated copies of wikipedia items there. For this program, we only see one language at a time, so we write that copy to the master wikidata directory. When later we get the same article in a different language, we write it to the same wikidata directory. Once all the languages have been processed, it would look like: ``` en.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/ de.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/ fr.wikipedia.org/wiki/Berlin/ -> wikidata/Q64/ wikidata/Q64/{en.html, de.html, fr.html, ...} ```
newsch (Migrated from github.com) reviewed 2023-06-24 19:47:22 +00:00
newsch (Migrated from github.com) commented 2023-06-24 19:47:21 +00:00

I guess it could be, I haven't looked for that. Wikipedia works with either.

See below for more details, but the generator should decode those before dumping the urls.

It looks like a handful of encoded titles still slip through, but none with %2F=/.
I made an issue with some notes about this in #7.

From my read of when it first adds a wikipedia tag and later writes it as a url:

  1. If the tag looks like a url instead of the expected lang:Article Title format, take what's after .wikipedia.org/wiki/, url decode it, replace underscores with spaces, then concat that with the lang at the beginning of the url and store it.
  2. Otherwise attempt to check if it's a url, replace underscores with spaces, and store it.
  3. To transform it back into a url, replace spaces with underscores in the title, escape any %s, and add it to the end of https://lang.wikipedia.org/wiki/.

Glancing at the url decoding, I don't think there's anything wrong with it - it should handle arbitrary characters, although neither the encoding or decoding look unicode-aware.

I guess it could be, I haven't looked for that. Wikipedia works with either. See below for more details, but the generator should decode those before dumping the urls. It looks like a handful of encoded titles still slip through, but none with `%2F`=`/`. I made an issue with some notes about this in #7. From my read of when it [first adds a wikipedia tag](https://github.com/organicmaps/organicmaps/blob/34bbdf6a2f077b3d629b3f17e8e05bd18a4e4110/generator/osm2meta.cpp#L241) and later [writes it as a url](https://github.com/organicmaps/organicmaps/blob/34bbdf6a2f077b3d629b3f17e8e05bd18a4e4110/indexer/feature_meta.cpp#L19): 1. If the tag looks like a url instead of the expected `lang:Article Title` format, take what's after `.wikipedia.org/wiki/`, url decode it, replace underscores with spaces, then concat that with the lang at the beginning of the url and store it. 2. Otherwise attempt to check if it's a url, replace underscores with spaces, and store it. 3. To transform it back into a url, replace spaces with underscores in the title, escape any `%`s, and add it to the end of `https://lang.wikipedia.org/wiki/`. Glancing at the [url decoding](https://github.com/organicmaps/organicmaps/blob/34bbdf6a2f077b3d629b3f17e8e05bd18a4e4110/coding/url.cpp#L118), I don't think there's anything wrong with it - it should handle arbitrary characters, although neither the encoding or decoding look unicode-aware.
newsch (Migrated from github.com) reviewed 2023-06-24 19:53:50 +00:00
newsch (Migrated from github.com) commented 2023-06-24 19:53:50 +00:00

Yes, there are a handful, for example https://en.wikipedia.org/wiki/KXTV/KOVR/KCRA_Tower.

There are 39 present in the generator urls
$ grep -E '^https://\w+\.wikipedia\.org/wiki/.+/.+/' /tmp/wikipedia_urls.txt | sort | uniq
https://de.wikipedia.org/wiki/Darum/Gretesch/Lüstringen
https://de.wikipedia.org/wiki/Kienhorst/Köllnseen/Eichheide
https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Erlangen/A#Altstädter_Friedhof_2/3,_Altstädter_Friedhof
https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001-1/099)
https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001–1/099)
https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001–1/099)#Evang._Christuskirche
https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/100–1/199)
https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/200–1/299)
https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/300–1/399)
https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/400–1/499)
https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/500–1/580)
https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/500–1/580)#Schulgeb.C3.A4ude
https://de.wikipedia.org/wiki/Rhumeaue/Ellerniederung/Gillersheimer_Bachtal
https://de.wikipedia.org/wiki/Speck_/_Wehl_/_Helpenstein
https://de.wikipedia.org/wiki/Veldrom/Feldrom/Kempen
https://de.wikipedia.org/wiki/VHS_Witten/Wetter/Herdecke
https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Bach
https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Judenberg
https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Kramerberg
https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Loasleiten
https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Pelzereck
https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Theresienberg
https://de.wikipedia.org/wiki/Wohnanlage_Arzbacher_Straße/Thalkirchner_Straße/Wackersberger_Straße/Würzstraße
https://en.wikipedia.org/wiki/Abura/Asebu/Kwamankese_District
https://en.wikipedia.org/wiki/Ajumako/Enyan/Essiam_District
https://en.wikipedia.org/wiki/Bibiani/Anhwiaso/Bekwai_Municipal_District
https://en.wikipedia.org/wiki/Clapp/Langley/Crawford_Complex
https://en.wikipedia.org/wiki/KXTV/KOVR/KCRA_Tower
https://en.wikipedia.org/wiki/SAIT/AUArts/Jubilee_station
https://en.wikipedia.org/wiki/Santa_Cruz/Graciosa_Bay/Luova_Airport
https://fr.wikipedia.org/wiki/Landunvez#/media/Fichier:10_Samson_C.jpg
https://gl.wikipedia.org/wiki/Moaña#/media/Ficheiro:Plano_de_Moaña.png
https://it.wikipedia.org/wiki/Tswagare/Lothoje/Lokalana
https://lb.wikipedia.org/wiki/Lëscht_vun_den_nationale_Monumenter_an_der_Gemeng_Betzder#/media/Fichier:Roodt-sur-Syre,_14_rue_d'Olingen.jpg
https://pt.wikipedia.org/wiki/Wikipédia:Wikipédia_na_Universidade/Cursos/Rurtugal/Gontães
https://ru.wikipedia.org/wiki/Алажиде#/maplink/0
https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Волинська_область/Старовижівський_район
https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Київська_область/Броварський_район
https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Полтавська_область/Семенівський_район
Yes, there are a handful, for example <https://en.wikipedia.org/wiki/KXTV/KOVR/KCRA_Tower>. <details><summary>There are 39 present in the generator urls</summary> <pre> $ grep -E '^https://\w+\.wikipedia\.org/wiki/.+/.+/' /tmp/wikipedia_urls.txt | sort | uniq https://de.wikipedia.org/wiki/Darum/Gretesch/Lüstringen https://de.wikipedia.org/wiki/Kienhorst/Köllnseen/Eichheide https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Erlangen/A#Altstädter_Friedhof_2/3,_Altstädter_Friedhof https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001-1/099) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001–1/099) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/001–1/099)#Evang._Christuskirche https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/100–1/199) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/200–1/299) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/300–1/399) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/400–1/499) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/500–1/580) https://de.wikipedia.org/wiki/Liste_der_Baudenkmäler_in_Neuss_(1/500–1/580)#Schulgeb.C3.A4ude https://de.wikipedia.org/wiki/Rhumeaue/Ellerniederung/Gillersheimer_Bachtal https://de.wikipedia.org/wiki/Speck_/_Wehl_/_Helpenstein https://de.wikipedia.org/wiki/Veldrom/Feldrom/Kempen https://de.wikipedia.org/wiki/VHS_Witten/Wetter/Herdecke https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Bach https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Judenberg https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Kramerberg https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Loasleiten https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Pelzereck https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Österreich/JE/Theresienberg https://de.wikipedia.org/wiki/Wohnanlage_Arzbacher_Straße/Thalkirchner_Straße/Wackersberger_Straße/Würzstraße https://en.wikipedia.org/wiki/Abura/Asebu/Kwamankese_District https://en.wikipedia.org/wiki/Ajumako/Enyan/Essiam_District https://en.wikipedia.org/wiki/Bibiani/Anhwiaso/Bekwai_Municipal_District https://en.wikipedia.org/wiki/Clapp/Langley/Crawford_Complex https://en.wikipedia.org/wiki/KXTV/KOVR/KCRA_Tower https://en.wikipedia.org/wiki/SAIT/AUArts/Jubilee_station https://en.wikipedia.org/wiki/Santa_Cruz/Graciosa_Bay/Luova_Airport https://fr.wikipedia.org/wiki/Landunvez#/media/Fichier:10_Samson_C.jpg https://gl.wikipedia.org/wiki/Moaña#/media/Ficheiro:Plano_de_Moaña.png https://it.wikipedia.org/wiki/Tswagare/Lothoje/Lokalana https://lb.wikipedia.org/wiki/Lëscht_vun_den_nationale_Monumenter_an_der_Gemeng_Betzder#/media/Fichier:Roodt-sur-Syre,_14_rue_d'Olingen.jpg https://pt.wikipedia.org/wiki/Wikipédia:Wikipédia_na_Universidade/Cursos/Rurtugal/Gontães https://ru.wikipedia.org/wiki/Алажиде#/maplink/0 https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Волинська_область/Старовижівський_район https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Київська_область/Броварський_район https://uk.wikipedia.org/wiki/Вікіпедія:Вікі_любить_пам'ятки/Полтавська_область/Семенівський_район </pre> </details>
biodranik (Migrated from github.com) reviewed 2023-06-24 20:45:44 +00:00
@ -40,0 +30,4 @@
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
biodranik (Migrated from github.com) commented 2023-06-24 20:45:44 +00:00

Good. Please don't forget that if something can be simplified or improved by changing the current generator approach, then it makes sense to do it.

Good. Please don't forget that if something can be simplified or improved by changing the current generator approach, then it makes sense to do it.
newsch (Migrated from github.com) reviewed 2023-06-26 13:52:47 +00:00
@ -40,0 +30,4 @@
page: &Page,
redirects: impl IntoIterator<Item = WikipediaTitleNorm>,
) -> anyhow::Result<PathBuf> {
let base = base.as_ref();
newsch (Migrated from github.com) commented 2023-06-26 13:52:47 +00:00

I'm working on a list of changes that would be helpful

I'm working on a list of changes that would be helpful
newsch commented 2023-06-30 21:34:31 +00:00 (Migrated from github.com)

I ran them with all languages on my machine. I only have 4 cores, so more than two instances didn't show much of an improvement.
I didn't run into any errors, but there is a race condition between checking if the folder for a QID exists and creating it.
If we decide to do parallelism by running multiple instances, that should be handled. But I think we will be better off running multiple decompression threads internally.

Speaking of which, after investigating pgzip further, my understanding is it can only parallelize decompressing files that it compressed in a specific way. I'll make another issue for investigating other gunzip implementations.

I ran them with all languages on my machine. I only have 4 cores, so more than two instances didn't show much of an improvement. I didn't run into any errors, but there is a race condition between checking if the folder for a QID exists and creating it. If we decide to do parallelism by running multiple instances, that should be handled. But I think we will be better off running multiple decompression threads internally. Speaking of which, after investigating [pgzip](https://github.com/pgzip/pgzip) further, my understanding is it can only parallelize decompressing files that _it_ compressed in a specific way. I'll make another issue for investigating other gunzip implementations.
biodranik commented 2023-07-01 07:29:27 +00:00 (Migrated from github.com)

Parallelism is the next step, it can be done using existing tools. Let's lower its priority.

Why is there a race condition with QID? Aren't they created from a separate pass over the OSM dump?

Parallelism is the next step, it can be done using existing tools. Let's lower its priority. Why is there a race condition with QID? Aren't they created from a separate pass over the OSM dump?
newsch commented 2023-07-03 14:08:29 +00:00 (Migrated from github.com)

Why is there a race condition with QID? Aren't they created from a separate pass over the OSM dump?

When running multiple instances in parallel, they could process different translations of an article at the same time, and interleave between checking that the QID folder doesn't exist and creating it.

The same thing could hypothetically happen with article title folders, but since each dump is in a different language it shouldn't occur.

It is probably unlikely to occur, and it won't take down the entire program.
I can add special handling for the error to mitigate it.

> Why is there a race condition with QID? Aren't they created from a separate pass over the OSM dump? When running multiple instances in parallel, they could process different translations of an article at the same time, and interleave between checking that the QID folder doesn't exist and creating it. The same thing could hypothetically happen with article title folders, but since each dump is in a different language it shouldn't occur. It is probably unlikely to occur, and it won't take down the entire program. I can add special handling for the error to mitigate it.
biodranik commented 2023-07-04 11:57:13 +00:00 (Migrated from github.com)

Aren't file system operations atomic? Adding handler for the case "tried to create it but it was already created by another process" is a good idea.

Aren't file system operations atomic? Adding handler for the case "tried to create it but it was already created by another process" is a good idea.
newsch commented 2023-07-04 13:41:18 +00:00 (Migrated from github.com)

Yes, individual syscalls should be atomic but I don't think there are any guarantees between the call to path.is_dir() and fs::create_dir(&path).

It looks like create_dir_all explicitly handles this though by checking if the directory exists after getting an error. So it should not be a problem after all.

Yes, individual syscalls should be atomic but I don't think there are any guarantees between the call to `path.is_dir()` and `fs::create_dir(&path)`. It looks like [`create_dir_all` explicitly handles this though](https://doc.rust-lang.org/std/fs/fn.create_dir_all.html#errors) [by checking if the directory exists after getting an error](https://doc.rust-lang.org/src/std/fs.rs.html#2483). So it should not be a problem after all.
biodranik (Migrated from github.com) approved these changes 2023-07-06 05:50:28 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
biodranik (Migrated from github.com) commented 2023-07-05 18:11:13 +00:00

List? Delimited by what? Any example? Is specifying a directory with dumps better?

List? Delimited by what? Any example? Is specifying a directory with dumps better?
biodranik (Migrated from github.com) commented 2023-07-05 18:11:36 +00:00

Why it should be at PATH? Can it be run from any directory?

Why it should be at PATH? Can it be run from any directory?
biodranik (Migrated from github.com) commented 2023-07-05 18:13:39 +00:00

Is extracting ids directly from the osm pbf planet dump better than relying on the intermediate generator files? What are pros and cons?

Is extracting ids directly from the osm pbf planet dump better than relying on the intermediate generator files? What are pros and cons?
biodranik (Migrated from github.com) commented 2023-07-05 18:14:04 +00:00

nit: Start sentences with a capital letter and end them with a dot.

nit: Start sentences with a capital letter and end them with a dot.
@ -9,0 +42,4 @@
--wikidata-ids wikidata_ids.txt \
--wikipedia-urls wikipedia_urls.txt \
descriptions/
done
biodranik (Migrated from github.com) commented 2023-07-05 18:17:35 +00:00

Would a hint about om-wikiparser command line options be helpful?

Would a hint about om-wikiparser command line options be helpful?
biodranik (Migrated from github.com) commented 2023-07-06 05:49:23 +00:00

Print file name too?

Print file name too?
biodranik (Migrated from github.com) commented 2023-07-06 05:49:38 +00:00

In which cases dates can be different?

In which cases dates can be different?
biodranik (Migrated from github.com) commented 2023-07-06 05:50:22 +00:00

How will it work now?

How will it work now?
newsch (Migrated from github.com) reviewed 2023-07-06 14:07:22 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
newsch (Migrated from github.com) commented 2023-07-06 14:07:21 +00:00

I meant a shell list/array(?), separated by spaces.

One example is a glob, so using a directory and then referencing $WIKIPEDIA_DUMP_DIRECTORY/*.json.tar.gz might be clearer?

I meant a shell list/array(?), separated by spaces. One example is a glob, so using a directory and then referencing `$WIKIPEDIA_DUMP_DIRECTORY/*.json.tar.gz` might be clearer?
newsch (Migrated from github.com) reviewed 2023-07-06 14:41:06 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
newsch (Migrated from github.com) commented 2023-07-06 14:41:06 +00:00

It doesn't need to be, the example script read more clearly to me if it's in the context of the intermediate_data directory. It could also be run as ../../../wikiparser/target/release/om-wikiparser, with cargo run --release from the wikiparser directory, or anything else.

It doesn't need to be, the example script read more clearly to me if it's in the context of the `intermediate_data` directory. It could also be run as `../../../wikiparser/target/release/om-wikiparser`, with `cargo run --release` from the wikiparser directory, or anything else.
newsch (Migrated from github.com) reviewed 2023-07-06 14:52:10 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
newsch (Migrated from github.com) commented 2023-07-06 14:52:10 +00:00

Pros:

  • Independent of the generator process. Can be run as soon as planet file is updated.

Cons:

  • Need to keep osm query in sync with generator's own multi-step filtering and transformation process.
  • Need to match generator's multi-step processing of urls exactly.

When I did this earlier, it was with the osm-filter tool, I only tested it on the yukon region, and it output more entries than the generator did.

I can create an issue for this, but the rough steps to get that working are:

  • Convert osmfilter query to osmium command so it can work on pbf files directly.
  • Dig into generator map processing to try to improve querying.
  • Compare processing of a complete planet with generator output.
  • Write conversion of osmuim output for wikiparser to use.
Pros: - Independent of the generator process. Can be run as soon as planet file is updated. Cons: - Need to keep osm query in sync with generator's own multi-step filtering and transformation process. - Need to match generator's multi-step processing of urls exactly. When I did this earlier, it was with the `osm-filter` tool, I only tested it on the yukon region, and it output _more_ entries than the generator did. I can create an issue for this, but the rough steps to get that working are: - Convert `osmfilter` query to `osmium` command so it can work on `pbf` files directly. - Dig into generator map processing to try to improve querying. - Compare processing of a complete planet with generator output. - Write conversion of `osmuim` output for `wikiparser` to use.
newsch (Migrated from github.com) reviewed 2023-07-06 14:52:54 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
newsch (Migrated from github.com) commented 2023-07-06 14:52:54 +00:00

Sorry, old habits die hard.

Sorry, old habits die hard.
newsch (Migrated from github.com) reviewed 2023-07-06 14:54:31 +00:00
newsch (Migrated from github.com) commented 2023-07-06 14:54:30 +00:00

That's an old TODO, I'll remove it. It returns any parse errors it encounters with the title and redirects.

That's an old TODO, I'll remove it. It returns any parse errors it encounters with the title and redirects.
newsch (Migrated from github.com) reviewed 2023-07-06 14:56:27 +00:00
newsch (Migrated from github.com) commented 2023-07-06 14:56:27 +00:00

The debug line above does that.

The debug line above does that.
newsch (Migrated from github.com) reviewed 2023-07-06 15:00:20 +00:00
newsch (Migrated from github.com) commented 2023-07-06 15:00:20 +00:00

That's referring to #9, but I should remove that line now that it is designed to overwrite the directories from a previous run.

That's referring to #9, but I should remove that line now that it is designed to overwrite the directories from a previous run.
biodranik (Migrated from github.com) reviewed 2023-07-06 15:03:40 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
biodranik (Migrated from github.com) commented 2023-07-06 15:03:40 +00:00

It's better to mention list item separators explicitly and provide some example for clarity.

It's better to mention list item separators explicitly and provide some example for clarity.
biodranik (Migrated from github.com) reviewed 2023-07-06 15:04:39 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
biodranik (Migrated from github.com) commented 2023-07-06 15:04:39 +00:00

...then why suggesting to install the tool at PATH?

...then why suggesting to install the tool at PATH?
biodranik (Migrated from github.com) reviewed 2023-07-06 15:15:48 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
biodranik (Migrated from github.com) commented 2023-07-06 15:15:48 +00:00
  1. Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why?
  2. What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right? Do you remember how big is the percent of "unnecessary" articles?
  3. osmfilter can work with o5m, osmconvert can process pbf. There is also https://docs.rs/osmpbf/latest/osmpbf/ for direct pbf processing if it makes the approach simpler. How good is the osmium tool compared to other options?

It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. WDYT?

1. Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why? 2. What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right? Do you remember how big is the percent of "unnecessary" articles? 3. osmfilter can work with o5m, osmconvert can process pbf. There is also https://docs.rs/osmpbf/latest/osmpbf/ for direct pbf processing if it makes the approach simpler. How good is the osmium tool compared to other options? It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. WDYT?
newsch (Migrated from github.com) reviewed 2023-07-06 15:21:47 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
newsch (Migrated from github.com) commented 2023-07-06 15:21:47 +00:00

So that you can always reference it as om-wikiparser wherever you are, without worrying about where it is relative to you, or copying it into your working directory.

I meant this as an explanation of how to use it, not a step-by-step for what to run on the build server filesystem.

Maybe writing a shell script to use on the maps server instead would be helpful?

Would you prefer:

# Transform intermediate files from generator.
cut -f 2 id_to_wikidata.csv > wikidata_ids.txt
tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt
# Begin extraction.
for dump in $WIKIPEDIA_ENTERPRISE_DUMPS
do
  tar xzf $dump | $WIKIPARSER_DIR/target/release/om-wikiparser \
    --wikidata-ids wikidata_ids.txt \
    --wikipedia-urls wikipedia_urls.txt \
    descriptions/
done

or

# Transform intermediate files from generator.
maps_build=~/maps_build/$BUILD_DATE/intermediate_data
cut -f 2 $maps_build/id_to_wikidata.csv > $maps_build/wikidata_ids.txt
tail -n +2 $maps_build/wiki_urls.txt | cut -f 3 > $maps_build/wikipedia_urls.txt
# Begin extraction.
for dump in $WIKIPEDIA_ENTERPRISE_DUMPS
do
  tar xzf $dump | ./target/release/om-wikiparser \
    --wikidata-ids $maps_build/wikidata_ids.txt \
    --wikipedia-urls $maps_build/wikipedia_urls.txt \
    $maps_build/descriptions/
done
So that you can always reference it as `om-wikiparser` wherever you are, without worrying about where it is relative to you, or copying it into your working directory. I meant this as an explanation of how to use it, not a step-by-step for what to run on the build server filesystem. Maybe writing a shell script to use on the maps server instead would be helpful? Would you prefer: ```shell # Transform intermediate files from generator. cut -f 2 id_to_wikidata.csv > wikidata_ids.txt tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt # Begin extraction. for dump in $WIKIPEDIA_ENTERPRISE_DUMPS do tar xzf $dump | $WIKIPARSER_DIR/target/release/om-wikiparser \ --wikidata-ids wikidata_ids.txt \ --wikipedia-urls wikipedia_urls.txt \ descriptions/ done ``` or ```shell # Transform intermediate files from generator. maps_build=~/maps_build/$BUILD_DATE/intermediate_data cut -f 2 $maps_build/id_to_wikidata.csv > $maps_build/wikidata_ids.txt tail -n +2 $maps_build/wiki_urls.txt | cut -f 3 > $maps_build/wikipedia_urls.txt # Begin extraction. for dump in $WIKIPEDIA_ENTERPRISE_DUMPS do tar xzf $dump | ./target/release/om-wikiparser \ --wikidata-ids $maps_build/wikidata_ids.txt \ --wikipedia-urls $maps_build/wikipedia_urls.txt \ $maps_build/descriptions/ done ```
biodranik (Migrated from github.com) reviewed 2023-07-06 16:04:06 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
biodranik (Migrated from github.com) commented 2023-07-06 16:04:06 +00:00
  1. Can it be wrapped in a helper script that can be easily customized and run on the generator, maybe directly from the wikiparser repo? :)
  2. cargo run -r may be even better instead of a path to binary :) But it's also ok to hard-code the path or use $WIKIPARSER_BINARY var.

Think about me testing your code soon on a production server. Less surprises = less stress ;-)

1. Can it be wrapped in a helper script that can be easily customized and run on the generator, maybe directly from the wikiparser repo? :) 2. `cargo run -r` may be even better instead of a path to binary :) But it's also ok to hard-code the path or use `$WIKIPARSER_BINARY` var. Think about me testing your code soon on a production server. Less surprises = less stress ;-)
biodranik (Migrated from github.com) reviewed 2023-07-06 16:05:30 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
biodranik (Migrated from github.com) commented 2023-07-06 16:05:30 +00:00

Btw, it may make sense to also print/measure time taken to execute some commands after the first run on the whole planet, to have some reference starting values.

Btw, it may make sense to also print/measure time taken to execute some commands after the first run on the whole planet, to have some reference starting values.
newsch (Migrated from github.com) reviewed 2023-07-06 16:25:02 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
newsch (Migrated from github.com) commented 2023-07-06 16:25:01 +00:00

It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term.

Absolutely agree!

Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why?

I think so, do you mean the wikipedia/wikidata files or the mwm format in general?

By transformations, when I looked at it last, it looked like it was doing some sort of merging of ways/nodes/shapes to get a single parent object.
When I compared the OSM IDs that it output with the Wikidata ids it didn't match up with what I got from osmfilter, even if the urls were the same. Not a problem for the wikiparser, as long as the QIDs/articles are all caught, but it was harder to tell if they were doing the same thing.

As we talked about before, there are also multiple layers of filtering nodes by amenity or other tags, and I only looked at the final layer when I was trying this osmfilter approach (based on ftypes_matcher.cpp).

What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right?

As you say, the worst case isn't a problem for the end user, but I want to do more comparisons with the whole planet file to be confident that this is really a superset of them.

Do you remember how big is the percent of "unnecessary" articles?

That was around 25%, but in the Yukon territory so not very many nodes and I would guess not comparable to the planet.

How good is the osmium tool compared to other options?

I haven't looked into omium much, but my understanding is it is at least as powerful as osmfilter/osmconvert. I know we talked about using pbfs directly at some point so that's why I mentioned it.

> It would be great to have a well-defined and independent API between the generator and wikiparser, to avoid complications when supporting it in the longer term. Absolutely agree! > Does it make sense to create an issue to document the existing generator's output format, and propose some improvements if necessary? What kind of complex transformations are done in the generator now, and why? I think so, do you mean the wikipedia/wikidata files or the mwm format in general? By transformations, when I looked at it last, it looked like it was doing some sort of merging of ways/nodes/shapes to get a single parent object. When I compared the OSM IDs that it output with the Wikidata ids it didn't match up with what I got from `osmfilter`, even if the urls were the same. Not a problem for the wikiparser, as long as the QIDs/articles are all caught, but it was harder to tell if they were doing the same thing. As we talked about before, there are also multiple layers of filtering nodes by amenity or other tags, and I only looked at the final layer when I was trying this `osmfilter` approach (based on [`ftypes_matcher.cpp`](https://github.com/organicmaps/organicmaps/blob/982c6aa92d7196a5690dcdc1564e427de7611806/indexer/ftypes_matcher.cpp#L473)). > What's wrong in outputting more URLs? I assume that the generator may filter now OSM POIs/types that we are not supporting yet. In the worst case, some more articles will be extracted from the planet, right? As you say, the worst case isn't a problem for the end user, but I want to do more comparisons with the whole planet file to be confident that this is really a superset of them. > Do you remember how big is the percent of "unnecessary" articles? That was around 25%, but in the Yukon territory so not very many nodes and I would guess not comparable to the planet. > How good is the osmium tool compared to other options? I haven't looked into `omium` much, but my understanding is it is at least as powerful as `osmfilter`/`osmconvert`. I know we talked about using pbfs directly at some point so that's why I mentioned it.
newsch (Migrated from github.com) reviewed 2023-07-06 16:32:06 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
newsch (Migrated from github.com) commented 2023-07-06 16:32:06 +00:00

I will update the README to be more of an explanation, and make another issue/PR for a script that handles the build directory, timing, backtraces, saving logs, etc.

I will update the README to be more of an explanation, and make another issue/PR for a script that handles the build directory, timing, backtraces, saving logs, etc.
biodranik (Migrated from github.com) reviewed 2023-07-06 16:38:34 +00:00
@ -2,7 +2,45 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
biodranik (Migrated from github.com) commented 2023-07-06 16:38:34 +00:00

I think so, do you mean the wikipedia/wikidata files or the mwm format in general?

I meant those files that are required for wikiparser to work. It actually may make sense to keep it in README or some other doc, not in an issue.

> I think so, do you mean the wikipedia/wikidata files or the mwm format in general? I meant those files that are required for wikiparser to work. It actually may make sense to keep it in README or some other doc, not in an issue.
Sign in to join this conversation.
No description provided.