Get all translations for articles matched by title #15

New issue

Closed

opened 2023-07-04 16:32:36 +00:00 by newsch · 0 comments

newsch commented

2023-07-04 16:32:36 +00:00

(Migrated from github.com)

Currently the program checks for matches against the list of article titles and wikidata QIDs.
The QIDs are language agnostic, so all translations of them will be picked up.

For titles however, there's no way to figure out if an article is the translation of another title in the list, so only the article in the title's language is matched on.

Example

For the Eiffel Tower, if OSM doesn't have a wikidata= tag, only wikipedia=fr:Tour Eiffel, we don't know to extract en:Eiffel Tower or ru:Эйфелева башня until we process the page in the fr dump and get its wikidata QID.

At the same time there will be russian-only tags that need to be mapped to other languages, but can't be resolved until we process the ru dump.

For objects with a wikidata= tag this is not a problem, and there are wikipedia:lang= tags, but the generator needs to be updated to handle those and not every OSM object has all of the tags.

Solution

A complete mapping from title to QID would need to include all titles and redirects in each supported language.

We can build that by scanning through all the dumps initially, or by parsing some smaller dumps of redirects and QIDs, by using or doing something similar to this wikimapper project.

Some options to resolve the problem:

Build a complete mapping
Build a partial mapping only of the required titles
Save QIDs of articles matched by title, then find them again in another pass over all dumps

I think writing the missed QIDs out after the first scan is a good first step, if doing two passes increases runtime too much we can investigate the smaller dump option.

Currently the program checks for matches against the list of article titles and wikidata QIDs. The QIDs are language agnostic, so all translations of them will be picked up. For titles however, there's no way to figure out if an article is the translation of another title in the list, so only the article in the title's language is matched on. ## Example For the Eiffel Tower, if OSM doesn't have a `wikidata=` tag, only `wikipedia=fr:Tour Eiffel`, we don't know to extract `en:Eiffel Tower` or `ru:Эйфелева башня` until we process the page in the `fr` dump and get its wikidata QID. At the same time there will be russian-only tags that need to be mapped to other languages, but can't be resolved until we process the `ru` dump. For objects with a `wikidata=` tag this is not a problem, and there are `wikipedia:lang=` tags, but the generator needs to be updated to handle those and not every OSM object has all of the tags. ## Solution A complete mapping from title to QID would need to include all titles and redirects in each supported language. We can build that by scanning through all the dumps initially, or by parsing some smaller dumps of redirects and QIDs, by using or doing something similar to [this wikimapper project](https://github.com/jcklie/wikimapper). Some options to resolve the problem: - Build a complete mapping - Build a partial mapping only of the required titles - Save QIDs of articles matched by title, then find them again in another pass over all dumps I think writing the missed QIDs out after the first scan is a good first step, if doing two passes increases runtime too much we can investigate the smaller dump option.