Investigate using osmfilter/osmium for generating inputs #19
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/wikiparser#19
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As discussed in #6,
Instead of waiting for the generator to reach the descriptions stage, we could process the OSM globe file directly to query the wikipedia and wikidata tags.
To separate them, we should be careful to:
Previous Work
I got a query with
osmfilter
working based the filters inftypes_matcher.cpp
.I ran it on the Yukon territory map and found that it output additional articles compared to the generator.
Next Steps
osmfilter
command toosmium
so.pbf
files can be used directly.Interestingly, OM stores wiki articles only for hard-coded attractions.
@vng does it make sense to store articles for other types? Does the generator store articles only for point- and area-like types, or also for linear ones? I think we can drop wiki for linear types at the moment (with a comment), because they are not selected intuitively and require a long tap.
@newsch I think that the simplest approach would be to do any necessary filtering on the generator side. And extract everything from the dump in Wikiparser. In this way, we can always decide if the article should be embedded or not in the C++ code and not duplicate filtering logic in two different modules.
Ok, I will add a comparison of the current filtering against all the tags in the OSM dump.
This will extract all OSM wikidata/wikipedia tags into a tsv file.
osmium
doesn't support a csv output, but we can useosmconvert
to convert it into the same format I used above:It takes about 20 minutes to process the planet file on my computer.
Size-wise, compared to the generator files I have (not sure if these are from a recent build or the last successful scraping):
Worth noting that these counts are deduplicated entries, of which there are about 500,000 each; and using only the
wikidata
andwikipedia
tags, notwikipedia:*
.With that information I think processing all tags without initial filtering won't impact runtimes or output filesize significantly.