Investigate using osmfilter/osmium for generating inputs #19

Closed
opened 2023-07-10 17:38:17 +00:00 by newsch · 5 comments
newsch commented 2023-07-10 17:38:17 +00:00 (Migrated from github.com)

As discussed in #6,
Instead of waiting for the generator to reach the descriptions stage, we could process the OSM globe file directly to query the wikipedia and wikidata tags.

To separate them, we should be careful to:

  • use the same criteria as the generator for filtering nodes
  • process article titles in the same method as the generator

Previous Work

I got a query with osmfilter working based the filters in ftypes_matcher.cpp.

osmfilter planet.o5m --keep="( wikipedia= or wikidata= ) and ( amenity=grave_yard or amenity=fountain or amenity=place_of_worship or amenity=theatre or amenity=townhall or amenity=university or boundary=national_park or building=train_station or highway=pedestrian or historic=archaeological_site or historic=boundary_stone or historic=castle or historic=fort or historic=memorial or historic=monument or historic=ruins or historic=ship or historic=tomb or historic=wayside_cross or historic=wayside_shrine or landuse=cemetery or leisure=garden or leisure=nature_reserve or leisure=park or leisure=water_park or man_made=lighthouse or man_made=tower or natural=beach or natural=cave_entrance or natural=geyser or natural=glacier or natural=hot_spring or natural=peak or natural=volcano or place=square or tourism=artwork or tourism=museum or tourism=gallery or tourism=zoo or tourism=theme_park or waterway=waterfall or tourism=viewpoint or tourism=attraction )" \
| osmconvert - --csv-headline --csv="@oname @id wikipedia wikidata"

I ran it on the Yukon territory map and found that it output additional articles compared to the generator.

Next Steps

  • Investigate earlier processing layers in map generator to improve the query.
  • Try to convert osmfilter command to osmium so .pbf files can be used directly.
  • Run query on whole planet file and compare with generator output and all wikipedia/wikidata tags in the planet file.
  • Update wikiparser to handle direct OSM inputs.
As [discussed in #6](https://github.com/organicmaps/wikiparser/pull/6#discussion_r1253465466), Instead of waiting for the generator to reach the descriptions stage, we could process the OSM globe file directly to query the wikipedia and wikidata tags. To separate them, we should be careful to: - use the same criteria as the generator for filtering nodes - process article titles in the same method as the generator ## Previous Work I got a query with `osmfilter` working based the filters in [`ftypes_matcher.cpp`](https://github.com/organicmaps/organicmaps/blob/982c6aa92d7196a5690dcdc1564e427de7611806/indexer/ftypes_matcher.cpp#L473). ``` osmfilter planet.o5m --keep="( wikipedia= or wikidata= ) and ( amenity=grave_yard or amenity=fountain or amenity=place_of_worship or amenity=theatre or amenity=townhall or amenity=university or boundary=national_park or building=train_station or highway=pedestrian or historic=archaeological_site or historic=boundary_stone or historic=castle or historic=fort or historic=memorial or historic=monument or historic=ruins or historic=ship or historic=tomb or historic=wayside_cross or historic=wayside_shrine or landuse=cemetery or leisure=garden or leisure=nature_reserve or leisure=park or leisure=water_park or man_made=lighthouse or man_made=tower or natural=beach or natural=cave_entrance or natural=geyser or natural=glacier or natural=hot_spring or natural=peak or natural=volcano or place=square or tourism=artwork or tourism=museum or tourism=gallery or tourism=zoo or tourism=theme_park or waterway=waterfall or tourism=viewpoint or tourism=attraction )" \ | osmconvert - --csv-headline --csv="@oname @id wikipedia wikidata" ``` I ran it on the Yukon territory map and found that it output additional articles compared to the generator. ## Next Steps - [x] Investigate earlier processing layers in map generator to improve the query. - [x] Try to convert `osmfilter` command to `osmium` so `.pbf` files can be used directly. - [x] Run query on whole planet file and compare with generator output and all wikipedia/wikidata tags in the planet file. - [x] Update wikiparser to handle direct OSM inputs.
biodranik commented 2023-07-10 21:15:02 +00:00 (Migrated from github.com)

Interestingly, OM stores wiki articles only for hard-coded attractions.
@vng does it make sense to store articles for other types? Does the generator store articles only for point- and area-like types, or also for linear ones? I think we can drop wiki for linear types at the moment (with a comment), because they are not selected intuitively and require a long tap.

Interestingly, OM stores wiki articles only for hard-coded attractions. @vng does it make sense to store articles for other types? Does the generator store articles only for point- and area-like types, or also for linear ones? I think we can drop wiki for linear types at the moment (with a comment), because they are not selected intuitively and require a long tap.
biodranik commented 2023-07-10 21:36:38 +00:00 (Migrated from github.com)

@newsch I think that the simplest approach would be to do any necessary filtering on the generator side. And extract everything from the dump in Wikiparser. In this way, we can always decide if the article should be embedded or not in the C++ code and not duplicate filtering logic in two different modules.

@newsch I think that the simplest approach would be to do any necessary filtering on the generator side. And extract _everything_ from the dump in Wikiparser. In this way, we can always decide if the article should be embedded or not in the C++ code and not duplicate filtering logic in two different modules.
newsch commented 2023-07-10 23:13:17 +00:00 (Migrated from github.com)

Ok, I will add a comparison of the current filtering against all the tags in the OSM dump.

Ok, I will add a comparison of the current filtering against all the tags in the OSM dump.
newsch commented 2023-08-01 18:35:45 +00:00 (Migrated from github.com)

This will extract all OSM wikidata/wikipedia tags into a tsv file. osmium doesn't support a csv output, but we can use osmconvert to convert it into the same format I used above:

osmium tags-filter \
~/Downloads/nova-scotia-latest.osm.pbf \
--omit-referenced --output-format osm \
wikidata wikipedia wikipedia:de wikipedia:en wikipedia:es wikipedia:fr wikipedia:ru \
| osmconvert - --csv-headline --csv="@id @oname @lon @lat name wikidata wikipedia wikipedia:de wikipedia:en wikipedia:es wikipedia:fr wikipedia:ru" \
| head
This will extract all OSM wikidata/wikipedia tags into a tsv file. `osmium` doesn't support a csv output, but we can use `osmconvert` to convert it into the same format I used above: ```sh osmium tags-filter \ ~/Downloads/nova-scotia-latest.osm.pbf \ --omit-referenced --output-format osm \ wikidata wikipedia wikipedia:de wikipedia:en wikipedia:es wikipedia:fr wikipedia:ru \ | osmconvert - --csv-headline --csv="@id @oname @lon @lat name wikidata wikipedia wikipedia:de wikipedia:en wikipedia:es wikipedia:fr wikipedia:ru" \ | head ```
newsch commented 2023-08-01 20:17:56 +00:00 (Migrated from github.com)

It takes about 20 minutes to process the planet file on my computer.
Size-wise, compared to the generator files I have (not sure if these are from a recent build or the last successful scraping):

tag # from generator # from osmium # difference % difference
wikidata id 2135422 2198188 +62766 +3%
wikipedia article 1080547 1244377 +163830 +15%

Worth noting that these counts are deduplicated entries, of which there are about 500,000 each; and using only the wikidata and wikipedia tags, not wikipedia:*.

With that information I think processing all tags without initial filtering won't impact runtimes or output filesize significantly.

It takes about 20 minutes to process the planet file on my computer. Size-wise, compared to the generator files I have (not sure if these are from a recent build or the last successful scraping): | tag | \# from generator | \# from osmium | # difference | % difference | | ----------------- | ----------------- | -------------- | ------------ | ------------ | | wikidata id | 2135422 | 2198188 | \+62766 | \+3% | | wikipedia article | 1080547 | 1244377 | \+163830 | \+15% | Worth noting that these counts are deduplicated entries, of which there are about 500,000 each; and using only the `wikidata` and `wikipedia` tags, not `wikipedia:*`. With that information I think processing all tags without initial filtering _won't_ impact runtimes or output filesize significantly.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: organicmaps/wikiparser#19
No description provided.