Investigate using osmfilter/osmium for generating inputs

newsch commented

2023-07-10 17:38:17 +00:00

(Migrated from github.com)

As discussed in #6,
Instead of waiting for the generator to reach the descriptions stage, we could process the OSM globe file directly to query the wikipedia and wikidata tags.

To separate them, we should be careful to:

use the same criteria as the generator for filtering nodes
process article titles in the same method as the generator

Previous Work

I got a query with osmfilter working based the filters in ftypes_matcher.cpp.

osmfilter planet.o5m --keep="( wikipedia= or wikidata= ) and ( amenity=grave_yard or amenity=fountain or amenity=place_of_worship or amenity=theatre or amenity=townhall or amenity=university or boundary=national_park or building=train_station or highway=pedestrian or historic=archaeological_site or historic=boundary_stone or historic=castle or historic=fort or historic=memorial or historic=monument or historic=ruins or historic=ship or historic=tomb or historic=wayside_cross or historic=wayside_shrine or landuse=cemetery or leisure=garden or leisure=nature_reserve or leisure=park or leisure=water_park or man_made=lighthouse or man_made=tower or natural=beach or natural=cave_entrance or natural=geyser or natural=glacier or natural=hot_spring or natural=peak or natural=volcano or place=square or tourism=artwork or tourism=museum or tourism=gallery or tourism=zoo or tourism=theme_park or waterway=waterfall or tourism=viewpoint or tourism=attraction )" \
| osmconvert - --csv-headline --csv="@oname @id wikipedia wikidata"

I ran it on the Yukon territory map and found that it output additional articles compared to the generator.

Next Steps

Investigate earlier processing layers in map generator to improve the query.
Try to convert osmfilter command to osmium so .pbf files can be used directly.
Run query on whole planet file and compare with generator output and all wikipedia/wikidata tags in the planet file.
Update wikiparser to handle direct OSM inputs.

As [discussed in #6](https://github.com/organicmaps/wikiparser/pull/6#discussion_r1253465466), Instead of waiting for the generator to reach the descriptions stage, we could process the OSM globe file directly to query the wikipedia and wikidata tags. To separate them, we should be careful to: - use the same criteria as the generator for filtering nodes - process article titles in the same method as the generator ## Previous Work I got a query with `osmfilter` working based the filters in [`ftypes_matcher.cpp`](https://github.com/organicmaps/organicmaps/blob/982c6aa92d7196a5690dcdc1564e427de7611806/indexer/ftypes_matcher.cpp#L473). ``` osmfilter planet.o5m --keep="( wikipedia= or wikidata= ) and ( amenity=grave_yard or amenity=fountain or amenity=place_of_worship or amenity=theatre or amenity=townhall or amenity=university or boundary=national_park or building=train_station or highway=pedestrian or historic=archaeological_site or historic=boundary_stone or historic=castle or historic=fort or historic=memorial or historic=monument or historic=ruins or historic=ship or historic=tomb or historic=wayside_cross or historic=wayside_shrine or landuse=cemetery or leisure=garden or leisure=nature_reserve or leisure=park or leisure=water_park or man_made=lighthouse or man_made=tower or natural=beach or natural=cave_entrance or natural=geyser or natural=glacier or natural=hot_spring or natural=peak or natural=volcano or place=square or tourism=artwork or tourism=museum or tourism=gallery or tourism=zoo or tourism=theme_park or waterway=waterfall or tourism=viewpoint or tourism=attraction )" \ | osmconvert - --csv-headline --csv="@oname @id wikipedia wikidata" ``` I ran it on the Yukon territory map and found that it output additional articles compared to the generator. ## Next Steps - [x] Investigate earlier processing layers in map generator to improve the query. - [x] Try to convert `osmfilter` command to `osmium` so `.pbf` files can be used directly. - [x] Run query on whole planet file and compare with generator output and all wikipedia/wikidata tags in the planet file. - [x] Update wikiparser to handle direct OSM inputs.

biodranik commented

2023-07-10 21:15:02 +00:00

(Migrated from github.com)

Interestingly, OM stores wiki articles only for hard-coded attractions.
@vng does it make sense to store articles for other types? Does the generator store articles only for point- and area-like types, or also for linear ones? I think we can drop wiki for linear types at the moment (with a comment), because they are not selected intuitively and require a long tap.

Interestingly, OM stores wiki articles only for hard-coded attractions. @vng does it make sense to store articles for other types? Does the generator store articles only for point- and area-like types, or also for linear ones? I think we can drop wiki for linear types at the moment (with a comment), because they are not selected intuitively and require a long tap.

biodranik commented

2023-07-10 21:36:38 +00:00

(Migrated from github.com)

@newsch I think that the simplest approach would be to do any necessary filtering on the generator side. And extract everything from the dump in Wikiparser. In this way, we can always decide if the article should be embedded or not in the C++ code and not duplicate filtering logic in two different modules.

@newsch I think that the simplest approach would be to do any necessary filtering on the generator side. And extract _everything_ from the dump in Wikiparser. In this way, we can always decide if the article should be embedded or not in the C++ code and not duplicate filtering logic in two different modules.

newsch commented

2023-07-10 23:13:17 +00:00

(Migrated from github.com)

Ok, I will add a comparison of the current filtering against all the tags in the OSM dump.

newsch commented

2023-08-01 18:35:45 +00:00

(Migrated from github.com)

This will extract all OSM wikidata/wikipedia tags into a tsv file. osmium doesn't support a csv output, but we can use osmconvert to convert it into the same format I used above:

osmium tags-filter \
~/Downloads/nova-scotia-latest.osm.pbf \
--omit-referenced --output-format osm \
wikidata wikipedia wikipedia:de wikipedia:en wikipedia:es wikipedia:fr wikipedia:ru \
| osmconvert - --csv-headline --csv="@id @oname @lon @lat name wikidata wikipedia wikipedia:de wikipedia:en wikipedia:es wikipedia:fr wikipedia:ru" \
| head

This will extract all OSM wikidata/wikipedia tags into a tsv file. `osmium` doesn't support a csv output, but we can use `osmconvert` to convert it into the same format I used above: ```sh osmium tags-filter \ ~/Downloads/nova-scotia-latest.osm.pbf \ --omit-referenced --output-format osm \ wikidata wikipedia wikipedia:de wikipedia:en wikipedia:es wikipedia:fr wikipedia:ru \ | osmconvert - --csv-headline --csv="@id @oname @lon @lat name wikidata wikipedia wikipedia:de wikipedia:en wikipedia:es wikipedia:fr wikipedia:ru" \ | head ```

newsch commented

2023-08-01 20:17:56 +00:00

(Migrated from github.com)

It takes about 20 minutes to process the planet file on my computer.
Size-wise, compared to the generator files I have (not sure if these are from a recent build or the last successful scraping):

tag	# from generator	# from osmium	# difference	% difference
wikidata id	2135422	2198188	+62766	+3%
wikipedia article	1080547	1244377	+163830	+15%

Worth noting that these counts are deduplicated entries, of which there are about 500,000 each; and using only the wikidata and wikipedia tags, not wikipedia:*.

With that information I think processing all tags without initial filtering won't impact runtimes or output filesize significantly.

It takes about 20 minutes to process the planet file on my computer. Size-wise, compared to the generator files I have (not sure if these are from a recent build or the last successful scraping): | tag | \# from generator | \# from osmium | # difference | % difference | | ----------------- | ----------------- | -------------- | ------------ | ------------ | | wikidata id | 2135422 | 2198188 | \+62766 | \+3% | | wikipedia article | 1080547 | 1244377 | \+163830 | \+15% | Worth noting that these counts are deduplicated entries, of which there are about 500,000 each; and using only the `wikidata` and `wikipedia` tags, not `wikipedia:*`. With that information I think processing all tags without initial filtering _won't_ impact runtimes or output filesize significantly.

👍 1

Rows
Columns

Investigate using osmfilter/osmium for generating inputs #19

Previous Work

Next Steps