Add osm tag file parsing #23

2023-08-03T13:57:54Z

newsch commented

2023-08-03 13:57:54 +00:00

(Migrated from github.com)

Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the --csv output of osmconvert.

Closes #19.

Notes from parsing all planet wikipedia/wikidata tags:

6 UTF-8 TSV parsing errors
585 tag parsing errors
- 511 titles
  - no lang
  - ;/း instead of :
  - qid instead of title
- 74 qids
  - wikipedia title instead of qid
  - Q123;Q124
  - (Q123)
  - Warfstermolen (Q1866088)
  - yandex urls
  - amenity
  - coordinates

There are 50 wikipedia entries with url escaping, some are urls instead of titles and not handled correctly.

Remaining work:

add url checks in title parsing, match generator url handling (and handle mobile website urls)
~~serialize parse errors to disk for changes~~ Add structured parse errors (see #25 for rest), log summary
Use osmpbf crate to parse planet file to fix osmconvert truncation problem
Merge #21 and update run.sh to use new method

Parse wikipedia and wikidata tags from a tsv file of OSM tags, compatible with the `--csv` output of `osmconvert`. Closes #19. Notes from parsing all planet wikipedia/wikidata tags: - 6 UTF-8 TSV parsing errors - 585 tag parsing errors - 511 titles - no lang - `;`/`း` instead of `:` - qid instead of title - 74 qids - wikipedia title instead of qid - `Q123;Q124` - `(Q123)` - `Warfstermolen (Q1866088)` - yandex urls - amenity - coordinates There are 50 wikipedia entries with url escaping, some are urls instead of titles and not handled correctly. Remaining work: - [x] add url checks in title parsing, match generator url handling (and handle mobile website urls) - [x] ~~serialize parse errors to disk for changes~~ Add structured parse errors (see #25 for rest), log summary - [x] Use `osmpbf` crate to parse planet file to fix `osmconvert` truncation problem - [x] Merge #21 and update `run.sh` to use new method

biodranik commented

2023-08-03 17:49:41 +00:00

(Migrated from github.com)

Is there an invalid UTF in OpenStreetMap.org? Any examples? Can it be fixed?

It would be great to fix errors in OpenStreetMap.org directly.

On the other hand, it is important to log errors with OpenStreetMap.org data and report them to osmers or fix ourselves, while producing robust output.

Is there an invalid UTF in OpenStreetMap.org? Any examples? Can it be fixed? It would be great to fix errors in OpenStreetMap.org directly. On the other hand, it is important to log errors with OpenStreetMap.org data and report them to osmers or fix ourselves, while producing robust output.

👍 1

newsch commented

2023-08-03 18:23:28 +00:00

(Migrated from github.com)

I'm not sure yet if it's in the OSM db or something that got messed up by osmium/osmconvert, and I can't copy/paste them here, but I attached the relevant lines here:
planet-wikipedia-utf8-errors.csv

CSV parse error: record 602681 (line 602682, field: 6, byte: 48946539): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 1326111 (line 1326112, field: 4, byte: 102037202): invalid utf-8: invalid UTF-8 in field 4 near byte index 25
CSV parse error: record 1610617 (line 1610618, field: 6, byte: 118698840): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 1778686 (line 1778687, field: 6, byte: 128580834): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 2279654 (line 2279713, field: 6, byte: 158513245): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 2723754 (line 2723813, field: 4, byte: 186764436): invalid utf-8: invalid UTF-8 in field 4 near byte index 254

I'll add an option to output the errors in a structured way that we can deal with. Some are straightforward fixes, others might require someone with local knowledge that we can leave notes on.

I'm not sure yet if it's in the OSM db or something that got messed up by `osmium`/`osmconvert`, and I can't copy/paste them here, but I attached the relevant lines here: [planet-wikipedia-utf8-errors.csv](https://github.com/organicmaps/wikiparser/files/12253829/planet-wikipedia-utf8-errors.csv) ``` CSV parse error: record 602681 (line 602682, field: 6, byte: 48946539): invalid utf-8: invalid UTF-8 in field 6 near byte index 25 CSV parse error: record 1326111 (line 1326112, field: 4, byte: 102037202): invalid utf-8: invalid UTF-8 in field 4 near byte index 25 CSV parse error: record 1610617 (line 1610618, field: 6, byte: 118698840): invalid utf-8: invalid UTF-8 in field 6 near byte index 25 CSV parse error: record 1778686 (line 1778687, field: 6, byte: 128580834): invalid utf-8: invalid UTF-8 in field 6 near byte index 25 CSV parse error: record 2279654 (line 2279713, field: 6, byte: 158513245): invalid utf-8: invalid UTF-8 in field 6 near byte index 25 CSV parse error: record 2723754 (line 2723813, field: 4, byte: 186764436): invalid utf-8: invalid UTF-8 in field 4 near byte index 254 ``` I'll add an option to output the errors in a structured way that we can deal with. Some are straightforward fixes, others might require someone with local knowledge that we can leave notes on.

biodranik commented

2023-08-03 18:35:57 +00:00

(Migrated from github.com)

Thanks! Could it be some local locale issue? What is your terminal locale/code page? Does it support non-US ones? Or it could be some osmium issue.

There is nothing wrong with the output:
https://www.openstreetmap.org/api/0.6/node/1648348163
https://www.openstreetmap.org/node/1648348163

https://www.openstreetmap.org/api/0.6/way/53580264

https://www.openstreetmap.org/api/0.6/relation/7032757

Thanks! Could it be some local locale issue? What is your terminal locale/code page? Does it support non-US ones? Or it could be some osmium issue. There is nothing wrong with the output: https://www.openstreetmap.org/api/0.6/node/1648348163 https://www.openstreetmap.org/node/1648348163 https://www.openstreetmap.org/api/0.6/way/53580264 https://www.openstreetmap.org/api/0.6/relation/7032757

newsch commented

2023-08-03 18:55:23 +00:00

(Migrated from github.com)

My terminal is set to en_US.UTF-8, but it went straight from osmium -> osmconvert -> file.
I have an outdated planet file, I'll extract the full objects from there and see.

My terminal is set to `en_US.UTF-8`, but it went straight from `osmium` -> `osmconvert` -> file. I have an outdated planet file, I'll extract the full objects from there and see.

newsch commented

2023-08-03 19:17:18 +00:00

(Migrated from github.com)

osmium outputs it fine, but osmconvert doesn't.
I think it's a truncation issue, they're all around 257 bytes long, the error is at the end, and osmconvert has a fixed buffer of length 256 for keys/values.

It looks like these aren't titles but notes? or copied text from the articles? Wikipedia has a limit of 255 characters in titles.

`osmium` outputs it fine, but `osmconvert` doesn't. I think it's a truncation issue, they're all around 257 bytes long, the error is at the end, and [`osmconvert` has a fixed buffer of length 256 for keys/values](https://gitlab.com/osm-c-tools/osmctools/-/blob/f341f5f237737594c1b024338f0a2fc04fabdff3/src/osmconvert.c#L3177). It looks like these aren't titles but notes? or copied text from the articles? [Wikipedia has a limit of 255 characters in titles.](https://ru.wikipedia.org/wiki/%D1%81.%20%D0%A5%D0%B0%D1%80%D0%B1%D0%B0%D0%BB%D0%B0%201-%D1%8F%20-%20%D1%81%D0%B5%D0%BB%D1%8C%D1%81%D0%BA%D0%B8%D0%B9%20%D0%BD%D0%B0%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%BD%D1%8B%D0%B9%20%D0%BF%D1%83%D0%BD%D0%BA%D1%82%20(%D1%81%D0%B5%D0%BB%D0%BE),%20%D0%B5%D0%B4%D0%B8%D0%BD%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D1%8B%D0%B9%20%D0%BD%D0%B0%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%BD%D1%8B%D0%B9%20%D0%BF%D1%83%D0%BD%D0%BA%D1%82%20%D0%B8%20%D0%B0%D0%B4%D0%BC%D0%B8%D0%BD%D0%B8%D1%81%D1%82%D1%80%D0%B0%D1%82%D0%B8%D0%B2%D0%BD%D1%8B%D0%B9%20%D1%86%D0%B5%D0%BD%D1%82%D1%80%20%D0%A1%D0%9F%20%C2%AB%D0%A5%D0%B0%D1%82%D1%8B%D0%BB%D1%8B%D0%BD%D1%81%D0%BA%D0%B8%D0%B9%20%D0%BD%D0%B0%D1%81%D0%BB%D0%B5%D0%B3%C2%BB%20%D0%A7%D1%83%D1%80%D0%B0%D0%BF%D1%87%D0%B8%D0%BD%D1%81%D0%BA%D0%BE%D0%B3%D0%BE%20%D1%83%D0%BB%D1%83%D1%81%D1%83%20(%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%B0)%20%D0%A0%D0%B5%D1%81%D0%BF%D1%83%D0%B1%D0%BB%D0%B8%D0%BA%D0%B8%20%D0%A1%D0%B0%D1%85%D0%B0%20(%D0%AF%D0%BA%D1%83%D1%82%D0%B8%D1%8F),%20%D1%80%D0%B0%D1%81%D0%BF%D0%BE%D0%BB%D0%BE%D0%B6%D0%B5%D0%BD%D0%BD%D1%8B%D0%B9%20%D0%B2%207%20%D0%BA%D0%B8%D0%BB%D0%BE%D0%BC%D0%B5%D1%82%D1%80%D0%B0%D1%85%20%D0%BE%D1%82%20%D0%A7%D1%83%D1%80%D0%B0%D0%BF%D1%87%D0%B8,%20%D0%B0%D0%B4%D0%BC%D0%B8%D0%BD%D0%B8%D1%81%D1%82%D1%80%D0%B0%D1%82%D0%B8%D0%B2%D0%BD%D0%BE%D0%B3%D0%BE%20%D1%86%D0%B5%D0%BD%D1%82%D1%80%D0%B0)

biodranik commented

2023-08-03 19:38:35 +00:00

(Migrated from github.com)

Good catch! You can patch osmconvert to increase the size, or try native rust approach.

👍 1

biodranik commented

2023-08-03 19:47:48 +00:00

(Migrated from github.com)

The link above says 255 urf-8 bytes in title, not 255 characters.

You may try to osmupdate your planet (don't forget to drop authors and history).

The link above says 255 urf-8 bytes in title, not 255 characters. You may try to osmupdate your planet (don't forget to drop authors and history).

newsch commented

2023-08-04 19:13:15 +00:00

(Migrated from github.com)

Sorry, I misspoke - yes, 255 bytes of utf-8. I'll try the rust parser you used.

biodranik (Migrated from github.com) reviewed 2023-08-09 18:53:35 +00:00

biodranik (Migrated from github.com) left a comment

Such a big PR! It would be easier to do it in parts.

How does simplification output look now?

Can we run and test it?

Such a big PR! It would be easier to do it in parts. How does simplification output look now? Can we run and test it?

benches/id_parsing.rs

					
				@ -4,22 +4,21 @@ use std::{collections::HashSet, str::FromStr};

				extern crate om_wikiparser;

				extern crate test;

biodranik (Migrated from github.com) commented

2023-08-09 18:50:27 +00:00

Why is it not in one line?

biodranik (Migrated from github.com) commented

2023-08-09 18:50:50 +00:00

A constant to avoid copy-paste?

benches/id_parsing.rs

					
				@ -23,3 +21,4 @@

				    let title = Title::from_url(TITLE).unwrap();

				    let mut set = HashSet::new();

				    b.iter(|| {

				        set.insert(&title);

biodranik (Migrated from github.com) commented

2023-08-09 18:51:02 +00:00

A constant to avoid copy-paste?

newsch commented

2023-08-09 19:15:15 +00:00

(Migrated from github.com)

Such a big PR! It would be easier to do it in parts.

I will try to break it up into better commits and let you know - some of it is refactoring that isn't helpful to see together.

How does simplification output look now?

This doesn't touch simplification, I am doing that next. I'll add my updates to #4 and open a PR with the changes.

Can we run and test it?

Sure! I've been testing it as I go, and run.sh is updated to extract the tags from a pbf file. The new format is documented in the script and the README, the TL;DR is that now you pass a pbf file as well:

./run.sh /tmp/output/ ~/Downloads/planet-latest.osm.pbf ~/Downloads/enwiki-NS0-20230401-ENTERPRISE-HTML.json.tar.gz ...

> Such a big PR! It would be easier to do it in parts. I will try to break it up into better commits and let you know - some of it is refactoring that isn't helpful to see together. > How does simplification output look now? This doesn't touch simplification, I am doing that next. I'll add my updates to #4 and open a PR with the changes. > Can we run and test it? Sure! I've been testing it as I go, and `run.sh` is updated to extract the tags from a pbf file. The new format is documented in the script and the README, the TL;DR is that now you pass a pbf file as well: ``` ./run.sh /tmp/output/ ~/Downloads/planet-latest.osm.pbf ~/Downloads/enwiki-NS0-20230401-ENTERPRISE-HTML.json.tar.gz ... ```

newsch (Migrated from github.com) reviewed 2023-08-09 19:16:18 +00:00

benches/id_parsing.rs

					
				@ -4,22 +4,21 @@ use std::{collections::HashSet, str::FromStr};

				extern crate om_wikiparser;

				extern crate test;

newsch (Migrated from github.com) commented

2023-08-09 19:16:18 +00:00

I'm not sure, renaming it to the shorter Title must have altered rustfmt's heuristics.

I'm not sure, renaming it to the shorter `Title` must have altered `rustfmt`'s heuristics.

newsch commented

2023-08-09 20:38:19 +00:00

(Migrated from github.com)

I think it will be easiest to review the remaining commits individually. They're reasonably separated and each has a meaningful commit message.

These contain meaningful changes:

Add new option to parse osm tag file (+114/-11)
Extract tags in parallel in rust (+559/-97)
Improve url handling (+26/-2)
Structure parse errors and only log warning if above threshold (+155/-40)

These move things around and don't change meaningful functionality:

Refactor into subcommands (+305/-277)
Refactor and rename title/qid wrappers (+211/-207)

I think it will be easiest to review the remaining commits individually. They're reasonably separated and each has a meaningful commit message. These contain meaningful changes: - [ ] [Add new option to parse osm tag file](https://github.com/organicmaps/wikiparser/pull/23/commits/a2c113a8850ced2c115fe2065ad169956ece49fa) (+114/-11) - [ ] [Extract tags in parallel in rust](https://github.com/organicmaps/wikiparser/pull/23/commits/3d48c39793cc39129d37d0e07851c2488160f3c4) (+559/-97) - [ ] [Improve url handling](https://github.com/organicmaps/wikiparser/pull/23/commits/2532d1365e5d50e77e533963df27f6465f6618cc) (+26/-2) - [ ] [Structure parse errors and only log warning if above threshold](https://github.com/organicmaps/wikiparser/pull/23/commits/b250dd4b1390bf23c673923f4e9a65f8c6d3a32e) (+155/-40) These move things around and don't change meaningful functionality: - [ ] [Refactor into subcommands](https://github.com/organicmaps/wikiparser/pull/23/commits/0ac935c175a161a1e60ac7ebe917d858721df308) (+305/-277) - [ ] [Refactor and rename title/qid wrappers](https://github.com/organicmaps/wikiparser/pull/23/commits/29cdbe2301dce8a615505ba4442208e50127f33d) (+211/-207)

biodranik (Migrated from github.com) approved these changes 2023-08-09 21:35:06 +00:00

biodranik (Migrated from github.com) left a comment

Thanks for clear commits and their descriptions!

src/get_articles.rs

					
				@ -0,0 +134,4 @@

				            .unwrap_or_default();

				        let matching_titles = if wikipedia_titles.is_empty() {

				            Default::default()

biodranik (Migrated from github.com) commented

2023-08-09 21:33:40 +00:00

What is the benefit of hiding errors under a threshold? Isn't it beneficial to see all errors and be able to estimate/compare the quality of the dump, and to easily grep/find what is most important, or feed the whole log to contributors for fixes?

src/wm/title.rs

biodranik (Migrated from github.com) commented

2023-08-09 21:29:57 +00:00

Does it make sense to print wrong hosts in a log to fix/support them?

biodranik (Migrated from github.com) commented

2023-08-09 21:30:14 +00:00

ditto

newsch (Migrated from github.com) reviewed 2023-08-09 22:22:05 +00:00

src/wm/title.rs

newsch (Migrated from github.com) commented

2023-08-09 22:22:04 +00:00

They are caught at a higher level and logged/saved with the full string

newsch (Migrated from github.com) reviewed 2023-08-09 22:40:00 +00:00

src/get_articles.rs

					
				@ -0,0 +134,4 @@

				            .unwrap_or_default();

				        let matching_titles = if wikipedia_titles.is_empty() {

				            Default::default()

newsch (Migrated from github.com) commented

2023-08-09 22:40:00 +00:00

The threshold only determines if the message is info vs error level.
When you use the run.sh script with multiple languages it prints a copy of the hundreds of errors for each language.
I think writing the parse errors to a file separately will be easier to read and deal with.

I'm open to other ideas.

The threshold only determines if the message is `info` vs `error` level. When you use the `run.sh` script with multiple languages it prints a copy of the hundreds of errors for each language. I think writing the parse errors to a file separately will be easier to read and deal with. I'm open to other ideas.

Sign in to join this conversation.

No reviewers

biodranik